CN116958221A

CN116958221A - Cell data analysis method, device, equipment and storage medium

Info

Publication number: CN116958221A
Application number: CN202310593539.1A
Authority: CN
Inventors: 吴子涵; 姚建华
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-10-27

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for analyzing cell data, and belongs to the technical field of artificial intelligence. The method comprises the following steps: obtaining a tissue staining image and sequencing data corresponding to the tissue sample; respectively carrying out datum line detection on the sequencing data and the tissue staining image to obtain a first datum line distribution corresponding to the sequencing data and a second datum line distribution corresponding to the tissue staining image; based on the first datum line distribution and the second datum line distribution, carrying out image registration on the sequencing data and the tissue staining images to obtain gene expression results corresponding to each cell in the tissue sample; performing cell type annotation on each cell based on the gene expression result to obtain a cell analysis result corresponding to the tissue sample; the accuracy of image registration between sequencing data and tissue staining images is improved, and further the efficiency and accuracy of cell data analysis are improved.

Description

Cell data analysis method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a cell data analysis method, a device, equipment and a storage medium.

Background

The space transcriptome cell-level packaging and type annotation flow mainly comprises three steps of cell segmentation, image registration and cell type annotation.

In the related art, the method for registering the tissue staining image into the sequencing space can be mainly divided into an automatic method and a manual method, wherein the automatic method realizes registration based on the characteristics of the image, and the mapping relation is obtained by calculating a transformation matrix between pixel value peaks or the most values of the tissue staining image and the space sequencing data conversion image; the manual method is to realize image registration directly through manual alignment operation.

However, the manual operation process consumes manpower and time, the error is uncontrollable, large-scale data cannot be handled, the quality requirements of the tissue and sequencing images by the image feature-based method are high, and the image registration failure rate is high because the image feature-based method generally depends on the tissue slice itself to have obvious local morphological features.

Disclosure of Invention

The embodiment of the application provides a cell data analysis method, a device, equipment and a storage medium, which can improve the accuracy of image registration between sequencing data and tissue staining images, thereby improving the efficiency and accuracy of cell data analysis. The technical scheme is as follows:

In one aspect, an embodiment of the present application provides a method for analyzing cellular data, the method comprising:

obtaining a tissue staining image corresponding to a tissue sample and sequencing data, wherein the tissue staining image characterizes each cell contained in the tissue sample, and the sequencing data comprises gene expression data and spatial position data of each sequencing point, and each cell corresponds to at least one sequencing point;

performing baseline detection on the sequencing data and the tissue staining image respectively to obtain a first baseline distribution corresponding to the sequencing data and a second baseline distribution corresponding to the tissue staining image, wherein the first baseline distribution is used for positioning each sequencing point in the sequencing data, and the second baseline distribution is used for positioning each cell in the tissue staining image;

based on the first datum line distribution and the second datum line distribution, carrying out image registration on the sequencing data and the tissue staining images to obtain gene expression results corresponding to each cell in the tissue sample;

and carrying out cell type annotation on each cell based on the gene expression result to obtain a cell analysis result corresponding to the tissue sample.

In another aspect, an embodiment of the present application provides a cell data analysis apparatus, including:

the tissue staining system comprises an acquisition module, a sequencing module and a detection module, wherein the acquisition module is used for acquiring a tissue staining image corresponding to a tissue sample and sequencing data, the tissue staining image represents each cell contained in the tissue sample, the sequencing data comprises gene expression data and spatial position data of each sequencing point, and each cell corresponds to at least one sequencing point;

the first datum line detection module is used for respectively carrying out datum line detection on the sequencing data and the tissue staining image to obtain a first datum line distribution corresponding to the sequencing data and a second datum line distribution corresponding to the tissue staining image, wherein the first datum line distribution is used for positioning each sequencing point in the sequencing data, and the second datum line distribution is used for positioning each cell in the tissue staining image;

the image registration module is used for carrying out image registration on the sequencing data and the tissue staining images based on the first datum line distribution and the second datum line distribution to obtain gene expression results corresponding to all cells in the tissue sample;

And the type annotation module is used for carrying out cell type annotation on each cell based on the gene expression result to obtain a cell analysis result corresponding to the tissue sample.

In another aspect, embodiments of the present application provide a computer device comprising a processor and a memory having at least one instruction stored therein, the at least one instruction being loaded and executed by the processor to implement a method of cell data analysis as described in the above aspects.

In another aspect, embodiments of the present application provide a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a method of cell data analysis as described in the above aspects.

In another aspect, embodiments of the present application provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the cell data analysis method provided in the above aspect.

In the embodiment of the application, in order to obtain a cell analysis result corresponding to a tissue sample, sequencing data and tissue staining data corresponding to the tissue sample are firstly obtained, wherein the tissue staining image represents each cell contained in the tissue sample, the sequencing data comprises gene expression data and spatial position data of each sequencing point, further, in order to determine the gene expression data corresponding to each cell, reference line detection is carried out on the sequencing data and the tissue staining image to obtain a first reference line distribution corresponding to the sequencing data and a second reference line distribution corresponding to the tissue staining image, and image registration is carried out on the sequencing data and the tissue staining image according to the first reference line distribution and the second reference line distribution, so that the gene expression result corresponding to each cell is obtained, and further, cell type annotation is carried out on each cell, so that a cell analysis result corresponding to the tissue sample is obtained. The image registration between the sequencing data and the tissue staining images is realized in a reference line detection mode, so that the accuracy of the image registration between the sequencing data and the tissue staining images is improved, and the efficiency and the accuracy of cell data analysis are further improved.

Drawings

FIG. 1 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of a method for analyzing cellular data provided in an exemplary embodiment of the application;

FIG. 3 illustrates sequencing data and a tissue staining image schematic provided by an exemplary embodiment of the present application;

FIG. 4 illustrates a baseline distribution schematic provided by an exemplary embodiment of the present application;

FIG. 5 is a flow chart illustrating a method of analyzing cellular data according to another exemplary embodiment of the present application;

FIG. 6 illustrates a schematic diagram of cell segmentation by an example segmentation model provided by an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of cell segmentation results provided in accordance with an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of baseline detection of sequencing data according to an exemplary embodiment of the present application;

FIG. 9 illustrates a corresponding baseline distribution of tissue staining images provided by an exemplary embodiment of the present application;

FIG. 10 illustrates a global baseline detection on ssDNA images provided in accordance with an exemplary embodiment of the present application;

FIG. 11 illustrates a schematic image registration result provided by an exemplary embodiment of the present application;

FIG. 12 shows a schematic representation of ssDNA images and cell segmentation results after image registration provided by an exemplary embodiment of the present application;

FIG. 13 illustrates a flow chart for determining cross entropy loss provided by an exemplary embodiment of the present application;

FIG. 14 is a schematic representation of cell type annotation results provided by an exemplary embodiment of the present application;

FIG. 15 is a flow chart illustrating a method of analyzing cellular data according to an exemplary embodiment of the present application;

FIG. 16 is a block diagram showing the structure of a cell data analysis apparatus according to an exemplary embodiment of the present application;

fig. 17 is a schematic diagram showing the structure of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and the like, and is specifically described through the following embodiment.

Referring to FIG. 1, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown. The implementation environment includes a terminal 120 and a server 140. The data communication between the terminal 120 and the server 140 is performed through a communication network, alternatively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The terminal 120 is an electronic device in which an application program having a cell data analysis function is installed. The cell data analysis function may be a function of an original application in the terminal, or a function of a third party application; the electronic device may be a smart phone, a tablet computer, a personal computer, a wearable device, a vehicle-mounted terminal, or the like, and in fig. 1, the terminal 120 is taken as an example of a personal computer, but the present application is not limited thereto.

The server 140 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like. In the embodiment of the present application, the server 140 may be a background server of an application having a cell data analysis function.

In one possible implementation, as shown in fig. 1, there is data interaction between the server 140 and the terminal 120. The terminal 120 obtains a tissue staining image and sequencing data corresponding to the tissue sample, sends the tissue staining image and the sequencing data to the server 140, performs reference line detection on the tissue staining image and the sequencing data by the server 140 to obtain a first reference line distribution corresponding to the sequencing data and a second reference line distribution corresponding to the tissue staining image, so that the server 140 performs image registration on the sequencing data and the tissue staining image according to the first reference line distribution and the second reference line distribution, determines a gene expression result corresponding to each cell, performs cell type annotation on each cell based on the gene expression result, obtains a cell analysis result corresponding to the tissue sample, and sends the cell analysis result to the terminal 120.

Referring to fig. 2, a flowchart of a method for analyzing cell data according to an exemplary embodiment of the present application is shown, where the method is used in a computer device (including a terminal 120 and/or a server 140), and the method includes the following steps:

step 201, obtaining a tissue staining image and sequencing data corresponding to the tissue sample, wherein the tissue staining image characterizes each cell contained in the tissue sample, and the sequencing data comprises gene expression data and spatial position data of each sequencing point, and each cell corresponds to at least one sequencing point.

Optionally, the tissue sample may be a pathological section such as a brain tissue section, a tumor tissue section, and the specific type of the tissue sample is not limited in the embodiment of the present application.

Sequencing data is obtained by sequencing a tissue sample by a gene sequencing device, optionally acquired by a spatial transcriptome sequencing technique, including but not limited to: seqFISH, MERFISH, seqFISH +, osmFISH, slide-seq, visium, STARmap, HDST. The embodiments of the present application are not limited to a particular type of sequencing data.

Alternatively, the tissue samples may contain on the order of millions to tens of millions of cells, and the spatial transcriptome may have sub-cellular levels of sequencing precision, i.e., multiple sequencing spots per cell.

In some embodiments, the sequencing data comprises gene expression data of each sequencing point and spatial position data, wherein the gene expression data is used for representing genes corresponding to transcripts obtained by sequencing at the sequencing points, and the spatial position data is used for representing spatial positions of the sequencing points in the tissue sample.

In one possible embodiment, given that there may be a large error in detecting cells directly on the sequencing data, the computer device may obtain a tissue staining image by performing additional staining and imaging on the tissue sample, determine the cell profile of each cell in the tissue sample, and map the detected cell profile to the spatial sequencing data by means of image registration.

The tissue staining image may be an image obtained based on any staining method, and optionally, the staining method includes, but is not limited to hematoxylin-eosin staining (H & E), ssDNA staining, 4',6-diamidino-2-phenylindole (DAPI) staining, and the staining method may be specifically determined based on actual application scene requirements, and embodiments of the present application are not limited herein.

Optionally, the tissue staining image characterizes each cell included in the tissue sample, where the tissue staining image may be an imaging result at any original resolution, for example, the resolution may be 10 times, 20 times, 40 times, etc. of the microscope, and may specifically be determined based on the actual application scenario requirements, and embodiments of the present application are not limited herein.

Alternatively, the tissue staining image is a single-stranded deoxyribonucleic acid staining image (ssDNA image), schematically, as shown in fig. 3, the spatial transcriptome sequencing data 301 is visualized, each pixel represents a sequencing point, the brightness of the pixel indicates the total amount of gene expression captured by the sequencing point, and the tissue staining image of the corresponding region is ssDNA image 302.

Step 202, performing baseline detection on the sequencing data and the tissue staining image respectively to obtain a first baseline distribution corresponding to the sequencing data and a second baseline distribution corresponding to the tissue staining image, wherein the first baseline distribution is used for positioning each sequencing point in the sequencing data, and the second baseline distribution is used for positioning each cell in the tissue staining image.

In consideration of determining gene expression corresponding to each cell in a tissue sample directly according to sequencing data, it is difficult to ensure accuracy of correspondence between gene expression and cells, so that a computer device can introduce a tissue staining image as an aid by acquiring the tissue staining image corresponding to the tissue sample, thereby determining cell attribution of each gene expression data by registering the tissue staining image onto the sequencing data and mapping the cell profile determined on the tissue staining image into the sequencing data.

In the related art, the mapping relation is obtained by calculating a transformation matrix between the peak value or the maximum value of the pixel value of the tissue staining image and the sequencing data conversion image, but the mapping relation requires that the tissue sample itself has obvious local morphological characteristics, so that the image registration is easy to generate larger errors under the condition that the local morphological characteristics of the tissue sample are not obvious. In the embodiment of the application, the reference line is pre-carved on the placement bottom plate of the tissue sample, and the reference line detection is respectively carried out on the sequencing data and the tissue staining image, so that the image registration between the sequencing data and the tissue staining image is realized according to the sequencing data and the reference line distribution condition corresponding to the tissue staining image.

In one possible embodiment, the reference lines may be pre-engraved on the placement base plate of the tissue sample in the horizontal and vertical directions, and optionally, the reference line intervals between the respective reference lines in the horizontal and vertical directions may be the same or different, where the reference line intervals may be arranged in a periodic distribution in order to improve the efficiency of image registration in the case where the reference line intervals are different.

In one possible implementation, after acquiring the sequencing data, the computer device may perform reference line detection on the sequencing data according to a plurality of reference lines in horizontal and vertical directions on the placement base plate, so as to obtain a first reference line distribution corresponding to the sequencing data, and each sequencing point in the sequencing data may be located through the first reference line distribution.

In one possible implementation, the computer device may take a photograph of the tissue sample under the microscope, and after obtaining the tissue staining image, perform reference line detection on the tissue staining image, so as to obtain a second reference line distribution corresponding to the tissue staining image, and through the second reference line distribution, each cell in the tissue staining image may be located.

Illustratively, as shown in FIG. 4, the sequencing data corresponds to a first baseline profile 401 and the tissue staining image (ssDNA image) corresponds to a second baseline profile 402.

And 203, performing image registration on the sequencing data and the tissue staining images based on the first datum line distribution and the second datum line distribution to obtain gene expression results corresponding to each cell in the tissue sample.

In order to determine gene expression data corresponding to each cell in a tissue sample, after performing reference line detection on the sequencing data and the tissue staining image to obtain a first reference line distribution corresponding to the sequencing data and a second reference line distribution corresponding to the tissue staining image, the computer device may map each cell contour on the tissue staining image to the sequencing data by means of image registration.

Optionally, at least one sequencing point corresponds to each cell in the tissue sample, and the gene expression data corresponding to each cell is the gene expression data corresponding to all the sequencing points corresponding to each cell. In one possible implementation manner, the computer device performs image registration on the sequencing data and the tissue staining image according to the first datum line distribution and the second datum line distribution, so that each cell outline is mapped to the sequencing data to obtain a corresponding relation between each cell and the sequencing point, and thus gene expression data and spatial position data of all the sequencing points corresponding to each cell are used as gene expression of the cell to obtain a gene expression result corresponding to each cell in the tissue sample.

And 204, performing cell type annotation on each cell based on the gene expression result to obtain a cell analysis result corresponding to the tissue sample.

Optionally, after determining the gene expression data corresponding to each cell in the tissue sample and obtaining the gene expression result, the computer device may further annotate each cell in the tissue sample with a cell type for further analysis of each cell in the tissue sample.

In one possible embodiment, after determining the gene expression results corresponding to each cell in the tissue sample, the computer device may annotate each cell with a cell type based on the gene expression results corresponding to each cell according to a single cell sequencing data definition, wherein the single cell sequencing data definition characterizes the correspondence between different cell gene expression data and cell types.

Alternatively, the single cell sequencing data definition may be single cell data with annotation result disclosed, or may be data definition of self-produced single cell data by self-clustering, which is not limited in the embodiment of the present application.

In one possible implementation manner, in order to improve accuracy of determining cell types, the computer device may further perform probability prediction on cell types of each cell in the tissue sample through a spatial transcriptome cell annotation method based on self-supervised learning, so as to determine cell types corresponding to each cell according to a probability prediction value, and perform cell type annotation on the cells, so as to obtain a cell analysis result corresponding to the tissue sample.

In summary, in the embodiment of the present application, in order to obtain a cell analysis result corresponding to a tissue sample, sequencing data and tissue staining data corresponding to the tissue sample are first obtained, where the tissue staining image characterizes each cell included in the tissue sample, the sequencing data includes gene expression data and spatial position data of each sequencing point, further, in order to determine gene expression data corresponding to each cell, reference line detection is performed on the sequencing data and the tissue staining image, so as to obtain a first reference line distribution corresponding to the sequencing data, and a second reference line distribution corresponding to the tissue staining image, and image registration is performed on the sequencing data and the tissue staining image according to the first reference line distribution and the second reference line distribution, so as to obtain a gene expression result corresponding to each cell, and further, cell type annotation is performed on each cell, so as to obtain a cell analysis result corresponding to the tissue sample. The image registration between the sequencing data and the tissue staining images is realized in a reference line detection mode, so that the accuracy of the image registration between the sequencing data and the tissue staining images is improved, and the efficiency and the accuracy of cell data analysis are further improved.

In some embodiments, to improve the accuracy of the correspondence between cells and sequencing spots, the tissue staining image may also be segmented prior to mapping the tissue staining image to the sequencing data, thereby mapping the segmented cell profile to the sequencing data.

Referring to fig. 5, a flowchart of a method for analyzing cell data according to an exemplary embodiment of the present application is shown, where the method is used in a computer device (including a terminal 120 and/or a server 140), and the method includes the following steps:

step 501, obtaining a tissue staining image and sequencing data corresponding to a tissue sample, wherein the tissue staining image characterizes each cell contained in the tissue sample, and the sequencing data comprises gene expression data and spatial position data of each sequencing point, and each cell corresponds to at least one sequencing point.

Reference may be made to step 201 for the specific implementation of this step, and this embodiment is not described here in detail.

Step 502, inputting the tissue staining image into an instance segmentation model, and determining candidate cell areas in the tissue staining image through a candidate area detection network in the instance segmentation model.

Optionally, in order to improve the accuracy of the gene expression data corresponding to each cell, the computer device may first determine the outline and the position information corresponding to each cell in the tissue staining image, so as to determine the sequencing point data corresponding to each cell.

In one possible implementation, the computer device may determine the contour and location information corresponding to each cell by way of cell segmentation of the cells in the tissue staining image. In the related art, each cell and the background in the tissue staining image are distinguished through a watershed algorithm, but the watershed algorithm is sensitive to image quality (cell staining effect, image brightness and the like), so that impurities similar to cell morphology are difficult to distinguish, and the problems of over-segmentation and the like are easy to cause due to the influence of change of pixel values in the cells. In an embodiment of the application, the computer device uses a deep learning model of instance segmentation to implement the process of cell segmentation.

Optionally, the example segmentation model may include a candidate region detection network for detecting a cellular region that may exist in the tissue staining image, a region classification network for classifying the cellular region and the background region, a bounding box regression network for adjusting and classifying the boundaries of the cellular region and the background region, and a segmentation network for performing cellular contour classification on the detected cellular region.

Alternatively, the example segmentation model may be a Mask Region-based convolutional neural network model (Mask Region-based Convolutional Neural Network, mask R-CNN), the candidate Region detection network may be a Roi Align layer, the Region classification network and the bounding box regression network may be Conv layers. In one possible implementation, the computer device may obtain a training set constructed by manually labeling on a small number of tissue staining images, and train the Mask R-CNN model based on the training set, so that the Mask R-CNN model may perform more accurate cell segmentation on unlabeled tissue staining images, and obtain contour and position information of all cells therein.

In one possible embodiment, the computer device inputs the tissue staining image into an instance segmentation model, and first detects cell regions that may be present in the tissue staining image by a candidate region detection network in the instance segmentation model, thereby determining candidate cell regions in the tissue staining image.

And step 503, judging and adjusting the candidate cell area through the area classification network and the bounding box regression network in the example segmentation model.

Further, in order to improve the accuracy of cell region determination, the computer device discriminates and adjusts candidate cell regions through a region classification network and a bounding box regression network in the example segmentation model, so as to divide cell regions and background regions in the tissue staining image.

And 504, demarcating the outline of the cells in the tissue staining image through a segmentation network in the example segmentation model to obtain a cell segmentation result corresponding to the tissue staining image.

In one possible implementation manner, in the process of dividing the cell area, the computer device may further perform contour detection and division on each cell in the cell area through a division network in the example division model, so as to determine contour and position information of all cells in the tissue staining image, and obtain a cell division result corresponding to the tissue staining image.

Illustratively, as shown in fig. 6, the computer device inputs the tissue staining image 601 into an example segmentation model 602, so that cell areas that may exist in the tissue staining image 601 are detected by a candidate area detection network 603 in the example segmentation model 602, the cell areas and the background areas are classified by an area classification network 604, the cell areas are boundary-classified by a bounding box regression network 605, and the cell outlines of the cell areas are classified by a segmentation network 606, thereby obtaining a cell segmentation result 607.

Fig. 7 is a schematic diagram of an exemplary cell segmentation result provided by the embodiment of the present application, where ssDNA image 701 is a tissue staining image, a more reliable segmentation result may be obtained in the case of dense cells in cell segmentation result 702, and image 703 may be obtained by superimposing ssDNA image 701 and cell segmentation result 702, and by applying the example segmentation model in the embodiment of the present application to verify on a manual labeling test set of 10209 cells, an accuracy (precision) of 0.912 and a recall (recovery) of 0.913 are obtained.

And step 505, performing image conversion on the sequencing data to obtain a sequencing image corresponding to the sequencing data.

Optionally, in order to improve the efficiency of reference line detection on the sequencing data, the computer device may process the sequencing data first, convert the sequencing data into a sequencing image, and then perform reference line detection on the sequencing image.

In one possible implementation manner, after obtaining the sequencing data corresponding to the tissue sample, the computer device may first determine at least one gene included in each sequencing point and a gene expression amount corresponding to each gene according to the gene expression data corresponding to each sequencing point in the sequencing data, further determine each sequencing point as a pixel point corresponding to a sequencing image, and determine a sum of gene expression amounts corresponding to each sequencing point as a pixel value corresponding to the pixel point, so as to generate the sequencing image according to the pixel point and the pixel value corresponding to the pixel point.

Optionally, the bit depth of the sequencing image may be 8 bits, 16 bits, or 32 bits, and the specific bit depth may be determined based on the actual application scene requirement, which is not limited in this embodiment of the present application. In one possible implementation, the sequencing image is a 16-bit image, so that in the case of determining the pixel values of each pixel point, the computer device may filter abnormally high pixel values, leaving only a portion in the range of 0-65535, thereby generating a 16-bit sequencing image based on the pixel values corresponding to the pixel points after the filtering process.

And step 506, performing reference line detection on the sequencing image in the horizontal direction and the vertical direction respectively to obtain a first reference line distribution corresponding to the sequencing image.

Optionally, in the process of pre-engraving the reference line on the placement substrate of the tissue sample, in order to facilitate subsequent reference line detection, to improve the efficiency of reference line detection, reference line pre-engraving may be performed in two directions, horizontal and vertical, so that in the process of performing reference line detection on the sequencing image, the computer device may directly perform reference line detection on the sequencing image in the horizontal and vertical directions.

Alternatively, the computer device may use a laplacian of gaussian operator (Laplacian of Gaussian, loG) for reference line detection of the sequenced image, wherein the LoG operator combines a gaussian smoothing and a laplacian edge detection operator, where the location of the target intensity mutation can be detected. In one possible implementation, the computer device detects pixel values of each pixel point on the sequencing image in the horizontal and vertical directions through the laplacian of gaussian operator, and determines a position where the pixel values obviously mutate as a datum line position, so as to obtain a first datum line distribution corresponding to the sequencing image.

Optionally, in order to highlight the image features of the sequencing image as much as possible in the reference line detection process, the computer device may also perform brightness adjustment on the sequencing image first, highlight the image features by increasing the brightness of the image, and then apply the laplace gaussian operator to perform reference line detection in the horizontal and vertical directions, respectively.

Schematically, as shown in fig. 8, a computer device may obtain a sequencing image 801 by performing image transformation on sequencing data, and adjust brightness of the sequencing image 801 to obtain an adjusted sequencing image 802, so as to perform brightness extraction on the adjusted sequencing image 802, obtain a brightness signal extraction image 803, further use a laplacian gaussian operator to perform filtering, obtain a filtered graph 804, and determine a first reference line distribution 805 corresponding to the sequencing image 801, where the abscissas and the ordinates in the sequencing image 801 respectively represent coordinates of each pixel point in the image in the horizontal and vertical directions, the abscissas in the brightness signal extraction image 803 and the filtered graph 804 represent coordinates of each pixel point in each row or each column, and the abscissas represent signal values for performing signal extraction on pixel values.

And 507, respectively performing datum line detection on the tissue staining images in the horizontal direction and the vertical direction, and determining second datum line distribution corresponding to the tissue staining images.

Alternatively, the computer device may use the laplacian of gaussian operator to perform baseline detection of tissue staining images from both the horizontal and vertical directions as well.

In one possible embodiment, since the reference line may have a small inclination during the acquisition of the tissue staining image by photographing with a microscope, so that there may be a reference line detection error when the tissue staining image is reference-detected in the horizontal and vertical directions, in order to improve the accuracy of the reference line detection, the computer device may further rotate the tissue staining image at a first rotation angle according to the rotation angle threshold during the reference line detection of the tissue staining image, so that in the case that the reference line is detected to be in the horizontal or vertical state, the computer device may determine the current reference line detection result as a second reference line distribution corresponding to the tissue staining image, and determine the current rotation angle as a first target rotation angle of the tissue staining image with respect to the sequencing data.

In one illustrative example, during reference line detection of a tissue staining image from horizontal and vertical directions using the laplace operator, the computer apparatus rotates the tissue staining image a plurality of times at a first rotation angle of 5 degrees, so that in a case where a position where a significant abrupt change in a pixel value is detected, i.e., the reference line is horizontal or vertical, the computer apparatus determines the current rotation angle as the first target rotation angle.

Optionally, to improve the efficiency of reference line detection, the computer device may further determine the first target rotation angle in the form of a multi-component search, in one possible implementation, the computer device may first rotate the tissue staining image at a first rotation angle of 5 degrees, and in the case where a significant mutation in the pixel value is detected, the computer device may adjust the angle of the first rotation angle to 1 degree, so as to determine the rotation angle of the current tissue staining image in the case where the reference line is completely horizontal or vertical, and determine it as the first target rotation angle.

Schematically, as shown in fig. 9, a graph for extracting a pixel value signal using a laplacian gaussian operator based on a first rotation angle is displayed in a first tissue staining image 901, and a graph for extracting a pixel value signal using a laplacian gaussian operator based on a second rotation angle is displayed in a second tissue staining image 902.

In one possible implementation manner, for a tissue sample with an oversized view, that is, in a case that the sizes of the sequencing image and the tissue staining image are larger, in order to improve accuracy of reference line detection, the computer device may further perform image segmentation on the sequencing image and the tissue staining image to obtain at least two sequencing sub-images and at least two tissue staining sub-images, where the size of the sequencing sub-images is smaller than that of the sequencing image and the size of the tissue staining sub-images is smaller than that of the tissue staining image, so that the computer device performs reference line detection on the at least two sequencing sub-images and the at least two tissue staining sub-images respectively to obtain a third reference line distribution corresponding to the at least two sequencing sub-images and a fourth reference line distribution corresponding to the at least two tissue staining sub-images, and further performs reference line matching on the at least two sequencing sub-images based on the third reference line distribution to obtain a first reference line distribution corresponding to the sequencing image; and performing reference line matching on at least two tissue dyeing sub-images based on the fourth reference line distribution to obtain a second reference line distribution corresponding to the tissue dyeing images.

In one possible implementation, in performing the reference line matching on at least two sequencing sub-images based on the third reference line distribution, the computer device may perform the matching filtering and the checking on the reference lines between adjacent sequencing sub-images according to the adjacent relation between the sequencing sub-images, so as to obtain the first reference line distribution corresponding to the sequencing images.

In one possible implementation manner, in order to determine the first target rotation angle of the tissue staining sub-image relative to the sequencing image, in the process of performing reference line detection on the at least two tissue staining sub-images, the computer device may further determine at least two second target rotation angles of the at least two tissue staining sub-images relative to the at least two sequencing sub-images, where a reference line direction corresponding to the tissue staining sub-images obtained by rotating based on the second target rotation angles is consistent with a reference line direction corresponding to the sequencing sub-images, and is in a horizontal or vertical state, and the computer device filters abnormal values included in the at least two second target rotation angles, so that an average value between the filtered at least two second target rotation angles is determined as the first target rotation angle of the tissue staining image relative to the sequencing data.

Schematically, fig. 10 shows a schematic diagram of global reference line detection results on ssDNA images according to an exemplary embodiment of the present application, where the left graph is a ssDNA image of a complete tissue sample, the middle graph is a spliced display of reference line detection results of each small image (where the detected reference lines in different small images are represented by different colors), and the right graph is a reference line distribution result obtained after the reference lines of all small areas are inspected and connected.

And step 508, performing image registration on the sequencing data and the tissue staining images based on the first datum line distribution and the second datum line distribution to obtain gene expression results corresponding to each cell in the tissue sample.

In one possible implementation manner, after obtaining the first reference line distribution corresponding to the sequencing data, the second reference line distribution corresponding to the tissue staining image, and the first target rotation angle of the tissue staining image relative to the sequencing data, the computer device may rotate the tissue staining image according to the first target rotation angle to obtain a rotated tissue staining image, where the reference line direction corresponding to the rotated tissue staining image is consistent with the reference line direction corresponding to the sequencing data, and is in a horizontal or vertical state.

Further, considering that the tissue staining image is obtained by photographing through a microscope, that is, the tissue staining image and the sequencing data may have a problem of inconsistent size, in order to improve the accuracy of image registration between the tissue staining image and the sequencing data, the computer device may further scale and translate the rotated tissue staining image according to the first datum line distribution and the second datum line distribution, so that the rotated tissue staining image and the sequencing data are subjected to image registration in a datum line alignment manner, that is, each cell contour on the tissue staining image is mapped onto the sequencing data, so that gene expression data of each sequencing point is distributed into corresponding cells or backgrounds, and gene expression data corresponding to each cell is determined, thereby obtaining a gene expression result corresponding to each cell.

In one possible embodiment, in order to improve the efficiency and accuracy of image registration and avoid the problem of alignment and dislocation of the reference lines caused by the condition of the same reference line interval, the reference lines may be pre-engraved according to the reference line period, that is, the intervals between the reference lines are different in the reference line period, so that in the process of scaling and translating the rotated tissue staining image, the computer device may relatively scale the rotated tissue staining image according to the reference line period, and translate the scaled tissue staining image according to the position of the first reference line in the same reference line period, so as to realize image registration between the tissue staining image and the sequencing data, where the first target rotation angle, the relative scaling and the translation data form registration parameters of the tissue staining image relative to the sequencing data.

Schematically, fig. 11 shows a schematic diagram of an image registration result provided by an exemplary embodiment of the present application, where the first row has four exemplary regions and the second row has a partial enlargement of an image.

Schematically, fig. 12 shows a schematic diagram of an image-registered ssDNA image and a cell segmentation result provided by an exemplary embodiment of the present application, where the first left image is the registered ssDNA image, the second left image is the registered cell segmentation result, the second right image is the visualized original spatial transcriptome sequencing data, and the first right image is the result displayed by superposition of the first three images.

Step 509, performing cell type prediction on each cell in the tissue sample through the deep neural network, so as to obtain a first cell type probability value corresponding to each cell output by the deep neural network.

Alternatively, to determine the cell type of each cell in the tissue sample as accurately as possible, the computer device may annotate each cell with a cell type (Spatial cell type Identification, spatial-ID) migration annotation method.

In one possible embodiment, the computer device predicts cell types of individual cells in the tissue sample via a deep neural network (Deep Neural Networks, DNN) to obtain a first cell type probability value corresponding to the individual cells.

Optionally, the input of the deep neural network is the gene expression result corresponding to each cell, and the input is the first cell type probability value corresponding to each cell.

Alternatively, the computer device may train the deep neural network through a sample single cell dataset. In one possible implementation manner, the computer device obtains a sample single-cell data set, sample gene expression corresponding to the sample single-cell data set and sample cell types, inputs the sample single-cell data set, the sample gene expression corresponding to each sample cell in the sample single-cell data set and the sample cell types into the deep neural network, outputs a first sample cell type probability value corresponding to each sample cell in the sample single-cell data set through the deep neural network, further considers that under the condition of unbalanced sample cell types, a loss function is biased to one side with more samples during training, the loss function is small during training, and the type recognition precision is low for fewer samples, so that in order to solve the problem of unbalanced type prediction under the condition of more cell types, the computer device can count the cell numbers corresponding to each sample cell type based on the sample cell types and the first sample cell type probability value, take the reciprocal of the cell numbers as the type weight of the corresponding sample cell types, and train the deep neural network by the first cross entropy loss based on the category weighting (Weighted cross entropy loss).

Step 510, predicting the cell type of each cell in the tissue sample through the graph rolling network, so as to obtain a second cell type probability value corresponding to each cell output by the graph rolling network, wherein the graph rolling network is composed of a depth self-encoder, a graph self-encoder and a classifier.

Optionally, in order to further improve accuracy of cell type prediction, the computer device may further perform cell type prediction on each cell in the tissue sample through a Graph convolution network (Graph Convolutional Network, GCN), so as to obtain a second cell type probability value corresponding to each cell output by the Graph convolution network, where the Graph convolution network is formed by a Deep Auto-Encoder (DAE), a Graph Auto-Encoder (GAE), and a classifier.

In one possible implementation, the computer device learns the gene expression and spatial neighborhood information encoding of the spatial transcriptome cells using a graph roll-up network and determines a second cell type probability value corresponding to each cell for use in modifying the first cell type probability value output by the deep neural network.

Optionally, the input of the graph rolling network is the gene expression result corresponding to each cell, and the second cell type probability value corresponding to each cell is output.

Alternatively, the computer device may train the graph rolling network through a sample single cell dataset. In one possible embodiment, the computer device obtains the sample single cell data set, the sample gene expression corresponding to the sample single cell data set, and the sample cell type, and inputs the three into the graph rolling network, outputs a second sample cell type probability value corresponding to each sample cell in the sample single cell data set through the graph rolling network, and further considers that the graph rolling network is composed of three parts of a depth self-encoder, a graph self-encoder, and a classifier, so the computer device needs to determine losses corresponding to the depth self-encoder, the graph self-encoder, and the classifier based on the sample cell type and the second sample cell type probability value, respectively, wherein the depth self-encoder and the graph self-encoder correspond to reconstruction losses (Reconstruction Loss), and considers that the first cell type probability value obtained through the depth neural network has been weighted, so for the classifier, the computer device can determine the losses as second cross entropy losses (Soft Label Cross Entropy Loss) for soft label output, and train the depth self-encoder and the graph self-encoder according to the reconstruction losses, and train the classifier with the second cross entropy losses for soft label output.

Illustratively, as shown in fig. 13, the computer device inputs the sample single-cell dataset 1301 into the deep neural network 1302, outputs a first sample cell type probability value 1303 through the deep neural network 1302, thereby determining a class-weighted first cross entropy penalty based on the first sample cell type probability value 1303 and the sample cell type 1309, thereby training the deep neural network 1302 with the class-weighted first cross entropy penalty; and inputs the sample single cell data set 1301 into the graph rolling network 1304, outputs a second sample cell type probability value 1308 through the depth self-encoder 1305, the self-encoder 1306, and the classifier 1307 in the graph rolling network 1304, thereby determining a second cross entropy penalty for the soft label output based on the second sample cell type probability value 1308 and the sample cell type 1309, and trains the classifier 1307 with the second cross entropy penalty for the soft label output.

In step 511, the probability value of the first cell type is corrected by the probability value of the second cell type to obtain the predicted result of the cell type corresponding to each cell.

Further, after obtaining the first cell type probability value output by the deep neural network and the second cell type probability value output by the graph convolution network, the computer equipment corrects the first cell type probability value by using the second cell type probability value, so as to obtain the cell type probability value corresponding to each cell, and determining the cell type prediction result corresponding to each cell according to the cell type probability value.

And step 512, performing cell type annotation on each cell based on the cell type prediction result to obtain a cell analysis result corresponding to the tissue sample.

Optionally, the cell type prediction result of each cell corresponds to the gene expression result, and in a possible implementation manner, in the process of performing cell type annotation on the cells, the computer device may determine, according to the cell type prediction result, the cell type with the largest corresponding probability value as the target cell type of the current cell, so as to perform cell type annotation on the cells, and obtain a cell analysis result corresponding to the tissue sample.

Schematically, fig. 14 shows a schematic diagram of cell type annotation results provided by an exemplary embodiment of the present application, taking a case that a tissue sample is located in a mouse brain cortex as an example, where fig. 1401 is a result of directly applying Spatial-ID to perform cell type annotation, fig. 1402 is a cell type annotation result obtained after training a deep neural network by adding category weights to a loss function, fig. 1403 is a cell type annotation result obtained after training a deep neural network based on category weighted first cross entropy loss, and fig. 1404 is a cell type annotation result obtained after training a classifier with second cross entropy loss output for a soft tag and correcting the first cell type probability value by the second cell type probability value.

Optionally, the cell analysis results include at least contour and location information, gene expression data, and cell type of each cell in the tissue sample. In one possible embodiment, the computer device may further analyze the results of the cell analysis downstream, for example, the outline, location information, and cell type of the cell may be used for analysis of cell morphology, location information, gene expression data of the cell may be used for analysis of cell interactions, cell microenvironment, etc., and gene expression data of the cell, cell type may be used for analysis of Marker gene significance, etc.

In the above embodiment, in the process of determining the cell contour, the cell segmentation is performed on the tissue staining image by using the example segmentation network, so that the accuracy of cell contour segmentation in the cell segmentation result is improved; before image registration is carried out on sequencing data, a sequencing image is obtained by carrying out image conversion on the sequencing data, and then datum line detection is carried out on the sequencing image, so that the efficiency and the accuracy of datum line detection are improved; and by determining the datum line distribution corresponding to the tissue staining image and the sequencing image and carrying out image registration on the tissue staining image and the sequencing image according to the datum line distribution, the accuracy of image registration is improved, and the accuracy of gene expression results corresponding to cells is further improved.

In addition, the probability prediction is carried out on the cell types of each cell through the deep neural network and the graph rolling network, so that the accuracy of cell type annotation is improved, and the accuracy of cell type prediction through the deep neural network under the condition of unbalanced cell type quantity is optimized through training the deep neural network based on the first cross entropy loss of class weighting, so that the quality of cell analysis results is improved.

Referring to fig. 15, a flowchart of a method for analyzing cell data according to another exemplary embodiment of the present application is shown.

Firstly, a computer device obtains sequencing data 1503 corresponding to a tissue sample 1501 through a space transcriptome technology according to the tissue sample 1501, obtains a tissue staining image 1502 corresponding to the tissue sample 1501 through a microscope photographing and image staining technology, further the computer device performs cell segmentation on the tissue staining image 1502 through an example segmentation model to obtain a cell segmentation result 1505 corresponding to the tissue sample 1501, performs image conversion based on the sequencing data to obtain a sequencing image, and performs reference line detection on the sequencing image and the tissue staining image 1502 to obtain respective corresponding reference line distribution 1506, so that cell outlines in the tissue staining image 1502 are mapped into the sequencing data 1503 to obtain gene expression results 1507 corresponding to each cell, and further the computer device determines cell types corresponding to each cell in the tissue sample 1501 according to the gene expression results 1507 and a sample single-cell data set 1504, performs cell type annotation to obtain a cell type annotation result 1508, and further determines a cell analysis result 1509 corresponding to the tissue sample 1501.

In combination with the above embodiments, the method for analyzing cell data according to the embodiments of the present application can be divided into three steps of cell segmentation, image registration and cell type annotation.

In the process of dividing the cells, the accuracy of dividing the cell outlines in a region with dense cells can be improved by dividing the cell outlines by using an example division model; in the image registration process, a reference line is pre-carved on a placement bottom plate of a tissue sample, a sequencing image and a tissue staining image are subjected to reference line detection, and the tissue staining image is rotated, scaled and translated based on reference line distribution, so that registration parameters between the sequencing image and the tissue staining image are determined, and the efficiency and the accuracy of image registration are improved; in the cell type annotation process, the accuracy of cell type prediction in the case of unbalanced cell type numbers is optimized by training the deep neural network based on the class weighted first cross entropy loss and training the classifier in the graph convolution network for the soft label output second cross entropy loss.

By adopting the cell data analysis method provided by the embodiment of the application, the cell-level analysis is carried out on the ultra-high resolution large-scale space transcriptome data, and the complete upstream processing process from gene expression encapsulation to cell type annotation is realized.

Referring to fig. 16, a block diagram of a cell data analysis apparatus according to an exemplary embodiment of the present application is shown, the apparatus comprising:

an obtaining module 1601, configured to obtain a tissue staining image and sequencing data corresponding to a tissue sample, where the tissue staining image characterizes each cell included in the tissue sample, and the sequencing data includes gene expression data and spatial position data of each sequencing point, and each cell corresponds to at least one sequencing point;

a first reference line detection module 1602, configured to perform reference line detection on the sequencing data and the tissue staining image respectively, to obtain a first reference line distribution corresponding to the sequencing data, and a second reference line distribution corresponding to the tissue staining image, where the first reference line distribution is used to locate each sequencing point in the sequencing data, and the second reference line distribution is used to locate each cell in the tissue staining image;

the image registration module 1603 is configured to perform image registration on the sequencing data and the tissue staining image based on the first reference line distribution and the second reference line distribution, so as to obtain a gene expression result corresponding to each cell in the tissue sample;

And a type annotation module 1604, configured to annotate cell types of the cells based on the gene expression result, so as to obtain a cell analysis result corresponding to the tissue sample.

Optionally, the first reference line detection module 1602 includes:

the image conversion unit is used for carrying out image conversion on the sequencing data to obtain a sequencing image corresponding to the sequencing data;

the first datum line detection unit is used for respectively carrying out datum line detection on the sequencing image in the horizontal direction and the vertical direction to obtain first datum line distribution corresponding to the sequencing image;

and the second reference line detection unit is used for respectively carrying out reference line detection on the tissue staining images in the horizontal direction and the vertical direction and determining second reference line distribution corresponding to the tissue staining images.

Optionally, the image conversion unit is configured to:

determining at least one gene at each sequencing point and a gene expression level corresponding to the at least one gene based on the sequencing data;

determining the sequencing point as a pixel point corresponding to the sequencing image;

determining the sum of the gene expression amounts at the sequencing points as a pixel value corresponding to the pixel point;

And generating the sequencing image based on the pixel points and the pixel values corresponding to the pixel points.

Optionally, the second reference line detection unit is configured to:

rotating the tissue staining image at a first rotation angle based on a rotation angle threshold in the process of performing reference line detection on the tissue staining image in the horizontal and vertical directions, respectively;

and under the condition that the reference line is detected to be in a horizontal or vertical state, determining a current reference line detection result as a second reference line distribution corresponding to the tissue staining image, and determining a current rotation angle as a first target rotation angle of the tissue staining image relative to the sequencing data.

Optionally, the image registration module 1603 is configured to:

rotating the tissue staining image based on the first target rotation angle to obtain a rotated tissue staining image, wherein the reference line direction corresponding to the rotated tissue staining image is consistent with the reference line direction corresponding to the sequencing data;

and scaling and translating the rotated tissue staining image based on the first datum line distribution and the second datum line distribution, and performing image registration on the rotated tissue staining image and the sequencing data in a datum line alignment mode to obtain a gene expression result corresponding to each cell in the tissue sample.

Optionally, the apparatus further includes:

the image segmentation module is used for carrying out image segmentation on the sequencing image and the tissue staining image to obtain at least two sequencing sub-images and at least two tissue staining sub-images;

the second reference line detection module is used for respectively carrying out reference line detection on the at least two sequencing sub-images and the at least two tissue dyeing sub-images to obtain third reference line distribution corresponding to the at least two sequencing sub-images and fourth reference line distribution corresponding to the at least two tissue dyeing sub-images;

the first datum line matching module is used for performing datum line matching on the at least two sequencing sub-images based on the third datum line distribution to obtain a first datum line distribution corresponding to the sequencing images;

and the second datum line matching module is used for performing datum line matching on the at least two tissue dyeing sub-images based on the fourth datum line distribution to obtain a second datum line distribution corresponding to the tissue dyeing images.

Optionally, the apparatus further includes:

the rotation angle determining module is used for determining at least two second target rotation angles of the at least two tissue staining sub-images relative to the at least two sequencing sub-images, wherein the reference line direction corresponding to the tissue staining sub-images obtained based on the rotation of the second target rotation angles is consistent with the reference line direction corresponding to the sequencing sub-images;

A filtering module for filtering abnormal values contained in the at least two second target rotation angles;

the rotation angle determination module is further for determining an average value between the filtered at least two second target rotation angles as a first target rotation angle of the tissue staining image relative to the sequencing data.

Optionally, after the obtaining the tissue staining image and the sequencing data corresponding to the tissue sample, the apparatus further includes:

the region detection module is used for inputting the tissue staining image into an example segmentation model, and determining candidate cell regions in the tissue staining image through a candidate region detection network in the example segmentation model;

the region dividing module is used for judging and adjusting the candidate cell region through a region classification network and a bounding box regression network in the example segmentation model;

and the contour demarcation module is used for demarcating the contour of the cells in the tissue staining image through a segmentation network in the example segmentation model to obtain a cell segmentation result corresponding to the tissue staining image.

Optionally, the type annotation module 1604 is configured to:

cell type prediction is carried out on each cell in the tissue sample through a deep neural network, and a first cell type probability value corresponding to each cell output by the deep neural network is obtained;

Cell type prediction is carried out on each cell in the tissue sample through a graph rolling network, a second cell type probability value corresponding to each cell output by the graph rolling network is obtained, and the graph rolling network consists of a depth self-encoder, a graph self-encoder and a classifier;

correcting the first cell type probability value by using the second cell type probability value to obtain a cell type prediction result corresponding to each cell;

and carrying out cell type annotation on each cell based on the cell type prediction result to obtain a cell analysis result corresponding to the tissue sample.

Optionally, the apparatus further includes:

the first probability output module is used for inputting a sample single-cell data set, sample gene expression corresponding to the sample single-cell data set and sample cell types into the deep neural network, and outputting first sample cell type probability values corresponding to all sample cells in the sample single-cell data set through the deep neural network;

the first loss determination module is used for counting the number of cells corresponding to each sample cell type based on the sample cell type and the first sample cell type probability value to obtain a first cross entropy loss based on category weighting;

And the first training module is used for training the deep neural network with the first cross entropy loss based on the category weighting.

Optionally, the apparatus further includes:

the second probability output module is used for inputting a sample single-cell data set, sample gene expression corresponding to the sample single-cell data set and sample cell types into the graph rolling network, and outputting second sample cell type probability values corresponding to each sample cell in the sample single-cell data set through the graph rolling network;

a second loss determination module, configured to determine, based on the sample cell type and the second sample cell type probability value, a reconstruction loss corresponding to the depth self-encoder and the graph self-encoder in the graph rolling network, and a second cross entropy loss corresponding to the classifier and for soft label output;

and the second training module is used for training the depth self-encoder and the graph self-encoder according to the reconstruction loss and training the classifier according to the second cross entropy loss output for the soft labels.

It should be noted that: the apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the method embodiments are described in the method embodiments, which are not repeated herein.

Referring to fig. 17, a schematic diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the present application relates to a method for manufacturing a semiconductor device. The computer device 1700 includes a central processing unit (Central Processing Unit, CPU) 1701, a system memory 1704 including a random access memory 1702 and a read only memory 1703, and a system bus 1705 connecting the system memory 1704 and the central processing unit 1701. The computer device 1700 also includes a basic Input/Output system (I/O) 1706, and a mass storage device 1707 for storing an operating system 1713, application programs 1714, and other program modules 1715, which facilitate the transfer of information between the various devices within the computer.

The basic input/output system 1706 includes a display 1708 for displaying information and an input device 1709, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1708 and input device 1709 are coupled to the central processing unit 1701 through an input output controller 1710 coupled to the system bus 1705. The basic input/output system 1706 may also include an input/output controller 1710 for receiving and processing input from a keyboard, mouse, or electronic stylus, among many other devices. Similarly, the input output controller 1710 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1707 is connected to the central processing unit 1701 through a mass storage controller (not shown) connected to the system bus 1705. The mass storage device 1707 and its associated computer-readable media provide non-volatile storage for the computer device 1700. That is, the mass storage device 1707 may include a computer readable medium (not shown), such as a hard disk or drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (RAM, random Access Memory), read Only Memory (ROM), flash Memory or other solid state Memory technology, compact disk (CD-ROM), digital versatile disk (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1704 and mass storage 1707 described above may be referred to collectively as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1701, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1701 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1700 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1700 may connect to the network 1711 through a network interface unit 1712 connected to the system bus 1705, or other types of networks or remote computer systems (not shown) using the network interface unit 1712.

The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction is stored in the readable storage medium, and the at least one instruction is loaded and executed by a processor to realize the cell data analysis method described in the above embodiment.

Alternatively, the computer-readable storage medium may include: ROM, RAM, solid state disk (SSD, solid State Drives), or optical disk, etc. The RAM may include, among other things, resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory).

Embodiments of the present application provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the cell data analysis method described in the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but is intended to cover all modifications, equivalents, alternatives, and improvements falling within the spirit and principles of the application.

Claims

1. A method of analyzing cellular data, the method comprising:

2. The method of claim 1, wherein performing baseline detection on the sequencing data and the tissue staining image, respectively, to obtain a first baseline distribution corresponding to the sequencing data and a second baseline distribution corresponding to the tissue staining image, comprises:

performing image transformation on the sequencing data to obtain a sequencing image corresponding to the sequencing data;

Respectively carrying out datum line detection on the sequencing image in the horizontal direction and the vertical direction to obtain a first datum line distribution corresponding to the sequencing image;

and respectively carrying out datum line detection on the tissue dyeing image in the horizontal direction and the vertical direction, and determining second datum line distribution corresponding to the tissue dyeing image.

3. The method of claim 2, wherein performing image transformation on the sequencing data to obtain a sequencing image corresponding to the sequencing data comprises:

4. The method of claim 2, wherein the performing baseline detection on the tissue staining image in the horizontal direction and the vertical direction, respectively, and determining the second baseline distribution corresponding to the tissue staining image comprises:

5. The method of claim 4, wherein performing image registration on the sequencing data and the tissue staining image based on the first baseline distribution and the second baseline distribution to obtain a gene expression result corresponding to each cell in the tissue sample comprises:

6. The method according to any one of claims 1 to 5, further comprising:

image segmentation is carried out on the sequencing image and the tissue staining image to obtain at least two sequencing sub-images and at least two tissue staining sub-images;

respectively carrying out reference line detection on the at least two sequencing sub-images and the at least two tissue dyeing sub-images to obtain third reference line distribution corresponding to the at least two sequencing sub-images and fourth reference line distribution corresponding to the at least two tissue dyeing sub-images;

performing baseline matching on the at least two sequencing sub-images based on the third baseline distribution to obtain a first baseline distribution corresponding to the sequencing images;

and performing reference line matching on the at least two tissue dyeing sub-images based on the fourth reference line distribution to obtain a second reference line distribution corresponding to the tissue dyeing images.

7. The method of claim 6, wherein the method further comprises:

determining at least two second target rotation angles of the at least two tissue staining sub-images relative to the at least two sequencing sub-images, wherein a reference line direction corresponding to the tissue staining sub-images obtained based on the second target rotation angles is consistent with a reference line direction corresponding to the sequencing sub-images;

Filtering abnormal values contained in the at least two second target rotation angles;

determining an average value between the filtered at least two second target rotation angles as a first target rotation angle of the tissue staining image relative to the sequencing data.

8. The method of claim 1, wherein after the obtaining the tissue staining image and the sequencing data corresponding to the tissue sample, the method further comprises:

inputting the tissue staining image into an example segmentation model, and determining candidate cell areas in the tissue staining image through a candidate area detection network in the example segmentation model;

judging and adjusting the candidate cell area through an area classification network and a bounding box regression network in the example segmentation model;

and demarcating the outline of the cells in the tissue staining image through a segmentation network in the example segmentation model to obtain a cell segmentation result corresponding to the tissue staining image.

9. The method of claim 1, wherein annotating each cell with a cell type based on the gene expression results to obtain a cell analysis result corresponding to the tissue sample, comprising:

10. The method according to claim 9, wherein the method further comprises:

inputting a sample single-cell data set, sample gene expression corresponding to the sample single-cell data set and sample cell types into the deep neural network, and outputting first sample cell type probability values corresponding to each sample cell in the sample single-cell data set through the deep neural network;

Based on the sample cell type and the first sample cell type probability value, counting the cell number corresponding to each sample cell type to obtain a first cross entropy loss based on category weighting;

training the deep neural network with the class-weighted first cross entropy loss.

11. The method according to claim 9, wherein the method further comprises:

inputting a sample single-cell data set, sample gene expression corresponding to the sample single-cell data set and sample cell types into the graph rolling network, and outputting second sample cell type probability values corresponding to each sample cell in the sample single-cell data set through the graph rolling network;

determining a reconstruction loss corresponding to the depth self-encoder and the graph self-encoder in the graph rolling network and a second cross entropy loss corresponding to the classifier and aiming at soft label output based on the sample cell type and the second sample cell type probability value;

training the depth self-encoder and the graph self-encoder with the reconstruction loss, and training the classifier with the second cross entropy loss for soft label output.

12. A cell data analysis device, the device comprising:

13. A computer device, the computer device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the cell data analysis method of any one of claims 1 to 11.

14. A computer readable storage medium storing at least one instruction for execution by a processor to implement the method of cell data analysis of any one of claims 1 to 11.

15. A computer program product, the computer program product comprising computer instructions stored in a computer readable storage medium; a processor of a computer device reads the computer instructions from the computer readable storage medium, the processor executing the computer instructions, causing the computer device to implement the cell data analysis method according to any one of claims 1 to 11.