CN115579055A

CN115579055A - Cell object classification method, device, electronic equipment and storage medium

Info

Publication number: CN115579055A
Application number: CN202211560115.7A
Authority: CN
Inventors: 刘松明; 贺照人
Original assignee: Baitu Shengke Suzhou Intelligent Technology Co ltd
Current assignee: Baitu Shengke Suzhou Intelligent Technology Co ltd
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-01-06
Anticipated expiration: 2042-12-05
Also published as: CN115579055B

Abstract

The invention provides a cell object classification method, a cell object classification device, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining a cell object data set of a set of cell objects, each cell object in the set of cell objects having an attribute value of at least one first attribute; determining an adjacency matrix corresponding to the cell object data set, wherein the vertex of the adjacency matrix is each cell object; partitioning each cell object based on the adjacency matrix to obtain at least one cell object community; based on the adjacency matrix and each cell object community, aggregating each cell object by using a preset aggregation algorithm to obtain a cell object cluster set; for each cluster of cell objects, a classification operation is performed. The method enables better classification of cellular objects.

Description

Cell object classification method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for classifying cell objects, an electronic device, and a storage medium.

Background

With the development of internet technology, more and more data are processed through a computer, the data are effectively classified, and the data can be used beneficially, so that a large amount of data can be better analyzed and utilized in the next step.

The existing data classification technology mainly aims at direct aggregation of all data, and the complexity of an algorithm is high, so that the classification efficiency and accuracy of a large amount of data sets are low. Meanwhile, as computer technologies are being applied in many fields, types of data to be processed, such as text information data, audio data, image data, and the like, are also more diversified. In the field of biomedicine, due to the vigorous development of sequencing technology, massive data is generated, and therefore, the development of efficient classification methods for massive data sets is urgent.

The classification of cells and cellular data is an important field of application of the method of the invention. The single cell sequencing technology can research the transcriptome expression level of a research object in a disease and normal state from a cell level, and provides a powerful tool for the research of complex systems such as tumor heterogeneity, immunology and the like.

When applying single cell sequencing technology to tumor immune-related studies, an important challenge is how to distinguish between tumor cells and normal cells. For specific cancer species, differentiation can be by expression of well-recognized marker genes, such as the CD38, CD138 and CD56 genes can be used to mark tumor cells in multiple myeloma.

Tumor cells originate from normal cells, and are usually caused by the fact that the genome of the normal cells is changed, so that the gene expression and function of the normal cells are abnormal, and the cells are not monitored by the body. Copy Number Variation (CNV) is an important component of genetic Variation, and refers to the amplification or reduction of DNA (Deoxyribonucleic Acid) fragments of 1000 or more bases. CNV is widely present in various tumors, and is involved in the development of tumors and the like by affecting the expression of genes. Based on the method, researchers provide a method for deducing cell CNV based on single cell sequencing data, and the tumor cells can be identified based on the premise that the CNV of the tumor cells is different from that of normal cells.

By combining a second-generation sequencing technology, the single-cell sequencing method provides a finer visual field for researching complex diseases, but due to the problems of low total amount of RNA in a single cell, degradation of specific RNA in a cell lysis process and the like, a cell expression matrix obtained by researchers is large in data volume and high in sparsity, and challenges are brought to downstream rapid analysis.

For the method for identifying tumor cells based on specific marker genes, on one hand, the selection of the marked genes is strictly limited by prior knowledge and is only suitable for certain tumors with definite marker genes, however, the tumor cells often have strong heterogeneity, and it is difficult to find out specific one or more genes to represent all tumor cell subtypes; on the other hand, because the data volume of the single cell expression matrix is huge, the marker gene obtained in the previous research is not always consistent with the expected expression in the single cell data, and the application of the method is limited.

The running time and resource consumption of the existing analysis method are closely related to the number of cells participating in operation, and the number of cells which are rapidly increased in the current single cell research cannot be met. In addition, the existing analysis method does not provide prediction results after various cell classifications, and in actual research, researchers are often required to carry out subjective judgment. Other analytical methods are only applicable to CNV prediction of a single sample.

Disclosure of Invention

The invention provides a cell object classification method, a cell object classification device, an electronic device and a storage medium, which can be used for solving the problems of low classification efficiency, long operation time and high memory requirement in the related art.

In a first aspect, embodiments of the present disclosure provide a method for classifying a cell object, the method including: obtaining a cell object data set of a set of cell objects, each of said cell objects having an attribute value of at least one first attribute; determining an adjacency matrix corresponding to the cell object dataset, the vertices of the adjacency matrix being the cell objects; partitioning each of the cell objects based on the adjacency matrix to obtain at least one cell object community, wherein the cell object community comprises at least one cell object; based on the adjacency matrix and each cell object community, aggregating each cell object by using a preset aggregation algorithm to obtain a cell object cluster set; for each of the above clusters of cell objects, the following classification operations are performed: the method further includes determining an attribute value of each cellular object in the cluster of cellular objects at each of the first attributes based on the attribute value of each cellular object in the cluster of cellular objects at the at least one first attribute, and determining a type of each cellular object in the cluster of cellular objects based on the attribute value of each cellular object in the cluster of cellular objects at each of the first attributes.

In some alternative embodiments, after the obtaining the cell object data sets of the cell object set and before the determining the adjacency matrix corresponding to the cell object data set, the method further includes: for each of the above cell object data, the following preprocessing operations are performed: determining whether the cellular object data is from a plurality of data sources; in response, performing multiple data source merge processing on the cell object data; the source merging process comprises the following steps: the cell object data has at least one source mark, the cell object data is subjected to source mark summarization, and the cell object data with preset source marks is merged based on the source mark summarization result; determining whether the cell object data is preset quality data; in response thereto, the preset mass cell object data is deleted from the cell object data set.

In some optional embodiments, the preprocessing operation further includes: for each of the above-described first attributes, the following invalid attribute deletion operation is performed: determining the number of cell object data with a first attribute value under the first attribute as a preset invalid attribute value in the cell object data set as the number of invalid first attribute values of the first attribute; and deleting the first attribute and the corresponding first attribute value of each cell object data in response to the number of invalid first attribute values of the first attribute being greater than a preset invalid first attribute value number threshold.

In some alternative embodiments, the determining an adjacency matrix corresponding to the cell object dataset includes: generating an adjacency matrix corresponding to the cell object data set, wherein the vertexes of the adjacency matrix correspond to the cell object data respectively; for each of the cell object data, connecting a vertex corresponding to the cell object data in the adjacency matrix with a closest vertex set corresponding to the cell object data in the adjacency matrix, where the closest vertex set corresponding to the cell object data is each vertex corresponding to a second predetermined number of cell object data in the adjacency matrix, the distance between the cell object data and the vertex set being the closest.

In some optional embodiments, said partitioning each of said cell objects based on said adjacency matrix to obtain at least one cell object community, said cell object community comprising at least one cell object, comprises: for each of said cellular objects, generating a cellular object community comprising said cellular object; determining a community discrimination for each of said cellular object communities based on said adjacency matrix; for each of the above-mentioned cellular objects, the following update operations are performed: determining updated community discrimination degrees of other cell object communities for each other cell object community except the cell object community, wherein the updated community discrimination degrees of the other cell object communities are community discrimination degrees of new cell object communities in which the cell object is added to the other cell object communities, and whether the maximum value of the updated community discrimination degrees of each other cell object community is greater than the community discrimination degree of the cell object community to which the cell object currently belongs is determined; and in response to the determination, moving the cell object to the updated one of the other cell object communities with the highest community discrimination.

In some optional embodiments, the update operation further includes: determining whether the cell object is broken chain in the cell object community before the cell object is updated; in response to determining that the cell object is to be re-partitioned from the community of cell objects prior to the update, at least two new communities of cell objects are obtained.

In some alternative embodiments, for each of the cell object clusters, performing the following classification operations, including: in response to determining that the set of cell object clusters satisfies a predetermined condition, for each of the cell object clusters, performing the classifying operation.

In some optional embodiments, the determining the attribute value of each cell object cluster at each first attribute based on the attribute value of each cell object in the cell object cluster at least at one first attribute includes: for each first attribute, determining the sum of the attribute values of the first attribute of the cell objects in the cell object cluster as the attribute value of the first attribute of the cell object cluster.

In some alternative embodiments, the determining the type of each cell object in the cell object cluster based on the attribute value of the cell object cluster at each of the first attributes includes: determining an attribute value of the second attribute for each cell object in the cluster of cell objects based on cell object data for each cell object in the cluster of cell objects; the type of each cellular object in the cluster of cellular objects is determined based on the attribute value of the cluster of cellular objects at each of the first attributes and the attribute value of the second attribute of each cellular object in the cluster of cellular objects.

In some alternative embodiments, the cell object is a cell, the cell object data is a gene expression matrix of the cell, the first attribute is a gene, the attribute value of the first attribute is a gene expression amount, and the second attribute is a CNV.

In a second aspect, an embodiment of the present invention provides a cell object sorting apparatus, including: an acquisition unit configured to acquire a cell object data set of a cell object set, each of the cell objects having an attribute value of at least one first attribute; a determination unit configured to determine an adjacency matrix corresponding to the cell object data set, a vertex of the adjacency matrix being each of the cell objects; a partitioning unit configured to partition each of the cell objects based on the adjacency matrix to obtain at least one cell object community, the cell object community including at least one cell object; the aggregation unit is configured to aggregate each cell object by using a preset aggregation algorithm based on the adjacency matrix and each cell object community to obtain a cell object cluster set; a classification unit configured to perform, for each of the cell object clusters, the following classification operations: determining an attribute value of each cell object in the cluster of cell objects at each of the first attributes based on the attribute value of each cell object in the cluster of cell objects at least one first attribute, and determining a type of each cell object in the cluster of cell objects based on the attribute value of each cell object in the cluster of cell objects at each of the first attributes.

In a third aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by one or more processors, implements the method as described in any implementation manner of the first aspect.

In order to realize rapid classification of a large number of cells and overcome the problems of long running time and high resource consumption of the existing analysis method, the invention provides a method, a device, electronic equipment and a storage medium for rapidly classifying cell objects.

Drawings

FIG. 1 is a schematic diagram of an implementation environment in which an embodiment of the present invention may be used.

FIG. 2A is a flow chart of one embodiment of a method of cell object classification according to the present invention.

FIG. 2B is an exploded flow diagram for one embodiment of step 201' in accordance with the present invention.

FIG. 2C is an exploded flow diagram for one embodiment of step 205, in accordance with the present invention.

Fig. 3 is a schematic structural diagram of an embodiment of the cell object sorting apparatus according to the present invention.

FIG. 4 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Fig. 5 shows the distribution of CNV scores calculated in an ideal case, in which the abscissa represents the score of CNV, the ordinate represents the density, the left peak represents normal cells, and the right peak represents tumor cells.

Fig. 6 shows the composition of cells in each of the subpopulations of super cells, classified as indeterminate, malignant, suspected malignant, normal and suspected normal cells, after the clustering effect evaluation of the different datasets (dataset 1-dataset 6) with each dataset divided into about 1000 super cell subpopulations.

FIG. 7 shows the runtime of the method of the present invention and common single cell analysis software with the same computational core. Wherein, the method of the invention: the super cell grouping and common single cell analysis software' are the total time of the method, and the method comprises the following steps: common single cell analysis software "is the time required for running common single cell analysis software in the method of the present invention," other methods: common single cell analysis software "is the total time required without the super cell clustering method.

FIG. 8 shows Receiver Operating Characteristic curves (ROC Curve) of the method of the present invention and common single cell analysis software in different datasets, where the abscissa is specificity and the ordinate is sensitivity.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Illustratively, the first attribute of the cellular object is a gene, the attribute value of the first attribute is a gene expression level, and the second attribute is a CNV.

Fig. 1 is a schematic diagram illustrating an implementation environment of a cell object classification method, an apparatus, an electronic device and a storage medium to which the present invention can be applied.

As shown in fig. 1, the implementation environment includes an electronic device 100, and the cell object classification method in the embodiment of the present invention may be executed by

terminal devices

101, 102, 103, and 104. Illustratively, the electronic device 100 may comprise at least one of a terminal device or a server.

The

terminal apparatuses

101, 102, 103, and 104 may be hardware or software. When the

terminal devices

101, 102, 103, 104 are hardware, they may be various electronic devices having a display screen and supporting information input (e.g., text input and/or voice input, etc.), including but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103, and 104 are software, they can be installed in the above-listed terminal apparatuses. It may be implemented as multiple software or software modules (e.g., to provide cell object classification services) or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices in fig. 1 is merely illustrative. There may be any number of terminal devices, as desired for implementation.

With continuing reference to fig. 2A, a flow 200 of one embodiment of a cellular object sorting method according to the present invention is shown, the cellular object sorting method comprising the steps of:

in step 201, a cell object data set of a cell object set is obtained.

In this embodiment, the subject of execution of the cell object classification method may first acquire a cell object data set of a cell object set. Here, each of the acquired cell objects has an attribute value of at least one first attribute. The cell object data is not limited in the embodiments of the present invention, and the cell object data includes, but is not limited to, a gene expression matrix (i.e., cell data) of a cell.

In some alternative embodiments, the executing subject of the cell object classification method may execute step 201' after executing step 201 and before executing step 202:

step 201', for each of the above-mentioned cell object data, a preprocessing operation is performed.

Here, the pre-processing operation may include the following steps 201'1 to 201'8 as shown in FIG. 2B:

step 201'1, it is determined whether the cell object data is from multiple data sources.

And if yes, the step of step 201'2 execution is carried out.

And if the determination is no, the step is shifted to the step 201'3 for execution.

In step 201'2, the cell object data is subjected to multi-source merge processing.

Here, the source merging process may specifically include: the cell object data has at least one source marker, the cell object data is subjected to source marker summarization, and the cell object data with preset source markers is merged based on the source marker summarization result.

Step 201'3, determining whether the cell object data is preset quality data.

For example, the preset quality may be a low quality in a biological data application scenario, for example, the low quality includes but is not limited to a twin case, a case where cell death and a high percentage of mitochondrial genes in dead cells result, and the like.

If yes, go to step 201'4 execution.

And if the determination is no, the step is shifted to the step 201'5 for execution.

Step 201'4, deleting the preset mass cell object data from the cell object data set.

And step 201'4 is executed, and the step 202 can be executed.

In step 201 ″, for each of the first attributes, an invalid attribute deleting operation is performed.

Here, the invalid attribute deleting operation may specifically include: and determining the number of the cell object data with the first attribute value of the cell object data set under the first attribute as a preset invalid attribute value as the number of invalid first attribute values of the first attribute.

And after the step 201'5 is executed, the step 201'6 can be executed. Step 201'6, in response to the number of invalid first attribute values of the first attribute being greater than a preset invalid first attribute value number threshold, deleting the first attribute and the corresponding first attribute value of each of the cell object data.

And after the step 201'6 is executed, the step 201'7 can be executed.

Step 201'7 of performing feature selection based on the cell object data sets to obtain features of the cell object data sets of a first preset number as feature selection results.

For example, here, the first preset number may be 3000.

And after the step 201'7 is executed, the step 201'8 can be executed.

In step 201', dimension reduction processing is performed for each of the cell object data based on the feature selection result.

For example, each of the cell object data may be reduced to 50 dimensions.

And step 201'8 is executed, and the step 202 can be executed.

Step 202 is to identify an adjacency matrix corresponding to the cell object data set, the vertices of the adjacency matrix being the cell objects.

In some alternative embodiments, the vertices of the adjacency matrix correspond to the cell object data, respectively.

In some optional embodiments, for each cell object data, a vertex corresponding to the cell object data in the adjacency matrix is connected to a nearest vertex set corresponding to the cell object data in the adjacency matrix, where the nearest vertex set corresponding to the cell object data is each vertex corresponding to a second predetermined number (for example, 20) of cell object data in the adjacency matrix, where the distance between the cell object data set and the cell object data in the adjacency matrix is the nearest.

Step 203, based on the adjacency matrix, partitioning each cell object to obtain at least one cell object community.

Here, the community of cell objects may include at least one cell object.

In some alternative embodiments, for each of the above-described cellular objects, generating a cellular object community comprising the cellular object; determining a community discrimination for each of said cell object communities based on said adjacency matrix; for each of the above-mentioned cellular objects, the following update operations are performed: determining updated community discrimination degrees of other cell object communities for each other cell object community except the cell object community, wherein the updated community discrimination degrees of the other cell object communities are community discrimination degrees of new cell object communities in which the cell object is added to the other cell object communities, and whether the maximum value of the updated community discrimination degrees of each other cell object community is greater than the community discrimination degree of the cell object community to which the cell object currently belongs is determined; and in response to the determination, moving the cell object to the updated one of the other cell object communities with the highest community discrimination.

In some optional embodiments, the update operation may further include: determining whether the cell object is broken chain in the cell object community before the cell object is updated; in response to determining that the cell object is to be re-partitioned from the community of cell objects prior to the update, at least two new communities of cell objects are obtained.

And 204, based on the adjacency matrix and each cell object community, aggregating each cell object by using a preset aggregation algorithm to obtain a cell object cluster set.

In some alternative embodiments, the performing subject of the cell object classification method may perform the following operations after performing step 204 and before performing step 205: in response to determining that the set of cell object clusters satisfies a predetermined condition, for each of the cell object clusters, performing the classifying operation. Illustratively, the preset condition is that after the loop of step 204, the number of cell clusters obtained finally matches or approaches the expected number of cell clusters.

In step 205, a classification operation is performed for each of the cell object clusters.

Here, the classifying operation may specifically include: determining an attribute value of each cell object in the cluster of cell objects at each of the first attributes based on the attribute value of each cell object in the cluster of cell objects at least one first attribute, and determining a type of each cell object in the cluster of cell objects based on the attribute value of each cell object in the cluster of cell objects at each of the first attributes.

In some optional embodiments, step 205 may further include the following steps 2051 to 2053 as shown in fig. 2C:

step 2051, for each first attribute, determining the sum of the attribute values of the first attribute of each cell object in the cell object cluster as the attribute value of the first attribute of the cell object cluster.

Step 2052, an attribute value of the second attribute of each cell object in the cluster of cell objects is determined based on the cell object data for each cell object in the cluster of cell objects.

Step 2053 determines the type of each cell object in the cell object cluster based on the attribute value of the cell object cluster at each of the first attributes and the attribute value of the second attribute of each cell object in the cell object cluster.

With further reference to fig. 3, as an implementation of the methods shown in the above-mentioned figures, the present invention provides an embodiment of a cell object classifying device, which corresponds to the embodiment of the method shown in fig. 2A, and which can be applied to various electronic devices.

As shown in fig. 3, the cell object sorting apparatus 300 of the present embodiment includes: an obtaining unit 301 configured to obtain a cell object data set of a set of cell objects, each of the cell objects having an attribute value of at least one first attribute; a determining unit 302 configured to determine an adjacency matrix corresponding to the cell object data set, a vertex of the adjacency matrix being each of the cell objects; a partitioning unit 303 configured to partition each of the cell objects based on the adjacency matrix to obtain at least one cell object community, where the cell object community includes at least one cell object; an aggregation unit 304 configured to aggregate each cell object by using a preset aggregation algorithm based on the adjacency matrix and each cell object community to obtain a cell object cluster set; a classification unit 305 configured to perform the following classification operations for each of the above-described cell object clusters: determining an attribute value of each cell object in the cluster of cell objects at each of the first attributes based on the attribute value of each cell object in the cluster of cell objects at least one first attribute, and determining a type of each cell object in the cluster of cell objects based on the attribute value of each cell object in the cluster of cell objects at each of the first attributes.

As shown in fig. 4, a schematic diagram of a computer system 400 suitable for use in implementing the electronic device of the present invention is shown. The computer system 400 illustrated in FIG. 4 is only an example and should not impose any limitations on the scope of use or functionality of embodiments of the invention.

As shown in fig. 4, computer system 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the computer system 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication device 409 may allow the computer system 400 to communicate with other devices, either wirelessly or by wire, to exchange data. While fig. 4 illustrates a computer system 400 having various means of electronic equipment, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program according to an embodiment of the present invention. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program, when executed by the processing apparatus 401, performs the functions defined above in the methods of embodiments of the present invention.

It should be noted that the computer readable medium of the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the method for cell object classification as illustrated in the embodiment shown in fig. 2A and its optional implementations.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents is encompassed without departing from the spirit of the disclosure. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

For example, when the cell object data is a gene expression matrix of a cell, the first attribute is a gene, the attribute value of the first attribute is a gene expression level, and the second attribute is CNV. The cell object classification method of the invention is a flow and effect embodiment in a specific application scene.

Example 1: for the obtained single cell expression matrix, firstly, the data is preprocessed, which comprises the following steps:

1. when a plurality of samples exist, merging data of the multi-sample data;

2. removing cells of poor quality and genes expressed in only a few cells;

3. screening for highly variable genes, i.e., screening for those genes with greater variance between cells, represents a major difference between cells;

4. principal component analysis, illustratively, principal components are obtained using a PCA dimension reduction method;

for example, first, it is determined whether the single-cell dataset of the input process (as shown in fig. 2B) is from multiple samples, and if the single-cell dataset is multi-sample data, the data merging of the multi-sample data is performed according to the data source attribute; if the input data set is from a single sample, no merging process is performed.

Removing the poor quality cells, alternative means may be, but are not limited to, for example: 1. the gene expression quantity is obviously more than a certain preset threshold value; 2. the ratio of the mitochondrial gene of the cell is higher; the cell fraction is determined as a predetermined mass of cells, and its value needs to be removed. Alternatively, genes expressed in only a small number of cells may be removed. After a large number of high variable genes are obtained through further screening, the genes are reduced to dimension reduction by using a PCA dimension reduction method, for example, the dimension reduction is reduced to 50 dimensions, and each dimension is regarded as the common contribution of several genes according to the gene aggregation attribute.

Illustratively, in the current procedure, 3000 highly variable genes are screened by input data preprocessing, then dimensionality is reduced to 50 principal components, and then the first 30 principal components are selected.

Calculating the Gaussian probability of an observation distance between any two data points (each cell corresponds to one data point) based on Euclidean distance based on a dimension reduction space formed by 30 principal components obtained by dimension reduction, wherein each data point is required to be connected with at least a preset number (for example, 20) of nearest neighbor data points so as to effectively control the balance between a local structure and a global structure, ensure local information and simultaneously keep certain global structure information, thereby constructing a weighted adjacency matrix of a data point adjacency graph; based on the adjacency matrix or by adopting an aggregation algorithm, object (i.e. cell) aggregation is performed, the embodiment of the present invention does not limit other micro-organization attribute aggregation algorithms, and the aggregation algorithm mentioned in the embodiment includes, but is not limited to, a convolutional neural network based, a recursive neural network based, a random forest algorithm based, a depth based semantic algorithm, a self-attention mechanism based neural network algorithm, a graph based neural network algorithm, a classical classification algorithm, a strong connection component algorithm, a weak connection component algorithm, a triangle counting algorithm, and the like; the specific implementation mode comprises the steps that each data point (node) is regarded as an independent community by presetting the aggregation algorithm, the community distinguishing degree is improved through the mobile node to obtain a new partition result, and the higher the community distinguishing degree is, the better the clustering result is. After a new partition result is formed, based on the judgment of whether each node in the partition has broken links, the partition is refined, namely, the existing partition is split in a specific mode to ensure good connection in the partition. When the subdivision refinement is carried out, the community discrimination does not need to be evaluated again; and carrying out network aggregation based on the partitions, moving the aggregated network nodes, and repeating the steps until the partition result cannot be further improved to obtain a final cell grouping result.

Based on the above cell clustering results, cells within a single cell population having similar attributes according to the aggregation algorithm based on the present embodiment are regarded as a cell population (i.e., an object cluster), and in the following embodiments, for the convenience of understanding in a biological scene, the cell population is referred to as "super cell"; the sum of the gene expression levels of all the cells in the cell population is regarded as the gene expression intensity of the cell population, and the obtained cell population gene expression matrix is used as input data, and the input data is subjected to CNV scoring. The scoring is to evaluate the copy number of the corresponding genome position in each cell (including the tumor cells to be classified) by evaluating the gene expression intensity of the cells and calculating the relative expression amount of normal cells as a reference, so as to obtain the CNV scoring.

Based on the above CNV scoring results, the tumor cells and normal cells theoretically have different CNV score distributions, and the CNV scores of all the cells should conform to a bimodal distribution, as shown in fig. 5, the lower score of the left peak is normal cells, and the higher score of the right peak CNV is tumor cells.

Example 2:

a test dataset was constructed based on published single cell sequencing data (Kim et al, 2020) to evaluate the effect of the current protocol.

Table 1: basic case of test data currently used

The effect of clustering on different datasets, each divided into approximately 1000 subpopulations of supercells, was first evaluated, and figure 6 shows the composition of cells in each supercell subpopulation. Wherein, most of the super cell subsets are composed of single cell types, and only individual super cell subsets have the condition that the tumor cells and normal cells are completely mixed and cannot be distinguished, which indicates that the super cell subsets obtained in the cell grouping step have high uniformity, and the super cell subsets are also the basis for identifying the tumor cells in the subsequent steps.

Second, the time that the current protocol and common single cell analysis software were processing the same data set was evaluated. Figure 7 shows the current protocol (super-cell) and the runtime of common single-cell analysis software with the same computational core. Considering that the current scheme mainly consists of two main steps of super cell clustering and running common single cell analysis software, fig. 7 shows the total time for performing cell analysis using the current scheme (super cell clustering + running common single cell analysis software), the time for running common single cell analysis software in the current scheme, and the total time for using only common single cell analysis software without using the current scheme. From the results, the total time required for the current protocol is much lower than that without the protocol but with only common single cell analysis software.

Finally, the accuracy of the current protocol and common single cell analysis software at the tumor cell prediction level was also evaluated. CNV scores were first calculated for each cell and referenced to the cellular annotation given by the article author (Kim et al, 2020). FIG. 8 shows the Receiver Operating Characteristic Curve (ROC Curve) of the two in different data sets. In any data set, the Area under the Curve (AUC) of the current scheme is higher than that of the original common single-cell analysis software, which also means that the current scheme is closer to the comments given by the article authors when used for tumor cell prediction, and the effect is better.

Claims

1. A method of cell object classification, the method comprising:

obtaining a cell object data set of a set of cell objects, each cell object in the set of cell objects having an attribute value of at least one first attribute;

determining an adjacency matrix corresponding to the cell object dataset, a vertex of the adjacency matrix being each of the cell objects;

partitioning each cell object based on the adjacency matrix to obtain at least one cell object community, wherein the cell object community comprises at least one cell object;

based on the adjacency matrix and each cell object community, aggregating each cell object by using a preset aggregation algorithm to obtain a cell object cluster set;

for each cluster of cell objects, the following classification operations are performed: the method includes determining attribute values of each of the cell objects in the cluster of cell objects for the respective first attributes based on the attribute values of each of the cell objects in the cluster of cell objects for the at least one first attribute, and determining a type of each of the cell objects in the cluster of cell objects based on the attribute values of each of the cell objects in the cluster of cell objects for the respective first attributes.

2. The method of claim 1, wherein after the obtaining a cell object dataset of a cell object set, and before the determining an adjacency matrix corresponding to the cell object dataset, the method further comprises:

for each cell object data, the following preprocessing operations are performed:

determining whether the cell object data is from a plurality of data sources; in response, performing multiple data source merge processing on the cell object data; the source merging process comprises the following steps: the cell object data is provided with at least one source mark, the cell object data is subjected to source mark summarization, and the cell object data with preset source marks are merged based on the source mark summarization result;

determining whether the cell object data is a preset mass cell object data; in response thereto, the preset mass cell object data is deleted from the cell object data set.

3. The method of claim 2, wherein the preprocessing operation further comprises:

for each of the first attributes, performing the following invalid attribute deletion operations: determining the number of cell object data with a first attribute value under the first attribute as a preset invalid attribute value in the cell object data set as the number of invalid first attribute values of the first attribute; deleting the first attribute and the corresponding first attribute value of each cell object data in response to the number of invalid first attribute values of the first attribute being greater than a preset invalid first attribute value number threshold.

4. The method of claim 2 or 3, wherein the preprocessing operation further comprises:

performing feature selection based on the cell object data set to obtain a first preset number of cell object data features as a feature selection result;

and performing dimensionality reduction on each cell object data according to the feature selection result.

5. The method of claim 4, wherein said determining an adjacency matrix corresponding to the cellular object dataset comprises:

generating an adjacency matrix corresponding to the cell object data set, wherein the vertex of the adjacency matrix corresponds to each cell object data respectively;

for each cell object data, connecting a vertex corresponding to the cell object data in the adjacency matrix with a nearest vertex set corresponding to the cell object data in the adjacency matrix, where the nearest vertex set corresponding to the cell object data is each vertex corresponding to a second preset number of cell object data with the nearest distance between the cell object data and the cell object data in the cell object data set in the adjacency matrix.

6. The method of claim 5, wherein said partitioning each of said cell objects based on said adjacency matrix to obtain at least one cell object community, said cell object community comprising at least one cell object, comprises:

for each of said cellular objects, generating a cellular object community comprising the cellular object;

determining a community discrimination for each of the cell object communities based on the adjacency matrix;

for each of said cellular objects, performing the following updating operations: for each other cell object community except the cell object community in each cell object community, determining updated community discrimination of the other cell object communities, wherein the updated community discrimination of the other cell object communities is the community discrimination of a new cell object community after the cell object is added to the other cell object communities, and determining whether the maximum value in the updated community discrimination of each other cell object community is greater than the community discrimination of the cell object community to which the cell object currently belongs; in response to determining that the cell object is moved to the one of the other cell object communities having the highest updated community discrimination.

7. The method of claim 6, wherein the update operation further comprises:

determining whether the cell object is broken chain in the cell object community before the cell object is updated;

in response to determining that the cell object is to be re-partitioned from the community of cell objects prior to the update, at least two new communities of cell objects are obtained.

8. The method of claim 7, wherein said performing, for each of said clusters of cellular objects, a classification operation comprising:

in response to determining that the set of cell object clusters satisfies a preset condition, performing the classifying operation for each of the cell object clusters.

9. The method of claim 8, wherein said determining the value of the attribute of each cell object cluster at each first attribute based on the value of the attribute of each cell object in the cell object cluster at least one first attribute comprises:

for each first attribute, determining the sum of the attribute values of the first attribute of the cell objects in the cell object cluster as the attribute value of the first attribute of the cell object cluster.

10. The method of claim 9, wherein said determining the type of each cell object in the cell object cluster based on the attribute value of the cell object cluster at each of the first attributes comprises:

determining an attribute value of the second attribute for each cell object in the cluster of cell objects based on cell object data for each cell object in the cluster of cell objects;

determining the type of each cell object in the cell object cluster based on the attribute value of the cell object cluster at each of the first attributes and the attribute value of the second attribute of each cell object in the cell object cluster.

11. A cellular object sorting apparatus comprising:

an acquisition unit configured to acquire a cell object data set of a cell object set, each cell object in the cell object set having an attribute value of at least one first attribute;

a determination unit configured to determine an adjacency matrix corresponding to the cell object data set, a vertex of the adjacency matrix being each of the cell objects;

a partitioning unit configured to partition each of the cell objects based on the adjacency matrix, resulting in at least one cell object community, the cell object community comprising at least one cell object;

the aggregation unit is configured to aggregate each cell object by using a preset aggregation algorithm based on the adjacency matrix and each cell object community to obtain a cell object cluster set;

a classification unit configured to perform, for each of the cell object clusters, the following classification operations: the method further includes determining an attribute value for each cell object in the cluster of cell objects at each of the first attributes based on the attribute value for each cell object in the cluster of cell objects at the at least one first attribute, and determining a type of each cell object in the cluster of cell objects based on the attribute value for each cell object in the cluster of cell objects at each of the first attributes.

12. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-10.