CN117953965A - Classification prediction method and device for tumors and electronic equipment - Google Patents

Classification prediction method and device for tumors and electronic equipment Download PDF

Info

Publication number
CN117953965A
CN117953965A CN202410112672.5A CN202410112672A CN117953965A CN 117953965 A CN117953965 A CN 117953965A CN 202410112672 A CN202410112672 A CN 202410112672A CN 117953965 A CN117953965 A CN 117953965A
Authority
CN
China
Prior art keywords
information
preset
gene expression
enrichment
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410112672.5A
Other languages
Chinese (zh)
Inventor
樊嘉
张道涵
梁宸
陆佳成
周俭
施国明
黄晓勇
郭晓军
孟献龙
胡舒阳
叶沐
裴晏梓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongshan Hospital Fudan University
Original Assignee
Zhongshan Hospital Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongshan Hospital Fudan University filed Critical Zhongshan Hospital Fudan University
Priority to CN202410112672.5A priority Critical patent/CN117953965A/en
Publication of CN117953965A publication Critical patent/CN117953965A/en
Pending legal-status Critical Current

Links

Abstract

The application provides a classification prediction method, a classification prediction device and electronic equipment for tumors, which relate to the technical field of biological information and comprise the steps of obtaining original sequencing data of a user to be detected, and obtaining cell gene expression data based on the original sequencing data; obtaining enrichment information to be detected according to the cell gene expression data, preset grouping information and first preset information; obtaining information of communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information; the method and the device have the advantages that the enrichment information to be detected and the communication intensity information to be detected are substituted into a tumor classification prediction model to obtain the prediction classification result of the user to be detected, the enrichment information to be detected and the communication intensity information to be detected are used for prediction, and the interpretability and the classification accuracy of the prediction classification result are improved, so that reliable reference opinion can be provided for clinical decision.

Description

Classification prediction method and device for tumors and electronic equipment
Technical Field
The present invention relates to the field of biological information technologies, and in particular, to a method and an apparatus for classification prediction of tumors, and an electronic device.
Background
Classification of cancer diseases is a complex problem requiring a combination of factors. With the development of high throughput sequencing technology, gene expression profiling has become an important tool in the study of cancer.
Classical machine learning algorithms such as logistic regression, support vector machine classification algorithms, random forests, and feed forward neural networks all classify and predict samples directly from their gene expression data.
However, there are complex interactions and network relationships between genes in gene expression data, which are not fully considered in the conventional machine learning method. Therefore, the method has limitations and unilateral property when directly classifying cancer tissues through gene expression data, and has low classification accuracy.
A classification prediction method, a classification prediction device and electronic equipment for tumors are provided.
Disclosure of Invention
The specification provides a classification prediction method, a device and electronic equipment for tumors, which are used for predicting through enrichment information to be detected and communication intensity information to be detected, so that the interpretation of the prediction is improved, the data are input into a tumor classification prediction model to obtain a prediction classification result of a user to be detected, and the classification accuracy is improved, so that reliable reference opinion is provided for clinical decision making, and accurate judgment of professionals is assisted.
The method for classifying and predicting the tumor adopts the following technical scheme that:
Acquiring original sequencing data of a user to be tested, and acquiring cell gene expression data based on the original sequencing data;
obtaining enrichment information to be detected according to the cell gene expression data, preset grouping information and first preset information;
Obtaining information of communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information;
Substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result of the user to be detected.
Optionally, the obtaining the original sequencing data of the user to be tested, and obtaining the cellular gene expression data based on the original sequencing data includes:
Obtaining a tissue sample to be tested of the user to be tested, and preparing single-cell suspension for the tissue sample to be tested to obtain single-cell suspension;
constructing a gene information library according to the single cell suspension;
carrying out high-throughput sequencing on the gene information library to obtain original sequencing data;
Correcting the original sequencing data to obtain corrected sequencing data;
And determining the expression relation between the cells and the genes based on the corrected sequencing data, and taking the collection of the expression relation as the cell gene expression data.
Optionally, the first preset information includes a preset gene set and preset pathway information;
the obtaining enrichment information to be detected according to the cell gene expression data, the preset grouping information and the first preset information comprises the following steps:
Calculating initial enrichment information by combining the expression relationship between the preset gene set and the cell gene expression data, wherein the initial enrichment information comprises a first enrichment association relationship among cells, a preset gene set passage and an initial enrichment score;
classifying the first enrichment association according to the preset grouping information, and determining a second enrichment association among a cell population, a preset gene set path and a target enrichment score;
screening the second enrichment association relation according to the preset path information, and collecting the second enrichment association relation containing the dimensionality reduction path as the enrichment information to be detected.
Optionally, the second preset information includes a plurality of preset receptors and a plurality of preset ligands;
Optionally, the obtaining the information of the communication intensity to be detected according to the cellular gene expression data, the preset grouping information and the second preset information includes:
Mapping the cellular gene expression data onto a protein-protein interaction network, determining a correlation relationship between the predetermined ligand and the predetermined receptor in the cellular gene expression data as a receptor-ligand pair level;
calculating a probability of communication between cells by the receptor-ligand pair level;
And classifying cells according to the preset classification information, determining the target communication intensity between two cell clusters in combination with the communication probability, and summarizing to generate the communication intensity information to be detected.
Optionally, substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result includes:
Taking the cell population as an input node;
Extracting target enrichment scores of cell groups in the enrichment information to be detected as node characteristics of the input nodes;
extracting target communication intensity in the communication intensity information to be detected as edge weight between two input nodes;
Carrying out graph convolution operation on each input node, and determining the prediction probability of each classification label;
Screening out the prediction probability meeting the preset condition, and outputting the corresponding classification label as a predicted classification label to obtain a predicted classification result.
Optionally, the method further comprises:
Constructing a target gene expression matrix according to a diagnosis-confirmed patient sample, wherein the diagnosis-confirmed patient sample comprises an original tissue sample;
Fusing all target gene expression matrixes under the same original tissue sample to obtain a first gene expression matrix;
if the first gene expression matrix does not have a batch effect, obtaining a marker gene from the first gene expression matrix;
And performing unsupervised clustering on the cells through the marker genes to obtain preset grouping information, wherein the preset grouping information comprises the corresponding relation between cell groups.
The application provides a classification prediction system for tumors, which adopts the following technical scheme that:
the acquisition module is used for acquiring original sequencing data of a user to be tested and acquiring cell gene expression data based on the original sequencing data;
the first processing module is used for obtaining enrichment information to be detected according to the cell gene expression data, the preset grouping information and the first preset information;
The second processing module is used for obtaining the information of the communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information;
and the prediction module is used for substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result of the user to be detected.
Optionally, the acquiring module includes:
the acquisition submodule is used for acquiring a tissue sample to be tested of the user to be tested, and preparing single-cell suspension for the tissue sample to be tested to obtain single-cell suspension;
A gene information library construction submodule for constructing a gene information library according to the single cell suspension;
The sequencing submodule is used for carrying out high-throughput sequencing on the gene information library to obtain original sequencing data;
The correction sub-module is used for correcting the original sequencing data to obtain corrected sequencing data;
And the aggregation sub-module is used for determining the expression relation between the cells and the genes based on the corrected sequencing data, and taking the aggregation of the expression relation as the cell gene expression data.
Optionally, the first preset information includes a preset gene set and preset pathway information;
the first processing module includes:
The enrichment processing submodule is used for combining the expression relationship between the preset gene set and the cell gene expression data to calculate initial enrichment information, wherein the initial enrichment information comprises a first enrichment association relationship among cells, a preset gene set passage and an initial enrichment score;
The summarizing sub-module is used for classifying the first enrichment association relation according to the preset grouping information and determining a second enrichment association relation among a cell group, a preset gene set passage and a target enrichment score;
And the screening sub-module is used for screening the second enrichment association relation according to the preset channel information and collecting the second enrichment association relation containing the dimensionality reduction channel as the enrichment information to be detected.
Optionally, the second preset information includes a plurality of preset receptors and a plurality of preset ligands;
The second processing module includes:
A mapping submodule for mapping the cellular gene expression data onto a protein-protein interaction network, and determining the correlation between the preset ligand and the preset receptor in the cellular gene expression data as a receptor-ligand pair level;
A communication probability processing sub-module for calculating a communication probability between cells through the receptor-ligand pair level;
And the grouping sub-module is used for grouping the cells according to the preset grouping information, determining the target communication intensity between the two cell groups by combining the communication probability, and summarizing to generate the to-be-detected communication intensity information.
Optionally, the prediction module includes:
Taking the cell population as an input node;
The first extraction submodule is used for extracting target enrichment scores of cell groups in the enrichment information to be detected and taking the target enrichment scores as node characteristics of the input nodes;
The second extraction submodule is used for extracting target communication intensity in the communication intensity information to be detected as edge weight between the two input nodes;
the graph rolling sub-module is used for conducting graph rolling operation on each input node and determining the prediction probability of each classification label;
and the label screening sub-module is used for screening out the prediction probability meeting the preset condition and outputting the corresponding classification label as a predicted classification label to obtain a predicted classification result.
Optionally, the method further comprises: grouping modules;
The grouping module comprises:
a matrix construction sub-module for constructing a target gene expression matrix from a patient sample for diagnosis, the patient sample for diagnosis comprising an original tissue sample;
the fusion submodule is used for fusing all target gene expression matrixes under the same original tissue sample to obtain a first gene expression matrix;
A judging submodule, configured to obtain a marker gene from the first gene expression matrix if the first gene expression matrix does not have a batch effect;
And the clustering sub-module is used for performing unsupervised clustering on the cells through the marker genes to obtain preset grouping information, wherein the preset grouping information comprises the corresponding relation between cell groups.
The specification also provides an electronic device, wherein the electronic device includes:
A processor; and
A memory storing computer executable instructions that, when executed, cause the processor to perform any of the methods described above.
The present specification also provides a computer readable storage medium storing one or more programs which when executed by a processor implement any of the methods described above.
According to the application, the cell gene expression data is obtained based on the original sequencing data by acquiring the original sequencing data of a user to be tested; obtaining enrichment information to be detected according to the cell gene expression data, preset grouping information and first preset information; obtaining information of communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information; the method and the device have the advantages that the enrichment information to be detected and the communication intensity information to be detected are substituted into a tumor classification prediction model to obtain the prediction classification result of the user to be detected, the enrichment information to be detected and the communication intensity information to be detected are used for prediction, and the interpretability and the classification accuracy of the prediction classification result are improved, so that reliable reference opinion can be provided for clinical decision.
Drawings
Fig. 1 is a schematic diagram of a classification prediction method for tumor according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a classification prediction system for tumor according to an embodiment of the present disclosure;
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a computer readable medium according to an embodiment of the present disclosure.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art. The basic principles of the invention defined in the following description may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.
Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. The same reference numerals in the drawings denote the same or similar elements, components or portions, and thus a repetitive description thereof will be omitted.
The features, structures, characteristics or other details described in a particular embodiment do not exclude that may be combined in one or more other embodiments in a suitable manner, without departing from the technical idea of the invention.
In the description of specific embodiments, features, structures, characteristics, or other details described in the present invention are provided to enable one skilled in the art to fully understand the embodiments. It is not excluded that one skilled in the art may practice the present invention without one or more of the specific features, structures, characteristics, or other details.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The term "and/or" and/or "includes all combinations of any one or more of the associated listed items.
Fig. 1 is a schematic diagram of a classification prediction method for tumor according to an embodiment of the present disclosure, where the method includes:
S2, acquiring original sequencing data of a user to be tested, and acquiring cell gene expression data based on the original sequencing data;
s3, obtaining enrichment information to be detected according to the cell gene expression data, preset grouping information and first preset information;
S4, obtaining information of communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information;
S5, substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result of the user to be detected.
The following is a further explanation taking the classification prediction of primary liver tumors as an example:
Primary liver tumors can be divided into two main categories: primary liver cancer and benign placeholder lesions. Among them, primary liver cancer is the sixth most common cancer and the third most cancer death cause worldwide in 2020. Benign occupancy lesions of the liver, including vascular smooth muscle lipoma (AML), focal Nodular Hyperplasia (FNH), and liver adenoma, are usually asymptomatic without surgical intervention.
Accurate diagnosis of liver tumors is critical to timely development of treatment and improvement of survival rate of liver tumor patients.
Currently, in tumor imaging examinations, clinical diagnosis of primary liver cancer is mainly performed by combining serum Alpha Fetoprotein (AFP) with imaging examinations such as color ultrasound, magnetic Resonance Imaging (MRI) and/or Computed Tomography (CT) and the like. However, despite the often significant serological markers, imaging characteristics, etc. differences between liver malignancy and benign lesions, AFP has only 39% of the specificity of diagnosing primary liver cancer, i.e., some patients may have delayed treatment due to atypical imaging characteristics. Furthermore, imaging diagnosis requires the assistance of a trained radiologist.
In tumour pathology diagnosis, the nature of focus can be clarified through liver disease focus puncture biopsy, liver cancer molecular typing is realized, and valuable information is provided for the diagnosis, guiding treatment process and judging prognosis of clear liver tumour. However, oncological diagnostics require stringent conditions, such as the need for sufficiently high quality tissue specimens and identifiable histological features. Moreover, pathological diagnosis requires a highly trained professional pathologist. In addition, the general histological features observed in hematoxylin-eosin (H & E) staining are often insufficient to determine tumor properties during the course of oncological diagnosis, often requiring immunohistochemical staining, which makes the pathological diagnosis more challenging and time-consuming.
In order to further improve the prediction classification effect of tumors, the prior art also uses a single-cell RNA sequencing method to calculate the abundance of tumor-related cell types based on gene expression data, so as to obtain classification labels of tumors. However, interactions and network relationships between genes are complex, which makes it simple to rely on gene expression profiles for cancer tissue classification, with obvious limitations and low accuracy.
Based on this, in order to improve the accuracy of prediction and the interpretation of classification, the invention provides a method for predicting and classifying tumor, which specifically comprises the following steps:
S1, constructing, training and verifying a tumor classification prediction model;
in order to improve the robustness of the tumor classification prediction model, firstly, S11 builds model training data;
S11-1, constructing a target gene expression matrix according to a patient sample to be diagnosed;
S11-1-1, collecting a plurality of confirmed patient samples, wherein the confirmed patient samples comprise original tissue samples and corresponding confirmed labels;
In one embodiment of the present specification, tumor tissue from a patient who has received a surgical or in vivo puncture of a primary liver tumor is collected as a raw tissue sample after written informed consent from each patient who has been diagnosed, as approved by the ethics committee. Tumor tissue is composed of many different types of cells, which have remarkable differences and diversity in gene expression, growth, differentiation, and the like, and thus, has high heterogeneity. Wherein tumor tissue is collected and treated within 90 minutes after surgery.
The diagnosed patients include a plurality of hepatocellular carcinoma patients, a plurality of cholangiocellular carcinoma patients and a plurality of localized nodular hyperplasia patients. Wherein, the hepatocellular carcinoma patient and the cholangiocellular carcinoma patient are malignant primary liver tumor patients, and the localized nodular hyperplasia patient is benign primary liver tumor patient.
One diagnostic label for each diagnostic patient's original tissue sample. Diagnostic labels include the etiology of the diagnosed patient.
The tumor classification prediction model provided by the invention can predict benign and malignant classifications of tumors and also can predict specific cancer classification of malignant tumors. The sample of the patient to be diagnosed can be collected according to the actual situation, and the diagnosis tag can be adjusted.
Specifically, if a tumor classification prediction model for benign and malignant classification of tumors is to be constructed, the definitive label corresponding to the original tissue sample of a hepatocellular carcinoma patient or cholangiocellular carcinoma patient is the label representing malignant tumor. The corresponding definitive diagnosis label of the patient with the limited nodular hyperplasia is the label representing benign tumor.
If a tumor classification prediction model for specific cancer classification of malignant tumor is to be constructed, the corresponding diagnosis label of the original tissue sample of the hepatocellular carcinoma patient is the label representing the hepatocellular carcinoma. The corresponding definitive diagnosis label of the original tissue sample of the cholangiocellular carcinoma patient is the label representing cholangiocellular carcinoma. The corresponding definitive label for patients with localized nodular hyperplasia represents a label for non-malignant tumors, or an empty label. Because the localized nodular hyperplasia is a benign tumor, a definite patient sample of the localized nodular hyperplasia patient may not be used to participate in the training of the tumor classification prediction model when constructing the tumor classification prediction model for a specific cancer classification of the malignant tumor.
The label value of the confirmed label may be numerical/specific coincidence/specific text, which is not limited herein. The expression form of the subsequent classification label is not limited either.
For example, "0" may be used as a tag value corresponding to benign tumor, and "1" may be used as a tag value corresponding to malignant tumor. The "benign" word may be used directly as a label value corresponding to benign tumor, and the "malignant" word may be used as a label value corresponding to malignant tumor.
Preferably, the number of cases of each type of patient to be diagnosed is not required to be the same. In a specific embodiment of the present disclosure, raw tissue samples of 25 diagnosed patients are obtained, wherein the diagnosed patients include 7 hepatocellular carcinoma patients, 11 cholangiocellular carcinoma patients, and 7 localized nodular hyperplasia patients.
S11-1-2, preparing single cell suspensions of the original tissue samples respectively to obtain single cell suspensions corresponding to the original tissue samples. Wherein each original tissue sample corresponds to a single cell suspension.
The invention provides a method for preparing single cell suspension, which specifically comprises the following steps:
obtaining an original tissue sample, flushing the original tissue sample by using 10-20ml of cell flushing fluid, shearing blood clots on the original tissue sample to obtain a pretreated original tissue sample, removing the cell flushing fluid, and supplementing fresh cell flushing fluid;
selecting tissue with complete structure from the pretreated original tissue sample, adding 5ml of tissue dissociation liquid, shearing the tissue to a size of 1mm 3 by using scissors to obtain the original tissue dissociation sample, wherein the sheared original tissue sample is soaked in the tissue dissociation liquid. Wherein the tissue dissociation solution is obtained by dissolving 0.1g of type II collagenase in 50ml of DMEM medium.
Incubating the original tissue dissociation sample for 20 minutes in a water bath shaker at 37 ℃ to obtain an original tissue dissociation sample; then, the original tissue lysate was filtered using a 70 μm filter, and 5ml of the cell washing solution was added to wash the 70 μm filter, and stored on ice to obtain an original tissue lysate.
And (3) centrifuging the original tissue filtrate for 6 minutes at the temperature of 4 ℃ to obtain the centrifuged original tissue filtrate. Removing the supernatant of the centrifuged original tissue filtrate, and then precipitating and re-suspending in 10ml of cell cleaning liquid to obtain a re-suspension of an original tissue sample; centrifuging the heavy suspension of the original tissue sample for 10 minutes at the temperature of 4 ℃ to obtain the cell heavy suspension of the original tissue sample; 3ml of the cell suspension was extracted therefrom to resuspend the pellet, thereby preparing a single cell suspension corresponding to the original tissue sample.
Wherein, the cell viability of the single cell suspension is above 85%, and the cell concentration is 700-1200cells/μl, so as to prepare for 10x Genomics Chromium TM systems.
S11-1-3, sequencing single cell suspensions corresponding to each original tissue sample respectively to obtain sequencing data corresponding to the original tissue samples.
Specifically, based on a 10x Genomics platform, capturing single cells from single cell suspension by using oil drop gel beads to generate GEMs, and completing single cell separation; wherein each cell is encapsulated in a separate GEMs, one oil droplet for each cell for subsequent molecular manipulation.
After the GEMs are generated, gel beads in the GEMs are gradually dissolved; when gel beads in GEMs are dissolved, the gel beads release the carried barcode sequence, mRNA in cells can be combined with the barcode sequence and a Unique Molecular Identifier (UMI) in the GEMs to form mRNA with the barcode sequence and UMI, and a PCR instrument is used for carrying out reverse transcription on the mRNA to obtain cDNA; and labeling the cDNA obtained by reverse transcription to obtain cDNA with a barcode sequence and UMI marks, so that the cDNA can be accurately identified and detected by subsequent analysis.
Upon completion of reverse transcription, the GEMs are broken up, releasing cDNA with the barcode sequence and UMI; separating oil drops wrapped in the oil phase, and purifying and enriching the one-chain cDNA by using magnetic beads; the magnetic beads can specifically bind to cDNA, the cDNA is separated from other impurities through magnetic force, and the subsequent accuracy and sensitivity are improved through removing other impurities and unwanted nucleic acid molecules.
After removal of impurities, the cDNA was amplified and quality checked. And (5) carrying out warehouse establishment on the cDNA qualified in quality inspection. Wherein, the cDNA is converted into a gene information library suitable for sequencing, and the process comprises the steps of fragmentation, ligation of sequencing adaptors, sample Index PCR and the like. In one embodiment of the present specification, it specifically includes:
① Fragmenting the cDNA to obtain a plurality of cDNA fragments, namely, breaking cDNA molecules into lengths suitable for reading by a sequencing platform; preferably, the fragment distribution is centered between 300-700 bp.
② Modifying the cDNA fragment to obtain a modified cDNA fragment; that is, the linker and other necessary sequences required for sequencing are added;
③ Assigning an index to the single cell suspension; that is, single cell suspensions are assigned a unique Index tag (Index) that is used to distinguish between corresponding sequencing data for different single cell suspensions.
④ Carrying out PCR amplification on the modified cDNA fragment; that is, the modified cDNA fragments corresponding to each index tag are subjected to PCR amplification to obtain a gene information library corresponding to the single cell suspension, and the cDNA fragments of the patient to be diagnosed are stored in the gene information library.
⑤ And (3) performing quality inspection on the gene information library to ensure that the quality and the concentration of the gene information library meet the sequencing requirement.
MRNA is captured from single cells through GEMs so as to be conveniently converted into a gene information library for high-throughput sequencing, so that the accuracy and the speed of later sequencing are improved.
After obtaining the gene information library, the gene information library is sequenced by using the Il lumina sequencing platform, and each original tissue sample is correspondingly obtained with one piece of sequencing data, wherein the sequencing data is preferably in Fastq format.
S11-1-4, correcting sequencing data corresponding to each original tissue sample;
(1) Carrying out data quality statistics on the sequencing data to obtain a quality statistics result;
and carrying out data quality statistics on the sequencing data by adopting CELL RANGER software. In one embodiment of the present description, the quality statistics include: effective Barcodes ratio, percentage of bases with Qphred number greater than 30 to total bases, barcode sequence greater than Q30 ratio, RNA sequence greater than Q30 ratio, and UMI sequence greater than Q30 ratio.
(2) Judging whether the quality statistical result accords with a quality evaluation rule or not; and if the quality statistics result does not accord with the quality evaluation rule, determining that the evaluation is not qualified, and re-sequencing the single-cell suspension, or re-preparing the single-cell suspension, and sequencing the newly prepared single-cell suspension. And if the quality statistical result accords with the quality evaluation rule, the quality evaluation rule is judged to be qualified, and the sequencing data is the sequencing data qualified in evaluation and is used for the subsequent prediction process.
In one embodiment of the present description, the quality assessment rules include: the effective Barcodes proportion, the percentage of the total base of the bases with the Qphred value more than 30, the proportion of the Barcode sequence more than Q30, the proportion of the RNA sequence more than Q30 and the proportion of the UMI sequence more than Q30 respectively reach more than 90 percent.
(3) Correcting the sequencing data which are qualified in evaluation to obtain corrected sequencing data;
in one embodiment of the present disclosure, sequencing data that is qualified for quality assessment is aligned to a reference genome of an Ensembl database to obtain an alignment; correcting the sequencing data qualified in quality evaluation based on the comparison result, deleting the data inconsistent in comparison from the sequencing data qualified in quality evaluation, and obtaining corrected sequencing data.
S11-1-5, obtaining the expression relation between cells and genes based on corrected sequencing data, and constructing an original gene expression matrix corresponding to an original tissue sample;
the cDNA of the same cell carries the barcode sequence and UMI label. Thus, genes can be classified by cell based on the barcode sequence.
First, the corrected sequencing data is split according to the Barcode sequence, so that transcription data of each cell is obtained. The transcription data of each cell was deduplicated, removing duplicate transcription data generated due to PCR amplification.
The de-duplicated transcriptional data is measured and counted by identifying the Barcode sequence and the UMI marker to quantify the expression value of each gene in each cell, including the number of mRNA molecules or protein content of the gene. Based on the expression values of the respective genes in each cell, the expression relationship between the cells and the genes is obtained as gene expression data.
Constructing an original gene expression matrix corresponding to the original tissue sample. Specifically, summarizing all gene expression data of an original tissue sample, and constructing an original gene expression matrix; wherein each row of the original gene expression matrix represents a gene, each column of the original gene expression matrix represents a cell, and each numerical value in the original gene expression matrix represents the expression value of the corresponding gene in the corresponding cell.
S11-1-6, constructing a target gene expression matrix based on the original gene expression matrix;
(1) And randomly selecting a predetermined number of cells from the original gene expression matrix by resampling, and searching corresponding gene expression data according to the selected cells to construct a target gene expression matrix. And taking the diagnosis confirming label of the original gene expression matrix as the diagnosis confirming label of the target gene expression matrix according to the original gene expression matrix corresponding to the target gene expression matrix. Wherein the predetermined number is preferably 5000 or more.
(2) And (3) repeating the step (1) of the step S11-1-6 for a plurality of times, namely obtaining a plurality of target gene expression matrixes based on one original gene expression matrix, wherein each original tissue sample corresponds to the plurality of target gene expression matrixes.
The number of repetitions is 5 to 20, preferably 10. Thus, if step S11-1-6 (1) is repeated 10 times, 250 new target gene expression matrices can be obtained based on the original gene expression matrices of 25 original tissue samples.
S11-2, cell grouping is carried out based on the target gene expression matrix, and preset grouping information is obtained.
S11-2-1, fusing all target gene expression matrixes under the same original tissue sample to obtain a first gene expression matrix;
Preferably, seurat packages can be used for fusing all target gene expression matrixes under the same original tissue sample to obtain fused expression matrixes; and (3) aligning and integrating the fused expression matrix by using a Harmony package, and eliminating the difference between different experimental conditions and technical platforms to obtain a first gene expression matrix. Each original tissue sample corresponds to a first gene expression matrix.
S11-2-2, analyzing whether the first gene expression matrix has batch effect;
The first gene expression matrix was analyzed by mapping to assess the mixing between different cells. The tumor microenvironment is composed of a variety of cells, including malignant cells, stromal cells, immune cells, and the like. In one embodiment of the present disclosure, immune cells can be evaluated for mixing with stromal cells. Mapping methods include, but are not limited to, scatter plots, heat plots, and cluster plots.
To assess whether significant batch effects were present, a comparative analysis was performed on the different first gene expression matrices. Specifically, the PCA function is used to reduce the different first gene expression matrices into two or three dimensions and observe whether different data points are clustered together or significantly separated.
If the distribution of points in the scatter plot is more diffuse, it may indicate that there is a significant batch effect. If significant batch effects are present, data integration or normalization processing is performed to ensure that the different first gene expression matrices are comparable. The method is realized by carrying out linear transformation or normalization on the first gene expression matrixes of different batches through data integration processing, so that the first gene expression matrixes corresponding to different original tissue samples have the same scale or distribution. The scale of the data is adjusted to the same level by normalization processing to facilitate later comparison and analysis.
S11-2-3 if the first gene expression matrix does not have a batch effect, obtaining a marker gene from the first gene expression matrix;
Specifically, the first gene expression matrix is standardized; from the normalized results, 2000 hypervariable genes were selected using Find Variable Features functions; performing linear dimension reduction PCA on 2000 hypervariable genes by using Run PCA function, and identifying 50 main genes with obvious differences in expression modes; the first 20 major genes with significant differences were selected from the 50 major genes and used as marker genes.
S11-2-4, performing unsupervised clustering on cells through a marker gene to obtain preset clustering information;
The preset grouping information comprises a corresponding relation between cell groups and cells; wherein each cell population is associated with at least one cell;
in one embodiment of the present description, cell grouping is performed using unsupervised clustering. Specifically, cells including a marker gene are determined based on the target expression matrix, and are taken as target cells; determining a proximity relationship between the individual target cells using a Find Neighbors function; determining the size and number of cell clusters based on the granularity of UMAP clusters; in combination with the close relationship between individual target cells and the size and number of cell clusters, the Find Clusters function is used to divide the target cells into different cell populations. Wherein, the granularity of UMAP clusters is preferably 0.5.
Traditional tumor diagnosis relies on pathological analysis of tumor cells, but diagnosis and treatment of primary unknown cancers accounting for 7% of all cancers are very difficult. The single cell RNA sequencing can be used for comprehensively analyzing single tumor cells and/or immune cells, and can be used for characterizing different cell subsets, determining group heterogeneity and analyzing cell fate branching points, and the analysis method obviously exceeds the conventional molecular or pathological method. In addition, it is very difficult for a micro-tissue to make clinical diagnosis. Thus, based on sequencing results to understand in depth the interactions between the tumor and the different cell types in the microenvironment, the interpretability and accuracy of the tumor prediction classification can be improved. In order to improve the interpretability, the invention considers the channel enrichment matrix and the communication intensity matrix for prediction classification.
S11-3, obtaining a channel enrichment matrix according to the target gene expression matrix, preset grouping information and first preset information;
specifically, the first preset information includes a preset gene set and preset pathway information.
Combining the gene expression information of each cell in the preset gene set and the target gene expression matrix, and calculating first enrichment information;
S11-3-1 is loaded with msigdbr packages, a get_GO_ genesets function is used for obtaining a preset gene set, and the preset gene set comprises a plurality of preset gene set passages; preferably, the predetermined set of genes comprises a C5 human set of genes.
S11-3-2, determining a plurality of preset gene set passages through a preset gene set, and determining first enrichment information according to gene expression information of each cell in a target gene expression matrix, wherein the first enrichment information comprises: enrichment scores of the corresponding preset gene set pathways of each cell; preferably, the activity level of the corresponding preset gene set pathway (GSVA score) of each cell is calculated by GSVA function and used as enrichment score.
S11-3-3, classifying and summarizing all the first enrichment information according to preset grouping information to obtain second enrichment information, wherein the second enrichment information comprises enrichment scores of preset gene set passages corresponding to cell clusters;
As described above, the preset group information includes a correspondence between cell groups and cells; determining each cell corresponding to the same cell group based on preset grouping information; then, the average value of all the enrichment scores of the cells in the same group on the preset gene set pathway is taken as the enrichment score of the cells in the preset gene set pathway.
S11-3-4, screening the second enrichment information according to preset channel information, and searching the second enrichment information containing the dimension-reducing channel as target enrichment information;
The preset path information is predetermined. The preset path information comprises a plurality of dimension reduction paths.
In one embodiment of the present specification, if the present invention is to be used for predicting benign and malignant classification of tumor (liver tumor), the preset pathway information includes 25 dimension-reducing pathways, as shown in table 1 in detail:
(Table 1)
If the present invention is to be used for predicting a specific cancer species classification of malignancy (liver tumor), the preset pathway information includes 18 dimension-reducing pathways, as shown in table 2 in detail:
Sequence number ID of Gene pathway Sequence number ID of Gene pathway Sequence number ID of Gene pathway
1 GO:0031589 7 GO:0050901 13 GO:1902042
2 GO:0002181 8 GO:0051346 14 GO:0043270
3 GO:0030036 9 GO:0001503 15 GO:0016337
4 GO:0007159 10 GO:0045785 16 GO:0052547
5 GO:0030593 11 GO:0007157 17 GO:0050863
6 GO:0050900 12 GO:0001819 18 GO:0042110
(Table 2)
And screening the second enrichment data based on the dimension-reducing channel to obtain a channel enrichment matrix serving as target enrichment information. Wherein the pathway enrichment matrix comprises: enrichment scores for the dimensionality reduction pathways corresponding to each cell population.
Wherein each row of the pathway enrichment matrix represents a cell population, each column of the pathway enrichment matrix represents a dimension-reducing pathway, and the values of the pathway enrichment matrix represent the enrichment scores of the corresponding cell population in the corresponding dimension-reducing pathway, wherein the enrichment scores are used for displaying the activity degree of the cell population in the dimension-reducing pathway.
S11-4, obtaining a communication intensity matrix according to a target gene expression matrix, preset grouping information and a set of marker genes;
Marker genes are genes whose expression levels are significantly higher or lower than normal, and which can serve as over-expressed ligand or receptor candidate genes. Judging the specific expression type of each marker gene, taking the over-expressed ligand as a preset ligand, taking the receptor candidate gene as a preset receptor, and taking the set of the preset ligand and the preset receptor as second preset information.
Cellular gene expression data were mapped onto a protein-protein interaction network (PPI network) to determine receptor-ligand pair levels. In one embodiment of the present description, cellular gene expression data is mapped onto a protein-protein interaction network (PPI network) to determine which cells express which receptors and ligands. By comparing the expression relationship of the cells corresponding to the different cells to the gene, one can determine which pre-set receptors and pre-set ligands are over-expressed in the particular cell. If the pre-set ligand or pre-set receptor is overexpressed, the interaction relationship between the pre-set ligand and the pre-set receptor is recognized as a receptor-ligand pair level.
Calculating the probability of communication between cells by the level of receptor-ligand pairs; in one embodiment of the present disclosure, after determining the receptor-ligand pair level, the probability of communication between the receptor-containing cell and the ligand-containing cell is calculated from the receptor-ligand pair level by a Compute Commun Prob function. The probability of cell-cell communication in the biological sense is inferred by performing a substitution test based on the probability of communication of the receptor-ligand pair.
Preferably, the communication analysis is performed using CellChat packets. Interaction of receptor-ligand pairs between Cell populations was analyzed by introducing the SECRETED SIGNALING dataset in Cell Phone db.
And combining preset grouping information to group the cells, and calculating the information of the communication intensity to be detected between the two cell groups.
In one embodiment of the present disclosure, the average of all communication probabilities between two cell populations is obtained as the intensity of communication between the two cell populations.
And constructing a target communication matrix based on the communication intensity among the cell groups, wherein each row of the target communication matrix represents one cell group, and each column of the target communication matrix represents one cell group. The values of the target communication matrix represent the intensity of communication between the corresponding cell population and the corresponding cell population. Communication intensity for displaying the ability of information communication between two cell populations.
S11-5, taking the diagnosis confirming label corresponding to the target gene expression matrix, and the channel enrichment matrix and the communication intensity matrix which are obtained based on the target gene expression matrix as model training data. That is, one target gene expression matrix corresponds to one model training data.
S12, constructing a tumor classification prediction model;
Single cell RNA sequencing can perform comprehensive analysis on single tumor cells and immune cells, and can further determine population heterogeneity, analyze interactions between cells and immune cells, and dissect cell fate branching points. However, in order to process multidimensional single-cell RNA sequencing data, powerful computational methods are required to support. Neural networks are well suited for extracting and learning hidden features directly from massive data.
In order to mine the relationship between the cell population, the enrichment score of the dimension reduction channel, the communication intensity among the cell population and the classification label, the interpretation of the prediction classification is improved. In one embodiment of the present description, a tumor classification prediction model is constructed based on GNN map neural networks.
In one embodiment of the present description, a tumor classification prediction model comprises an input layer, a hidden layer, and an output layer, wherein the hidden layer comprises 2-4 layers of unidirectional graph convolution layers and a single layer full-join layer.
Specifically, the input layer is used for receiving the channel enrichment matrix and the communication intensity matrix;
And 2-4 layers of unidirectional graph convolution layers, combining the channel enrichment matrix and the communication intensity matrix into a graph data structure, wherein cell clusters are used as input nodes, enrichment scores in the channel enrichment matrix are used as characteristics of the input nodes, and communication intensity in the communication intensity matrix is used as edge weight among the input nodes.
A potential main node is added in each layer of graph rolling network and is connected to each input node in the graph, and the edge weight between the main node and each input node is set to be the average weight of a connection group.
The potential master node has an initial eigenvalue of a 0 vector for collecting information from all input nodes in the graph, acting as a global temporary space to which each node writes information but does not read from in order to allow information to be collected from a distance during the propagation phase.
In the graph convolution for each layer, the size of the hidden layer matches exactly the number of passes selected (i.e., the input dimension). That is, the number of nodes of the hidden layer is the same as the input dimension, which helps to ensure the expressive power of the network.
Single-layer full-connection layer: the feature vectors of the potential master nodes of the picture volume lamination layer at the last layer are projected to the output layer through the single-layer full-connection layer.
Output layer: for calculating the probability of outputting each classification label through the sigmoid function. Wherein the predicted class label is determined with probability=0.5 as a threshold. The class label is set with reference to the definitive label.
All layers share the same set of learnable parameters. This arrangement helps to avoid overfitting because the entire network only learns one shared parameter set, rather than learning separate parameters for each layer.
The characteristics of the input nodes and the edge weights between the input nodes are adjusted before each training, testing and using the tumor classification prediction model.
Specifically, the adjusting the characteristics of the input node includes:
For each enrichment score in the pathway enrichment matrix, its z-score is calculated: the calculated z-score value is used for normalization processing, and node characteristics of the input nodes are updated.
Wherein,
With respect to adjusting edge weights between input nodes, comprising: calculating average communication intensity; and (3) carrying out normalization by dividing the edge weight by the average communication intensity, and updating the edge weight between the two input nodes.
Wherein,
The invention obtains the activity degree of the cell population in the dimension-reducing channel and the information communication capability among the cell population based on single-cell RNA sequencing, and takes the activity degree as a prediction basis, thereby improving the robustness of classifying the high-heterogeneity tissue.
S13, dividing model training data, determining a training set and a testing set, training the tumor classification prediction model by using the training set, and evaluating the tumor classification prediction model by using the testing set;
Specifically, selecting one of the original gene expression matrixes each time, selecting a corresponding part of target gene expression matrixes based on the original gene expression matrixes, and taking corresponding model training data as a test sample; searching the rest unselected original gene expression matrixes, determining all target gene expression matrixes corresponding to the original gene expression matrixes, and taking model training data corresponding to the target gene expression matrixes as training samples. In one embodiment of the present disclosure, the same original gene expression matrix corresponds to 10 target gene expression matrices, and 25 original gene expression matrices together correspond to 250 target gene expression matrices; 10 target gene expression matrixes in one original gene expression matrix are selected as test samples each time; all target gene expression matrices (240 target samples) of the remaining 24 original gene expression matrices were used as training samples. Since there is no data cross between the original tissue samples corresponding to the training set and the test set, i.e. the training set and the test set are completely separated at the case level, it avoids any possible information leakage.
Summarizing model training data corresponding to all training samples, and constructing a training set; and summarizing model training data corresponding to all the test samples, and constructing a test set.
When training the tumor classification prediction model, each target gene expression matrix randomly discards node characteristics and corresponding edge weights of one input node (cell group). The number of training sets of different categories may be different, and the data enhancement can be performed in the mode to realize balancing, so that adverse effects caused by unbalance of training data are eliminated, and generalization of the tumor classification prediction model is improved.
S13-1, training a tumor classification prediction model by using a training set; verifying the trained model by using the corresponding internal test set;
In one embodiment of the present description, 100 iterations of training are performed when training a tumor classification predictive model. In each iteration, the following steps are performed:
(1) Forward propagation: receiving input data using a tumor classification prediction model (a graph neural network); an original output (logits) is calculated using the current model parameters, where the original output includes the predictive labels and their predictive probabilities.
(2) Calculating loss: the actual definitive label and the raw output (logits) will be used to calculate the loss value.
(3) Back propagation: gradients are calculated using the calculated loss values, and the gradients for each parameter are calculated using a back propagation algorithm based on the partial derivatives of the loss function to the model parameters (gradients).
(4) Updating parameters: model parameters were updated using Adam optimizer.
Preferably, the loss function used in calculating the loss is preferably a binary cross entropy loss function (BCE With Logits Loss). An Adam optimizer is used, and parameters thereof are set according to practical conditions. Preferably, the initial learning rate is set to 0.001; batch size 4; the total number of iterations is 100.
These steps will be repeated in each training iteration until a preset number of iterations is reached or other stopping conditions are met. By iterating and parameter updating, the model will gradually learn and improve its predictive performance during the training process.
In one embodiment of the present description, a leave-one-out cross validation (LOOCV) is performed on the tumor classification prediction model, which employs different training and test sets for each round of LOOCV (one iteration).
S13-2, evaluating the performance and reliability of the tumor classification prediction model based on the verification result.
Specifically, each time 10 rounds of LOOCV pass, an evaluation index is calculated; wherein, based on all predictive labels and their predictive probabilities in 10 rounds of LOOCV, the evaluation index is calculated in a summarized way, and the evaluation index comprises area under ROC curve (AUC) and F1 fraction.
After 100 rounds of LOOCV, 10 evaluation indices were obtained for evaluating the performance and reliability of the model.
Wherein, the training set and the testing set are both internal data sets.
In one embodiment of the present specification, if the tumor classification prediction model is for benign and malignant classification of a tumor, based on the evaluation of the test results, specifically comprising summarizing the area under ROC curve (AUC) and F1 scores that predict a tumor as benign on the internal dataset; summarizing the area under ROC curve (AUC) and F1 scores of the tumor classification on the internal data set to predict the tumor classification, thereby confirming the accuracy of the tumor classification prediction model on the original tissue sample with high heterogeneity.
If the tumor classification prediction model is a specific cancer classification for malignancy, based on the evaluation of the test results, specifically comprising summarizing the area under ROC curve (AUC) and F1 scores for predicting a specific cancer species of malignancy as hepatocellular carcinoma on the internal dataset; summarizing the area under the ROC curve (AUC) versus F1 scores for the specific carcinoma species on the internal dataset that predicted malignancy was intrahepatic cholangiocarcinoma.
Preferably, a TCGA dataset may also be acquired. Training of the tumor classification prediction model was performed using the internal dataset, and testing of the tumor classification prediction model was performed using the TCGA dataset. Based on the evaluation of the test result, the method specifically comprises the steps of summarizing the area under ROC curve (AUC) and F1 fraction of a specific cancer species which predicts malignant tumor on a TCGA data set as liver cell liver cancer; summarizing the area under ROC curve (AUC) and F1 score of a specific carcinoma species predicted to be malignant on TCGA dataset as intrahepatic cholangiocarcinoma to confirm the accuracy of tumor classification prediction model on high heterogeneity of original tissue samples.
In addition, in addition to using the above-described complete single cell data as test data, model training data may be constructed based on the cellular gene expression matrix of the selected immune cells, and model training data may be constructed based on the cellular gene expression matrix of the selected non-immune cells to be respectively input as test data into the tumor classification prediction model, representing both extremes of sample heterogeneity, to evaluate the accuracy of the model on highly heterogeneous tissue samples.
In one embodiment of the present description, when the area under ROC curve (AUC) >60% and F1 score >0.6, the evaluation of the tumor classification prediction model is deemed to pass, which can be used to make subsequent classification predictions.
S2, acquiring original sequencing data of a user to be tested, and acquiring cell gene expression data based on the original sequencing data;
s21, obtaining a tissue sample to be tested of a user to be tested, and obtaining original sequencing data based on the tissue sample to be tested;
S21-1, obtaining a tissue sample to be tested of the user to be tested, and preparing single-cell suspension for the tissue sample to be tested to obtain single-cell suspension;
The tissue sample to be tested is: and obtaining a physiological tissue sample of the user to be tested through living body puncture. Compared with the traditional puncture pathological diagnosis, the tissue sample to be tested has lower requirements on the tissue integrity.
The method for preparing the single cell suspension comprises the following steps:
(1) Preprocessing a tissue sample to be detected to obtain a preprocessed sample; specifically, a tissue sample to be measured is obtained, 10-20ml of cell flushing fluid is used for flushing the tissue sample to be measured, after blood clots on the tissue sample to be measured are sheared off, a pretreated tissue sample to be measured is obtained, the cell flushing fluid is removed, and fresh cell flushing fluid is supplemented, so that a pretreated sample is obtained;
(2) Cutting the pretreated sample to obtain a tissue dissociation sample to be detected;
(3) And (3) performing cracking and filtering on the tissue dissociation sample to be detected to obtain the filtrate to be detected.
(4) And (3) centrifuging the filtrate to be tested to obtain the centrifuged filtrate to be tested.
(5) And (3) carrying out precipitation and resuspension treatment on the filtered liquid after centrifugation to obtain single-cell suspension.
The specific method for preparing the single cell suspension can refer to step S11, and will not be described herein.
S21-2, constructing a gene information library according to the single cell suspension;
Firstly, capturing single cells from single cell suspension, and separating the single cells; in one embodiment of the present disclosure, single cell isolation is accomplished based on a 10x Genomics platform using oil drop gel beads to capture single cells from a single cell suspension, generating GEMs. mRNA is then captured from individual cells by GEMs and converted into a cDNA library that can be used for high throughput sequencing as a library of genetic information.
Construction of a Gene informative library from a Single cell suspension reference is made to step S11-1-3 of the present invention, and will not be described here.
S21-3, carrying out high-throughput sequencing on the gene information library to obtain original sequencing data.
In one embodiment of the present disclosure, the library of genetic information is sequenced using an Il luminea sequencing platform to obtain raw sequencing data, wherein the raw sequencing data is preferably in Fastq format.
S22, determining the expression relation between cells and genes from the original sequencing data to obtain gene expression data in each cell;
s22-1, correcting the original sequencing data to obtain corrected sequencing data;
Firstly, carrying out data quality statistics on original sequencing data to obtain a quality statistics result;
In one embodiment of the present description, the quality statistics include: effective Barcodes ratio, percentage of bases with Qphred number greater than 30 to total bases, barcode sequence greater than Q30 ratio, RNA sequence greater than Q30 ratio, and UMI sequence greater than Q30 ratio.
Secondly, judging whether the quality statistical result accords with a quality evaluation rule or not to obtain an evaluation result; and if the quality statistics result does not accord with the quality evaluation rule, determining that the evaluation is not qualified, and re-sequencing the single cell suspension or re-preparing the single cell suspension for sequencing. And if the quality statistical result accords with the quality evaluation rule, the evaluation is qualified, and the original sequencing data is the sequencing data qualified in the evaluation and is used for the subsequent prediction process.
In one embodiment of the present description, the quality assessment rules include: the effective Barcodes proportion, the percentage of the total base of the bases with the Qphred value more than 30, the proportion of the Barcode sequence more than Q30, the proportion of the RNA sequence more than Q30 and the proportion of the UMI sequence more than Q30 respectively reach more than 90 percent.
Then, correcting the sequencing data which are qualified in evaluation to obtain corrected sequencing data;
In one embodiment of the present disclosure, sequencing data that is qualified for quality assessment is aligned to a reference genome of an Ensembl database to obtain an alignment;
Correcting the sequencing data qualified in quality evaluation based on the comparison result, deleting the data inconsistent in comparison from the sequencing data qualified in quality evaluation, and obtaining corrected sequencing data.
According to the application, sequencing data is obtained through a tissue sample to be tested, tumors and microenvironment information thereof are extracted through a neural network of a biological interpretability graph based on a priori data set to predict, so that the defects that the conventional pathological diagnosis has high requirements on tissues and the molecular pathological diagnosis based on specific markers is easily influenced by tumor heterogeneity are effectively overcome.
S22-2, determining the expression relation between cells and genes based on the corrected sequencing data, and taking the collection of the expression relation as the cell gene expression data;
the cDNA of the same cell carries the barcode sequence and UMI label.
Thus, the genetic information can be classified by cell based on the barcode sequence. Specifically, splitting the corrected sequencing data according to the Barcode sequence to obtain transcription data of each cell;
The transcription data of each cell was deduplicated, removing duplicate transcription data generated due to PCR amplification. Then, the expression values of the respective genes in each cell were quantified by recognizing the Barcode sequence and the UMI marker, and the correlation among the cells, genes and expression values was determined as the expression relationship between the cells and genes by measuring and counting the transcription data. The expression value includes the number of mRNA molecules or the protein content of the gene. And summarizing all the expression relations to obtain the cell gene expression data.
Preferably, a Gene expression matrix is constructed based on the expression relationship of each Gene in each Cell, that is, the expression relationship of all cells and genes, wherein each row of the Gene expression matrix represents one Gene Gene x, each column of the Gene expression matrix represents one Cell y, and the values corresponding to Gene i of the ith row and Cell j of the jth column in the matrix are the expression values of Gene i in Cell j, wherein 0 < i.ltoreq.x, and 0 < j.ltoreq.y.
S3, obtaining enrichment information to be detected according to the cell gene expression data, preset grouping information and first preset information;
The first preset information includes a preset gene set and preset pathway information.
The cellular gene expression data includes gene expression information in each cell.
S31, calculating initial enrichment information by combining the expression relationship between the preset gene set and the cell gene expression data, wherein the initial enrichment information comprises a first enrichment association relationship among cells, a preset gene set channel and an initial enrichment score;
Acquiring a preset gene set; specifically, msigdbr packages are loaded, a get_GO_ genesets function is used for obtaining a preset gene set, and the preset gene set comprises a plurality of preset gene set passages; preferably, the predetermined set of genes comprises a C5 human set of genes.
Determining a plurality of preset gene set passages through the preset gene sets, obtaining initial enrichment scores of the preset gene set passages corresponding to each cell according to the expression relationship in the cell gene expression data, and determining a first enrichment association relationship among the cells, the preset gene set passages and the initial enrichment scores.
And summarizing all the first enrichment association relations to obtain initial enrichment information.
Preferably, GSVA scores for the corresponding pre-set gene set pathways for each cell are calculated as initial enrichment scores by GSVA functions.
S32, classifying the first enrichment association according to the preset grouping information, and determining a second enrichment association among a cell population, a preset gene set path and a target enrichment score;
the preset grouping information comprises a corresponding relation between cell groups and cells; based on preset grouping information, determining each cell corresponding to the same cell group, acquiring all initial enrichment scores of the cells in the same group in a preset gene cluster path, and taking the average value of all the initial enrichment scores as a target enrichment score of the cell group in the preset gene cluster path.
And determining a second enrichment association between the cell population, the preset gene set pathway and the target enrichment score.
And summarizing all the second enrichment association relations to obtain target enrichment information.
S33, screening the second enrichment association relation according to the preset path information, and collecting the second enrichment association relation containing the dimensionality reduction path as the enrichment information to be detected;
The preset path information is predetermined. The preset path information comprises a plurality of dimension reduction paths.
In one embodiment of the present specification, if the present invention is used for predicting benign and malignant classification of liver tumor of a user to be tested, the preset pathway information includes 25 dimension-reducing pathways, as shown in the foregoing table 1.
If the present invention is used to predict a specific cancer classification of a user to be tested, the preset path information includes 18 dimension-reducing paths, as shown in the above-mentioned table 2.
Screening the target enrichment information based on the dimension reduction channels in the preset channel information, searching the target enrichment score of the dimension reduction channel corresponding to each cell group, further determining third enrichment association relations among the cell groups, the dimension reduction channels and the target enrichment score, and summarizing all the third enrichment association relations to obtain enrichment information to be detected.
Of course, the second enrichment association relation irrelevant to the dimension reduction channel can be removed from the target enrichment information so as to obtain the enrichment information to be detected.
S4, obtaining information of communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information;
The second preset information comprises a plurality of preset receptors and a plurality of preset ligands.
S41, mapping the cell gene expression data onto a protein-protein interaction network, and determining the correlation between the preset ligand and the preset receptor in the cell gene expression data as a receptor-ligand pair level;
In one embodiment of the present description, cellular gene expression data is mapped onto a protein-protein interaction network (PPI network) to determine which cells express which receptors and ligands. By comparing the expression relationship of the cells corresponding to the different cells to the gene, one can determine which pre-set receptors and pre-set ligands are over-expressed in the particular cell. If the predetermined ligand and/or predetermined receptor is overexpressed, the correlation between the predetermined ligand and the predetermined receptor is recognized as a receptor-ligand pair level.
S42, calculating the communication probability between cells through the receptor-ligand pair level;
In one embodiment of the present disclosure, after determining the receptor-ligand pair level, the probability of communication between the receptor-containing cell and the ligand-containing cell is calculated from the receptor-ligand pair level by a Compute Commun Prob function. The probability of cell-cell communication in the biological sense is inferred by performing a substitution test based on the probability of communication of the receptor-ligand pair.
Preferably, the communication analysis is performed using CellChat packets. Interaction of receptor-ligand pairs between Cell populations was analyzed by introducing the SECRETED SIGNALING dataset in Cell Phone db.
S43, grouping cells according to the preset grouping information, determining target communication intensity between two cell groups by combining the communication probability, and summarizing to generate the to-be-detected communication intensity information.
In one embodiment of the present disclosure, after the cells are grouped according to the preset grouping information, an average value of all communication probabilities between the two cell groups is obtained as the target communication intensity between the two cell groups. And obtaining the communication intensity relation between the two cell groups based on the target communication intensity between the two cell groups, and summarizing all the communication intensity relations to generate the communication intensity information to be detected.
Preferably, a communication intensity matrix is constructed based on the target communication intensity between the cell populations, wherein each row of the communication intensity matrix represents one cell population Group α, each column of the communication intensity matrix represents one cell population Group β, the values corresponding to the cell Group m of the mth row and the cell Group n of the nth column in the communication intensity matrix are the target communication intensities between the cell Group m and the cell Group n, wherein m is greater than 0 and less than or equal to alpha, and n is greater than 0 and less than or equal to beta.
S5, substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result.
Taking the cell population as an input node; for each input node (cell group), extracting a target enrichment score of the cell group in the enrichment information to be detected as a node characteristic of the corresponding input node (cell group); and extracting the target communication intensity in the communication intensity information to be detected as the edge weight between two input nodes (cell groups).
The graph rolling operation is performed for each input node (cell group), the embedding vector of the input node (cell group) is updated, and the information of the neighbor nodes and the edge weights between the neighbor nodes are considered. In each convolution operation, the model updates the embedded vector of the input node (cell population) with its internal parameters (which have been learned during the training process). After the graph rolling operation, feature information of neighboring nodes is aggregated using an aggregation function. A new feature representation of the current node is obtained.
The GNN model inputs the node embedded vector of the last layer into a classifier, and calculates and outputs the prediction probability of each classification label through a sigmoid function; screening out the prediction probability meeting the preset condition, and outputting the corresponding classification label as a predicted classification label to obtain a predicted classification result.
In one embodiment of the present specification, the predetermined condition to be met includes: the probability of predicting classification labels is more than or equal to 0.5.
In practical application, the application scene of the invention is not limited to classification prediction of primary liver tumors. Because the technology is based on single-cell RNA sequencing and a graph neural network GNN, the technology can be widely applied to classification prediction of other types of tumors, and further provides reliable reference opinion for clinical decision making so as to assist accurate judgment of professionals.
Fig. 2 is a schematic structural diagram of a classification prediction system for tumor according to an embodiment of the present disclosure, where the system includes:
The acquisition module 202 is used for acquiring original sequencing data of a user to be tested and acquiring cell gene expression data based on the original sequencing data;
The first processing module 203 is configured to obtain enrichment information to be detected according to the cellular gene expression data, the preset grouping information and the first preset information;
the second processing module 204 is configured to obtain information of communication intensity to be detected according to the cellular gene expression data, the preset grouping information and the second preset information;
And the prediction module 205 is configured to substitute the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result of the user to be detected.
Optionally, the acquiring module 202 includes:
the acquisition submodule is used for acquiring a tissue sample to be tested of the user to be tested, and preparing single-cell suspension for the tissue sample to be tested to obtain single-cell suspension;
A gene information library construction submodule for constructing a gene information library according to the single cell suspension;
The sequencing submodule is used for carrying out high-throughput sequencing on the gene information library to obtain original sequencing data;
The correction sub-module is used for correcting the original sequencing data to obtain corrected sequencing data;
And the aggregation sub-module is used for determining the expression relation between the cells and the genes based on the corrected sequencing data, and taking the aggregation of the expression relation as the cell gene expression data.
Optionally, the first preset information includes a preset gene set and preset pathway information;
The first processing module 203 includes:
The enrichment processing submodule is used for combining the expression relationship between the preset gene set and the cell gene expression data to calculate initial enrichment information, wherein the initial enrichment information comprises a first enrichment association relationship among cells, a preset gene set passage and an initial enrichment score;
The summarizing sub-module is used for classifying the first enrichment association relation according to the preset grouping information and determining a second enrichment association relation among a cell group, a preset gene set passage and a target enrichment score;
And the screening sub-module is used for screening the second enrichment association relation according to the preset channel information and collecting the second enrichment association relation containing the dimensionality reduction channel as the enrichment information to be detected.
Optionally, the second preset information includes a plurality of preset receptors and a plurality of preset ligands;
optionally, the second processing module 204 includes:
A mapping submodule for mapping the cellular gene expression data onto a protein-protein interaction network, and determining the correlation between the preset ligand and the preset receptor in the cellular gene expression data as a receptor-ligand pair level;
A communication probability processing sub-module for calculating a communication probability between cells through the receptor-ligand pair level;
And the grouping sub-module is used for grouping the cells according to the preset grouping information, determining the target communication intensity between the two cell groups by combining the communication probability, and summarizing to generate the to-be-detected communication intensity information.
Optionally, the prediction module 205 includes:
Taking the cell population as an input node;
The first extraction submodule is used for extracting target enrichment scores of cell groups in the enrichment information to be detected and taking the target enrichment scores as node characteristics of the input nodes;
The second extraction submodule is used for extracting target communication intensity in the communication intensity information to be detected as edge weight between the two input nodes;
the graph rolling sub-module is used for conducting graph rolling operation on each input node and determining the prediction probability of each classification label;
and the label screening sub-module is used for screening out the prediction probability meeting the preset condition and outputting the corresponding classification label as a predicted classification label to obtain a predicted classification result.
Optionally, the method further comprises: grouping modules;
The grouping module comprises:
a matrix construction sub-module for constructing a target gene expression matrix from a patient sample for diagnosis, the patient sample for diagnosis comprising an original tissue sample;
the fusion submodule is used for fusing all target gene expression matrixes under the same original tissue sample to obtain a first gene expression matrix;
A judging submodule, configured to obtain a marker gene from the first gene expression matrix if the first gene expression matrix does not have a batch effect;
And the clustering sub-module is used for performing unsupervised clustering on the cells through the marker genes to obtain preset grouping information, wherein the preset grouping information comprises the corresponding relation between cell groups.
The functions of the apparatus according to the embodiments of the present invention have been described in the foregoing method embodiments, so that the descriptions of the embodiments are not exhaustive, and reference may be made to the related descriptions in the foregoing embodiments, which are not repeated herein.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (9)

1. A method for classifying and predicting a tumor, comprising:
Acquiring original sequencing data of a user to be tested, and acquiring cell gene expression data based on the original sequencing data;
obtaining enrichment information to be detected according to the cell gene expression data, preset grouping information and first preset information;
Obtaining information of communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information;
Substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result of the user to be detected.
2. The method for classifying and predicting tumors according to claim 1, wherein the step of obtaining raw sequencing data of a user to be tested and obtaining cellular gene expression data based on the raw sequencing data comprises the steps of:
Obtaining a tissue sample to be tested of the user to be tested, and preparing single-cell suspension for the tissue sample to be tested to obtain single-cell suspension;
constructing a gene information library according to the single cell suspension;
carrying out high-throughput sequencing on the gene information library to obtain original sequencing data;
Correcting the original sequencing data to obtain corrected sequencing data;
And determining the expression relation between the cells and the genes based on the corrected sequencing data, and taking the collection of the expression relation as the cell gene expression data.
3. The method of claim 1, wherein the first predetermined information comprises a predetermined gene set and predetermined pathway information;
the obtaining enrichment information to be detected according to the cell gene expression data, the preset grouping information and the first preset information comprises the following steps:
Calculating initial enrichment information by combining the expression relationship between the preset gene set and the cell gene expression data, wherein the initial enrichment information comprises a first enrichment association relationship among cells, a preset gene set passage and an initial enrichment score;
classifying the first enrichment association according to the preset grouping information, and determining a second enrichment association among a cell population, a preset gene set path and a target enrichment score;
screening the second enrichment association relation according to the preset path information, and collecting the second enrichment association relation containing the dimensionality reduction path as the enrichment information to be detected.
4. The method of claim 3, wherein the second predetermined information comprises a plurality of predetermined receptors and a plurality of predetermined ligands;
The obtaining the information of the communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information comprises the following steps:
Mapping the cellular gene expression data onto a protein-protein interaction network, determining a correlation relationship between the predetermined ligand and the predetermined receptor in the cellular gene expression data as a receptor-ligand pair level;
calculating a probability of communication between cells by the receptor-ligand pair level;
And classifying cells according to the preset classification information, determining the target communication intensity between two cell clusters in combination with the communication probability, and summarizing to generate the communication intensity information to be detected.
5. The method of claim 4, wherein substituting the enrichment information to be tested and the communication intensity information to be tested into the tumor classification prediction model to obtain the prediction classification result comprises:
Taking the cell population as an input node;
Extracting target enrichment scores of cell groups in the enrichment information to be detected as node characteristics of the input nodes;
extracting target communication intensity in the communication intensity information to be detected as edge weight between two input nodes;
Carrying out graph convolution operation on each input node, and determining the prediction probability of each classification label;
Screening out the prediction probability meeting the preset condition, and outputting the corresponding classification label as a predicted classification label to obtain a predicted classification result.
6. The method of claim 1, further comprising:
Constructing a target gene expression matrix according to a diagnosis-confirmed patient sample, wherein the diagnosis-confirmed patient sample comprises an original tissue sample;
Fusing all target gene expression matrixes under the same original tissue sample to obtain a first gene expression matrix;
if the first gene expression matrix does not have a batch effect, obtaining a marker gene from the first gene expression matrix;
And performing unsupervised clustering on the cells through the marker genes to obtain preset grouping information, wherein the preset grouping information comprises the corresponding relation between cell groups.
7. A classification and prediction system for tumors, comprising:
the acquisition module is used for acquiring original sequencing data of a user to be tested and acquiring cell gene expression data based on the original sequencing data;
the first processing module is used for obtaining enrichment information to be detected according to the cell gene expression data, the preset grouping information and the first preset information;
The second processing module is used for obtaining the information of the communication intensity to be detected according to the cell gene expression data, the preset grouping information and the second preset information;
and the prediction module is used for substituting the enrichment information to be detected and the communication intensity information to be detected into a tumor classification prediction model to obtain a prediction classification result of the user to be detected.
8. An electronic device, wherein the electronic device comprises:
A processor; and
A memory storing computer executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.
9. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.
CN202410112672.5A 2024-01-26 2024-01-26 Classification prediction method and device for tumors and electronic equipment Pending CN117953965A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410112672.5A CN117953965A (en) 2024-01-26 2024-01-26 Classification prediction method and device for tumors and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410112672.5A CN117953965A (en) 2024-01-26 2024-01-26 Classification prediction method and device for tumors and electronic equipment

Publications (1)

Publication Number Publication Date
CN117953965A true CN117953965A (en) 2024-04-30

Family

ID=90802635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410112672.5A Pending CN117953965A (en) 2024-01-26 2024-01-26 Classification prediction method and device for tumors and electronic equipment

Country Status (1)

Country Link
CN (1) CN117953965A (en)

Similar Documents

Publication Publication Date Title
Landau et al. Artificial intelligence in cytopathology: a review of the literature and overview of commercial landscape
CN113454733A (en) Multi-instance learner for prognostic tissue pattern recognition
US11468559B2 (en) Cellular analysis
CN112005306A (en) Method and system for selecting, managing and analyzing high-dimensional data
AU2003214724B2 (en) Medical applications of adaptive learning systems using gene expression data
CN108319813A (en) Circulating tumor DNA copies the detection method and device of number variation
JP2023507252A (en) Cancer classification using patch convolutional neural networks
WO2003085548A1 (en) Apparatus and method for analyzing data
WO2023179263A1 (en) System, model and kit for evaluating malignancy grade or probability of thyroid nodules
Padmanabhan et al. An active learning approach for rapid characterization of endothelial cells in human tumors
CN107208131A (en) Method for lung cancer parting
US20210118526A1 (en) Calculating cell-type rna profiles for diagnosis and treatment
CN112927757A (en) Gastric cancer biomarker identification method based on gene expression and DNA methylation data
Levy et al. Mixed effects machine learning models for colon cancer metastasis prediction using spatially localized immuno-oncology markers
CN106874705B (en) The method for determining tumor marker based on transcript profile data
Elhadary et al. Revolutionizing chronic lymphocytic leukemia diagnosis: A deep dive into the diverse applications of machine learning
KR101990430B1 (en) System and method of biomarker identification for cancer recurrence prediction
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN112397153A (en) Method for screening biomarker for predicting esophageal squamous cell carcinoma prognosis
Mavropoulos et al. Artificial intelligence-driven morphology-based enrichment of malignant cells from body fluid
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN108603233A (en) The unicellular Genome Atlas of circulating tumor cell (CTC) is analyzed to characterize disease heterogeneity in metastatic disease
CN115831232A (en) Cancer primary focus tracing method, device, system and storage medium
CN117953965A (en) Classification prediction method and device for tumors and electronic equipment
WO2023091967A1 (en) Systems and methods for personalized treatment of tumors

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination