CN117409961A - Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm - Google Patents
Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm Download PDFInfo
- Publication number
- CN117409961A CN117409961A CN202311720287.0A CN202311720287A CN117409961A CN 117409961 A CN117409961 A CN 117409961A CN 202311720287 A CN202311720287 A CN 202311720287A CN 117409961 A CN117409961 A CN 117409961A
- Authority
- CN
- China
- Prior art keywords
- mass spectrum
- spectrum data
- deep learning
- mass
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001819 mass spectrum Methods 0.000 title claims abstract description 155
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 73
- 201000011510 cancer Diseases 0.000 title claims abstract description 68
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000013135 deep learning Methods 0.000 title claims abstract description 27
- 238000003745 diagnosis Methods 0.000 title claims abstract description 27
- 239000011159 matrix material Substances 0.000 claims abstract description 42
- 238000013136 deep learning model Methods 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 25
- 230000014759 maintenance of location Effects 0.000 claims description 29
- 238000007781 pre-processing Methods 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 17
- 208000024770 Thyroid neoplasm Diseases 0.000 claims description 12
- 201000002510 thyroid cancer Diseases 0.000 claims description 12
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 11
- 201000005202 lung cancer Diseases 0.000 claims description 11
- 208000020816 lung neoplasm Diseases 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000004949 mass spectrometry Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000003595 spectral effect Effects 0.000 abstract description 9
- 239000003550 marker Substances 0.000 abstract description 2
- 230000008901 benefit Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 239000000090 biomarker Substances 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- QZUDBNBUXVUHMW-UHFFFAOYSA-N clozapine Chemical compound C1CN(C)CCN1C1=NC2=CC(Cl)=CC=C2NC2=CC=CC=C12 QZUDBNBUXVUHMW-UHFFFAOYSA-N 0.000 description 2
- 238000002405 diagnostic procedure Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The invention discloses a multi-cancer diagnosis method and a system based on mass spectrum data and a deep learning algorithm, wherein the method comprises the following steps: acquiring mass spectrum data sets of various cancer tissues and mass spectrum data sets of normal tissues; discretizing M/Zs of each mass spectrum data and standardizing the internets to obtain corresponding matrixesThe method comprises the steps of carrying out a first treatment on the surface of the Training the deep learning model through a matrix; matrix for acquiring mass spectrum data of object to be identifiedThe method comprises the steps of carrying out a first treatment on the surface of the Matrix of mass spectral data of an object to be identifiedAnd inputting the classification result into a trained deep learning model. According to the multi-cancer diagnosis method and system based on the mass spectrum data and the deep learning algorithm, provided by the invention, the identification of the marker is not relied on, and the diagnosis of the cancer can be rapidly carried out by only inputting the mass spectrum raw data as a model.
Description
Technical Field
The invention belongs to the technical field of mass spectrometry, and particularly relates to a multi-cancer diagnosis method and system based on mass spectrum data and a deep learning algorithm.
Background
Cancer is one of the diseases that severely threatens human health, with extremely high morbidity and mortality in clinical practice. However, early cancer diagnosis remains a challenging task. Traditional cancer diagnostic methods typically involve tissue biopsies and microscopic pathology analysis, which are time consuming, costly, invasive to the patient, and subjective in outcome. Mass spectrometry is an important technology widely applied to the fields of chemistry, biomedicine, environmental science and the like, and can provide distribution information of chemical components in a tissue sample. This technique has been widely used in cancer diagnosis and treatment response assessment, but existing mass spectrometry data analysis methods still present challenges for information extraction and identification of complex biomarkers.
Disclosure of Invention
The invention provides a multi-cancer diagnosis method and a multi-cancer diagnosis system based on mass spectrum data and a deep learning algorithm, which solve the technical problems, and concretely adopts the following technical scheme:
a multi-cancer species diagnosis method based on mass spectrometry data and a deep learning algorithm, comprising:
acquiring mass spectral datasets of multiple cancerous tissuesA mass spectrum data set of normal tissue, a mass spectrum data set of multiple cancer tissues comprising a mass spectrum data set of lung cancer tissue and a mass spectrum data set of thyroid cancer tissue, the mass spectrum data set being an array of retention times Rt, mass-to-charge ratios m/z and kurtosis value intents corresponding to mass-to-charge ratios m/z, each retention time Rt corresponding to a set of mass-to-charge ratios m/z and kurtosis value intents corresponding to the set of mass-to-charge ratios m/z, each mass spectrum data comprising R retention times Rt, each mass spectrum data corresponding to an array ((Rt) 1 ,(m/z 1 , m/z 2 ,…, m/z n ),(intensity 1 ,intensity 2 , …,intensity n ),…, (Rt R ,(m/z 1 , m/z 2 ,…, m/z l ),(intensity 1 ,intensity 2 , …,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 , m/z 2 ,…, m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 , …,intensity n ) Is the internets;
discretizing M/Zs of each mass spectrum data in the mass spectrum data set and standardizing the internets, and obtaining a matrix corresponding to each mass spectrum data based on the processed array;
By matrix of each mass spectrum data obtainedTraining the deep learning model according to the corresponding classification information;
acquiring mass spectrum data of an object to be identified, discretizing M/Zs in the mass spectrum data and standardizing the Transit to obtain a matrix of the mass spectrum data of the object to be identified;
Matrix corresponding to mass spectrum data of object to be identifiedAnd inputting the classification result into a trained deep learning model.
Further, for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
;
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
;
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
;
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
Further, the deep learning model includes a 1-layer one-dimensional CNN module, a 2-layer fransformer module, and a 2-layer feedforward neural network module, wherein the fransformer module includes a position coding and a multi-headed attentiveness mechanism.
Further, the calculation formula of the loss function of the deep learning model is:
;
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Value of the j-th class of the real label representing sample i,/>Representing the probability that the model predicted sample i belongs to the j-th class.
Further, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
A multi-cancer seed diagnostic system based on mass spectrometry data and a deep learning algorithm, comprising:
the data acquisition module is used for acquiring a mass spectrum data set of various cancer tissues and a mass spectrum data set of normal tissues, wherein the mass spectrum data set of various cancer tissues comprises a mass spectrum data set of lung cancer tissues and a mass spectrum data set of thyroid cancer tissues, the mass spectrum data set is an array consisting of retention time Rt, mass-to-charge ratio m/z and kurtosis value intensity corresponding to the mass-to-charge ratio m/z, each retention time Rt corresponds to a group of mass-to-charge ratio m/z and kurtosis value intensity corresponding to the group of mass-to-charge ratio m/z one by one, each mass spectrum data comprises R retention times Rt, and the array corresponding to each mass spectrum data is ((Rt) 1 ,(m/z 1 ,m/z 2 ,…,m/z n ),(intensity 1 ,intensity 2 ,…,intensity n ),…,(Rt R ,(m/z 1 ,m/z 2 ,…,m/z l ),(intensity 1 ,intensity 2 ,…,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 ,m/z 2 ,…,m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 ,…,intensity n ) Is the internets;
the data preprocessing module is used for performing discretization processing and intersitization processing on M/Zs of each mass spectrum data in the mass spectrum data set, and obtaining a matrix corresponding to each mass spectrum data based on the processed array;
A multi-cancer classification module comprising a deep learning model, a matrix of each mass spectrum data obtained by the data preprocessing moduleTraining the deep learning model according to the corresponding classification information;
after training is completed, acquiring mass spectrum data of the object to be identified through a data acquisition module, and then passing through a data preprocessing module and performing mass matchingM/Zs in the spectrum data are discretized and the internets are standardized to obtain a matrix of mass spectrum data of the object to be identifiedThen the matrix corresponding to the mass spectrum data of the object to be identified is +.>Inputting the multi-cancer classification module to the multi-cancer classification module, and obtaining a classification result by the multi-cancer classification module through a trained deep learning model.
Further, the data preprocessing module is obtained by the following method:
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
;
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
;
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
;
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
Further, the deep learning model in the multi-cancer classification module comprises a 1-layer one-dimensional CNN module, a 2-layer transducer module, and a 2-layer feedforward neural network module, wherein the transducer module comprises a position coding and a multi-head attention mechanism.
Further, the calculation formula of the loss function of the deep learning model in the multi-cancer classification module is as follows:
;
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Representing the true of sample iThe value of the j-th class of real tags, < ->Representing the probability that the model predicted sample i belongs to the j-th class.
Further, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
The multi-cancer diagnosis method and system based on the mass spectrum data and the deep learning algorithm provided by the invention have the beneficial effects that the multi-cancer diagnosis method and system based on the mass spectrum data and the deep learning algorithm can be used for rapidly performing cancer diagnosis by only inputting mass spectrum raw data as a model without depending on identification of a marker. It extracts features directly on raw mass spectral data of multiple cancer species without the need for identification of biomarkers for each cancer species.
The invention has the advantages that the provided multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm can be regarded as [ intensity ] by one m/z initial vector code 1 ,intensity 2 ,…,intensity R ]R refers to the number of Rt contained in one mass spectrum data, and then the [ intensity ] 1 ,intensity 2 ,…,intensity R ]Adding to obtainTherefore, a plurality of mass spectrum data are input as a model, and alignment in the Rt dimension is not needed, so that the method is not limited by a mass spectrum acquisition instrument, has the advantage of cross equipment, is easier to train, and has better generalization.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of a multi-cancer diagnostic method based on mass spectral data and a deep learning algorithm of the present invention;
FIG. 2 is a graphical representation of the change in the loss function curve of a diagnostic model of a multiple cancerous disease in accordance with the present invention.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
FIG. 1 shows a multi-cancer diagnosis method based on mass spectrum data and a deep learning algorithm, which comprises the following steps:
a mass spectral dataset of a plurality of cancerous tissues and a mass spectral dataset of normal tissues are acquired. The mass spectrum data set is an array composed of retention times Rt, mass-to-charge ratios m/z and kurtosis values intents corresponding to the mass-to-charge ratios m/z, each retention time Rt corresponds to a group of mass-to-charge ratios m/z and kurtosis values intents corresponding to the group of mass-to-charge ratios m/z one by one, each mass spectrum data contains R retention times Rt, and each mass spectrum data corresponds to an array ((Rt) 1 ,(m/z 1 ,m/z 2 ,…, m/z n ),(intensity 1 ,intensity 2 ,…,intensity n ),…,(Rt R ,(m/z 1 , m/z 2 ,…, m/z l ),(intensity 1 ,intensity 2 ,…,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 , m/z 2 ,…, m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 , …,intensity n ) Are internets.
In the present application, the mass spectral data set of the plurality of cancer tissues includes a mass spectral data set of lung cancer tissues and a mass spectral data set of thyroid cancer tissues. The method comprises the steps of downloading lung cancer and thyroid cancer mass spectrum original data files (the raw format) from a mass spectrum database, extracting Retention time (Rt) from the original mass spectrum data, and constructing R groups (Rt, M/Zs, integrals) by mass-to-charge ratio of a primary mass spectrum, mass-to-charge ratio of a secondary mass spectrum and corresponding peak value (intensity) sequence information.
Specifically, the specific downloading sequence of the data is that firstly, the downloading address is read, whether the downloading address is the address of the prism or the iProX library is determined, then the FTP downloading link is executed, if yes, the mass spectrum file marked as the raw is downloaded into a folder named as the Multirawtata for storage.
In the embodiment of the invention, a lung cancer data set is used: IPX0001451001; the dataset may be obtained by linking https:// www.iprox.cn// page/scv017.Htmlquery = IPX 0001451001.
Thyroid cancer dataset used in the embodiments of the present invention: IPX0001444001; the dataset may be obtained by linking https:// www.iprox.cn// page/sub-project.
After acquiring the IPX0001451001 lung cancer dataset, the IPX0001444001 thyroid cancer dataset was selected for analysis with data collected in a mass spectrometry instrument containing a Nano electrospray ion source (Nano-ESI). The original data file of the raw is downloaded from the public database iProX with the suffix format, and the original data file is converted into the universal format with the suffix of mzML by using MSConverter software. The mass spectrum data of the cancer tissue in the IPX0001451001 lung cancer data set is regarded as lung cancer class, the mass spectrum data of the tissue beside the cancer is regarded as normal class, and similarly, the data of the thyroid cancer tissue in the IPX0001444001 thyroid cancer data set is regarded as thyroid cancer class, and the mass spectrum of the tissue beside the cancer is regarded as normal class.
The mzML file extracts the Rt, M/z and intensity information of the MS of the mass spectrum via python package proteomics and stores and spares it in the format of (Rt (M/Zs, integrals)).
Discretizing M/Zs of each mass spectrum data in the mass spectrum data set and standardizing the internets, and obtaining a matrix corresponding to each mass spectrum data based on the processed array。
Specifically, one mass spectrum data contains R M/Zs. For R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
;
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>Is a preset discretized range minimum. In the present application, S is set to 0.1, m/z is in the range +.>。
Carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
;
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing the standard peak.
The standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
;
wherein,represents the accumulated value of all standard peaks at the discretized index k, N represents the maximum value of the index.
Discretizing M/Zs, wherein the same index consists of a plurality of peaks, so that standard peaks in the same index are accumulated and summed to realize that each index corresponds to one。
Preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N. In the present application, the range of m/z is set to +.>S is 0.1, M is 15400 and N is 15400.
By matrix of each mass spectrum data obtainedAnd training the deep learning model according to the corresponding classification information.
In an embodiment of the present application, the deep learning model comprises a layer 1 one-dimensional CNN module, a layer 2 transducer module, and a layer 2 feedforward neural network module, wherein the transducer module comprises a position coding (Pos coding) and a multi-head attention mechanism (multi-head attention).
The calculation formula of the loss function of the deep learning model is as follows:
;
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>A value (0 or 1) of the j-th class of the real label representing sample i, is->Representing the probability that the model predicted sample i belongs to the j-th class.
The iteration number can be set according to the actual situation. In this application, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
In the embodiment of the application, the training set and the testing set are divided into the original data, the training set is utilized for training the model, and the testing set is used for evaluating the performance of the model. FIG. 2 shows the change in the loss function curve when training a deep learning model using a training set in an embodiment of the present application. It can be observed from the figure that the convergence rate is faster during model training. After the number of iterations reaches 200, the loss function begins to stabilize.
The matrix obtainedWhere M represents the length of M/Zs after discretization, M is 15400 in this example, R is the Rt number of mass spectrum data,
;
accumulating in Rt dimension to obtain
;
Will firstInput to a one-dimensional CNN module for convolution to obtain +.>Wherein d is the hidden variable dimension of the one-dimensional CNN module;
;
will beThe position code matrix is calculated by inputting the position code matrix into Positional Embedding in a transducer>Wherein d is the hidden variable dimension of Positional Embedding, will +.>Adding to the P vector to obtain the input of multi-head attribute in the transducer>。
And then will beIn the multi-head attribute input to the transducer, first +.>And (3) withMultiplying the three weight matrix row matrices to obtain the matrix Q, K, V needed for calculating the Attention value, wherein q, k, v respectively represent +.>And h represents the head number.
ThenMultiplying the matrix Q, K, scaling, obtaining Attention score matrix Attention of the matrix through Softmax function, and calculating multiplication of Attention and V to obtainWherein the scaling factor is->。
Will obtainThrough a residual network, i.e.)>And sub layer (, a)>Adding, and passing through a layer Layer Normalization to obtain +.>The method mainly converts each layer of network into the distribution with the same mean value and variance, and accelerates the convergence of the model.
Obtaining +.1 via a layer 1 feedforward neural network>Rear and->Accumulated get->Then passing through a layer Layer Normalization to obtain +.>And then->After dimension leveling, the fiber passes through a 2-layer feedforward neural network and then passes through a layer of Softmax obtains the predicted outcome->Wherein->Is a probability value array of three categories. In the examples of the present application ∈ ->The type of cancer with the highest probability value is regarded as a prediction result, and the preset classification threshold can be set according to actual requirements, which is not limited by the embodiment of the invention.
The performance of the deep learning model on the training validation set may be evaluated by the following criteria: accuracy (Accuracy), the number of correctly predicted samples is the proportion of the total number of samples. The performance evaluation results are shown in the following table 1, and the results of the classification index of the deep learning model constructed by the invention on the verification set for multi-cancer mass spectrum data are 87%.
TABLE 1 Multi-cancer data training verification set Classification index results based on deep learning method
After training is completed, acquiring mass spectrum data of an object to be identified, discretizing M/Zs in the mass spectrum data by the method, and standardizing the internets to obtain a matrix of the mass spectrum data of the object to be identified。
Matrix corresponding to mass spectrum data of object to be identifiedAnd inputting the classification result into a trained deep learning model.
As shown in fig. 1, the present application further discloses a multi-cancer diagnosis system based on mass spectrum data and a deep learning algorithm, comprising: the system comprises a data acquisition module, a data preprocessing module and a multi-cancer classification module.
The data acquisition module is used for acquiring a mass spectrum data set of various cancer tissues and a mass spectrum data set of normal tissues, wherein the mass spectrum data set of various cancer tissues comprises a mass spectrum data set of lung cancer tissues and a mass spectrum data set of thyroid cancer tissues, the mass spectrum data set is an array consisting of retention time Rt, mass-to-charge ratio m/z and kurtosis value intensity corresponding to the mass-to-charge ratio m/z, each retention time Rt corresponds to a group of mass-to-charge ratio m/z and kurtosis value intensity corresponding to the group of mass-to-charge ratio m/z one by one, each mass spectrum data comprises R retention times Rt, and the array corresponding to each mass spectrum data is ((Rt) 1 ,(m/z 1 ,m/z 2 ,…,m/z n ),(intensity 1 ,intensity 2 ,…,intensity n ),…,(Rt R ,(m/z 1 ,m/z 2 ,…,m/z l ),(intensity 1 ,intensity 2 ,…,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 , m/z 2 ,…, m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 ,…,intensity n ) Are internets. The data preprocessing module is used for performing discretization processing and standardization processing on M/Zs of each mass spectrum data in the mass spectrum data set, and acquiring a matrix corresponding to each mass spectrum data based on the processed array. The multi-cancer classification module comprises a deep learning model, and a matrix of each mass spectrum data obtained by the data preprocessing module>And training the deep learning model according to the corresponding classification information.
After training is completed, acquiring mass spectrum data of an object to be identified through a data acquisition module, and then passing through a data preprocessing module and processing the mass spectrum dataM/Zs in the matrix are discretized and standardized by internets to obtain a matrix of mass spectrum data of the object to be identifiedThen the matrix corresponding to the mass spectrum data of the object to be identified is +.>Inputting the result to a multi-cancer classification module, and obtaining a classification result by the multi-cancer classification module through a trained deep learning model.
As a preferred embodiment, the data preprocessing module is obtained by:
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
;
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>Is a preset discretized range minimum.
Carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
;
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing the standard peak.
The standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
;
wherein,represents the accumulated value of all standard peaks at the discretized index k, N represents the maximum value of the index.
Preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
As a preferred embodiment, the deep learning model in the multi-cancer classification module comprises a 1-layer one-dimensional CNN module, a 2-layer transducer module, and a 2-layer feedforward neural network module, wherein the transducer module comprises a position coding and a multi-head attention mechanism.
As a preferred embodiment, the calculation formula of the loss function of the deep learning model in the multiple cancer classification module is:
;
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Value of the j-th class of the real label representing sample i,/>Representing the probability that the model predicted sample i belongs to the j-th class.
As a preferred embodiment, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
The specific contents of each module in the multi-cancer diagnosis system based on the mass spectrum data and the deep learning algorithm refer to the multi-cancer diagnosis method based on the mass spectrum data and the deep learning algorithm, and are not repeated here.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.
Claims (10)
1. A method for diagnosing multiple cancers based on mass spectrum data and a deep learning algorithm, comprising:
acquiring mass spectrum data sets of various cancer tissues and normal tissues, wherein the mass spectrum data sets of the various cancer tissues comprise mass spectrum data sets of lung cancer tissues and mass spectrum data sets of thyroid cancer tissues, the mass spectrum data sets are an array consisting of retention time Rt, mass-to-charge ratio M/z and kurtosis value intensity corresponding to the mass-to-charge ratio M/z, each retention time Rt corresponds to a group of mass-to-charge ratio M/z and kurtosis value intensity corresponding to the group of mass-to-charge ratio M/z, each mass spectrum data comprises R retention times Rt, and a group of mass-to-charge ratios on each retention time Rt are recorded as M/Zs, and a group of kurtosis values corresponding to the mass-to-charge ratio M/z are recorded as Inteties;
discretizing M/Zs of each mass spectrum data in the mass spectrum data set and standardizing the internets, and obtaining a matrix corresponding to each mass spectrum data based on the processed array;
By matrix of each mass spectrum data obtainedTraining the deep learning model according to the corresponding classification information;
acquiring mass spectrum data of an object to be identified, discretizing M/Zs in the mass spectrum data and standardizing the Transit to obtain a matrix of the mass spectrum data of the object to be identified;
Matrix corresponding to mass spectrum data of object to be identifiedAnd inputting the classification result into a trained deep learning model.
2. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 1, wherein,
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
;
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
;
wherein,representing the maximum intensity value in the intensity array, +.>Representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
;
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
3. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 1, wherein,
the deep learning model includes a 1-layer one-dimensional CNN module, a 2-layer transducer module, and a 2-layer feedforward neural network module, wherein the transducer module includes a position coding and a multi-head attention mechanism.
4. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 1, wherein,
the calculation formula of the loss function of the deep learning model is as follows:
;
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Value of the j-th class of the real label representing sample i,/>Representing the probability that the model predicted sample i belongs to the j-th class.
5. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 4, wherein,
when the loss function converges or is trained to an epoch of 200, then the deep learning model completes training.
6. A multi-cancer diagnosis system based on mass spectrometry data and a deep learning algorithm, comprising:
the data acquisition module is used for acquiring mass spectrum data sets of various cancer tissues and mass spectrum data sets of normal tissues, wherein the mass spectrum data sets of various cancer tissues comprise mass spectrum data sets of lung cancer tissues and mass spectrum data sets of thyroid cancer tissues, the mass spectrum data sets are an array consisting of retention time Rt, mass-to-charge ratio M/z and kurtosis value intennits corresponding to the mass-to-charge ratio M/z, each retention time Rt corresponds to one group of mass-to-charge ratio M/z and kurtosis value intennits corresponding to one group of mass-to-charge ratio M/z, each mass spectrum data comprises R retention times Rt, and a group of mass-to-charge ratios on each retention time Rt are recorded as M/Zs, and a group of kurtosis values corresponding to the mass-to-charge ratio M/Zs are recorded as inteies;
the data preprocessing module is used for performing discretization processing and intersitization processing on M/Zs of each mass spectrum data in the mass spectrum data set, and obtaining a matrix corresponding to each mass spectrum data based on the processed array;
A multi-cancer classification module comprising a deep learning model, a matrix of each mass spectrum data obtained by the data preprocessing moduleTraining the deep learning model according to the corresponding classification information;
after training is completed, acquiring mass spectrum data of an object to be identified through a data acquisition module, and performing discretization processing on M/Zs in the mass spectrum data and standardization processing on the mass spectrum data through a data preprocessing module to obtain a matrix of the mass spectrum data of the object to be identifiedThen the matrix corresponding to the mass spectrum data of the object to be identified is +.>Inputting the multi-cancer classification module to the multi-cancer classification module, and obtaining a classification result by the multi-cancer classification module through a trained deep learning model.
7. The multi-cancer diagnosis system according to claim 6, wherein the multi-cancer diagnosis system is based on mass spectrum data and a deep learning algorithm,
the data preprocessing module is obtained by the following method:
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
;
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
;
wherein,representing the maximum intensity value in the intensity array, +.>Representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
;
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
8. The multi-cancer diagnosis system according to claim 6, wherein the multi-cancer diagnosis system is based on mass spectrum data and a deep learning algorithm,
the deep learning model in the multi-cancer classification module comprises a 1-layer one-dimensional CNN module, a 2-layer transducer module and a 2-layer feedforward neural network module, wherein the transducer module comprises a position coding and multi-head attention mechanism.
9. The multi-cancer diagnosis system according to claim 8, wherein the multi-cancer diagnosis system is based on mass spectrum data and a deep learning algorithm,
the calculation formula of the loss function of the deep learning model in the multi-cancer classification module is as follows:
;
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Representing the true of sample iThe value of the j-th class of real tags, < ->Representing the probability that the model predicted sample i belongs to the j-th class.
10. The multi-cancer diagnosis system based on mass spectrum data and deep learning algorithm according to claim 9,
when the loss function converges or is trained to an epoch of 200, then the deep learning model completes training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311720287.0A CN117409961A (en) | 2023-12-14 | 2023-12-14 | Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311720287.0A CN117409961A (en) | 2023-12-14 | 2023-12-14 | Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117409961A true CN117409961A (en) | 2024-01-16 |
Family
ID=89489446
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311720287.0A Pending CN117409961A (en) | 2023-12-14 | 2023-12-14 | Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117409961A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114997303A (en) * | 2022-05-30 | 2022-09-02 | 杭州电子科技大学 | Bladder cancer metabolic marker screening method and system based on deep learning |
WO2023164665A1 (en) * | 2022-02-25 | 2023-08-31 | Fred Hutchinson Cancer Center | Machine learning applications to predict biological outcomes and elucidate underlying biological mechanisms |
US20230301757A1 (en) * | 2022-03-25 | 2023-09-28 | J. Morita Mfg. Corp. | Identification apparatus and identification method |
CN117034017A (en) * | 2023-09-07 | 2023-11-10 | 云鉴康(杭州)医疗科技有限公司 | Mass spectrogram classification method, system, medium and equipment based on deep learning |
-
2023
- 2023-12-14 CN CN202311720287.0A patent/CN117409961A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023164665A1 (en) * | 2022-02-25 | 2023-08-31 | Fred Hutchinson Cancer Center | Machine learning applications to predict biological outcomes and elucidate underlying biological mechanisms |
US20230301757A1 (en) * | 2022-03-25 | 2023-09-28 | J. Morita Mfg. Corp. | Identification apparatus and identification method |
CN114997303A (en) * | 2022-05-30 | 2022-09-02 | 杭州电子科技大学 | Bladder cancer metabolic marker screening method and system based on deep learning |
CN117034017A (en) * | 2023-09-07 | 2023-11-10 | 云鉴康(杭州)医疗科技有限公司 | Mass spectrogram classification method, system, medium and equipment based on deep learning |
Non-Patent Citations (4)
Title |
---|
丛晓峰: "《PyTorch神经网络实战 移动端图像处理》", 30 June 2022, 机械工业出版社, pages: 210 * |
朝乐门: "《启迪数字学院系列丛书 数据分析原理与实践 基于经典算法及Python编程实现》", 31 July 2022, 机械工业出版社, pages: 126 * |
朱刚;李文;杜守国;崔久强;: "基于深度学习模型DeepAR的时间序列预测及应用实例", 电子商务, no. 07, 31 July 2020 (2020-07-31) * |
王月;王孟轩;张胜;杜?;: "基于BERT的警情文本命名实体识别", 计算机应用, no. 02, 10 February 2020 (2020-02-10), pages 1 - 2 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xie et al. | A deep-learning-based real-time detector for grape leaf diseases using improved convolutional neural networks | |
CN109670510B (en) | Deep learning-based gastroscope biopsy pathological data screening system | |
CN110604550B (en) | Method for establishing model for predicting complications of normal tissues and organs after tumor radiotherapy | |
CN111126263B (en) | Electroencephalogram emotion recognition method and device based on double-hemisphere difference model | |
Vuskovic et al. | Processing and analysis of serum antibody binding signals from Printed Glycan Arrays for diagnostic and prognostic applications | |
CN112446591A (en) | Evaluation system for student comprehensive capacity evaluation and zero sample evaluation method | |
CN111354338B (en) | Parkinson speech recognition system based on PSO convolution kernel optimization sparse transfer learning | |
CN111248913B (en) | Chronic obstructive pulmonary disease prediction system, equipment and medium based on transfer learning | |
CN113095409A (en) | Hyperspectral image classification method based on attention mechanism and weight sharing | |
CN111833330B (en) | Intelligent lung cancer detection method and system based on fusion of image and machine olfaction | |
CN114782753A (en) | Lung cancer histopathology full-section classification method based on weak supervision learning and converter | |
Silva et al. | Automatic detection of Flavescense Dorée grapevine disease in hyperspectral images using machine learning | |
CN115034254A (en) | Nuclide identification method based on HHT (Hilbert-Huang transform) frequency band energy features and convolutional neural network | |
CN117034017B (en) | Mass spectrogram classification method, system, medium and equipment based on deep learning | |
CN116612335B (en) | Few-sample fine-granularity image classification method based on contrast learning | |
CN114121158A (en) | Deep network self-adaption based scRNA-seq cell type identification method | |
US20080095428A1 (en) | Method for training of supervised prototype neural gas networks and their use in mass spectrometry | |
CN105869161A (en) | Method for selecting wave bands of hyperspectral image based on image quality assessment | |
CN117409961A (en) | Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm | |
CN107818329B (en) | Mass spectrum data analysis method | |
Ismail et al. | Efficient Harris Hawk optimization (HHO)-based framework for accurate skin cancer prediction | |
CN115470863A (en) | Domain generalized electroencephalogram signal classification method based on double supervision | |
CN114330422A (en) | Cross-test migration learning method for estimating electroencephalogram emotional characteristics in real time | |
Vakili et al. | Multi-class primary morphology lesions classification using deep convolutional neural network | |
Ummah et al. | Covid-19 and Tuberculosis Detection in X-Ray of Lung Images with Deep Convolutional Neural Network. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |