CN117409961A - Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm - Google Patents

Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm Download PDF

Info

Publication number
CN117409961A
CN117409961A CN202311720287.0A CN202311720287A CN117409961A CN 117409961 A CN117409961 A CN 117409961A CN 202311720287 A CN202311720287 A CN 202311720287A CN 117409961 A CN117409961 A CN 117409961A
Authority
CN
China
Prior art keywords
mass spectrum
spectrum data
deep learning
mass
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311720287.0A
Other languages
Chinese (zh)
Inventor
孙楠楠
段宏亮
张丽英
和涛
居斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Shengao Information Technology Co ltd
Original Assignee
Hangzhou Shengao Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Shengao Information Technology Co ltd filed Critical Hangzhou Shengao Information Technology Co ltd
Priority to CN202311720287.0A priority Critical patent/CN117409961A/en
Publication of CN117409961A publication Critical patent/CN117409961A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a multi-cancer diagnosis method and a system based on mass spectrum data and a deep learning algorithm, wherein the method comprises the following steps: acquiring mass spectrum data sets of various cancer tissues and mass spectrum data sets of normal tissues; discretizing M/Zs of each mass spectrum data and standardizing the internets to obtain corresponding matrixesThe method comprises the steps of carrying out a first treatment on the surface of the Training the deep learning model through a matrix; matrix for acquiring mass spectrum data of object to be identifiedThe method comprises the steps of carrying out a first treatment on the surface of the Matrix of mass spectral data of an object to be identifiedAnd inputting the classification result into a trained deep learning model. According to the multi-cancer diagnosis method and system based on the mass spectrum data and the deep learning algorithm, provided by the invention, the identification of the marker is not relied on, and the diagnosis of the cancer can be rapidly carried out by only inputting the mass spectrum raw data as a model.

Description

Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm
Technical Field
The invention belongs to the technical field of mass spectrometry, and particularly relates to a multi-cancer diagnosis method and system based on mass spectrum data and a deep learning algorithm.
Background
Cancer is one of the diseases that severely threatens human health, with extremely high morbidity and mortality in clinical practice. However, early cancer diagnosis remains a challenging task. Traditional cancer diagnostic methods typically involve tissue biopsies and microscopic pathology analysis, which are time consuming, costly, invasive to the patient, and subjective in outcome. Mass spectrometry is an important technology widely applied to the fields of chemistry, biomedicine, environmental science and the like, and can provide distribution information of chemical components in a tissue sample. This technique has been widely used in cancer diagnosis and treatment response assessment, but existing mass spectrometry data analysis methods still present challenges for information extraction and identification of complex biomarkers.
Disclosure of Invention
The invention provides a multi-cancer diagnosis method and a multi-cancer diagnosis system based on mass spectrum data and a deep learning algorithm, which solve the technical problems, and concretely adopts the following technical scheme:
a multi-cancer species diagnosis method based on mass spectrometry data and a deep learning algorithm, comprising:
acquiring mass spectral datasets of multiple cancerous tissuesA mass spectrum data set of normal tissue, a mass spectrum data set of multiple cancer tissues comprising a mass spectrum data set of lung cancer tissue and a mass spectrum data set of thyroid cancer tissue, the mass spectrum data set being an array of retention times Rt, mass-to-charge ratios m/z and kurtosis value intents corresponding to mass-to-charge ratios m/z, each retention time Rt corresponding to a set of mass-to-charge ratios m/z and kurtosis value intents corresponding to the set of mass-to-charge ratios m/z, each mass spectrum data comprising R retention times Rt, each mass spectrum data corresponding to an array ((Rt) 1 ,(m/z 1 , m/z 2 ,…, m/z n ),(intensity 1 ,intensity 2 , …,intensity n ),…, (Rt R ,(m/z 1 , m/z 2 ,…, m/z l ),(intensity 1 ,intensity 2 , …,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 , m/z 2 ,…, m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 , …,intensity n ) Is the internets;
discretizing M/Zs of each mass spectrum data in the mass spectrum data set and standardizing the internets, and obtaining a matrix corresponding to each mass spectrum data based on the processed array
By matrix of each mass spectrum data obtainedTraining the deep learning model according to the corresponding classification information;
acquiring mass spectrum data of an object to be identified, discretizing M/Zs in the mass spectrum data and standardizing the Transit to obtain a matrix of the mass spectrum data of the object to be identified
Matrix corresponding to mass spectrum data of object to be identifiedAnd inputting the classification result into a trained deep learning model.
Further, for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
Further, the deep learning model includes a 1-layer one-dimensional CNN module, a 2-layer fransformer module, and a 2-layer feedforward neural network module, wherein the fransformer module includes a position coding and a multi-headed attentiveness mechanism.
Further, the calculation formula of the loss function of the deep learning model is:
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Value of the j-th class of the real label representing sample i,/>Representing the probability that the model predicted sample i belongs to the j-th class.
Further, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
A multi-cancer seed diagnostic system based on mass spectrometry data and a deep learning algorithm, comprising:
the data acquisition module is used for acquiring a mass spectrum data set of various cancer tissues and a mass spectrum data set of normal tissues, wherein the mass spectrum data set of various cancer tissues comprises a mass spectrum data set of lung cancer tissues and a mass spectrum data set of thyroid cancer tissues, the mass spectrum data set is an array consisting of retention time Rt, mass-to-charge ratio m/z and kurtosis value intensity corresponding to the mass-to-charge ratio m/z, each retention time Rt corresponds to a group of mass-to-charge ratio m/z and kurtosis value intensity corresponding to the group of mass-to-charge ratio m/z one by one, each mass spectrum data comprises R retention times Rt, and the array corresponding to each mass spectrum data is ((Rt) 1 ,(m/z 1 ,m/z 2 ,…,m/z n ),(intensity 1 ,intensity 2 ,…,intensity n ),…,(Rt R ,(m/z 1 ,m/z 2 ,…,m/z l ),(intensity 1 ,intensity 2 ,…,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 ,m/z 2 ,…,m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 ,…,intensity n ) Is the internets;
the data preprocessing module is used for performing discretization processing and intersitization processing on M/Zs of each mass spectrum data in the mass spectrum data set, and obtaining a matrix corresponding to each mass spectrum data based on the processed array;
A multi-cancer classification module comprising a deep learning model, a matrix of each mass spectrum data obtained by the data preprocessing moduleTraining the deep learning model according to the corresponding classification information;
after training is completed, acquiring mass spectrum data of the object to be identified through a data acquisition module, and then passing through a data preprocessing module and performing mass matchingM/Zs in the spectrum data are discretized and the internets are standardized to obtain a matrix of mass spectrum data of the object to be identifiedThen the matrix corresponding to the mass spectrum data of the object to be identified is +.>Inputting the multi-cancer classification module to the multi-cancer classification module, and obtaining a classification result by the multi-cancer classification module through a trained deep learning model.
Further, the data preprocessing module is obtained by the following method:
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
Further, the deep learning model in the multi-cancer classification module comprises a 1-layer one-dimensional CNN module, a 2-layer transducer module, and a 2-layer feedforward neural network module, wherein the transducer module comprises a position coding and a multi-head attention mechanism.
Further, the calculation formula of the loss function of the deep learning model in the multi-cancer classification module is as follows:
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Representing the true of sample iThe value of the j-th class of real tags, < ->Representing the probability that the model predicted sample i belongs to the j-th class.
Further, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
The multi-cancer diagnosis method and system based on the mass spectrum data and the deep learning algorithm provided by the invention have the beneficial effects that the multi-cancer diagnosis method and system based on the mass spectrum data and the deep learning algorithm can be used for rapidly performing cancer diagnosis by only inputting mass spectrum raw data as a model without depending on identification of a marker. It extracts features directly on raw mass spectral data of multiple cancer species without the need for identification of biomarkers for each cancer species.
The invention has the advantages that the provided multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm can be regarded as [ intensity ] by one m/z initial vector code 1 ,intensity 2 ,…,intensity R ]R refers to the number of Rt contained in one mass spectrum data, and then the [ intensity ] 1 ,intensity 2 ,…,intensity R ]Adding to obtainTherefore, a plurality of mass spectrum data are input as a model, and alignment in the Rt dimension is not needed, so that the method is not limited by a mass spectrum acquisition instrument, has the advantage of cross equipment, is easier to train, and has better generalization.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of a multi-cancer diagnostic method based on mass spectral data and a deep learning algorithm of the present invention;
FIG. 2 is a graphical representation of the change in the loss function curve of a diagnostic model of a multiple cancerous disease in accordance with the present invention.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
FIG. 1 shows a multi-cancer diagnosis method based on mass spectrum data and a deep learning algorithm, which comprises the following steps:
a mass spectral dataset of a plurality of cancerous tissues and a mass spectral dataset of normal tissues are acquired. The mass spectrum data set is an array composed of retention times Rt, mass-to-charge ratios m/z and kurtosis values intents corresponding to the mass-to-charge ratios m/z, each retention time Rt corresponds to a group of mass-to-charge ratios m/z and kurtosis values intents corresponding to the group of mass-to-charge ratios m/z one by one, each mass spectrum data contains R retention times Rt, and each mass spectrum data corresponds to an array ((Rt) 1 ,(m/z 1 ,m/z 2 ,…, m/z n ),(intensity 1 ,intensity 2 ,…,intensity n ),…,(Rt R ,(m/z 1 , m/z 2 ,…, m/z l ),(intensity 1 ,intensity 2 ,…,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 , m/z 2 ,…, m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 , …,intensity n ) Are internets.
In the present application, the mass spectral data set of the plurality of cancer tissues includes a mass spectral data set of lung cancer tissues and a mass spectral data set of thyroid cancer tissues. The method comprises the steps of downloading lung cancer and thyroid cancer mass spectrum original data files (the raw format) from a mass spectrum database, extracting Retention time (Rt) from the original mass spectrum data, and constructing R groups (Rt, M/Zs, integrals) by mass-to-charge ratio of a primary mass spectrum, mass-to-charge ratio of a secondary mass spectrum and corresponding peak value (intensity) sequence information.
Specifically, the specific downloading sequence of the data is that firstly, the downloading address is read, whether the downloading address is the address of the prism or the iProX library is determined, then the FTP downloading link is executed, if yes, the mass spectrum file marked as the raw is downloaded into a folder named as the Multirawtata for storage.
In the embodiment of the invention, a lung cancer data set is used: IPX0001451001; the dataset may be obtained by linking https:// www.iprox.cn// page/scv017.Htmlquery = IPX 0001451001.
Thyroid cancer dataset used in the embodiments of the present invention: IPX0001444001; the dataset may be obtained by linking https:// www.iprox.cn// page/sub-project.
After acquiring the IPX0001451001 lung cancer dataset, the IPX0001444001 thyroid cancer dataset was selected for analysis with data collected in a mass spectrometry instrument containing a Nano electrospray ion source (Nano-ESI). The original data file of the raw is downloaded from the public database iProX with the suffix format, and the original data file is converted into the universal format with the suffix of mzML by using MSConverter software. The mass spectrum data of the cancer tissue in the IPX0001451001 lung cancer data set is regarded as lung cancer class, the mass spectrum data of the tissue beside the cancer is regarded as normal class, and similarly, the data of the thyroid cancer tissue in the IPX0001444001 thyroid cancer data set is regarded as thyroid cancer class, and the mass spectrum of the tissue beside the cancer is regarded as normal class.
The mzML file extracts the Rt, M/z and intensity information of the MS of the mass spectrum via python package proteomics and stores and spares it in the format of (Rt (M/Zs, integrals)).
Discretizing M/Zs of each mass spectrum data in the mass spectrum data set and standardizing the internets, and obtaining a matrix corresponding to each mass spectrum data based on the processed array
Specifically, one mass spectrum data contains R M/Zs. For R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>Is a preset discretized range minimum. In the present application, S is set to 0.1, m/z is in the range +.>
Carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing the standard peak.
The standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
wherein,represents the accumulated value of all standard peaks at the discretized index k, N represents the maximum value of the index.
Discretizing M/Zs, wherein the same index consists of a plurality of peaks, so that standard peaks in the same index are accumulated and summed to realize that each index corresponds to one
Preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N. In the present application, the range of m/z is set to +.>S is 0.1, M is 15400 and N is 15400.
By matrix of each mass spectrum data obtainedAnd training the deep learning model according to the corresponding classification information.
In an embodiment of the present application, the deep learning model comprises a layer 1 one-dimensional CNN module, a layer 2 transducer module, and a layer 2 feedforward neural network module, wherein the transducer module comprises a position coding (Pos coding) and a multi-head attention mechanism (multi-head attention).
The calculation formula of the loss function of the deep learning model is as follows:
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>A value (0 or 1) of the j-th class of the real label representing sample i, is->Representing the probability that the model predicted sample i belongs to the j-th class.
The iteration number can be set according to the actual situation. In this application, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
In the embodiment of the application, the training set and the testing set are divided into the original data, the training set is utilized for training the model, and the testing set is used for evaluating the performance of the model. FIG. 2 shows the change in the loss function curve when training a deep learning model using a training set in an embodiment of the present application. It can be observed from the figure that the convergence rate is faster during model training. After the number of iterations reaches 200, the loss function begins to stabilize.
The matrix obtainedWhere M represents the length of M/Zs after discretization, M is 15400 in this example, R is the Rt number of mass spectrum data,
accumulating in Rt dimension to obtain
Will firstInput to a one-dimensional CNN module for convolution to obtain +.>Wherein d is the hidden variable dimension of the one-dimensional CNN module;
will beThe position code matrix is calculated by inputting the position code matrix into Positional Embedding in a transducer>Wherein d is the hidden variable dimension of Positional Embedding, will +.>Adding to the P vector to obtain the input of multi-head attribute in the transducer>
And then will beIn the multi-head attribute input to the transducer, first +.>And (3) withMultiplying the three weight matrix row matrices to obtain the matrix Q, K, V needed for calculating the Attention value, wherein q, k, v respectively represent +.>And h represents the head number.
ThenMultiplying the matrix Q, K, scaling, obtaining Attention score matrix Attention of the matrix through Softmax function, and calculating multiplication of Attention and V to obtainWherein the scaling factor is->
Will obtainThrough a residual network, i.e.)>And sub layer (, a)>Adding, and passing through a layer Layer Normalization to obtain +.>The method mainly converts each layer of network into the distribution with the same mean value and variance, and accelerates the convergence of the model.
Obtaining +.1 via a layer 1 feedforward neural network>Rear and->Accumulated get->Then passing through a layer Layer Normalization to obtain +.>And then->After dimension leveling, the fiber passes through a 2-layer feedforward neural network and then passes through a layer of Softmax obtains the predicted outcome->Wherein->Is a probability value array of three categories. In the examples of the present application ∈ ->The type of cancer with the highest probability value is regarded as a prediction result, and the preset classification threshold can be set according to actual requirements, which is not limited by the embodiment of the invention.
The performance of the deep learning model on the training validation set may be evaluated by the following criteria: accuracy (Accuracy), the number of correctly predicted samples is the proportion of the total number of samples. The performance evaluation results are shown in the following table 1, and the results of the classification index of the deep learning model constructed by the invention on the verification set for multi-cancer mass spectrum data are 87%.
TABLE 1 Multi-cancer data training verification set Classification index results based on deep learning method
After training is completed, acquiring mass spectrum data of an object to be identified, discretizing M/Zs in the mass spectrum data by the method, and standardizing the internets to obtain a matrix of the mass spectrum data of the object to be identified
Matrix corresponding to mass spectrum data of object to be identifiedAnd inputting the classification result into a trained deep learning model.
As shown in fig. 1, the present application further discloses a multi-cancer diagnosis system based on mass spectrum data and a deep learning algorithm, comprising: the system comprises a data acquisition module, a data preprocessing module and a multi-cancer classification module.
The data acquisition module is used for acquiring a mass spectrum data set of various cancer tissues and a mass spectrum data set of normal tissues, wherein the mass spectrum data set of various cancer tissues comprises a mass spectrum data set of lung cancer tissues and a mass spectrum data set of thyroid cancer tissues, the mass spectrum data set is an array consisting of retention time Rt, mass-to-charge ratio m/z and kurtosis value intensity corresponding to the mass-to-charge ratio m/z, each retention time Rt corresponds to a group of mass-to-charge ratio m/z and kurtosis value intensity corresponding to the group of mass-to-charge ratio m/z one by one, each mass spectrum data comprises R retention times Rt, and the array corresponding to each mass spectrum data is ((Rt) 1 ,(m/z 1 ,m/z 2 ,…,m/z n ),(intensity 1 ,intensity 2 ,…,intensity n ),…,(Rt R ,(m/z 1 ,m/z 2 ,…,m/z l ),(intensity 1 ,intensity 2 ,…,intensity l ) Wherein n, l is the number of mass-to-charge ratios m/z corresponding to each retention time Rt, and a set of mass-to-charge ratios (m/z) at each retention time Rt is recorded 1 , m/z 2 ,…, m/z n ) Is M/Zs, which corresponds to a set of kurtosis values (intensity 1 ,intensity 2 ,…,intensity n ) Are internets. The data preprocessing module is used for performing discretization processing and standardization processing on M/Zs of each mass spectrum data in the mass spectrum data set, and acquiring a matrix corresponding to each mass spectrum data based on the processed array. The multi-cancer classification module comprises a deep learning model, and a matrix of each mass spectrum data obtained by the data preprocessing module>And training the deep learning model according to the corresponding classification information.
After training is completed, acquiring mass spectrum data of an object to be identified through a data acquisition module, and then passing through a data preprocessing module and processing the mass spectrum dataM/Zs in the matrix are discretized and standardized by internets to obtain a matrix of mass spectrum data of the object to be identifiedThen the matrix corresponding to the mass spectrum data of the object to be identified is +.>Inputting the result to a multi-cancer classification module, and obtaining a classification result by the multi-cancer classification module through a trained deep learning model.
As a preferred embodiment, the data preprocessing module is obtained by:
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>Is a preset discretized range minimum.
Carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
wherein,representing the maximum intensity value in the intensity array,representing the minimum intension among the integers,/for>Representing the standard peak.
The standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
wherein,represents the accumulated value of all standard peaks at the discretized index k, N represents the maximum value of the index.
Preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
As a preferred embodiment, the deep learning model in the multi-cancer classification module comprises a 1-layer one-dimensional CNN module, a 2-layer transducer module, and a 2-layer feedforward neural network module, wherein the transducer module comprises a position coding and a multi-head attention mechanism.
As a preferred embodiment, the calculation formula of the loss function of the deep learning model in the multiple cancer classification module is:
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Value of the j-th class of the real label representing sample i,/>Representing the probability that the model predicted sample i belongs to the j-th class.
As a preferred embodiment, when the loss function converges or trains to an epoch of 200, then the deep learning model completes training.
The specific contents of each module in the multi-cancer diagnosis system based on the mass spectrum data and the deep learning algorithm refer to the multi-cancer diagnosis method based on the mass spectrum data and the deep learning algorithm, and are not repeated here.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.

Claims (10)

1. A method for diagnosing multiple cancers based on mass spectrum data and a deep learning algorithm, comprising:
acquiring mass spectrum data sets of various cancer tissues and normal tissues, wherein the mass spectrum data sets of the various cancer tissues comprise mass spectrum data sets of lung cancer tissues and mass spectrum data sets of thyroid cancer tissues, the mass spectrum data sets are an array consisting of retention time Rt, mass-to-charge ratio M/z and kurtosis value intensity corresponding to the mass-to-charge ratio M/z, each retention time Rt corresponds to a group of mass-to-charge ratio M/z and kurtosis value intensity corresponding to the group of mass-to-charge ratio M/z, each mass spectrum data comprises R retention times Rt, and a group of mass-to-charge ratios on each retention time Rt are recorded as M/Zs, and a group of kurtosis values corresponding to the mass-to-charge ratio M/z are recorded as Inteties;
discretizing M/Zs of each mass spectrum data in the mass spectrum data set and standardizing the internets, and obtaining a matrix corresponding to each mass spectrum data based on the processed array
By matrix of each mass spectrum data obtainedTraining the deep learning model according to the corresponding classification information;
acquiring mass spectrum data of an object to be identified, discretizing M/Zs in the mass spectrum data and standardizing the Transit to obtain a matrix of the mass spectrum data of the object to be identified
Matrix corresponding to mass spectrum data of object to be identifiedAnd inputting the classification result into a trained deep learning model.
2. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 1, wherein,
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
wherein,representing the maximum intensity value in the intensity array, +.>Representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
3. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 1, wherein,
the deep learning model includes a 1-layer one-dimensional CNN module, a 2-layer transducer module, and a 2-layer feedforward neural network module, wherein the transducer module includes a position coding and a multi-head attention mechanism.
4. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 1, wherein,
the calculation formula of the loss function of the deep learning model is as follows:
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Value of the j-th class of the real label representing sample i,/>Representing the probability that the model predicted sample i belongs to the j-th class.
5. The method for diagnosing multiple cancers based on mass spectrum data and deep learning algorithm according to claim 4, wherein,
when the loss function converges or is trained to an epoch of 200, then the deep learning model completes training.
6. A multi-cancer diagnosis system based on mass spectrometry data and a deep learning algorithm, comprising:
the data acquisition module is used for acquiring mass spectrum data sets of various cancer tissues and mass spectrum data sets of normal tissues, wherein the mass spectrum data sets of various cancer tissues comprise mass spectrum data sets of lung cancer tissues and mass spectrum data sets of thyroid cancer tissues, the mass spectrum data sets are an array consisting of retention time Rt, mass-to-charge ratio M/z and kurtosis value intennits corresponding to the mass-to-charge ratio M/z, each retention time Rt corresponds to one group of mass-to-charge ratio M/z and kurtosis value intennits corresponding to one group of mass-to-charge ratio M/z, each mass spectrum data comprises R retention times Rt, and a group of mass-to-charge ratios on each retention time Rt are recorded as M/Zs, and a group of kurtosis values corresponding to the mass-to-charge ratio M/Zs are recorded as inteies;
the data preprocessing module is used for performing discretization processing and intersitization processing on M/Zs of each mass spectrum data in the mass spectrum data set, and obtaining a matrix corresponding to each mass spectrum data based on the processed array;
A multi-cancer classification module comprising a deep learning model, a matrix of each mass spectrum data obtained by the data preprocessing moduleTraining the deep learning model according to the corresponding classification information;
after training is completed, acquiring mass spectrum data of an object to be identified through a data acquisition module, and performing discretization processing on M/Zs in the mass spectrum data and standardization processing on the mass spectrum data through a data preprocessing module to obtain a matrix of the mass spectrum data of the object to be identifiedThen the matrix corresponding to the mass spectrum data of the object to be identified is +.>Inputting the multi-cancer classification module to the multi-cancer classification module, and obtaining a classification result by the multi-cancer classification module through a trained deep learning model.
7. The multi-cancer diagnosis system according to claim 6, wherein the multi-cancer diagnosis system is based on mass spectrum data and a deep learning algorithm,
the data preprocessing module is obtained by the following method:
for R M/Zs of each mass spectrum data, calculating an index of each M/z discretization in the M/Zs, wherein a calculation formula is as follows:
wherein Index is an Index,is a lower bound operation, S is a selected discretized scale, < >>A preset discretization range minimum value;
carrying out standardization processing on each intensity in R mediates in each mass spectrum data, wherein the standardization processing formula is as follows:
wherein,representing the maximum intensity value in the intensity array, +.>Representing the minimum intension among the integers,/for>Representing a standard peak;
the standard peak value in the same m/z index is accumulated, and the calculation formula is as follows:
wherein,representing the accumulated value of all standard peaks when the discretized index is k, N representing the maximum value of the index;
preprocessing R groups of M/Zs and intersites in each mass spectrum data based on the preprocessing method to obtain M rows and R columns of matrixWherein R represents the number of Rt in each mass spectrum data, M represents the length of M/Zs after discretization, and M is equal to N.
8. The multi-cancer diagnosis system according to claim 6, wherein the multi-cancer diagnosis system is based on mass spectrum data and a deep learning algorithm,
the deep learning model in the multi-cancer classification module comprises a 1-layer one-dimensional CNN module, a 2-layer transducer module and a 2-layer feedforward neural network module, wherein the transducer module comprises a position coding and multi-head attention mechanism.
9. The multi-cancer diagnosis system according to claim 8, wherein the multi-cancer diagnosis system is based on mass spectrum data and a deep learning algorithm,
the calculation formula of the loss function of the deep learning model in the multi-cancer classification module is as follows:
where n is the number of samples, c is the number of categories,for predictive value +.>Is true value +.>Representing the true of sample iThe value of the j-th class of real tags, < ->Representing the probability that the model predicted sample i belongs to the j-th class.
10. The multi-cancer diagnosis system based on mass spectrum data and deep learning algorithm according to claim 9,
when the loss function converges or is trained to an epoch of 200, then the deep learning model completes training.
CN202311720287.0A 2023-12-14 2023-12-14 Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm Pending CN117409961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311720287.0A CN117409961A (en) 2023-12-14 2023-12-14 Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311720287.0A CN117409961A (en) 2023-12-14 2023-12-14 Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm

Publications (1)

Publication Number Publication Date
CN117409961A true CN117409961A (en) 2024-01-16

Family

ID=89489446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311720287.0A Pending CN117409961A (en) 2023-12-14 2023-12-14 Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm

Country Status (1)

Country Link
CN (1) CN117409961A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997303A (en) * 2022-05-30 2022-09-02 杭州电子科技大学 Bladder cancer metabolic marker screening method and system based on deep learning
WO2023164665A1 (en) * 2022-02-25 2023-08-31 Fred Hutchinson Cancer Center Machine learning applications to predict biological outcomes and elucidate underlying biological mechanisms
US20230301757A1 (en) * 2022-03-25 2023-09-28 J. Morita Mfg. Corp. Identification apparatus and identification method
CN117034017A (en) * 2023-09-07 2023-11-10 云鉴康(杭州)医疗科技有限公司 Mass spectrogram classification method, system, medium and equipment based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023164665A1 (en) * 2022-02-25 2023-08-31 Fred Hutchinson Cancer Center Machine learning applications to predict biological outcomes and elucidate underlying biological mechanisms
US20230301757A1 (en) * 2022-03-25 2023-09-28 J. Morita Mfg. Corp. Identification apparatus and identification method
CN114997303A (en) * 2022-05-30 2022-09-02 杭州电子科技大学 Bladder cancer metabolic marker screening method and system based on deep learning
CN117034017A (en) * 2023-09-07 2023-11-10 云鉴康(杭州)医疗科技有限公司 Mass spectrogram classification method, system, medium and equipment based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
丛晓峰: "《PyTorch神经网络实战 移动端图像处理》", 30 June 2022, 机械工业出版社, pages: 210 *
朝乐门: "《启迪数字学院系列丛书 数据分析原理与实践 基于经典算法及Python编程实现》", 31 July 2022, 机械工业出版社, pages: 126 *
朱刚;李文;杜守国;崔久强;: "基于深度学习模型DeepAR的时间序列预测及应用实例", 电子商务, no. 07, 31 July 2020 (2020-07-31) *
王月;王孟轩;张胜;杜?;: "基于BERT的警情文本命名实体识别", 计算机应用, no. 02, 10 February 2020 (2020-02-10), pages 1 - 2 *

Similar Documents

Publication Publication Date Title
Xie et al. A deep-learning-based real-time detector for grape leaf diseases using improved convolutional neural networks
CN109670510B (en) Deep learning-based gastroscope biopsy pathological data screening system
CN110604550B (en) Method for establishing model for predicting complications of normal tissues and organs after tumor radiotherapy
CN111126263B (en) Electroencephalogram emotion recognition method and device based on double-hemisphere difference model
Vuskovic et al. Processing and analysis of serum antibody binding signals from Printed Glycan Arrays for diagnostic and prognostic applications
CN112446591A (en) Evaluation system for student comprehensive capacity evaluation and zero sample evaluation method
CN111354338B (en) Parkinson speech recognition system based on PSO convolution kernel optimization sparse transfer learning
CN111248913B (en) Chronic obstructive pulmonary disease prediction system, equipment and medium based on transfer learning
CN113095409A (en) Hyperspectral image classification method based on attention mechanism and weight sharing
CN111833330B (en) Intelligent lung cancer detection method and system based on fusion of image and machine olfaction
CN114782753A (en) Lung cancer histopathology full-section classification method based on weak supervision learning and converter
Silva et al. Automatic detection of Flavescense Dorée grapevine disease in hyperspectral images using machine learning
CN115034254A (en) Nuclide identification method based on HHT (Hilbert-Huang transform) frequency band energy features and convolutional neural network
CN117034017B (en) Mass spectrogram classification method, system, medium and equipment based on deep learning
CN116612335B (en) Few-sample fine-granularity image classification method based on contrast learning
CN114121158A (en) Deep network self-adaption based scRNA-seq cell type identification method
US20080095428A1 (en) Method for training of supervised prototype neural gas networks and their use in mass spectrometry
CN105869161A (en) Method for selecting wave bands of hyperspectral image based on image quality assessment
CN117409961A (en) Multi-cancer diagnosis method and system based on mass spectrum data and deep learning algorithm
CN107818329B (en) Mass spectrum data analysis method
Ismail et al. Efficient Harris Hawk optimization (HHO)-based framework for accurate skin cancer prediction
CN115470863A (en) Domain generalized electroencephalogram signal classification method based on double supervision
CN114330422A (en) Cross-test migration learning method for estimating electroencephalogram emotional characteristics in real time
Vakili et al. Multi-class primary morphology lesions classification using deep convolutional neural network
Ummah et al. Covid-19 and Tuberculosis Detection in X-Ray of Lung Images with Deep Convolutional Neural Network.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination