CN113793646B - Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof - Google Patents

Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof Download PDF

Info

Publication number
CN113793646B
CN113793646B CN202111150957.0A CN202111150957A CN113793646B CN 113793646 B CN113793646 B CN 113793646B CN 202111150957 A CN202111150957 A CN 202111150957A CN 113793646 B CN113793646 B CN 113793646B
Authority
CN
China
Prior art keywords
spectrum
sample
molecules
standard
sers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111150957.0A
Other languages
Chinese (zh)
Other versions
CN113793646A (en
Inventor
叶坚
何畅
毕心缘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111150957.0A priority Critical patent/CN113793646B/en
Publication of CN113793646A publication Critical patent/CN113793646A/en
Priority to PCT/CN2022/122403 priority patent/WO2023051661A1/en
Application granted granted Critical
Publication of CN113793646B publication Critical patent/CN113793646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • G16C20/64Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Library & Information Science (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Investigating, Analyzing Materials By Fluorescence Or Luminescence (AREA)
  • Investigating Or Analysing Materials By Optical Means (AREA)

Abstract

The application relates to a spectral image unmixing method based on a weighted non-negative matrix factorization (NMF-CLS), which comprises the following steps: based on a standard spectrum database, the NMF-CLS algorithm is adopted to unmixe the spectrum obtained by the test, so as to obtain the types and the relative concentrations of known molecules contained in the sample to be tested. The NMF-CLS algorithm of the application can obtain the types and the relative contents of known molecules in complex samples, and can exclude the influence of molecules which are not contained in a database on unmixed molecules.

Description

Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof
Technical Field
The application relates to the technical field of image processing, in particular to a spectral image unmixing method based on weighted non-negative matrix factorization.
Background
Infrared spectrum and raman spectrum
Infrared spectrum (Infrared spectrometry, IR) and raman spectrum (Raman spectrometry) are powerful tools for studying molecular structures and chemical compositions, and are widely applied in the fields of materials, chemical industry, environmental protection, geology and the like due to the advantages of rapidness, high sensitivity, small detection dosage and the like. From the analysis and test point of view, the two are matched to provide information on molecular structure better. Infrared spectrum and raman spectrum belong to the molecular vibration spectrum, but there is a large difference between them in practice: the infrared spectrum is an absorption spectrum and the raman spectrum is a scattering spectrum.
Infrared spectroscopy: when electromagnetic radiation interacts with a molecule of a substance, the energy of the electromagnetic radiation is different from the vibration or rotation energy of the molecule, the transition of the molecule from a low energy level to a high energy level is caused, and as a result, electromagnetic radiation with certain specific wavelengths is absorbed by the molecule of the substance, and the transition of the vibration energy level and the rotation energy level after infrared absorption of infrared radiation by the molecule of the infrared absorption spectrum is obtained by measuring the radiation intensity at different wavelengths, so that the infrared spectrum is also called as a molecular vibration rotation spectrum.
Raman spectroscopy: the light irradiates the substance so that the photons collide with electrons in the molecule, and if inelastic collision occurs, a part of energy is transferred to the electrons, and at this time, the frequency of scattered light is not equal to the frequency of incident light, and the scattering is called raman scattering, and the generated spectrum is called raman spectrum.
Raman spectroscopy and infrared spectroscopy are one of the most important analytical chemistry methods, and can provide key structural information such as chemical bonds of a system to be tested. However, their application to surface chemistry analysis of materials and biological systems often faces the bottleneck of lower sensitivity.
Surface enhanced infrared spectroscopy (surface-enhanced infrared absorption spectroscopy, SEIRAS)
When molecules adsorb on the Surface of rough metal particles, the infrared absorption signal is significantly enhanced by 10-1000 times, and this phenomenon is called Surface enhanced infrared absorption effect (SEIRA). The surface enhanced infrared spectrum technology based on the surface enhanced infrared absorption effect has high surface sensitivity, and can detect the change of infrared absorption within 10 6 An order of magnitude; the surface selection rule is simple, and the molecular adsorption orientation is convenient; the method is not limited by mass transfer resistance, and has great application value in analysis application.
Surface enhanced Raman Spectroscopy (Surface-enhanced Raman spectroscopy, SERS)
Surface enhanced Raman spectroscopy is achieved byThe phenomenon of raman scattering intensity enhancement caused by the plasmon resonance interaction of molecules adsorbed on the surface of a metal nanostructure and the metal surface is a very effective raman signal detection technique. The method can adsorb the molecules to be detected on the surface of the rough nano metal material, and can enhance the Raman signal of the object to be detected by 10 6 -15 The multiplied spectrum phenomenon solves the problem of low sensitivity of the common Raman spectrum, and the detection sensitivity can reach a single molecular level, thereby promoting the application of SERS in the fields of food safety, environmental protection, medical detection and the like.
Surface enhanced raman spectroscopy includes having a targeted surface enhanced raman spectroscopy and a spectroscopic surface enhanced raman spectroscopy. Targeting surface enhanced raman spectroscopy relies on specific binding, such as SERS particle surface modification antibodies, to specifically capture antigens in a sample, thereby detecting the content (concentration) of a single molecule or a few molecules, but cannot achieve broad-spectrum detection, and the metabolic results obtained are extremely limited. Broad spectrum surface enhanced Raman spectrum is independent of specific binding, broad spectrum detection in biological samples is realized, and main component analysis, machine learning and other methods are mostly adopted in analysis to directly classify two types of samples, but specific metabolite information (including types, contents and the like) cannot be obtained.
Spectral analysis
Raman spectrum and infrared spectrum are widely used because of their spectral characteristics that are unique for different substances. However, most raman spectra or infrared spectra are mixed and synthesized by different substances, and in order to analyze each component more accurately, it is necessary to perform unmixed analysis on the spectrum images.
The basic idea of classical least squares (Classical least squares, CLS) is to consider the spectrum of the mixed component (e.g. raman spectrum) approximately as a linear addition of a series of pure component spectra, the purpose of the algorithm being to find the coefficients of each pure component spectrum such that the sum of the squares of the errors of the reconstructed spectrum and the original spectrum is minimized by the linear addition.
A non-Negative Matrix Factorization (NMF) sees a mixed component spectrum (such as a raman spectrum) as a linear addition of multiple component spectra, similar to the classical least squares method, but relies on iteratively calculating the component spectra and their corresponding concentrations. The non-negative matrix factorization algorithm uses an iterative approach to decompose a matrix of multiple mixed spectra arranged in columns into a product of two non-negative matrices, one of which is ideally a matrix of spectra of each component arranged in columns, and the other of which is the corresponding relative concentration of each component in each spectrum. Let the spectrum matrix be a matrix V of m x n, representing a total of n spectra, each spectrum consisting of m points. The matrix of spectral composition of pure components is a matrix W of m x r, representing a total of r components. The matrix representing the relative concentration of each component is a matrix H of r x n, with each column representing the relative concentration of each component for each spectrum. The goal of each separation algorithm is to make V≡WH.
For classical least squares, the objective optimization function is set to
Where W is known as a matrix consisting of Raman spectra of the various pure components, in order to find H and minimize F. Then the partial derivative of F with respect to H can be obtained as
For non-negative matrix factorization, the objective optimization function is the same as the classical least squares method, equation (1), but for this algorithm, both W and H are unknown, requiring the parameters of both matrices to be calculated by iteration. The objective is to obtain a set of W and H simultaneously, minimizing F. The partial derivatives of F with respect to W and H can be obtained as
Classical least squares, while producing relatively accurate concentration coefficients, require the provision of spectra for each of the pure components to ensure a good fit, and for biological samples containing at least several hundred components, it is difficult to provide spectra for each component. The advantage of the non-negative matrix factorization algorithm is that it does not need to provide a raman spectrum of the pure component, but the calculated spectrum is often not matched with the spectrum of the actual component, and the relative concentration data of the target component cannot be found accurately. At present, no method capable of effectively carrying out accurate analysis on a Raman spectrum or an infrared spectrum exists in the prior art.
Metabonomics of
Metabolomics is a new discipline following genomics and proteomics, and is an important component of system biology, mainly examining the dynamic changes of all small molecule metabolites and their contents before and after the biological system is stimulated or disturbed. By carrying out overall qualitative and quantitative analysis on all small molecule metabolites in organisms, the relationship between the metabolites and the physiological and pathological changes can be explored and discovered. Research shows that the metabolome has important application value in the fields of early diagnosis of diseases, biomarker discovery, drug screening, toxicity evaluation, sports medicine, nutrition and the like.
The nuclear magnetic resonance spectroscopy (Nuclear magnetic resonance spectroscopy) is widely applied to metabonomics, and has the remarkable advantages of capability of observing multiple metabolites at one time, good reproducibility, no destructiveness and short measurement time. But low sensitivity has been an inherent disadvantage and primary challenge of nuclear magnetic resonance applications in metabonomics research.
The mass spectrometry has the advantages of high sensitivity, strong specificity and the like, is widely applied to the detection of metabolic components, and can carry out qualitative and quantitative analysis on the metabolic components after separation and ionization treatment. But their use has been limited because mass spectrometry does not allow direct detection of biological solutions or tissues.
Liquid chromatography-mass spectrometry (LC-MS) is also used for the study of metabolome. In recent years, LC-MS technology has been further improved, and the use of large-scale sample detection is increasing. With the increase of the number of samples to be detected, a series of problems are caused, for example, the detection time of large-scale samples is long, and the conditions of sensitivity reduction, retention time drift and the like occur in the long-time running process of the machine.
The Raman spectrum (Raman spectroscopy) can detect the structure of a compound and the tiny change of the compound based on vibration spectroscopy, has the advantages of no damage to a sample, simple sample pretreatment, high spatial resolution and the like, and is applied to the fields of clinical pathology research, classification and detection of microorganisms, analysis of the compound and the like.
Disclosure of Invention
Aiming at the technical problems existing in the prior art, one of the purposes of the application is to establish a new spectral image unmixing algorithm, namely a weighted moment nonnegative matrix factorization (NMF-CLS) algorithm by integrating a classical least square method and a nonnegative matrix factorization method, and obtain the relative concentration corresponding to a specific component on the basis of obtaining a better fitting effect.
It is a further object of the present application to provide the use of weighted non-negative matrix algorithms in metabonomics.
In one aspect, the present application provides a spectral image unmixing method based on a weighted non-negative matrix factorization (NMF-CLS) algorithm, comprising: based on a standard spectrum database, unmixing a spectrum obtained by testing by adopting an NMF-CLS algorithm to obtain the types and the relative concentrations of known molecules contained in a sample to be tested; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectrums of different molecules;
Wherein the objective function of the NMF-CLS algorithm is set as follows:
wherein, the spectrum matrix is set as a matrix V of m x n, which represents n spectrums in total, and each spectrum consists of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule; alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) And (3) enabling F in the equation to be minimum, and obtaining the relative concentration corresponding to the known molecule.
In certain embodiments, wherein said H (1) 、W (2) And H (2) Is calculated in an iterative process.
In certain embodiments, wherein said F pertains to W (2) 、H (1) And H (2) The partial derivatives of (2) are:
deriving W from partial derivatives (1) 、H (1) And H (2) The iterative formula of (2) is:
iterative updating H according to iterative formula (1) 、W (2) And H (2) Stopping iteration when the maximum iteration times N or F are reduced to a set threshold sigma, and after iteration is stopped, H (1) I.e. the end result of the relative concentrations of the known components.
In certain embodiments, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.
In certain embodiments, wherein the threshold σ does not exceed about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.
In certain embodiments, wherein said H (1) 、W (2) And H (2) The calculation process of (1) comprises:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing a coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to an iterative formula (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma;
5) After the iteration stops, H (1) I.e. the end result of the relative concentrations of the known components.
In some embodiments, the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested.
In some embodiments, the weight α is inversely related to the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested.
In certain embodiments, wherein when W (2) And H (2) When none exists, the weight alpha is set to 0, and the objective function of the NMF-CLS algorithm is set to be:i.e. classical least squares.
In some embodiments, when the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested is not less than 1, the weight α is set to 0, and the objective function of the NMF-CLS algorithm is set to:
In some embodiments, when the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested is less than 1, the weight α is not 0, and the objective function of the NMF-CLS algorithm is set as:
in some embodiments, the method for setting the weight α includes:
1) Determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested;
2) Configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of a sample to be tested;
3) Different weights alpha are set, NMF-CLS algorithm is adopted to unmixe the spectrum of the simple sample obtained by testing, the coefficients corresponding to the known molecules are obtained, a regression equation is established between the coefficients and the actual concentration of the known molecules, and the R-party value is calculated to obtain the alpha of the highest R-party as the optimal weight value suitable for the sample to be tested.
In certain embodiments, the method wherein determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested comprises a principal component analysis method.
In certain embodiments, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested is no more than about 1/2, or about 1/5, or about 1/10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of spectrum images of a certain molecule, calculating to obtain the average spectrum of the molecule, and similarly obtaining the average spectrum of other molecules, and incorporating the average spectrum into a standard spectrum database to obtain the standard spectrum database.
In certain embodiments, when a spectral image of a molecule is acquired, the concentration of the molecule is between 0.1mM and 10mM.
In certain embodiments, the number of spectral images in which a molecule is acquired is not less than about 10, about 20, about 50, about 100, or about 200.
In certain embodiments, normalizing the intensity of the average spectrum of the obtained molecules is also included.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of spectrum images of a certain molecule, calculating the intensity of the average spectrum of the molecule, normalizing the obtained spectrum, namely the standard spectrum of the molecule, and similarly obtaining the standard spectrum of other molecules, and integrating the standard spectrum into a standard spectrum database to obtain the standard spectrum database.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of spectrum images of a certain molecule, averaging the obtained plurality of spectrum images, normalizing the intensity of the spectrum to be within a range [0,1], obtaining the spectrum which is the standard spectrum of the molecule, and similarly obtaining the standard spectrums of other molecules, and integrating the standard spectrums into a standard spectrum database to obtain the standard spectrum database.
In some embodiments, the method includes collecting spectra of a plurality of samples to be tested, performing algorithm analysis on each spectrum separately, obtaining coefficients of different known components, then performing processing, and finally obtaining analysis results of relative concentrations of different known molecules in the samples.
In certain embodiments, wherein the processing comprises: average, sum, ANOVA analysis, and/or student t-test.
In certain embodiments, where the number of spectra of the sample to be measured is collected, it is desirable to ensure that molecular information in substantially the complete sample to be measured has been collected.
In certain embodiments, the number of spectra in which molecular information in a sample determined to be substantially complete to be tested has been collected includes, but is not limited to, comparison by Pearson coefficients.
In some embodiments, wherein the acquiring of the Pearson coefficients comprises: taking the average value of the spectrums of M samples to be detected as a standard spectrum, taking out N spectrums each time to average, calculating Pearson coefficients of the spectrums and the standard spectrum, repeating the operation for N times, and averaging the N Pearson coefficients to obtain correlation coefficients corresponding to the N spectrums.
In certain embodiments, wherein M is about 50 to 500, or about 100 to 400, or about 200 to 300.
In certain embodiments, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.
In certain embodiments, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.
In certain embodiments, wherein the sample to be tested comprises a chemical sample or a biological sample.
In certain embodiments, wherein the sample to be tested comprises a liquid sample.
In certain embodiments, the spectral image comprises infrared spectrum and raman spectrum.
In certain embodiments, the infrared spectrum comprises a surface enhanced infrared spectrum.
In certain embodiments, the raman spectrum comprises a surface enhanced raman spectrum.
In certain embodiments, the surface-enhanced raman spectrum is a broad spectrum surface-enhanced raman spectrum.
In another aspect, the application provides an analysis method based on surface enhanced raman spectroscopy, comprising the steps of: based on a Surface Enhanced Raman Spectroscopy (SERS) standard spectrum database, a weighted non-negative matrix factorization algorithm (NMF-CLS) is adopted to unmixe the spectrum obtained by the test, so that the types and the relative concentrations of known molecules contained in the sample to be tested are obtained; the known molecules are molecules contained in a SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectrums of different molecules;
Wherein the objective function of the NMF-CLS algorithm is set as follows:
wherein, the spectrum matrix is set as a matrix V of m x n, which represents n spectrums in total, and each spectrum consists of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule; alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) And (3) enabling F in the equation to be minimum, and obtaining the relative concentration corresponding to the known molecule.
In certain embodiments, wherein said H (1) 、W (2) And H (2) Is calculated in an iterative process.
In certain embodiments, wherein said F pertains to W (2) 、H (1) And H (2) The partial derivatives of (2) are:
deriving W from partial derivatives (1) 、H (1) And H (2) The iterative formula of (2) is:
iterative updating H according to iterative formula (1) 、W (2) And H (2) Stopping iteration when the maximum iteration times N or F are reduced to a set threshold sigma, and after iteration is stopped, H (1) I.e. the end result of the relative concentrations of the known components.
In certain embodiments, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.
In certain embodiments, wherein the threshold σ does not exceed about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.
In certain embodiments, wherein said H (1) 、W (2) And H (2) The calculation process of (1) comprises:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing a coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to an iterative formula (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma;
5) After the iteration stops, H (1) I.e. the end result of the relative concentrations of the known components.
In some embodiments, the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested.
In certain embodiments, wherein SERS employs non-targeted broad-spectrum detection.
In some embodiments, the weight α is inversely related to the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested.
In some embodiments, the method for setting the weight α includes:
1) Determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested;
2) Configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of a sample to be tested;
3) Different weights alpha are set, NMF-CLS algorithm is adopted to unmixe the spectrum of the simple sample obtained by testing, the coefficients corresponding to the known molecules are obtained, a regression equation is established between the coefficients and the concentration of the known molecules, and the R-party value is calculated to obtain the optimal weight value of the sample to be tested with the alpha of the highest R-party.
In certain embodiments, the method wherein determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested comprises a principal component analysis method.
In certain embodiments, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested is no more than about 1/2, or about 1/5, or about 1/10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain the SERS average spectrum of the molecule, and similarly obtaining the SERS average spectrum of other molecules, and integrating the SERS average spectrum into a standard spectrum database to obtain the SERS standard spectrum database.
In certain embodiments, when a SERS spectral image of a molecule is acquired, the concentration of the molecule is between 0.1mM and 10mM.
In certain embodiments, the number of SERS spectral images in which a molecule is acquired is not less than about 10, about 20, about 50, about 100, or about 200.
In certain embodiments, normalizing the intensity of the SERS average spectrum of the obtained molecules is also included.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, calculating and obtaining the intensity of the SERS average spectrum of the molecule, normalizing the intensity, wherein the obtained spectrum is the SERS standard spectrum of the molecule, and similarly obtaining the SERS standard spectrums of other molecules, and integrating the SERS standard spectrums into a SERS standard spectrum database to obtain the SERS standard spectrum database.
In certain embodiments, wherein the SERS criteria spectrum database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, averaging the obtained plurality of SERS spectrum images, normalizing the intensity of the average spectrum to be within a range [0,1], wherein the obtained spectrum is the SERS standard spectrum of the molecule, and similarly obtaining the SERS standard spectrums of other molecules, and integrating the SERS standard spectrums into a standard spectrum database to obtain the SERS standard spectrum database.
In some embodiments, the method comprises collecting SERS spectra of a plurality of samples to be tested, and performing algorithm analysis on each SERS spectrum to obtain a coefficient H of a known component (1) And then processing to obtain the analysis result of the relative concentration of the known molecules in the sample.
In certain embodiments, wherein the processing comprises: average, sum, ANOVA analysis, and/or student t-test.
In certain embodiments, the collection of the spectral quantity of the sample to be tested is required to ensure that molecular information in substantially all of the sample to be tested has been collected.
In certain embodiments, wherein the number of spectra determined to have collected substantially molecular information in the complete test sample includes, but is not limited to, comparison by Pearson coefficients.
In some embodiments, wherein the acquiring of the Pearson coefficients comprises: taking the average value of the spectrums of M samples to be detected as a standard spectrum, taking out N spectrums each time to average, calculating Pearson coefficients of the spectrums and the standard spectrum, repeating the operation for N times, and averaging the N Pearson coefficients to obtain correlation coefficients corresponding to the N spectrums.
In certain embodiments, wherein M is about 50 to 500, or about 100 to 400, or about 200 to 300.
In certain embodiments, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.
In certain embodiments, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.
In certain embodiments, the number of spectra in which the sample to be measured is scanned is not less than about 20, or about 30, or about 40, or about 50.
In certain embodiments, the number of spectra in which the sample to be measured is scanned is about 20 to 200, or about 30 to 160, or about 40 to 120, or about 50 to 80.
In certain embodiments, the speed at which the SERS spectrum of the sample to be measured is scanned is about 1 to 5 s/sheet.
In certain embodiments, wherein the sample to be tested comprises a chemical sample or a biological sample.
In certain embodiments, wherein the sample to be tested comprises a liquid sample.
In certain embodiments, wherein the biological sample comprises a cell culture fluid, a cell supernatant, a cell lysate, blood, a blood-derived product, lymph, urine, tears, saliva, cerebrospinal fluid, stool, synovial fluid, sputum, cells, organs or tissues.
In certain embodiments, wherein the molecules in the SERS criteria database comprise metabolites.
In certain embodiments, wherein the molecules in the SERS criteria database comprise small molecule metabolites.
In another aspect, the present application provides a metabonomics data processing method, comprising: and unmixing the spectrum data of the biological samples of the same type by adopting a weighted non-negative matrix factorization algorithm to obtain the types of known molecules in the samples and the intervals of relative concentrations, thereby obtaining a characteristic spectrum database of the biological samples.
In certain embodiments, the metabonomic data processing method further comprises: and similarly, obtaining the types and the relative concentration intervals of known molecules in other types of biological samples, and incorporating the types and the relative concentration intervals into a characteristic spectrum database to obtain the characteristic spectrum database containing different types of biological samples.
A metabonomic analysis method, the method comprising: based on a standard spectrum database, the NMF-CLS algorithm is adopted to unmixe the spectrum of the sample to be tested obtained by testing, so that the type and the relative concentration of the metabolite contained in the sample to be tested are obtained, wherein the metabolite is the molecule contained in the standard spectrum database.
In certain embodiments, wherein the spectrum of the sample to be tested is a broad spectrum SERS spectrum.
In certain embodiments, wherein the metabolite is a molecule contained within the SERS standard spectral database.
In certain embodiments, it further comprises performing a relevant biomedical analysis based on the obtained species of metabolite and its relative concentration.
In certain embodiments, wherein the biomedical analysis comprises analyzing differential metabolite data.
In certain embodiments, wherein the biomedical analysis comprises by comparing the metabolite species and relative concentrations of the test sample to a characteristic spectrum database.
In certain embodiments, wherein the biomedical analysis further comprises classifying or staging the sample.
In certain embodiments, wherein the spectral data comprises raman spectral data and infrared spectral data.
In certain embodiments, wherein the infrared spectral data comprises surface enhanced infrared spectral data.
In certain embodiments, wherein the raman spectral data comprises surface enhanced raman spectral data.
In certain embodiments, wherein the standard spectral database comprises a raman spectral standard spectral database and an infrared spectral standard database.
In certain embodiments, wherein the raman spectroscopy standard spectrum database comprises a SERS standard spectrum database.
In another aspect, the application provides a method of determining a biomarker, comprising:
1) Respectively obtaining spectrum data of a sample group sample and a control group sample, and carrying out unmixing on the spectrum obtained by testing by adopting a weighted non-negative matrix factorization algorithm based on a standard spectrum database, wherein the sample group sample and the control group sample respectively obtained contain the types and the relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectrum database;
2) Screening for differential molecules as biomarkers.
In certain embodiments, the differential molecule comprises a differential metabolite.
In certain embodiments, wherein step 2) comprises cross-selecting a plurality of differential metabolites by ANOVA analysis (ANOVA Test) and logistic regression (Logistic Regression).
In certain embodiments, wherein the ANOVA analysis comprises performing a statistical analysis on the different categories of data to find metabolites in which statistical differences occur between the different categories.
In certain embodiments, wherein the logistic regression comprises classifying using the relative concentration data to find metabolites that contribute to distinguishing the class of data.
In some embodiments, wherein the logistic regression employs L1 regularization, its absolute value weight at classification is greater than 0, which is believed to contribute to classification.
In certain embodiments, it further comprises validating the differential metabolite obtained.
In some embodiments, wherein the validating includes regression analysis of the actual concentration of the sample with coefficients unmixed by a weighted non-negative matrix factorization algorithm.
In certain embodiments, wherein the validating comprises validating by analyzing the fit of the differential metabolite to the physiology or pathology.
In another aspect, the application provides a method of detecting the presence of a disease or disorder, or assessing the risk of developing a disease or disorder, the method comprising the steps of:
1) Obtaining a spectrum of an individual sample to be tested, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and the relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining whether the individual has, or is at risk of developing, a disease or disorder.
In another aspect, the application provides a method of determining the stage of a disease or disorder, the method comprising the steps of:
1) Obtaining a spectrum of an individual sample to be tested, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and the relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the biomarker to a known stage level; and
3) The stage or type of disease or disorder is determined.
In certain embodiments, wherein the disease or condition is selected from the group consisting of: infectious diseases, proliferative diseases, neurodegenerative diseases, cancer, psychological diseases, metabolic diseases, autoimmune diseases, sexually transmitted diseases, gastrointestinal diseases, pulmonary diseases, cardiovascular diseases, stress and fatigue related disorders, mycoses, pathogenic diseases and obesity related disorders.
In another aspect, the application provides a method of cell or microorganism analysis, the method comprising the steps of:
1) Obtaining spectral data of a sample to be tested of cells, and based on a standard spectral database, adopting weighted non-negative matrix factorization (NMF-
CLS) unmixing the spectrum obtained by the test to obtain the species and relative concentration of the known metabolite contained in the sample to be tested, wherein the known metabolite is the molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining the physiological or pathological state, physiological or pathological type of said cell or microorganism.
In certain embodiments, the method further comprises screening the identified cells or microorganisms to obtain the desired cell or microorganism type of interest.
In certain embodiments, wherein the spectral data comprises raman spectral data and infrared spectral data.
In certain embodiments, wherein the infrared spectral data comprises surface enhanced infrared spectral data.
In certain embodiments, wherein the raman spectral data comprises surface enhanced raman spectral data.
In certain embodiments, wherein the standard spectral database comprises a raman spectral standard spectral database and an infrared spectral standard database.
In certain embodiments, wherein the raman spectroscopy standard spectrum database comprises a SERS standard spectrum database.
In certain embodiments, wherein the spectrum of the sample to be tested comprises a broad spectrum SERS spectrum.
In another aspect, the application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the aforementioned method.
In certain embodiments, the computer readable storage medium further stores standard spectrum database data.
In certain embodiments, the standard spectrum database comprises a SERS standard spectrum database.
In another aspect, the application provides an apparatus comprising a memory storing a standard spectrum database and a computer program, and a processor implementing the steps of the aforementioned method when the computer program is executed.
In another aspect, the present application provides a spectral unmixed system based on a weighted non-negative matrix factorization algorithm, comprising: and the solving and optimizing module is used for solving the weighted non-negative matrix factorization algorithm by adopting an iterative method to complete spectral data unmixing.
In some embodiments, the system further comprises a weight optimization module for solving the known molecular weights by using a linear regression method to determine an optimal weight value.
In certain embodiments, the system further comprises an evaluation module for evaluating the unmixed results using the relative concentrations of known molecules.
In another aspect, the present application provides the use of the aforementioned computer readable storage medium, the aforementioned device, or the aforementioned system in the manufacture of a device for the analysis of compounds and/or the classification and detection of microorganisms.
In another aspect, the application provides the use of the aforementioned computer-readable storage medium, the aforementioned device, or the aforementioned system in the manufacture of a device for metabonomic data processing and/or analysis.
In another aspect, the present application provides a metabonomic analysis device comprising: the data processing module is used for analyzing the surface enhanced Raman spectrum data of the sample to be detected to obtain the types and the relative concentrations of the metabolites in the sample.
In some embodiments, the data processing module includes a solution optimization module for solving a weighted non-negative matrix factorization algorithm using an iterative method to complete spectral data unmixing.
In some embodiments, the data processing module includes a weight optimization module for solving for known molecular weights using a linear regression approach to determine an optimal weight value.
In certain embodiments, the data processing module includes an evaluation module for evaluating the unmixed results using the relative concentrations of known molecules.
In certain embodiments, the evaluating comprises classifying the test sample using a differential metabolite classification test model.
In some embodiments, the device further includes a spectrum detection module, configured to perform spectrum detection on the sample to be detected, and obtain spectrum data of the sample to be detected.
In certain embodiments, the device further comprises a test sample collection module for collecting a test sample based on a metabonomics method.
Other aspects and advantages of the present application will become readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application are shown and described in the following detailed description. As those skilled in the art will recognize, the present disclosure enables one skilled in the art to make modifications to the disclosed embodiments without departing from the spirit and scope of the application as claimed. Accordingly, the drawings and descriptions of the present application are to be regarded as illustrative in nature and not as restrictive.
Drawings
The specific features of the application related to the application are shown in the appended claims. A better understanding of the features and advantages of the application in accordance with the present application will be obtained by reference to the exemplary embodiments and the accompanying drawings that are described in detail below. The drawings are briefly described as follows:
FIG. 1 is a flow chart showing the steps of a weighted non-negative matrix factorization algorithm in accordance with the present application;
FIG. 2 shows SERS standard spectra of 89 metabolite molecules constructed in example 1 of the present application;
FIGS. 3a-3g show a flow chart of model verification in example 2 of the present application and the effect of SERS spectra fitting after unmixing according to NMF-CLS;
FIG. 4a shows the effect of fitting the SERS spectra of the sample of example 2 of the present application after unmixing according to CLS;
FIG. 4b shows the effect of the SERS spectra of the sample of example 2 according to the application after unmixing with NMF;
FIG. 4c shows the spectrum of the known components calculated by NMF in example 2 of the present application;
FIGS. 5a-5c show the effect of fitting the SERS spectra of the model of example 3 of the present application after unmixing according to NMF-CLS;
FIGS. 6a-6c show the calculation of the number of spectra required for different cell samples according to example 4 of the present application;
FIG. 7a shows the change in cell morphology of each cell in example 4 of the present application;
FIG. 7b shows the effect of the SERS spectra of the cell culture fluids according to example 4 of the present application after unmixing according to CLS;
FIG. 7c shows SERS spectra obtained from 200 spectra of DAY2 data of each cell culture broth of example 4 of the present application;
FIG. 7d shows the coefficient variation curves for 8 different metabolites in each cell culture broth according to example 4 of the present application;
FIGS. 8a-8c show the calculation of the number of spectra required for different serum samples according to example 5 of the present application;
FIG. 9 is a chart showing SERS spectra obtained by selecting 200 spectra from each of the different serum samples in example 5 of the present application;
FIG. 10 shows the effect of the SERS spectra of the different serum samples of example 5 according to the application after unmixing according to NMF-CLS;
FIG. 11 shows 16 differential metabolites selected from different serum samples according to example 5 of the present application;
FIGS. 12a-12c show the results of a comparison of metabonomic classification with psa screening in example 5 of the present application.
Detailed Description
Further advantages and effects of the present application will become readily apparent to those skilled in the art from the present disclosure, by describing embodiments of the present application with specific examples.
Definition of terms
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. Also, unless otherwise indicated, the use of "or" includes "and vice versa, except within the claims. Non-limiting terms are not to be construed as limiting unless expressly stated or the context clearly indicates otherwise (e.g., "comprising," "having," and "including" typically indicate "including but not limited to"). In the claims, the singular forms such as "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. To aid in understanding and preparing the present application, the following illustrative, non-limiting examples are provided.
In the present application, the term "Metabolome" generally refers to the collection of all metabolites in a biological cell, tissue, organ or organism, and generally refers to the collective term for small molecule metabolites having a relative molecular mass of less than about 1500Da (Da: daltons).
The term "small molecule metabolite" includes organic and inorganic molecules that are present in a cell, cell compartment, or organelle, typically having a molecular weight below 2000 or 1500. The term does not include macromolecules such as large proteins (e.g., proteins having a molecular mass exceeding 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, or 10,000), large nucleic acids (e.g., nucleic acids having a molecular mass exceeding 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, or 10,000), or large polysaccharides (e.g., polysaccharides having a molecular mass exceeding 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, or 10,000). Small molecule metabolites of cells are typically found free in solution in the cytoplasm or in other organelles (e.g., mitochondria) where they form a pool of intermediates that can be further metabolized or used to produce macromolecules (called macromolecules). The term "small molecule metabolites" includes signal molecules and intermediates in chemical reactions that convert energy derived from food into a useful form. Examples of small molecule metabolites include phospholipids, glycerophospholipids, lipids, plasmalogens, sugars, fatty acids, amino acids, nucleotides, intermediates formed during cellular processes, isomers and other small molecules found within cells. In one embodiment, the small molecules of the application are isolated. Preferred metabolites include lipids and fatty acids.
By way of non-limiting example, the small molecule metabolite may be selected from: 1, 3-uric acid dimethyl ester, L-glucan, 1-methylnicotinamide, 2-hydroxyisobutyrate, 2-oxoglutarate, 3-aminoisobutyrate, 3-hydroxybutyrate, 3-hydroxyisovalerate, 3-indolesulfonate, 4-hydroxyphenylacetate, 4-hydroxyphenyllactic acid, 4-picolinate, acetate, acetoacetate, acetone, adipate, alanine, allantoin, asparagine, betaine, carnitine, citrate, myo-cine, creatinine, dimethylamine, ethanolamine, formate, trehalose, fumarate, glucose, glutamine, glycine, hippurate, histidine, hypoxanthine, isoleucine, lactate, leucine, lysine, mannitol, N, N-dimethylglycine, O-acetylcarnitine, pantothenate, propylene glycol, pyroglutamate, pyruvate, quinolinate, serine, succinate, sucrose, taurine, threonine, trigonelline, trimethylamine-N-oxide, tryptophan, tyrosine, cytosine, uracil, urea, valine, xylose, cis-aconitic acid, inositol, trans-aconitic acid, 1-methylhistidine, 3-methylhistidine, ascorbate, phenylacetylglutamine, 4-hydroxyproline, gluconate, galactose, galactitol, plant galactose, lactose, phenylalanine, praline betaine, trimethylamine, butyrate, propionate, isopropanol, mannose, 3-methylxanthine, ethanol, benzoate, glutamate, glycerol, guanosine, guanine, xanthine, adenine, uric acid, adenosine, inosine, inosinic acid, CO2, H2O, N-carbamoyl-beta-alanine, ammonia, beta-aminoisobutyric acid, putrescine, spermidine, spermine, methionine, S-adenosylmethionine, decarboxylated S-adenosylmethionine, arginine, ornithine, putrescine, N1-acetylspermidine, N1-acetylspermine, elF5A (Lys), elF5A (Dhp), elF5A (Hpu), N1N 2-diacetyl spermine, 3-aminopropionic acid, 3-acetamidopropionic acid, acrolein, FDP-lysine protein, threo-Ds-isocitrate, oxalyl-succinate, 2-oxo-glutarate, oxalyl-acetate, L-glutamate, 2-hydroxy-glutarate, acetyl-CoA, cis-aconitic acid, D-isocitrate, alpha-ketoglutarate, succinyl-CoA, malate, (-) O-acetyl-carnitine, itaconic acid hydrochloride, glycolate, oxalyl-CoA, 6P, 2-hydroxy-phosphate (3-phosphate), 2-D-glycerate, 6P-phosphate (3-phosphate), 2-hydroxy-phosphate (3-phosphate), 2-P-phosphate (3-phosphate), 2-hydroxy-phosphate (3-phosphate) 2 (P), 2-hydroxy-phosphate (3-phosphate) and 2-hydroxy-phosphate (P-6) phosphate (3-phosphate) respectively D-glucose, D-glucono-1, 5-lactone, D-gluconate, alpha-D-mannose 6-P, D-mannose, D-fructose, D-sorbitol, glycerone-P, sn-glycero-3P, D-glyceraldehyde, 1, 2-propane-diol, 2-hydroxypropionaldehyde, 3-P-serine, 3-P-hydroxy pyruvate, D-glycinate, hydroxy pyruvate, L-alanine, L-alanyl-tRNA, L-glutamate, 2-oxoglutarate, L-lactate, D-lactate, adenosine Triphosphate (ATP), adenosine Diphosphate (ADP), H+, succinate, O2, NADH, NAD+, NADP+, NADPH 6-phosphogluconolactone, 6-phosphogluconate, ribulose-5-phosphate, ribose-5-phosphate, xylulose-5-phosphate, glyceraldehyde 3-phosphate, sedoheptulose-7-phosphate, fructose-6-phosphate, erythrose 4-phosphate, xylulose-5-phosphate, D-ribulose, D-ribitol, D-ribose, L-ribulose, sedoheptulose 1,7P2, 3-oxo-6-P-hexulose, L-ornithine, carbamyl phosphate, L-citrulline, argininosuccinic acid, L-arginine, L-aspartate, adenosine Monophosphate (AMP), pyrophosphate, trans- Δ2-enoyl-CoA, L-beta-hydroxyalkyl CoA, beta-ketoethyl CoA, FADH2, acyl-CoA, propionyl-CoA, inosine Monophosphate (IMP), xanthosine Monophosphate (XMP), guanosine Monophosphate (GMP), xanthosine, adenylsuccinic acid, uridine Monophosphate (UMP), thymidine, thymine, deoxyribose-1-phosphate, deoxythymidine monophosphate (dTMP), deoxycytidine monophosphate (dCMP), retinyl palmitate, palmitoyl-CoA, isotretinoin, beta-glucuronide, retinal, beta-carotene, retinoic acid, calcitol, 25-hydroxyergocalciferol, calcitriol, methylcobalamin, 5 '-deoxyadenosyl cobalamin, alpha-CECH, NH4+, alpha-ketoglutarate, oxaloacetate, glutamate gamma-semialdehyde, delta 1-pyrroline-5-carboxylate, citrulline, NH3, N5, N10-methylene THF, 3-phosphoglycerate, alpha-ketobutyrate, alpha-amino-beta-ketobutyrate, aminoacetone, cysteinesulfonic acid, beta-sulfinylacetonate, bisulfite, sulfite, sulfate, glutathione, taurine, adenosine 5' -phosphosulfate, 3 '-phosphoadenyl5' -phosphosulfate, homocysteine, alpha-keto-beta-methylvalerate, alpha-ketoisocaproic acid, alpha-ketoisovalerate, alpha-methylbutyl-CoA, methylcrotonyl-CoA, 3-methyl-3-hydroxybutyrate acyl-CoA, 2-methylacetoacetyl-CoA, isovaleryl-CoA, 3-methylcrotonyl-CoA, 3-methylpentenedioyl-CoA, 3-hydroxy-3-methylglutaryl-CoA, acetoacetate, isobutyryl CoA, methylpropyl-CoA, 3-hydroxyisobutyryl-CoA, methylmalonic acid monoaldehyde, p-hydroxyphenylpyruvate, homogentisate, 4-maleoacetate, 4-fumarate, 3-hydroxytrimethyllysine, 4-N-trimethylaminobutyraldehyde, gamma-butyryl betaine, urocanic acid ester, 4-imidazolidinone-5-propionate, N-iminomethyl-L-glutamate, N5-iminomethyl-tetrahydrofolate, histamine, N-formate-canine urea, 3-hydroxycanine urea, anthranilate, 3-hydroxyphthalamide, glutaryl CoA, acetyl-CoA, and combinations thereof.
In the present application, the term "biological sample" or "chemical sample" may include various biological samples or chemical samples suitable for observation (e.g., imaging) or examination. Chemical samples include any chemical mixture or compound. Biological samples include, but are not limited to, cell cultures or extracts thereof; biopsy material obtained from an animal (e.g., mammal) or an extract thereof; and blood, saliva, urine, stool, semen, tears, or other bodily fluids or extracts thereof. For example, the term "biological sample" refers to any solid or fluid sample obtained from, excreted from, or secreted by any living organism, including unicellular microorganisms (such as bacteria and yeast) and multicellular organisms (such as plants and animals, e.g., vertebrates or mammals, and particularly healthy or apparently healthy human subjects or human patients affected by the condition or disease to be diagnosed or studied). The biological sample may be in any form, including solid materials such as tissues, cells, cell aggregates, cell extracts, cell homogenates or cell fractions; or a biopsy body, or a biological fluid. Biological fluids may be obtained from any site (e.g. blood, saliva (or buccal wash containing buccal cells), tears, plasma, serum, urine, bile, cerebrospinal fluid, amniotic fluid, peritoneal fluid and pleural fluid, or cells therefrom, aqueous or vitreous fluid, or any bodily secretion), exudates, secretions (e.g. fluids obtained from abscess or any other site of infection or inflammation) or from joints (e.g. normal joints or joints affected by diseases such as rheumatoid arthritis, osteoarthritis, gouty or septic arthritis). Biological samples may be obtained from any organ or tissue (including biopsy or autopsy specimens) or may comprise cells (whether primary or cultured) or media conditioned by any cell, tissue or organ. Biological samples may also include tissue sections, such as frozen sections taken for histological purposes. Biological samples also include mixtures of biomolecules including proteins, lipids, carbohydrates and nucleic acids produced by partially or completely fractionating a cell or tissue homogenate. While the sample is preferably taken from a human individual, the biological sample may be from any animal, plant, microorganism, cell, virus, yeast, or the like.
In the present application, the term "subject" generally refers to humans as well as non-human animals at any stage of development, including, for example, mammals, birds, reptiles, amphibians, fish, worms, and single cells. Cell cultures and living tissue samples are considered to be the majority of animals. In certain exemplary embodiments, the non-human animal is a mammal (e.g., rodent, mouse, rat, rabbit, monkey, dog, cat, sheep, cow, primate, or pig). The animal may be a transgenic animal or a human clone. If desired, the biological sample may be subjected to preliminary treatments, including preliminary separation techniques.
In the present application, the terms "microbial" and "microorganism" include all microorganisms including bacteria, viruses and fungi.
In the present application, the term "cell" generally refers to its meaning as generally recognized in the art. The cells may be prokaryotic (e.g., bacterial cells) or eukaryotic (e.g., mammalian or plant cells). Cells may be of somatic or germ line origin, totipotent or multipotent, dividing or non-dividing. The cells may also be derived from or may comprise gametes or embryos, stem cells, or fully differentiated cells.
In the present application, the term "disease" or "disorder" is used interchangeably and generally refers to any deviation of a subject from a normal state, such as any change in the state of the body or certain organs, which impedes or disrupts performance of the function and/or causes symptoms such as discomfort, dysfunction, pain or even death in a person suffering from or in contact with the disease. The disease or disorder may also be referred to as a disorder (disorder), an malaise (ailment), a malady (ailment), a disorder (disorder), a disease (hickness), an illness (illness), a physical malaise (compatibility), an index of disorder, or an affection. The term "stage" generally refers to the particular stage at which a disease is identified as having progressed.
In the present application, the term "Pearson correlation coefficient" or "Pearson coefficient" generally refers to the quotient of the covariance and standard deviation calculated between two sets of variables. The different pearson coefficients are worth the following: the pearson coefficient value being positive represents that the two are positively correlated, i.e. monotonically increasing; negative represents a negative correlation, i.e., a monotonically decreasing relationship. In the application, the pearson coefficient is used for judging the number of the spectrums, the average value of 200 spectrums is used as a standard spectrum, N spectrums are taken out each time to be averaged, the pearson coefficient of the spectrum and the standard spectrum is calculated, the operation is repeated for 5 times, and the average value of the 5 pearson coefficients is taken as a correlation coefficient corresponding to the N spectrums. The number of spectra is considered sufficient when the pearson coefficient value is greater than 0.8 or converges.
The absolute values of the pearson coefficients have different degrees of correlation in different intervals:
pearson coefficient absolute value Degree of relatedness
0-0.2 Weak correlation
0.2-0.5 Related to
0.5-0.8 Strong correlation
0.8-1.0 Extremely strong correlation
In the present application, the term "about" generally means about (apphcation), in the vicinity of. When the term "about" is used to refer to a range of values, the cutoff value or a particular value is used to indicate that the recited value may differ from the recited value by as much as 10%. Thus, the term "about" may be used to encompass variations from a particular value of ±10% or less, variations of ±5% or less, variations of ±1% or less, variations of ±0.5% or less, or variations of ±0.1% or less.
Detailed Description
Weighting moment nonnegative matrix factorization (NMF-CLS)
On the one hand, the application provides a novel spectrum unmixing method, namely a weighted moment nonnegative matrix factorization algorithm, and the relative concentration corresponding to a specific component is obtained on the basis of obtaining a better fitting effect. On the basis of the objective function of the original equation (1), the weight for the spectrum of the known component is added, and the reference spectrum of the unknown component is calculated by an algorithm, so that a better fitting effect and the type and relative concentration information corresponding to the known component are obtained at the same time. The weighted objective function is:
Wherein m is r1 matrix W (1) Reference spectra (known components in standard database) representing known components arranged in columns, matrix W of m x r2 (2) The spectra representing the unknown components arranged in columns are calculated by an algorithm. Matrix H of r1 x n (1) And r2 n matrix H 2 Respectively represent W (1) And W is (2) The corresponding relative concentrations, α, represent weights set for the known components.
Since the reference spectrum of the known component is accurate, the purpose of the algorithm is to find W (2) 、H (1) And H (2) So that F in equation (4) is minimized. Thereby obtaining F about W (2) 、H (1) And H (2) The partial derivatives of (2) are:
the multiplicative update proposed by Lee and Seung (Lee, d.d.; seung, h.s.nature 1999, 401, 788-791.) can be derived from the partial derivative as W (1) 、H (1) And H (2) The iterative formula of (2) is
And (3) carrying out iterative updating of H ((1)), W ((2)) and H ((2)) according to an iterative formula, stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma, and after the iteration is stopped, obtaining the final result of the relative concentration corresponding to the known component.
In certain embodiments, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.
In certain embodiments, wherein the threshold σ does not exceed about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.
Algorithm sets a spectral matrix W of known composition (1) And a coefficient (relative concentration) matrix H corresponding thereto (1) Spectral matrix W of unknown composition (2) And a coefficient (relative concentration) matrix H corresponding thereto (2) . Wherein H is (1) 、W (2) And H (2) The algorithm is calculated in the iterative process, and the algorithm implementation steps and the flow chart are as shown in fig. 1:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to the multiplication derivation mentioned in the application (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N (20 times) is reached or F is reduced to a set threshold sigma (0.000001);
5) After iteration stops, H (1) I.e. the final result of the coefficients (relative concentrations) corresponding to the known components.
W (1) And H (1) Representing the spectrum of the known component and the calculated coefficients of the known component, respectively, it should be noted that W (1) Is a Raman spectrum of a component, corresponding to H (1) Is the coefficient of the component under all spectra, i.e. H (1) The ith row of data in (a) is W (1) Coefficients of the i-th component of (c). Therefore, for any one of the known components, no further calculation is required, and only the target component corresponding to H is required to be taken out (1) Is required.
In some embodiments, the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested. For example, the standard spectrum and the metabolic sample are detected by using a broad spectrum SERS spectrum.
In some embodiments, the weight α is inversely related to the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested. In some embodiments, the ratio of the weight α to the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested is non-linearly related.
In certain embodiments, wherein when W (2) And H (2) When none exists, the weight alpha is set to 0, and the objective function of the NMF-CLS algorithm is set to be:i.e. classical least squares.
In some embodiments, when the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested is not less than 1, the weight α is set to 0, and the objective function of the NMF-CLS algorithm is set to:
in some embodiments, when the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested is less than 1, the weight α is not 0, and the objective function of the NMF-CLS algorithm is set as:
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of spectrum images of a certain molecule, calculating to obtain the average spectrum of the molecule, and similarly obtaining the average spectrum of other molecules, and incorporating the average spectrum into a standard spectrum database to obtain the standard spectrum database.
In certain embodiments, when a spectral image of a molecule is acquired, the concentration of the molecule is between 0.1mM and 10mM.
In certain embodiments, the number of spectral images in which a molecule is acquired is not less than about 10, about 20, about 50, about 100, or about 200.
In certain embodiments, the standard spectrum database creation further comprises collecting open source data derived from spectra in other documents.
In certain embodiments, normalizing the intensity of the average spectrum of the obtained molecules is also included.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of spectrum images of a certain molecule, calculating the intensity of the average spectrum of the molecule, normalizing the obtained spectrum, namely the standard spectrum of the molecule, and similarly obtaining the standard spectrum of other molecules, and integrating the standard spectrum into a standard spectrum database to obtain the standard spectrum database.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of spectrum images of a certain molecule, averaging the obtained plurality of spectrum images, normalizing the intensity of the spectrum to be within a range [0,1], obtaining the spectrum which is the standard spectrum of the molecule, and similarly obtaining the standard spectrums of other molecules, and integrating the standard spectrums into a standard spectrum database to obtain the standard spectrum database.
For example, the standard spectrum database creation may include: obtaining an average spectrum of a certain molecule, normalizing the intensity of the spectrum to be within a range [0,1], obtaining a spectrum which is a standard spectrum of the molecule, and similarly obtaining standard spectrums of other molecules, and integrating the standard spectrums into a standard spectrum database to obtain a standard spectrum database; the average spectrum can be obtained through open source data, or can be obtained through collecting a plurality of standard spectrum images and averaging.
In some embodiments, the method includes collecting spectra of a plurality of samples to be tested, performing algorithm analysis on each spectrum separately, obtaining coefficients of different known components, performing processing (such as averaging, summing, ANOVA analysis and/or student's t test), and finally obtaining analysis results of relative concentrations of different known molecules in the samples.
In certain embodiments, where the number of spectra of the sample to be measured is collected, it is desirable to ensure that molecular information in substantially the complete sample to be measured has been collected.
In certain embodiments, the number of spectra in which molecular information in a sample determined to be substantially complete to be tested has been collected includes, but is not limited to, comparison by Pearson coefficients.
In some embodiments, wherein the acquiring of the Pearson coefficients comprises: taking the average value of the spectrums of M samples to be detected as a standard spectrum, taking out N spectrums each time to average, calculating Pearson coefficients of the spectrums and the standard spectrum, repeating the operation for N times, and averaging the N Pearson coefficients to obtain correlation coefficients corresponding to the N spectrums.
In certain embodiments, wherein M is about 50 to 500, or about 100 to 400, or about 200 to 300.
In certain embodiments, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.
In certain embodiments, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.
In certain embodiments, the number of spectra in which the sample to be tested is collected is about 50 to 200.
For example, a spectral image unmixing method based on an NMF-CLS algorithm may include: collecting spectra of a plurality of samples to be tested, based on a standard spectrum database, adopting an NMF-CLS algorithm to carry out algorithm analysis on each spectrum independently, obtaining coefficients of known components of each spectrum, then carrying out processing (such as averaging or summation), and finally obtaining analysis results of relative concentrations of known molecules in the samples; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectrums of different molecules; the collection of the spectral number of the sample to be measured is required to ensure that molecular information has been collected substantially throughout the sample to be measured (e.g., a spectral number of about 50, about 100, or about 200).
In certain embodiments, wherein the sample to be tested comprises a chemical sample or a biological sample.
For example, biological or chemical samples include biomolecules, nucleosides, nucleic acids, polynucleotides, oligonucleotides, proteins, enzymes, polypeptides, antibodies, antigens, ligands, receptors, polysaccharides, carbohydrates, polyphosphates, nanopores, organelles, lipid layers, tissues, organs, organisms, body fluids. The term "biological or chemical sample" may include biologically active compound(s), such as analogs or mimics of the aforementioned species. As used herein, the term "biological sample" may include samples such as cell lysates, whole cells, organisms, organs, tissues and body fluids. "body fluids" may include, but are not limited to, blood, dried blood, clotting, serum, plasma, saliva, cerebral spinal fluid, pleural fluid, tears, ductal fluid, lymph, sputum, urine, amniotic fluid and semen. The sample may comprise "acellular" body fluids. "non-cellular body fluid" includes less than about 1% (w/w) whole cell material. Plasma or serum is an example of a non-cellular body fluid. The sample may comprise a sample of natural or synthetic origin (i.e. a cell sample made into non-cells). In some embodiments, the biological sample may be from a human or non-human source. In some embodiments, the biological sample may be from a human patient. In some embodiments, the biological sample may be from a human neonate.
In certain embodiments, wherein the sample to be tested comprises a liquid sample.
In certain embodiments, the spectral image comprises infrared spectrum and raman spectrum.
In certain embodiments, the infrared spectrum comprises a surface enhanced infrared spectrum.
In certain embodiments, the raman spectrum comprises a surface enhanced raman spectrum.
In certain embodiments, the surface-enhanced raman spectrum is a broad spectrum surface-enhanced raman spectrum.
Setting of the weight alpha
In some embodiments, the method for setting the weight α includes:
1) Determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested;
2) Configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of a sample to be tested;
3) Different weights alpha are set, NMF-CLS algorithm is adopted to unmixe the spectrum of the simple sample obtained by testing, the coefficients corresponding to the known molecules are obtained, a regression equation is established between the coefficients and the actual concentration of the known molecules, and the R-party value is calculated to obtain the alpha of the highest R-party as the optimal weight value suitable for the sample to be tested.
In certain embodiments, the method wherein determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested comprises a principal component analysis method.
In certain embodiments, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested is no more than about 1/2, or about 1/5, or about 1/10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.
For example, the number of known molecules in the simple sample may be 2 to 10, and the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested may not exceed 1/2. In some embodiments, the number of known molecules in the simple sample may be 2, and the number of molecules in the sample to be tested may be 4,5,6,7 or more.
Since the value of α is related to the ratio of the number of known components to the number of unknown components, in some embodiments, α may be calculated from a simple model of the same ratio, as follows:
1. Calculating the total component quantity in the complex sample by means of principal component analysis and the like, and calculating the ratio of the known component quantity to the unknown component quantity according to the known component quantity;
2. selecting a small number (such as 3) of known molecules and obtaining a standard spectrum thereof, and additionally setting a certain number of unknown molecules, so as to ensure that the number ratio of the known molecules to the unknown molecules is equal to that of a complex sample, and manually configuring solutions of the known molecules with different concentrations;
3. and (3) unmixing the Raman spectrum of the solution by adopting different alpha values, taking out coefficients corresponding to known molecules, establishing a regression equation of the coefficients and the concentration, and calculating the R-square value of the coefficients to obtain the alpha of the highest R-square as the optimal weight value suitable for the complex model.
Analysis method based on surface enhanced Raman spectrum
In another aspect, the application provides an analysis method based on surface enhanced raman spectroscopy, comprising the steps of: based on a Surface Enhanced Raman Spectroscopy (SERS) standard spectrum database, a weighted non-negative matrix factorization algorithm (NMF-CLS) is adopted to unmixe the spectrum obtained by the test, so that the types and the relative concentrations of known molecules contained in the sample to be tested are obtained; the known molecules are molecules contained in a SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectrums of different molecules;
Wherein the objective function of the NMF-CLS algorithm is set as follows:
wherein the spectrum matrix is set to be onem x n matrix V, representing a total of n spectra, each spectrum consisting of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule; alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) And (3) enabling F in the equation to be minimum, and obtaining the relative concentration corresponding to the known molecule.
In certain embodiments, wherein said H (1) 、W (2) And H (2) Is calculated in an iterative process.
In certain embodiments, wherein said F pertains to W (2) 、H (1) And H (2) The partial derivatives of (2) are:
deriving W from partial derivatives (1) 、H (1) And H (2) The iterative formula of (2) is:
iterative updating H according to iterative formula (1) 、W (2) And H (2) Stopping iteration when the maximum iteration times N or F are reduced to a set threshold sigma, and after iteration is stopped, H (1) I.e. the end result of the relative concentrations of the known components.
In certain embodiments, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.
In certain embodiments, wherein the threshold σ does not exceed about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.
In certain embodiments, wherein said H (1) 、W (2) And H (2) The calculation process of (1) comprises:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing a coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to an iterative formula (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma;
5) After the iteration stops, H (1) I.e. the end result of the relative concentrations of the known components.
In some embodiments, the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested.
In certain embodiments, wherein SERS employs non-targeted broad-spectrum detection.
In some embodiments, the weight α is inversely related to the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested.
In some embodiments, the method for setting the weight α includes:
1) Determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested;
2) Configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of a sample to be tested;
3) Different weights alpha are set, NMF-CLS algorithm is adopted to unmixe the spectrum of the simple sample obtained by testing, the coefficients corresponding to the known molecules are obtained, a regression equation is established between the coefficients and the concentration of the known molecules, and the R-party value is calculated to obtain the optimal weight value of the sample to be tested with the alpha of the highest R-party.
In certain embodiments, the method wherein determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested comprises a principal component analysis method.
In certain embodiments, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested is no more than about 1/2, or about 1/5, or about 1/10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.
In certain embodiments, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain the SERS average spectrum of the molecule, and similarly obtaining the SERS average spectrum of other molecules, and integrating the SERS average spectrum into a standard spectrum database to obtain the SERS standard spectrum database.
In certain embodiments, when a SERS spectral image of a molecule is acquired, the concentration of the molecule is between 0.1mM and 10mM.
In certain embodiments, the number of SERS spectral images in which a molecule is acquired is not less than about 10, about 20, about 50, about 100, or about 200.
In certain embodiments, normalizing the intensity of the SERS average spectrum of the obtained molecules is also included.
In some embodiments, wherein the standard spectrum database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, calculating and obtaining the intensity of the SERS average spectrum of the molecule, normalizing the intensity, wherein the obtained spectrum is the SERS standard spectrum of the molecule, and similarly obtaining the SERS standard spectrums of other molecules, and integrating the SERS standard spectrums into a SERS standard spectrum database to obtain the SERS standard spectrum database.
In certain embodiments, wherein the SERS criteria spectrum database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, averaging the obtained plurality of SERS spectrum images, normalizing the intensity of the average spectrum to be within a range [0,1], wherein the obtained spectrum is the SERS standard spectrum of the molecule, and similarly obtaining the SERS standard spectrums of other molecules, and integrating the SERS standard spectrums into a standard spectrum database to obtain the SERS standard spectrum database.
For example, the SERS criteria spectrum database creation may include: obtaining an SERS average spectrum of a certain molecule, normalizing the intensity of the spectrum to be within a range [0,1], wherein the obtained spectrum is an SERS standard spectrum of the molecule, and similarly obtaining SERS standard spectrums of other molecules, and incorporating the SERS standard spectrums into an SERS standard spectrum database to obtain an SERS standard spectrum database; the SERS average spectrum can be obtained through open source data, or can be obtained through collecting a plurality of standard SERS spectrum images and averaging.
In some embodiments, the method comprises collecting SERS spectra of a plurality of samples to be tested, and performing algorithm analysis on each SERS spectrum to obtain a coefficient H of a known component )1) And then processed (e.g., averaged, summed, ANOVA analysis, and/or student's t-test) to obtain an analytical result of the relative concentration of known molecules in the sample.
In certain embodiments, the collection of the spectral quantity of the sample to be tested is required to ensure that molecular information in substantially all of the sample to be tested has been collected.
In certain embodiments, wherein the number of spectra determined to have collected substantially molecular information in the complete test sample includes, but is not limited to, comparison by Pearson coefficients.
In some embodiments, wherein the acquiring of the Pearson coefficients comprises: taking the average value of the spectrums of M samples to be detected as a standard spectrum, taking out N spectrums each time to average, calculating Pearson coefficients of the spectrums and the standard spectrum, repeating the operation for N times, and averaging the N Pearson coefficients to obtain correlation coefficients corresponding to the N spectrums.
In certain embodiments, wherein M is about 50 to 500, or about 100 to 400, or about 200 to 300.
In certain embodiments, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.
In certain embodiments, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.
In certain embodiments, the number of spectra in which the sample to be measured is scanned is not less than about 20, or about 30, or about 40, or about 50.
In certain embodiments, the number of spectra in which the sample to be measured is scanned is about 20 to 200, or about 30 to 160, or about 40 to 120, or about 50 to 80.
In certain embodiments, the speed at which the SERS spectrum of the sample to be measured is scanned is about 1 to 5 s/sheet.
For example, an analysis method based on surface enhanced raman spectroscopy may include: collecting SERS spectra of a plurality of samples to be tested, based on a SERS standard spectrum database, adopting an NMF-CLS algorithm to carry out algorithm analysis on each SERS spectrum independently, obtaining coefficients of known components of each SERS spectrum, then carrying out processing (such as average or summation) and finally obtaining analysis results of relative concentrations of known molecules in the samples; the known molecules are molecules contained in a SERS standard spectrum database, and the SERS standard spectrum database consists of standard spectrums of different molecules; the collection of the spectral number of the sample to be measured is required to ensure that molecular information has been collected substantially throughout the sample to be measured (e.g., a spectral number of about 50, about 100, or about 200).
In certain embodiments, wherein the sample to be tested comprises a chemical sample or a biological sample.
In certain embodiments, wherein the sample to be tested comprises a liquid sample.
In certain embodiments, wherein the biological sample comprises a cell culture fluid, a cell supernatant, a cell lysate, blood, a blood-derived product (e.g., buffy coat, serum or plasma), lymph, urine, tears, saliva, cerebrospinal fluid, stool, synovial fluid, sputum, cells, organs or tissues.
For example, wherein the biological sample may be selected from the group consisting of: blood, plasma, urine, saliva, tears, and cerebrospinal fluid.
In certain embodiments, wherein the molecules in the SERS criteria database comprise metabolites.
In certain embodiments, wherein the molecules in the SERS criteria database comprise small molecule metabolites.
Metabonomics processing or analysis methods
In another aspect, the present application provides a metabonomics data processing method, comprising: and unmixing the spectrum data of the biological samples of the same type by adopting a weighted non-negative matrix factorization algorithm to obtain the types of known molecules in the samples and the intervals of relative concentrations, thereby obtaining a characteristic spectrum database of the biological samples.
In certain embodiments, the metabonomic data processing method further comprises: and similarly, obtaining the types and the relative concentration intervals of known molecules in other types of biological samples, and incorporating the types and the relative concentration intervals into a characteristic spectrum database to obtain the characteristic spectrum database containing different types of biological samples.
In certain embodiments, the spectrum is a SERS spectrum.
For example, the biological sample of the same type may be serum or a cell culture fluid.
For example, the biological samples of the same type may be derived from different sources, from serum samples derived from different populations (healthy populations vs. diseased populations), or from cell cultures derived from different cell types (normal cells vs. tumor cells).
For example, the characteristic spectrum database may comprise a SERS characteristic spectrum database of serum samples of healthy or diseased persons.
A metabonomic analysis method, the method comprising: based on a standard spectrum database, the NMF-CLS algorithm is adopted to unmixe the spectrum of the sample to be tested obtained by testing, so that the type and the relative concentration of the metabolite contained in the sample to be tested are obtained, wherein the metabolite is the molecule contained in the standard spectrum database.
In certain embodiments, wherein the spectrum of the sample to be tested is a broad spectrum SERS spectrum.
In certain embodiments, wherein the metabolite is a molecule contained within the SERS standard spectral database.
For example, the metabonomic analysis method may include: collecting SERS spectra of a plurality of samples to be tested, based on a SERS standard spectrum database, adopting an NMF-CLS algorithm to carry out algorithm analysis on each SERS spectrum independently, obtaining coefficients of known components of each SERS spectrum, then carrying out processing (such as averaging or summation), and finally obtaining analysis results of relative concentrations of known molecules (metabolite molecules) in the samples; the known molecules are molecules contained in a SERS standard spectrum database, and the SERS standard spectrum database consists of standard spectrums of different molecules; the collection of the spectral number of the sample to be measured is required to ensure that molecular information has been collected substantially throughout the sample to be measured (e.g., a spectral number of about 50, about 100, or about 200).
In certain embodiments, it further comprises performing a relevant biomedical analysis based on the obtained species of metabolite and its relative concentration.
In certain embodiments, wherein the biomedical analysis comprises analyzing differential metabolite data.
In certain embodiments, wherein the biomedical analysis comprises by comparing the metabolite species and relative concentrations of the test sample to a characteristic spectrum database.
In certain embodiments, wherein the biomedical analysis further comprises classifying or staging the sample.
In certain embodiments, wherein the spectral data comprises raman spectral data and infrared spectral data.
In certain embodiments, wherein the infrared spectral data comprises surface enhanced infrared spectral data.
In certain embodiments, wherein the raman spectral data comprises surface enhanced raman spectral data.
In certain embodiments, wherein the standard spectral database comprises a raman spectral standard spectral database and an infrared spectral standard database.
In certain embodiments, wherein the raman spectroscopy standard spectrum database comprises a SERS standard spectrum database.
In another aspect, the application provides a method of determining a biomarker, comprising:
1) Respectively obtaining spectrum data of a sample group sample and a control group sample, and carrying out unmixing on the spectrum obtained by testing by adopting a weighted non-negative matrix factorization algorithm based on a standard spectrum database, wherein the sample group sample and the control group sample respectively obtained contain the types and the relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectrum database;
2) Screening for differential molecules as biomarkers.
In certain embodiments, the differential molecule comprises a differential metabolite.
In certain embodiments, wherein step 2) comprises cross-selecting a plurality of differential metabolites by ANOVA analysis (ANOVA Test) and logistic regression (Logistic Regression).
In certain embodiments, wherein the ANOVA analysis comprises performing a statistical analysis on the different categories of data to find metabolites in which statistical differences occur between the different categories.
In certain embodiments, wherein the logistic regression comprises classifying using the relative concentration data to find metabolites that contribute to distinguishing the class of data.
In some embodiments, wherein the logistic regression employs L1 regularization, its absolute value weight at classification is greater than 0, which is believed to contribute to classification.
In certain embodiments, it further comprises validating the differential metabolite obtained.
In some embodiments, wherein the validating includes regression analysis of the actual concentration of the sample with coefficients unmixed by a weighted non-negative matrix factorization algorithm.
In certain embodiments, wherein the validating comprises validating by analyzing the fit of the differential metabolite to the physiology or pathology.
For example, the method of determining a biomarker may comprise:
1) Respectively obtaining spectrum data of a sample group sample and a control group sample, wherein each sample acquires a plurality of spectrum data, based on a standard spectrum database, adopting an NMF-CLS algorithm to carry out algorithm analysis on each SERS spectrum independently, obtaining coefficients of known components of each SERS spectrum, and then carrying out treatment (such as average or summation), wherein the respectively obtained sample group sample and the control group sample contain types and relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectrum database; the number of spectra per collected sample to be tested is required to ensure that molecular information in substantially all samples to be tested has been collected (e.g., a number of spectra of about 50, about 100, or about 200);
2) Performing statistical analysis on the data of the different categories by adopting ANOVA analysis to find out metabolites in which statistical differences occur among the different categories, wherein the logistic regression comprises classifying by using the relative concentration data to find out the metabolites contributing to distinguishing the data categories, and the logistic regression adopts L1 regularization, wherein the absolute value weight of the logistic regression is greater than 0 when the logistic regression is classified, and is considered to contribute to classification; intersection of differential metabolites selected in both ANOVA and logistic regression was taken as biomarker.
ANOVA and logistic regression screening
ANOVA performs statistical analysis on the data for the different categories to find metabolites where statistical differences occur between the different categories. Inputting data of different categories, ANOVA will give a test level, i.e. a p-value. The smaller the p value, the higher the inter-group variability, and it is considered that there is a significant difference between groups when the p value is less than 0.05. The present application calculates the p-value of the test level between the different classes for each metabolite relative concentration data, and considers the metabolites with p-value less than 0.05 as the metabolites with significant differences, and retains them.
Logistic regression uses relative coefficient data to classify and find metabolites that contribute to distinguishing between different classes of data. With L1 regularization, the absolute value of the calculated weight is greater than 0 when it is classified, which is considered to contribute to classification. When the logistic regression algorithm is used for classifying, different calculation weights are set for the relative concentration data of different metabolites, wherein the local calculation weights are the importance of the metabolites in the classification, and the larger the absolute value of the calculation weights is, the higher the importance is. In the case of L1 regularization, the metabolite weight value of low importance will be set to 0 by the algorithm, so that a metabolite with an absolute value of the calculated weight greater than 0 is taken as a metabolite with a possible difference.
Intersection of differential metabolites selected in both ANOVA and logistic regression was taken as the final metabolite screening result.
Use of the same
In another aspect, the application provides a method of detecting the presence of a disease or disorder, or assessing the risk of developing a disease or disorder, the method comprising the steps of:
1) Obtaining a spectrum of an individual sample to be tested, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and the relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
1) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining whether the individual has, or is at risk of developing, a disease or disorder.
For example, the method of detecting the presence of a disease or disorder, or assessing the risk of occurrence of a disease or disorder, may comprise the steps of:
1) Obtaining a surface enhanced Raman spectrum of an individual sample to be detected, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on an SERS standard database to obtain the type and the relative concentration of a known metabolite in the sample to be detected, wherein the known metabolite is a molecule contained in the SERS standard database;
2) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining whether the individual has, or is at risk of developing, a disease or disorder.
In another aspect, the application provides a method of determining the stage of a disease or disorder, the method comprising the steps of:
1) Obtaining a spectrum of an individual sample to be tested, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and the relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the biomarker to a known stage level; and
3) The stage or type of disease or disorder is determined.
In certain embodiments, wherein the disease or condition is selected from the group consisting of: infectious diseases, proliferative diseases, neurodegenerative diseases, cancer, psychological diseases, metabolic diseases, autoimmune diseases, sexually transmitted diseases, gastrointestinal diseases, pulmonary diseases, cardiovascular diseases, stress and fatigue related disorders, mycoses, pathogenic diseases and obesity related disorders.
For example, the method of determining the stage of a disease or disorder may comprise the steps of:
1) Obtaining a surface enhanced Raman spectrum of an individual sample to be detected, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on an SERS standard database to obtain the type and the relative concentration of a known metabolite in the sample to be detected, wherein the known metabolite is a molecule contained in the SERS standard database;
2) Comparing the relative concentration of the biomarker to a known stage level; and
3) The stage or type of disease or disorder is determined.
In certain embodiments, wherein the disease or condition is selected from the group consisting of: infectious diseases, proliferative diseases, neurodegenerative diseases, cancer, psychological diseases, metabolic diseases, autoimmune diseases, sexually transmitted diseases, gastrointestinal diseases, pulmonary diseases, cardiovascular diseases, stress and fatigue related disorders, mycoses, pathogenic diseases and obesity related disorders.
In another aspect, the application provides a method of cell or microorganism analysis, the method comprising the steps of:
1) Obtaining spectrum data (such as SERS spectrum) of a sample to be tested of the cell, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database (such as SERS standard), so as to obtain the type and relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining the physiological or pathological state, physiological or pathological type of said cell or microorganism.
In certain embodiments, the method further comprises screening the identified cells or microorganisms to obtain the desired cell or microorganism type of interest.
In certain embodiments, wherein the spectral data comprises raman spectral data and infrared spectral data.
In certain embodiments, wherein the infrared spectral data comprises surface enhanced infrared spectral data.
In certain embodiments, wherein the raman spectral data comprises surface enhanced raman spectral data.
In certain embodiments, wherein the standard spectral database comprises a raman spectral standard spectral database and an infrared spectral standard database.
In certain embodiments, wherein the raman spectroscopy standard spectrum database comprises a SERS standard spectrum database.
In certain embodiments, wherein the spectrum of the sample to be tested comprises a broad spectrum SERS spectrum.
Computer-readable storage medium, apparatus and system
In another aspect, the application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the aforementioned method.
In certain embodiments, the computer readable storage medium further stores standard spectrum database data.
In certain embodiments, the standard spectrum database comprises a SERS standard spectrum database.
In another aspect, the application provides an apparatus comprising a memory storing a standard spectrum database and a computer program, and a processor implementing the steps of the aforementioned method when the computer program is executed.
In another aspect, the present application provides a spectral unmixed system based on a weighted non-negative matrix factorization algorithm, comprising: and the solving and optimizing module is used for solving the weighted non-negative matrix factorization algorithm by adopting an iterative method to complete spectral data unmixing.
In some embodiments, the system further comprises a weight optimization module for solving the known molecular weights by using a linear regression method to determine an optimal weight value.
In certain embodiments, the system further comprises an evaluation module for evaluating the unmixed results using the relative concentrations of known molecules.
In another aspect, the present application provides the use of the aforementioned computer readable storage medium, the aforementioned device, or the aforementioned system in the manufacture of a device for the analysis of compounds and/or the classification and detection of microorganisms.
In another aspect, the application provides the use of the aforementioned computer-readable storage medium, the aforementioned device, or the aforementioned system in the manufacture of a device for metabonomic data processing and/or analysis.
In another aspect, the present application provides a metabonomic analysis device comprising: the data processing module is used for analyzing the surface enhanced Raman spectrum data of the sample to be detected to obtain the types and the relative concentrations of the metabolites in the sample.
In some embodiments, the data processing module includes a solution optimization module for solving a weighted non-negative matrix factorization algorithm using an iterative method to complete spectral data unmixing.
In some embodiments, the data processing module includes a weight optimization module for solving for known molecular weights using a linear regression approach to determine an optimal weight value.
In certain embodiments, the data processing module includes an evaluation module for evaluating the unmixed results using the relative concentrations of known molecules.
In certain embodiments, the evaluating comprises classifying the test sample using a differential metabolite classification test model.
In some embodiments, the device further includes a spectrum detection module, configured to perform spectrum detection on the sample to be detected, and obtain spectrum data of the sample to be detected.
In certain embodiments, the device further comprises a test sample collection module for collecting a test sample based on a metabonomics method.
For example, the metabonomic analysis device may include:
1) The data processing module is used for analyzing the surface enhanced Raman spectrum data of the sample to be detected to obtain the types and the relative concentrations of the metabolites in the sample; the data processing module comprises:
i) The solving and optimizing module is used for solving a weighted non-negative matrix factorization algorithm by adopting an iterative method to complete spectral data unmixing; ii) a weight optimization module, which is used for solving the known molecular weight by adopting a linear regression mode to determine an optimal weight value; iii) The evaluation module is used for evaluating the unmixed result by using the relative concentration of the known molecules;
2) The spectrum detection module is used for carrying out spectrum detection on the sample to be detected and obtaining spectrum data of the sample to be detected;
3) And the sample to be measured acquisition module is used for acquiring the sample to be measured based on a metabonomics method.
The application also discloses the following embodiments:
1. a spectral image unmixing method based on a weighted non-negative matrix factorization algorithm (NMF-CLS), comprising: based on a standard spectrum database, unmixing a spectrum obtained by testing by adopting an NMF-CLS algorithm to obtain the types and the relative concentrations of known molecules contained in a sample to be tested; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectrums of different molecules;
wherein the objective function of the NMF-CLS algorithm is set as follows:
wherein, the spectrum matrix is set as a matrix V of m x n, which represents n spectrums in total, and each spectrum consists of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule;alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) And (3) enabling F in the equation to be minimum, and obtaining the relative concentration corresponding to the known molecule.
2. The method of embodiment 1, wherein the H (1) 、W (2) And H (2) Is calculated in an iterative process.
3. The method of any of embodiments 1-2, wherein the F pertains to W (2) 、H (1) And H (2) The partial derivatives of (2) are:
deriving W from partial derivatives (1) 、H (1) And H (2) The iterative formula of (2) is:
iterative updating H according to iterative formula (1) 、W (2) And H (2) When the maximum iteration number N or F is reached and is reduced to the set valueStopping iteration when the threshold sigma is fixed, and after iteration is stopped, H (1) I.e. the final result of the relative concentrations of each known component.
4. The method of embodiment 3, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.
5. The method of embodiment 3, wherein the threshold σ does not exceed about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.
6. The method of any one of embodiments 1-5, wherein the H (1) 、W (2) And H (2) The calculation process of (1) comprises:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing a coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to an iterative formula (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma;
5) After the iteration stops, H (1) I.e. the end result of the relative concentrations of the known components.
7. The method according to any one of embodiments 1 to 6, wherein the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested.
8. The method according to any one of embodiments 1-7, wherein the weight α is inversely related to a ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested.
9. The method according to any one of embodiments 1-8, wherein when W (2) And H (2) When none exists, the weight alpha is set to 0, and the objective function of the NMF-CLS algorithm is set to be:i.e. classical least squares.
10. The method according to any one of embodiments 1 to 8, wherein when a ratio of a number of known molecules (r 1) and a number of unknown molecules (r 2) contained in the sample to be measured is not less than 1, the weight α is set to 0, and an objective function of the NMF-CLS algorithm is set to:
11. the method according to any one of embodiments 1 to 8, wherein the weight α is not 0 when the ratio of the number of known molecules (r 1) and the number of unknown molecules (r 2) contained in the sample to be tested is less than 1, and the objective function of the NMF-CLS algorithm is set to:
12. The method according to any one of embodiments 1 to 11, wherein the method of setting the weight α includes:
1) Determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested;
2) Configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of a sample to be tested;
3) Different weights alpha are set, NMF-CLS algorithm is adopted to unmixe the spectrum of the simple sample obtained by testing, the coefficients corresponding to the known molecules are obtained, a regression equation is established between the coefficients and the actual concentration of the known molecules, and the R-party value is calculated to obtain the alpha of the highest R-party as the optimal weight value suitable for the sample to be tested.
13. The method of embodiment 12, wherein the method of determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested comprises a principal component analysis method.
14. The method of any of embodiments 12-13, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested is no more than about 1/2, or about 1/5, or about 1/10.
15. The method of any one of embodiments 12-14, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.
16. The method of any one of embodiments 12-15, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.
17. The method of any one of embodiments 1-16, wherein the standard spectral database creation comprises: and collecting a plurality of spectrum images of a certain molecule, calculating to obtain the average spectrum of the molecule, and similarly obtaining the average spectrum of other molecules, and incorporating the average spectrum into a standard spectrum database to obtain the standard spectrum database.
18. The method of embodiment 17, wherein the concentration of a molecule is 0.1mM-10mM when the spectral image of the molecule is acquired.
19. The method of any of embodiments 17-18, wherein the number of spectral images of a molecule is acquired is not less than about 10, about 20, about 50, about 100, or about 200.
20. The method of any one of embodiments 17-19, further comprising normalizing the intensity of the average spectrum of the obtained molecules.
21. The method of any of embodiments 17-20, wherein the standard spectral database creation comprises: and collecting a plurality of spectrum images of a certain molecule, calculating the intensity of the average spectrum of the molecule, normalizing the obtained spectrum, namely the standard spectrum of the molecule, and similarly obtaining the standard spectrum of other molecules, and integrating the standard spectrum into a standard spectrum database to obtain the standard spectrum database.
22. The method of any one of embodiments 17-21, wherein the standard spectral database creation comprises: and collecting a plurality of spectrum images of a certain molecule, averaging the obtained plurality of spectrum images, normalizing the intensity of the spectrum to be within a range [0,1], obtaining the spectrum which is the standard spectrum of the molecule, and similarly obtaining the standard spectrums of other molecules, and integrating the standard spectrums into a standard spectrum database to obtain the standard spectrum database.
23. The method of any of embodiments 17-22, comprising collecting spectra of a plurality of samples to be tested, performing an algorithm analysis on each spectrum separately, obtaining coefficients of different known components, then performing a treatment, and finally obtaining analysis results of relative concentrations of different known molecules in the samples.
24. The method of embodiment 23, wherein the processing comprises: average, sum, ANOVA analysis, and/or student t-test.
25. The method of embodiment 23, wherein the collection of the spectral number of the sample to be tested is performed to ensure that substantially all of the molecular information in the sample to be tested has been collected.
26. The method of embodiment 25, wherein determining that the spectral number of molecular information in the sample to be substantially completely collected includes, but is not limited to, comparison by Pearson coefficients.
27. The method of embodiment 26, wherein the acquiring of the Pearson coefficients comprises: taking the average value of the spectrums of M samples to be detected as a standard spectrum, taking out N spectrums each time to average, calculating Pearson coefficients of the spectrums and the standard spectrum, repeating the operation for N times, and averaging the N Pearson coefficients to obtain correlation coefficients corresponding to the N spectrums.
28. The method of embodiment 27, wherein the M is about 50 to 500, or about 100 to 400, or about 200 to 300.
29. The method of embodiment 27, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.
30. The method of any of embodiments 26-29, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.
31. The method of any one of embodiments 1-30, wherein the sample to be tested comprises a chemical sample or a biological sample.
32. The method of any one of embodiments 1-31, wherein the sample to be tested comprises a liquid sample.
33. The method of any of embodiments 1-32, the spectral image comprising infrared spectrum and raman spectrum.
34. The method of embodiment 33, wherein the infrared spectrum comprises a surface enhanced infrared spectrum.
35. The method of embodiment 33, the raman spectrum comprising surface enhanced raman spectrum.
36. The method of embodiment 35, wherein the surface-enhanced raman spectrum is a broad spectrum surface-enhanced raman spectrum.
37. An analysis method based on surface enhanced Raman spectroscopy comprises the following steps: based on a Surface Enhanced Raman Spectroscopy (SERS) standard spectrum database, a weighted non-negative matrix factorization algorithm (NMF-CLS) is adopted to unmixe the spectrum obtained by the test, so that the types and the relative concentrations of known molecules contained in the sample to be tested are obtained; the known molecules are molecules contained in a SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectrums of different molecules;
wherein the objective function of the NMF-CLS algorithm is set as follows:
wherein, the spectrum matrix is set as a matrix V of m x n, which represents n spectrums in total, and each spectrum consists of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule; alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) So that F in the equation is minimized to obtainKnowing the relative concentration of the molecules.
38. The method of embodiment 37, wherein the H (1) 、W (2) And H (2) Is calculated in an iterative process.
39. The method of any of embodiments 37-38, wherein the F is about W (2) 、H (1) And H (2) The partial derivatives of (2) are:
deriving W from partial derivatives (1) 、H (1) And H (2) The iterative formula of (2) is:
iterative updating H according to iterative formula (1) 、W (2) And H (2) Stopping iteration when the maximum iteration times N or F are reduced to a set threshold sigma, and after iteration is stopped, H (1) I.e. the end result of the relative concentrations of the known components.
40. The method of embodiment 39, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.
41. The method of embodiment 39, wherein the threshold σ does not exceed about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.
42. The method of any one of embodiments 37-41, wherein the H (1) 、W (2) And H (2) The calculation process of (1) comprises:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing a coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to an iterative formula (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma;
5) After the iteration stops, H (1) I.e. the end result of the relative concentrations of the known components.
43. The method according to any one of embodiments 37-42, wherein the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested.
44. The method of any of embodiments 37-43, wherein SERS employs non-targeted broad spectrum detection.
45. The method according to any one of embodiments 37-44, wherein the weight α is inversely related to a ratio of a number of known molecules (r 1) and a number of unknown molecules (r 2) contained in the sample to be tested.
46. The method according to any one of embodiments 37 to 45, wherein the method of setting the weight α includes:
2) Determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested;
3) Configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of a sample to be tested;
4) Different weights alpha are set, NMF-CLS algorithm is adopted to unmixe the spectrum of the simple sample obtained by testing, the coefficients corresponding to the known molecules are obtained, a regression equation is established between the coefficients and the concentration of the known molecules, and the R-party value is calculated to obtain the optimal weight value of the sample to be tested with the alpha of the highest R-party.
47. The method of embodiment 46, wherein the method of determining the ratio of the number of known molecules to the number of unknown molecules in the sample to be tested comprises a principal component analysis method.
48. The method of any of embodiments 46-47, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the test sample is no more than about 1/2, or about 1/5, or about 1/10.
49. The method of any one of embodiments 46-48, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.
50. The method of any one of embodiments 46-49, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.
51. The method of any one of embodiments 37-50, wherein the standard spectral database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain the SERS average spectrum of the molecule, and similarly obtaining the SERS average spectrum of other molecules, and integrating the SERS average spectrum into a standard spectrum database to obtain the SERS standard spectrum database.
52. According to the method of embodiment 51, when a SERS spectrum image of a molecule is acquired, the concentration of the molecule is 0.1mM-10mM.
53. The method of any of embodiments 51-52, wherein the number of SERS spectral images of a molecule is not less than about 10, about 20, about 50, about 100, or about 200.
54. The method of any of embodiments 51-53, further comprising normalizing the intensity of the SERS average spectrum of the obtained molecule.
55. The method of any one of embodiments 51-54, wherein the standard spectral database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, calculating and obtaining the intensity of the SERS average spectrum of the molecule, normalizing the intensity, wherein the obtained spectrum is the SERS standard spectrum of the molecule, and similarly obtaining the SERS standard spectrums of other molecules, and integrating the SERS standard spectrums into a SERS standard spectrum database to obtain the SERS standard spectrum database.
56. The method of any of embodiments 51-55, wherein the SERS criteria spectral database creation comprises: and collecting a plurality of SERS spectrum images of a certain molecule, averaging the obtained plurality of SERS spectrum images, normalizing the intensity of the average spectrum to be within a range [0,1], wherein the obtained spectrum is the SERS standard spectrum of the molecule, and similarly obtaining the SERS standard spectrums of other molecules, and integrating the SERS standard spectrums into a standard spectrum database to obtain the SERS standard spectrum database.
57. The method of any of embodiments 37-56 comprising collecting SERS spectra of a plurality of samples to be tested and performing algorithmic analysis on each SERS spectrum separately to obtain coefficients H of known composition (1) And then processing to obtain the analysis result of the relative concentration of the known molecules in the sample.
58. The method of embodiment 57, wherein the processing comprises: average, sum, ANOVA analysis, and/or student t-test.
59. The method of embodiments 37-58 wherein the collecting the spectral number of the sample to be tested is required to ensure that molecular information in substantially all of the sample to be tested has been collected.
60. The method of embodiment 59, wherein the determining that the number of spectra for which molecular information in the sample to be tested has been substantially collected includes, but is not limited to, comparing by Pearson coefficients.
61. The method of embodiment 60, wherein the obtaining of the Pearson coefficients comprises: taking the average value of the spectrums of M samples to be detected as a standard spectrum, taking out N spectrums each time to average, calculating Pearson coefficients of the spectrums and the standard spectrum, repeating the operation for N times, and averaging the N Pearson coefficients to obtain correlation coefficients corresponding to the N spectrums.
62. The method of embodiment 61, wherein the M is about 50 to 500, or about 100 to 400, or about 200 to 300.
63. The method of embodiment 61, wherein the n is about 3 to 30, or about 4 to 20, or about 5 to 10.
64. The method of any of embodiments 60-63, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.
65. The method of any of embodiments 59-64, wherein the sample to be tested is scanned for a number of spectra that is not less than about 20, or about 30, or about 40, or about 50.
66. The method of any of embodiments 59-65, wherein the number of spectra of the scanned sample to be tested is about 20 to 200, or about 30 to 160, or about 40 to 120, or about 50 to 80.
67. The method of any of embodiments 37-66, wherein the speed of scanning the SERS spectrum of the sample under test is about 1-5 s/sheet.
68. The method of any one of embodiments 37-67, wherein the sample to be tested comprises a chemical sample or a biological sample.
69. The method of any one of embodiments 37-68, wherein the sample to be tested comprises a liquid sample.
70. The method of embodiment 68, wherein the biological sample comprises a cell culture fluid, a cell supernatant, a cell lysate, blood, a blood-derived product, lymph, urine, tears, saliva, cerebrospinal fluid, stool, synovial fluid, sputum, a cell, an organ, or a tissue.
71. The method of any of embodiments 37-70, wherein the molecules in the SERS criteria database comprise a metabolite.
72. The method of embodiment 71, wherein the molecules in the SERS criteria database comprise small molecule metabolites.
73. A metabonomic data processing method, the metabonomic data processing method comprising: and unmixing the spectrum data of the biological samples of the same type by adopting a weighted non-negative matrix factorization algorithm to obtain the types of known molecules in the samples and the intervals of relative concentrations, thereby obtaining a characteristic spectrum database of the biological samples.
74. The method of embodiment 73, further comprising: and similarly, obtaining the types and the relative concentration intervals of known molecules in other types of biological samples, and incorporating the types and the relative concentration intervals into a characteristic spectrum database to obtain the characteristic spectrum database containing different types of biological samples.
75. A metabonomic analysis method, the method comprising: based on a standard spectrum database, the NMF-CLS algorithm is adopted to unmixe the spectrum of the sample to be tested obtained by testing, so that the type and the relative concentration of the metabolite contained in the sample to be tested are obtained, wherein the metabolite is the molecule contained in the standard spectrum database.
76. The method of embodiment 75, further comprising performing a relevant biomedical analysis based on the obtained species of metabolite and its relative concentration.
77. The method of embodiment 76, wherein the biomedical analysis comprises analyzing differential metabolite data.
78. The method of embodiment 76, wherein the biomedical analysis comprises comparing the metabolite species and relative concentrations of the test sample to a characteristic spectrum database.
79. The method of embodiment 76, wherein the biomedical analysis further comprises classifying or staging the sample.
80. A method of determining a biomarker, comprising:
3) Respectively obtaining spectrum data of a sample group sample and a control group sample, and carrying out unmixing on the spectrum obtained by testing by adopting a weighted non-negative matrix factorization algorithm based on a standard spectrum database, wherein the sample group sample and the control group sample respectively obtained contain the types and the relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectrum database;
4) Screening for differential molecules as biomarkers.
81. The method of embodiment 80, wherein the differential molecule comprises a differential metabolite.
82. The method of embodiment 80, wherein said step 2) comprises cross-selecting a plurality of differential metabolites by ANOVA analysis (ANOVA Test) and logistic regression (Logistic Regression).
83. The method of embodiment 82, wherein the ANOVA analysis comprises performing a statistical analysis on the different categories of data to find metabolites in which statistical differences occur between the different categories.
84. The method of embodiment 82, wherein the logistic regression comprises classifying using the relative concentration data to find metabolites that contribute to distinguishing categories of data.
85. The method of embodiment 84, wherein the logistic regression employs L1 regularization with an absolute value weight of greater than 0 when classified as contributing to classification.
86. The method of embodiment 80, further comprising validating the differential metabolite obtained.
87. The method of embodiment 86, wherein the validating comprises regression analysis of the actual concentration of the sample with coefficients unmixed by a weighted non-negative matrix factorization algorithm.
88. The method of embodiment 86, wherein the validating comprises validating by analyzing the differential metabolite for physiological or pathological compliance.
89. A method of detecting the presence of a disease or disorder, or assessing the risk of occurrence of a disease or disorder, the method comprising the steps of:
1) Obtaining a spectrum of an individual sample to be tested, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and the relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining whether the individual has, or is at risk of developing, a disease or disorder.
90. A method of determining the stage of a disease or disorder, the method comprising the steps of:
1) Obtaining a spectrum of an individual sample to be tested, and unmixing the spectrum obtained by the test by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and the relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the biomarker to a known stage level; and
3) The stage or type of disease or disorder is determined.
91. The method of any of embodiments 89-90, wherein the disease or disorder is selected from the group consisting of: infectious diseases, proliferative diseases, neurodegenerative diseases, cancer, psychological diseases, metabolic diseases, autoimmune diseases, sexually transmitted diseases, gastrointestinal diseases, pulmonary diseases, cardiovascular diseases, stress and fatigue related disorders, mycoses, pathogenic diseases and obesity related disorders.
92. A method of cell or microorganism analysis, the method comprising the steps of:
1) Obtaining spectrum data of a sample to be tested of cells, and unmixing a spectrum obtained by testing by adopting weighted nonnegative matrix factorization (NMF-CLS) based on a standard spectrum database to obtain the type and relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;
2) Comparing the relative concentration of the known metabolite to a normal interval; and
3) Determining the physiological or pathological state, physiological or pathological type of said cell or microorganism.
93. The method of embodiment 92, further comprising further screening the identified cells or microorganisms to obtain a desired target cell or microorganism type.
94. The method of any of embodiments 73-93, wherein the spectral data comprises raman spectral data and infrared spectral data.
95. The method of embodiment 94, wherein the infrared spectral data comprises surface enhanced infrared spectral data.
96. The method of embodiment 94, wherein the raman spectral data comprises surface enhanced raman spectral data.
97. The method of any of embodiments 73-93, wherein the standard spectral database comprises a raman spectral standard spectral database and an infrared spectral standard database.
98. The method of embodiment 97, wherein the raman spectroscopy standard spectrum database comprises a SERS standard spectrum database.
99. The method of any of embodiments 93-98, wherein the spectrum of the sample to be tested comprises a broad spectrum SERS spectrum.
100. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of embodiments 1 to 99.
101. The computer readable storage medium of embodiment 100, further having stored thereon standard spectrum database data.
102. The computer readable storage medium of embodiment 101, the standard spectrum database comprises a SERS standard spectrum database.
103. An apparatus comprising a memory storing a standard spectrum database and a computer program, and a processor implementing the steps of the method of any one of embodiments 1 to 99 when the computer program is executed.
104. A spectral unmixed system based on a weighted non-negative matrix factorization algorithm, comprising: and the solving and optimizing module is used for solving the weighted non-negative matrix factorization algorithm by adopting an iterative method to complete spectral data unmixing.
105. The system of embodiment 104 further comprising a weight optimization module for solving for known molecular weights using a linear regression approach to determine optimal weight values.
106. The system of embodiment 104, further comprising an evaluation module for evaluating the unmixed results using the relative concentrations of known molecules.
107. Use of the computer readable storage medium of any one of embodiments 100-102, the device of embodiment 103, or the system of any one of embodiments 104-106 in a manufacturing device for analysis of a compound and/or classification and detection of microorganisms.
108. The use of the computer readable storage medium of any one of embodiments 100-102, the device of embodiment 103, or the system of embodiments 104-106 in the preparation of a device for metabonomic data processing and/or analysis.
109. A metabonomic analysis device, the device comprising: the data processing module is used for analyzing the spectrum data of the sample to be detected to obtain the types of the metabolites in the sample and the relative concentration of the metabolites.
110. The apparatus of embodiment 109 wherein the data processing module includes a solution optimization module for solving a weighted non-negative matrix factorization algorithm using an iterative method to complete spectral data unmixing.
111. The apparatus of embodiment 109 wherein the data processing module includes a weight optimization module for solving for known molecular weights using a linear regression approach to determine optimal weight values.
112. The apparatus of embodiment 109 wherein the data processing module comprises an evaluation module for evaluating the unmixed results using the relative concentrations of known molecules.
113. The device of embodiment 112, wherein the evaluating comprises classifying the test sample with a differential metabolite classification test model.
114. The apparatus according to embodiment 109, further comprising a spectrum detection module configured to perform spectrum detection on the sample to be detected, and obtain spectrum data of the sample to be detected.
115. The device of embodiment 109, further comprising a sample collection module for collecting the sample based on a metabonomics method.
Without intending to be limited by any theory, the following examples are meant to illustrate the methods and uses of the present application and the like and are not intended to limit the scope of the application.
Examples
Example 1SERS criteria database creation
The SERS standard database contains SERS spectra of 89 metabolite molecules (fig. 2), wherein the SERS spectra of 57 molecules are derived from open source data in other documents, and the other 32 are SERS spectra obtained by the laboratory by self-purchasing metabolite molecule standards (with purity of 98% and above) and performing SERS test.
(1) Dissolving metabolite molecules in water according to a certain concentration (0.1 mM-10 mM), and mixing with silver nano particles according to a certain proportion;
(2) The SERS spectrum of the mixed sample is tested, and the test parameters may be: 638nm laser, integration time 1 second, total acquisition of 201 raman spectra;
(3) In order to prevent the condition that the unmixed coefficients are inconsistent due to the intensity difference of the standard spectrum, the 201 Zhang Guangpu obtained by the sample is averaged, the intensity of the spectrum is normalized to be in the interval [0,1], and the obtained spectrum is the standard SERS spectrum of the metabolite molecule and is incorporated into a SERS standard database;
(4) The same procedure was followed for other purchased metabolite molecular standards;
(5) The SERS criteria database can be continuously extended.
Example 2 model verification one
2.1 model verification
2.1.1 model validation establishment of SERS database used:
for model validation, only two known molecules, cysteine and arginine, were contained within the design database.
2.1.2 preparation of samples to be tested:
as shown in fig. 3a, 4 samples were prepared, numbered (1) to (4), wherein the concentration of arginine decreased sequentially with the number, the concentration of cysteine increased sequentially with the number of samples, five other substances were also contained in each sample (not contained in the SERS standard database used for the model validation), and the concentration was unknown;
2.1.3SERS spectral test:
the prepared sample is mixed with silver particles and then subjected to Raman spectrum test, and the test parameters are as follows: (laser wavelength 638nm, laser power 100%, integration time 1s,10 times objective lens, total number of raman spectra tested 201 pieces/sample); and averaging the 201 spectra obtained by the test to obtain an average spectrum of the detected SERS spectrum.
In fig. 3b i shows the SERS spectrum of cysteine, iii shows the SERS spectrum of arginine, and ii shows a portion of the SERS spectrum randomly taken from the background.
2.1.4 weight optimization:
adjusting alpha values to calculate coefficients of cysteine and arginine under different samples, comparing the obtained coefficients with known standard concentrations, establishing a regression curve, taking a fitting effect R square value of the regression curve as a judgment standard, selecting the alpha value with the best regression effect as a weight value analyzed later, setting the alpha value as 0.5, and adding a weighted objective function as follows:
2.1.5 algorithm resolution:
the 201 spectra obtained by scanning were unmixed by using 2.1.4 algorithm based on SERS database, to obtain the coefficients of cysteine and arginine contained in each SERS spectrum, and then averaged to obtain the average coefficients (i.e. relative content) of known molecules (cysteine and arginine) contained in the SERS database used for model verification, and the results are shown in table 1 and fig. 3c-3 e.
The black thin line in fig. 3c represents the average spectrum of the detected SERS spectrum of the sample mixed with four different concentrations, the colored line represents the sum of the products of the reference spectrum and the corresponding coefficients fitted by the algorithm, i.e. the average spectrum of the reduced fitted SERS spectrum, and the two almost coincide to indicate that the fitting effect is good, indicating that the algorithm effectively dissociates the spectrum.
Fig. 3d is a representative spectrum of four samples selected randomly, which also shows that the fitting effect is good.
Table 1 model validation of the coefficients of cysteine and arginine
As shown in table 1 and fig. 3e, the algorithm based on SERS database calculates the cysteine and arginine coefficients and the actual concentration in the four samples for linear fitting, and the relative content obtained by the analysis of the algorithm has good linear correlation with the concentration of the metabolite molecule designed in the process of preparing the sample to be tested.
2.2 weight set-up contrast
If the weight α is not set (i.e., α is set to 0) in the case where there is an unknown component and the unknown component occupies a relatively large amount, the unmixed coefficients may not establish a good linear relationship with the true concentration, although they still have a good fitting effect. The effect of different weight alpha values on the unmixing effect is verified by comparing the unmixing results of arginine in the model verification part, the result of adding the weight (alpha=0.5) is shown in fig. 3f, the result of not adding the weight (alpha=0) is shown in fig. 3g, the coefficient which is not subjected to weight and is analyzed can not effectively represent the relative concentration, and the coefficient which is analyzed after adding the weight can effectively represent the relative concentration.
2.3 comparison of unmixed methods
Fig. 4a shows the fitting effect of four samples containing different concentrations of cysteine and arginine in example 2.1 after unmixing according to the least squares method (CLS), the black line is the average spectrum of the detected SERS spectrum, the red line is the average spectrum of the CLS algorithm fitted spectrum, and the blue line is the difference between the two, and the fitting effect is not good.
Fig. 4b shows the fitting effect of four samples containing different concentrations of cysteine and arginine in example 2.1 after unmixing according to the non-Negative Matrix Factorization (NMF), which is seen to be very good, but the known component spectra calculated by the non-negative matrix factorization in fig. 4c shows no significant raman peak, whereas the standard raman spectrum of the actual molecule (e.g. cysteine) has significant raman peaks, indicating that the known component spectra resolved by the non-negative matrix factorization (i.e. the standard raman spectrum of the molecule) cannot be matched with the raman spectrum of the actual molecule.
Example 3 model validation two
3.1 creation of SERS database for model verification:
for model validation, the design database contains only five known molecules of Tyrosine (Tyrosine), guanine (Guanine), cytosine (Cytosine), asparagine (aspargine), adenine (Adenine).
3.2 preparation of samples to be tested:
tyrosine (Tyrosine), guanine (Guanine), cytosine (Cytosine), asparagine (Asparagine), adenine (Adenine) were added to 3 samples at different concentrations, while 15 additional molecular solutions of unknown concentration were mixed.
3.3SERS Spectrometry test:
the prepared sample is mixed with silver particles and then subjected to Raman spectrum test, and the test parameters are as follows: (laser wavelength 638nm, laser power 100%, integration time 1s,10 times objective lens, total number of raman spectra tested 201 pieces/sample); and averaging the 201 spectra obtained by the test to obtain an average spectrum of the detected SERS spectrum.
In fig. 5a, i shows the SERS standard spectrum of 5 known molecules, and ii shows the raman spectrum of a randomly selected mixed solution comprising 5 target molecules and 15 unknown molecules, the background of which is seen to be complex.
3.4 weight optimization:
adjusting the alpha value to calculate the coefficients of target molecules under different samples, comparing the coefficients with known standard concentrations, establishing a regression curve, taking the fitting effect R square value of the regression curve as a judgment standard, selecting the alpha value with the best regression effect as a weight value analyzed later, setting the alpha value to be 0.5, and adding the weight into the target function as follows:
3.5 algorithm analysis:
the 201 spectra obtained by scanning are respectively unmixed by adopting an algorithm based on a SERS database of 3.4, the coefficients of known molecules (tyrosine, guanine, cytosine, asparagine and adenine) contained in each SERS spectrum are obtained, and then the average is carried out, so that the average coefficients (namely the relative content) of the known molecules contained in the SERS database used by the model verification are obtained. The concentration of the same sample was measured by mass spectrometry, and regression analysis was performed on the same sample with the coefficient unmixed by the present algorithm.
Fig. 5b shows the fitting effect of SERS spectra of three different mixtures, the black thin line represents the average spectrum of detected SERS spectra of three different mixtures, the color line represents the sum of the products of the reference spectrum fitted by the algorithm and the corresponding coefficients, i.e. the average spectrum of the reduced fitting SERS spectrum, and the two almost coincide, indicating that the fitting effect is good, indicating that the algorithm has effectively dissociated the spectra.
Fig. 5c shows the matching effect of the calculated average coefficient and the concentration detected by the mass spectrum, and it can be seen that the average coefficient (relative content) calculated by the algorithm and the concentration of the sample detected by the mass spectrum can be matched well, while the sensitivity of the SERS spectrum is higher, and the detection with higher sensitivity can be achieved.
Example 4 cell experiment
4.1 As described in example 1, a SERS database was created for cell verification;
4.2 preparation of the sample to be measured: the extracellular metabolic behavior of the three groups of cells, which varies with the number of days, expressed in the cell culture solution was compared, and the set cell groups were a LO2 (human normal stem cell) group, a HepG2 (human liver cancer cell) group and a HepG2+MTX (human liver cancer cell administration anticancer drug methotrexate) group.
Each group was subjected to continuous metabolic behavior testing for 5 days, i.e., 400 μl of cell culture broth was removed from the cell culture dishes each day. Removing dead cells and cell fragments of a cell culture fluid sample and removing protein molecules by ultrafiltration (3 KD-cutoff) sequentially through gradient centrifugation, so as to obtain metabolic molecule components in the cell culture fluid sample, and taking the metabolic molecule components as a subsequent sample to be detected;
4.3SERS spectroscopic test:
after the sample to be tested is mixed with silver nano particles, SERS spectrum test is carried out, and test parameters are as follows: (laser wavelength 638nm, laser power 100%, integration time 5s,10 times objective lens, total number of raman spectra tested 201 pieces/sample);
200 spectra are selected from DAY2 data of each culture solution to obtain a SERS spectrum heat map (fig. 7 c), wherein the abscissa is Raman shift, the ordinate represents the spectrum number, each row of pixels represents one spectrum, and the color of the pixels represents the Raman intensity of the pixels. It can be seen that there are fluctuations in peak position and peak intensity between spectra in the same measurement, and thus it can be considered that at different times, the types and numbers of molecules present in the raman-enhanced hot spot areas differ, depending on the concentration and type of the molecules, and therefore it is necessary to measure raman spectra multiple times to reflect the molecular composition in serum.
4.3.1 Spectrum quantity calculation
The part is used for calculating how many spectra are acquired in each measurement, so that the whole information of the sample can be ensured to be acquired.
We used the average of 200 spectra as the standard spectrum, taking N spectra each time to average, and calculating pearson coefficients for each time with the standard spectrum, repeating the above operation 5 times, and averaging the 5 pearson coefficients as correlation coefficients for the N spectra.
It is considered that when the curve converges (pearson coefficient greater than 0.8), it means that the required information has been substantially obtained. As a result, as shown in fig. 6, at around 50, the spectrum has substantially converged (pearson coefficient greater than 0.8) for different types of data.
4.4 weight optimization:
adjusting the alpha value to calculate the coefficients of target molecules under different samples, comparing the coefficients with known standard concentrations, establishing a regression curve, taking the fitting effect R square value of the regression curve as a judgment standard, selecting the alpha value with the best regression effect as a weight value analyzed later, setting the alpha value to be 0.5, and adding the weight into the target function as follows:
/>
4.5 algorithm analysis:
and respectively unmixing the 201 spectrums obtained by scanning by adopting a 4.4 algorithm based on the SERS database, obtaining the coefficient of the known molecule contained in each SERS spectrum, and then averaging to obtain the average coefficient (namely the relative content) of the known molecule contained in the SERS database used by the model verification. The obtained relative content can be used for further relevant biomedical analysis, such as metabolic difference between normal cells and tumor cells, metabolic behavior change monitoring after anti-tumor drug treatment of tumor cells, and the like.
Fig. 7b shows the fitting effect of SERS spectra of three cells per day in cell culture medium, the black thin line represents the average spectrum of detected SERS spectra, the colored line represents the sum of the products of the reference spectrum fitted by the algorithm and the corresponding coefficients, i.e. the average spectrum of the reduced fitted SERS spectrum, and the two almost coincide, indicating that the fitting effect is good, indicating that the algorithm has effectively dissociated the spectra.
4.6 differential metabolite screening
ANOVA performs statistical analysis on the data for the different categories to find metabolites where statistical differences occur between the different categories. Inputting data of different categories, ANOVA will give a test level, i.e. a p-value. The smaller the p value, the higher the inter-group variability, and it is considered that there is a significant difference between groups when the p value is less than 0.05. The present application calculates the p-value of the test level between the different classes for each metabolite relative concentration data, and considers the metabolites with p-value less than 0.05 as the metabolites with significant differences, and retains them.
Logistic regression uses relative coefficient data to classify and find metabolites that contribute to distinguishing between different classes of data. With L1 regularization, the absolute value of the calculated weight is greater than 0 when it is classified, which is considered to contribute to classification. When the logistic regression algorithm is used for classifying, different calculation weights are set for the relative concentration data of different metabolites, wherein the local calculation weights are the importance of the metabolites in the classification, and the larger the absolute value of the calculation weights is, the higher the importance is. In the case of L1 regularization, the metabolite weight value of low importance will be set to 0 by the algorithm, so that a metabolite with an absolute value of the calculated weight greater than 0 is taken as a metabolite with a possible difference.
Intersection of differential metabolites selected in both ANOVA and logistic regression was taken as the final metabolite screening result. As shown in fig. 7d, 8 exemplary differential metabolites were screened and analyzed for a change in calculated coefficients (i.e., relative concentrations).
Example 5 serum experiments
5.1 As described in example 1, a SERS database was created for cell verification;
5.2 preparation of sample to be measured:
serum samples (with sample sources of 85 BPH patients, 85 PCa patients and 75 healthy subjects) stored at the temperature of-80 ℃ are adopted in the serum test, the serum is thawed in a 4 ℃ environment, and then ultrafiltration (3 KD cutoff) is carried out on the serum, so that proteins in the serum are removed, the components of metabolite molecules in the serum are obtained, and the components of the serum metabolite molecules are used as samples to be tested;
5.3SERS Spectrometry test:
after the sample to be tested is mixed with silver nano particles, SERS spectrum test is carried out, and test parameters are as follows: (laser wavelength 638nm, laser power 100%, integration time 5s,10 times objective lens, total number of raman spectra tested 201 pieces/sample);
200 spectra were taken for DAY2 data of each culture broth to obtain SERS spectrothermograms (fig. 9). The abscissa is Raman shift and the ordinate represents the number of spectra. It can be seen that there are fluctuations in peak position and peak intensity between spectra in the same measurement, and thus it can be considered that at different times, the types and numbers of molecules present in the raman-enhanced hot spot areas differ, depending on the concentration and type of the molecules, and therefore it is necessary to measure raman spectra multiple times to reflect the molecular composition in serum.
5.3.1 Spectrum quantity calculation
The part is used for calculating how many spectra are acquired in each measurement, so that the whole information of the sample can be ensured to be acquired.
We used the average of 200 spectra as the standard spectrum, taking N spectra each time to average, and calculating pearson coefficients for each time with the standard spectrum, repeating the above operation 5 times, and averaging the 5 pearson coefficients as correlation coefficients for the N spectra.
It is considered that when the curve converges (pearson coefficient greater than 0.8), it means that the required information has been substantially obtained. As a result, as shown in fig. 8, at around 50, the spectrum has substantially converged (pearson coefficient greater than 0.99) for different types of data.
5.4 weight optimization:
adjusting the alpha value to calculate the coefficients of target molecules under different samples, comparing the coefficients with known standard concentrations, establishing a regression curve, taking the fitting effect R square value of the regression curve as a judgment standard, selecting the alpha value with the best regression effect as a weight value analyzed later, setting the alpha value to be 0.5, and adding the weight into the target function as follows:
5.5 algorithm analysis:
and respectively unmixing the 201 spectrums obtained by scanning by adopting a SERS database-based algorithm of 5.4, obtaining the coefficient of the known molecule contained in each SERS spectrum, and then averaging to obtain the average coefficient (namely the relative content) of the known molecule contained in the SERS database used by the model verification. The obtained relative amounts can be used for further relevant biomedical analyses such as early screening of diseases, typing of diseases, staging of diseases, etc.
Fig. 10 shows the fitting effect of SERS spectra of serum of three populations, the black thin line represents the average spectrum of detected SERS spectra, the colored line represents the sum of the products of the reference spectrum fitted by the algorithm and the corresponding coefficients, i.e. the average spectrum of the restored fitted SERS spectra, and the two almost coincide to show that the fitting effect is good, indicating that the algorithm has effectively dissociated the spectra.
5.6 differential metabolite screening
ANOVA performs statistical analysis on the data for the different categories to find metabolites where statistical differences occur between the different categories. Inputting data of different categories, ANOVA will give a test level, i.e. a p-value. The smaller the p value, the higher the inter-group variability, and it is considered that there is a significant difference between groups when the p value is less than 0.05. The present application calculates the p-value of the test level between the different classes for each metabolite relative concentration data, and considers the metabolites with p-value less than 0.05 as the metabolites with significant differences, and retains them.
Logistic regression uses relative coefficient data to classify and find metabolites that contribute to distinguishing between different classes of data. With L1 regularization, the absolute value of the calculated weight is greater than 0 when it is classified, which is considered to contribute to classification. When the logistic regression algorithm is used for classifying, different calculation weights are set for the relative concentration data of different metabolites, wherein the local calculation weights are the importance of the metabolites in the classification, and the larger the absolute value of the calculation weights is, the higher the importance is. In the case of L1 regularization, the metabolite weight value of low importance will be set to 0 by the algorithm, so that a metabolite with an absolute value of the calculated weight greater than 0 is taken as a metabolite with a possible difference.
Intersection of differential metabolites selected in both ANOVA and logistic regression was taken as the final metabolite screening result. After unmixing the surface enhanced raman spectrum of serum, the following 16 differential metabolites were screened out using Anova and logistic regression crossover analysis (fig. 11). For further analysis, we took the coefficients of these 16 differential metabolites for all samples (fig. 12 a). As shown in fig. 12b, 12c, samples of prostate cancer, benign prostatic hyperplasia, and healthy people were classified using the data consisting of these 16 differences and compared to the results of the psa screening of liquid biopsies in the clinic. The results showed that the analysis of the screened metabolites was superior to the results of the psa screening.

Claims (10)

1. A spectral image unmixing method based on a weighted non-negative matrix factorization (NMF-CLS) algorithm, comprising: based on a standard spectrum database, unmixing a spectrum obtained by testing by adopting an NMF-CLS algorithm to obtain the types and the relative concentrations of known molecules contained in a sample to be tested; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectrums of different molecules;
wherein the objective function of the NMF-CLS algorithm is set as follows:
Wherein, the spectrum matrix is set as a matrix V of m x n, which represents n spectrums in total, and each spectrum consists of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule; alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) And (3) enabling F in the equation to be minimum, and obtaining the relative concentration corresponding to the known molecule.
2. The method of claim 1, wherein the H (1) 、W (2) And H (2) Is calculated in an iterative process.
3. The method of claim 1, wherein the F pertains to W (2) 、H (1) And H (2) The partial derivatives of (2) are:
deriving W from partial derivatives (1) 、H (1) And H (2) The iterative formula of (2) is:
iterative updating H according to iterative formula (1) 、W (2) And H (2) Stopping iteration when the maximum iteration times N or F are reduced to a set threshold sigma, and after iteration is stopped, H (1) I.e. the final result of the relative concentrations of each known component.
4. A method according to any one of claims 1-3, wherein the H (1) 、W (2) And H (2) The calculation process of (1) comprises:
1) Input of a matrix W of known components (1) The measured spectrum matrix V, the maximum iteration number N and the threshold sigma;
2) Randomly initializing a coefficient matrix H of known composition (1) Spectral matrix W of unknown composition (2) Coefficient matrix H (2)
3) Iterative updating H according to an iterative formula (1) 、W (2) And H (2)
4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma;
5) After the iteration stops, H (1) I.e. the end result of the relative concentrations of the known components.
5. A method according to any one of claims 1 to 3, wherein the sample to be tested comprises a chemical sample or a biological sample.
6. A method according to any one of claims 1-3, the spectral image comprising infrared spectrum and raman spectrum.
7. An analysis method based on surface enhanced Raman spectroscopy comprises the following steps: based on a surface enhanced Raman spectroscopy SERS standard spectrum database, a weighted non-negative matrix factorization (NMF-CLS) algorithm is adopted to unmixe the spectrum obtained by the test, so that the types and the relative concentrations of known molecules contained in the sample to be tested are obtained; the known molecules are molecules contained in a SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectrums of different molecules;
Wherein the objective function of the NMF-CLS algorithm is set as follows:
wherein, the spectrum matrix is set as a matrix V of m x n, which represents n spectrums in total, and each spectrum consists of m points; m is r 1 Is a matrix W of (2) (1) Reference spectrum, m r, representing known molecules arranged in columns 2 W of (2) (2) A spectrum representing unknown molecules arranged in columns; r is (r) 1 * Matrix H of n (1) And r 2 * Matrix H of n (2) Respectively represent W (1) And W is (2) The corresponding relative concentrations; wherein r is 1 And r 2 Respectively represent r 1 Species-known molecule and r 2 A species unknown molecule; alpha represents the weight set for the known molecule, alpha is not less than 0; reference spectrum W due to known molecules (1) Is known to find W (2) 、H (1) And H (2) And (3) enabling F in the equation to be minimum, and obtaining the relative concentration corresponding to the known molecule.
8. The method of claim 7, wherein the molecules in the SERS criteria spectrum database comprise metabolites.
9. A metabonomic analysis method, the method comprising: based on a standard spectrum database, the spectrum of the sample to be tested obtained by the test is unmixed by the method of claim 1, so that the type and the relative concentration of the metabolite contained in the sample to be tested are obtained, wherein the metabolite is a molecule contained in the standard spectrum database.
10. A method of determining a biomarker, comprising:
1) Respectively obtaining spectrum data of a sample group sample and a control group sample, and carrying out unmixing on the spectrum obtained by the test by adopting the method of claim 1 based on a standard spectrum database, wherein the respectively obtained sample group sample and the control group sample contain types and relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectrum database;
2) Screening for differential molecules as biomarkers.
CN202111150957.0A 2021-09-29 2021-09-29 Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof Active CN113793646B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111150957.0A CN113793646B (en) 2021-09-29 2021-09-29 Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof
PCT/CN2022/122403 WO2023051661A1 (en) 2021-09-29 2022-09-29 Spectral image de-mixing method based on weighted non-negative matrix factorization, and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111150957.0A CN113793646B (en) 2021-09-29 2021-09-29 Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof

Publications (2)

Publication Number Publication Date
CN113793646A CN113793646A (en) 2021-12-14
CN113793646B true CN113793646B (en) 2023-11-28

Family

ID=78877556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111150957.0A Active CN113793646B (en) 2021-09-29 2021-09-29 Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof

Country Status (2)

Country Link
CN (1) CN113793646B (en)
WO (1) WO2023051661A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793646B (en) * 2021-09-29 2023-11-28 上海交通大学 Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697008A (en) * 2009-10-20 2010-04-21 北京航空航天大学 Hyperspectral unmixing method for estimating regularized parameter automatically
JP2011174906A (en) * 2010-02-25 2011-09-08 Olympus Corp Vibration spectrum analysis method
CN103413117A (en) * 2013-07-17 2013-11-27 浙江工业大学 Incremental learning and face recognition method based on locality preserving nonnegative matrix factorization ( LPNMF)
CN113066142A (en) * 2021-02-24 2021-07-02 西安电子科技大学 Optical function imaging method combining spatial regularization and semi-blind spectrum unmixing

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140272B2 (en) * 2008-03-27 2012-03-20 Nellcor Puritan Bennett Llc System and method for unmixing spectroscopic observations with nonnegative matrix factorization
CN105809105B (en) * 2016-02-06 2019-05-21 黑龙江科技大学 High spectrum image solution mixing method based on end member constrained non-negative matrix decomposition
US10776718B2 (en) * 2016-08-30 2020-09-15 Triad National Security, Llc Source identification by non-negative matrix factorization combined with semi-supervised clustering
CN112750091A (en) * 2021-01-12 2021-05-04 云南电网有限责任公司电力科学研究院 Hyperspectral image unmixing method
CN113793646B (en) * 2021-09-29 2023-11-28 上海交通大学 Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697008A (en) * 2009-10-20 2010-04-21 北京航空航天大学 Hyperspectral unmixing method for estimating regularized parameter automatically
JP2011174906A (en) * 2010-02-25 2011-09-08 Olympus Corp Vibration spectrum analysis method
CN103413117A (en) * 2013-07-17 2013-11-27 浙江工业大学 Incremental learning and face recognition method based on locality preserving nonnegative matrix factorization ( LPNMF)
CN113066142A (en) * 2021-02-24 2021-07-02 西安电子科技大学 Optical function imaging method combining spatial regularization and semi-blind spectrum unmixing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于攻击图模型的多目标网络安全评估研究;程叶霞;姜文;薛质;程叶坚;;计算机研究与发展(第S2期);全文 *
高光谱遥感影像混合像元分解研究进展;蓝金辉;邹金霖;郝彦爽;曾溢良;张玉珍;董铭巍;;遥感学报(第01期);全文 *

Also Published As

Publication number Publication date
CN113793646A (en) 2021-12-14
WO2023051661A1 (en) 2023-04-06

Similar Documents

Publication Publication Date Title
Havelund et al. Biomarker research in Parkinson’s disease using metabolite profiling
Wu et al. Sample normalization methods in quantitative metabolomics
Ting et al. Normalization and statistical analysis of quantitative proteomics data generated by metabolic labeling
Checa et al. Lipidomic data analysis: tutorial, practical guidelines and applications
Serkova et al. The emerging field of quantitative blood metabolomics for biomarker discovery in critical illnesses
Čuperlović-Culf et al. Cell culture metabolomics: applications and future directions
US20040121305A1 (en) Generation of efficacy, toxicity and disease signatures and methods of use thereof
Snowden et al. Application of metabolomics approaches to the study of respiratory diseases
Bessonneau et al. Analysis of human saliva metabolome by direct immersion solid-phase microextraction LC and benchtop orbitrap MS
US20140297195A1 (en) Method and Use of Metabolites for the Diagnosis of Inflammatory Brain Injury in Preterm Born Infants
EP2771696A2 (en) Method and use of metabolites for the diagnosis and differentiation of neonatal encephalopathy
Chen et al. LC-MS-based metabolomics of xenobiotic-induced toxicities
CN113793646B (en) Spectral image unmixing method based on weighted non-negative matrix factorization and application thereof
Jones et al. An introduction to metabolomics and its potential application in veterinary science
Bennuru et al. Metabolite profiling of infection-associated metabolic markers of onchocerciasis
Perry et al. Integrated molecular imaging technologies for investigation of metals in biological systems: A brief review
Blaurock et al. Metabolomics of human semen: a review of different analytical methods to unravel biomarkers for male fertility disorders
Teker et al. Age-related differences in response to plasma exchange in male rat liver tissues: insights from histopathological and machine-learning assisted spectrochemical analyses
Gowda Profiling redox and energy coenzymes in whole blood, tissue and cells using NMR spectroscopy
Armitage et al. Imaging of metabolites using secondary ion mass spectrometry
Tufi et al. Cross-platform metabolic profiling: application to the aquatic model organism Lymnaea stagnalis
US20220120761A1 (en) Urine metabolomics based method of detecting renal allograft injury
US20220252531A1 (en) Information processing apparatus and control method for information processing apparatus
Gupta et al. Techniques for detection and extraction of metabolites
Kasture et al. Metabolomics: current technologies and future trends

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant