CN113793646A

CN113793646A - Spectral image unmixing method based on weighted nonnegative matrix decomposition and application thereof

Info

Publication number: CN113793646A
Application number: CN202111150957.0A
Authority: CN
Inventors: 叶坚; 何畅; 毕心缘
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-14
Anticipated expiration: 2041-09-29
Also published as: CN113793646B; WO2023051661A1

Abstract

The application relates to a spectral image unmixing method based on a weighted nonnegative matrix factorization algorithm (NMF-CLS), which comprises the following steps: and based on a standard spectrum database, unmixing the spectrum obtained by testing by adopting an NMF-CLS algorithm to obtain the type and relative concentration of the known molecules contained in the sample to be tested. The NMF-CLS algorithm can obtain the types and relative contents of known molecules in a complex sample, and can eliminate the influence of molecules not contained in a database on unmixing.

Description

Spectral image unmixing method based on weighted nonnegative matrix decomposition and application thereof

Technical Field

The application relates to the technical field of image processing, in particular to a spectral image unmixing method based on weighted nonnegative matrix factorization.

Background

Infrared and Raman spectroscopy

Infrared spectroscopy (IR) and Raman spectroscopy (Raman spectroscopy) are powerful tools for studying molecular structures and chemical compositions, and are widely used in the fields of materials, chemical engineering, environmental protection, geology and the like due to the advantages of rapidness, high sensitivity, small detection amount and the like. From the viewpoint of analytical testing, the combination of the two methods can better provide information on molecular structures. The infrared spectrum and the Raman spectrum belong to molecular vibration spectrum, but the two spectra have larger difference in reality: the infrared spectrum is an absorption spectrum and the raman spectrum is a scattering spectrum.

Infrared spectrum: when electromagnetic radiation interacts with molecules of a substance, the energy of the electromagnetic radiation is different from the vibration or rotation energy of the molecules, and the molecules are caused to jump from a low level to a high level, so that the electromagnetic radiation with certain specific wavelengths is absorbed by the molecules of the substance, and the jump of the vibration energy level and the rotation energy level after the infrared absorption spectrum molecules absorb the infrared radiation is obtained by measuring the radiation intensity at different wavelengths, so that the infrared spectrum is also called as molecular vibration rotation spectrum.

Raman spectroscopy: when light strikes a substance, causing a photon to collide with an electron in a molecule, if an inelastic collision occurs, the photon has a portion of its energy transferred to the electron, and the frequency of the scattered light is not equal to the frequency of the incident light, which is called raman scattering, and the resulting spectrum is called raman spectrum.

Raman spectrum and infrared spectrum are one of the most important analytical chemical methods, and can provide key structural information such as chemical bonds of a system to be detected. However, they often face a bottleneck of low sensitivity when applied to surface chemistry analysis of materials and biological systems.

Surface-enhanced infrared spectroscopy (surface-enhanced absorption spectroscopy, SEIRAS)

When molecules are adsorbed on the Surface of rough metal particles, an infrared absorption signal is remarkably enhanced by 10-1000 times, and the phenomenon is called Surface-enhanced infrared absorption effect (SEIRA). Based on surface-enhanced redThe surface-enhanced infrared spectrum technology with external absorption effect has high surface sensitivity, and can detect the change of infrared absorption at 10⁶An order of magnitude; the surface selection rule is simple, and the molecular adsorption orientation is convenient; the mass transfer resistance is not limited, and the method has great application value in the aspect of analysis and application.

Surface-enhanced Raman spectroscopy (SERS)

The surface enhanced raman spectroscopy is a phenomenon of enhanced raman scattering intensity caused by the plasma resonance interaction between molecules adsorbed on the surface of a metal nanostructure and the surface of a metal, and is a very effective raman signal detection technology. It can adsorb the molecule to be detected on the surface of the rough nano metal material, and can enhance the Raman signal of the object to be detected 10⁶ ^-15The double spectrum phenomenon solves the problem of low sensitivity of the common Raman spectrum, and the detection sensitivity can reach a single molecule level, thereby promoting the application of SERS in the fields of food safety, environmental protection, medical detection and the like.

Surface enhanced raman spectroscopy includes both targeted surface enhanced raman spectroscopy and spectroscopic surface enhanced raman spectroscopy. The target surface enhanced raman spectroscopy relies on specific binding, such as SERS particle surface modified antibodies, which can specifically capture antigens in a sample, thereby detecting the content (concentration) of a single molecule or a few molecules, but cannot realize broad spectrum detection, and the obtained metabolic results are very limited. The broad-spectrum surface enhanced Raman spectrum does not depend on specific binding, realizes the broad-spectrum detection in biological samples, and mainly adopts methods such as principal component analysis, machine learning and the like to directly classify the two types of samples in the analysis, but cannot obtain specific metabolite information (including types, content and the like).

Spectral analysis

Raman spectroscopy and infrared spectroscopy are widely used because they produce unique spectral characteristics for different substances. However, most of the current raman spectrum or infrared spectrum images are mixed and synthesized by different substances, and unmixing analysis is required to be carried out on the spectrum images in order to more accurately analyze each component in the spectrum images.

The basic idea of the Classical Least Squares (CLS) method is to approximate the spectrum of a mixture of components (e.g. raman spectrum) as a linear addition of a series of pure component spectra, and the algorithm aims to find the coefficient of each pure component spectrum so that the sum of the squares of the errors of the reconstructed spectrum and the original spectrum is minimized by the linear addition.

non-Negative Matrix Factorization (NMF) treats mixed component spectra (e.g., raman spectra) as a linear addition of multiple component spectra, similar to classical least squares, but relies on iterative computation of the spectra of the components and their corresponding concentrations. The non-negative matrix factorization algorithm uses an iterative approach to decompose a matrix composed of a plurality of mixed spectra arranged in columns into a product of two non-negative matrices, one of which is ideally a matrix composed of spectra of each component arranged in columns, and the other is a corresponding relative concentration of each component in each spectrum. Let the spectrum matrix be an m x n matrix V, representing a total of n spectra, each consisting of m points. The matrix of spectral composition of the pure components is a matrix W of m r, representing a total of r components. The matrix representing the relative concentration of each component is a r x n matrix H, with each column representing the relative concentration of each component for each spectrum. The goal of each separation algorithm is to have V ≈ WH.

For the classical least squares method, the objective optimization function is set to

Where W is known as the matrix consisting of Raman spectra of various pure components in order to find H and minimize F. Then the partial derivative of F with respect to H can be found to be

For non-negative matrix factorization, the objective optimization function is the same as the classical least squares method, i.e. equation (1), but for this algorithm, W and H are both unknown, requiring iterative computation of the parameters of both matrices. The objective is to find a set of W and H simultaneously, minimizing F. It can be found that the partial derivatives of F with respect to W and H are respectively

Although the classical least squares method can obtain relatively accurate concentration coefficients, it is necessary to provide a spectrum for each pure component to ensure a good fitting effect, and for a biological sample containing at least several hundreds of components, it is difficult to provide a spectrum for each component. The non-negative matrix factorization algorithm has the advantage that a Raman spectrum of a pure component does not need to be provided, but the calculated spectrum is often not matched with the spectrum of an actual component, and relative concentration data of a target component cannot be accurately found. At present, no method capable of accurately analyzing a raman spectrum or an infrared spectrum exists in the prior art.

Metabolomics

Metabonomics is a new subject after genomics and proteomics, is an important component of system biology, and mainly inspects dynamic changes of all small molecule metabolites and content thereof before and after a biological system is stimulated or disturbed. By carrying out overall qualitative and quantitative analysis on all small molecule metabolites in an organism, the relationship between the metabolites and the physiological and pathological changes can be explored and found. Researches show that the metabolome has important application value in the fields of early diagnosis of diseases, biomarker discovery, drug screening, toxicity evaluation, sports medicine, nutrition and the like.

The two major analysis technologies of Nuclear magnetic resonance spectroscopy and mass spectrometry are the most important means for detecting metabolites, and the Nuclear magnetic resonance spectroscopy (Nuclear magnetic resonance spectroscopy) is widely applied to the metabolome, and has the obvious advantages of capability of observing a plurality of metabolites at one time, good reproducibility, no destruction and short measurement time. Low sensitivity has been an inherent disadvantage and a primary challenge for the application of nuclear magnetic resonance in metabolomics research.

The mass spectrometry has the advantages of high sensitivity, strong specificity and the like, is widely applied to detecting the metabolic components, and can carry out qualitative and quantitative analysis on the metabolic components after separation and ionization treatment. However, mass spectrometry has been limited in its application because it does not allow direct detection of biological solutions or tissues.

Liquid chromatography-mass spectrometry (LC-MS) is also used for metabolome studies. In recent years, the LC-MS technology is further improved, and the detection application of large-scale samples is more and more. With the increase of the number of samples to be detected, a series of problems are generated, for example, the detection time of large-scale samples is long, and the machine has the situations of sensitivity reduction, retention time drift and the like in the long-time operation process.

Raman spectroscopy (Raman spectroscopy) is based on vibrational spectroscopy, can detect the structure of a compound and its minute changes, has the advantages of no damage to a sample, simple and easy pretreatment of the sample, high spatial resolution, and the like, and has been applied to the fields of clinical pathology research, classification and detection of microorganisms, analysis of compounds, and the like.

Disclosure of Invention

Aiming at the technical problems in the prior art, one of the purposes of the application is to establish a new spectral image unmixing algorithm, namely a weighted matrix non-negative matrix factorization (NMF-CLS) algorithm, by combining a classical least square method and a non-negative matrix factorization method, and obtain the relative concentration corresponding to a specific component on the basis of obtaining a better fitting effect.

It is another object of the present application to provide the use of weighted non-negative matrix algorithms in metabolomics.

In one aspect, the present application provides a spectral image unmixing method based on a weighted non-negative matrix factorization (NMF-CLS) algorithm, including: based on a standard spectrum database, unmixing the spectrum obtained by testing by adopting an NMF-CLS algorithm to obtain the type and relative concentration of the known molecules contained in the sample to be tested; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectra of different molecules;

wherein the objective function of the NMF-CLS algorithm is set to:

the spectrum matrix is an m x n matrix V and represents n spectra in total, and each spectrum consists of m points; m r₁Matrix W of⁽¹⁾Reference spectra representing known molecules arranged in columns, m r₂W of (2)⁽²⁾A spectrum representing unknown molecules arranged in columns; r is₁Matrix H of n⁽¹⁾And r₂Matrix H of n⁽²⁾Respectively represent W⁽¹⁾And W⁽²⁾The corresponding relative concentration; wherein r is₁And r₂Respectively represent r₁A known molecule and r₂A species of unknown molecule; α represents a weight set for a known molecule, α ≧ 0; since the reference spectrum W of the molecule is known⁽¹⁾Is known, to find W⁽²⁾、H⁽¹⁾And H⁽²⁾And F in the equation is minimized, so that the relative concentration corresponding to the known molecule can be obtained.

In certain embodiments, wherein said H is⁽¹⁾、W⁽²⁾And H⁽²⁾Is calculated in an iterative process.

In certain embodiments, wherein F is with respect to W⁽²⁾、H⁽¹⁾And H⁽²⁾The partial derivatives of (a) are:

deriving W from the partial derivative⁽¹⁾、H⁽¹⁾And H⁽²⁾The iterative formula of (a) is:

iterative updating according to an iterative formula H⁽¹⁾、W⁽²⁾And H⁽²⁾Stopping iteration when the maximum iteration number N or F is reduced to a set threshold value sigma, and after the iteration is stopped, H⁽¹⁾I.e. the final result for the corresponding relative concentrations of the known components.

In certain embodiments, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.

In certain embodiments, wherein the threshold σ is no more than about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.

In certain embodiments, wherein said H is⁽¹⁾、W⁽²⁾And H⁽²⁾The calculation process of (2) includes:

1) inputting a known component matrix W⁽¹⁾The measured spectrum matrix V, the maximum iteration number N and the threshold value sigma;

2) random initialization of a coefficient matrix H of known composition⁽¹⁾Spectral matrix W of unknown composition⁽²⁾And coefficient matrix H⁽²⁾；

3) Performing iterative update according to an iterative formula H⁽¹⁾、W⁽²⁾And H⁽²⁾；

4) Stopping iteration when the maximum iteration number N or F is reduced to a set threshold value sigma;

5) after iteration has stopped, H⁽¹⁾I.e. the final result for the corresponding relative concentrations of the known components.

In certain embodiments, the standard spectrum is detected under the same conditions as the sample to be detected.

In certain embodiments, the weight α is inversely related to the ratio of the number of known molecules (r1) to the number of unknown molecules (r2) contained in the test sample.

In certain embodiments, wherein W is⁽²⁾And H⁽²⁾When the weights are not present, the weight alpha is set to be 0, and the target function of the NMF-CLS algorithm is set as:

i.e. the classical least squares method.

In certain embodiments, when the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the sample to be tested is not less than 1, the weight α is set to 0, and the objective function of the NMF-CLS algorithm is set to:

in certain embodiments, wherein when the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the sample to be tested is less than 1, the weight α is not 0, the objective function of the NMF-CLS algorithm is set as:

in some embodiments, the method for setting the weight α includes:

1) determining the ratio of the number of known molecules to the number of unknown molecules in a sample to be detected;

2) configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of the samples to be detected;

3) setting different weight alpha, unmixing the spectrum of the simple sample obtained by testing by adopting an NMF-CLS algorithm, obtaining the respective corresponding coefficients of the known molecules, establishing a regression equation by using the coefficients and the actual concentration of the known molecules, and calculating the R-square value of the regression equation to obtain the alpha of the highest R-square as the optimal weight value suitable for the sample to be tested.

In certain embodiments, the method wherein the ratio of the number of known molecules to the number of unknown molecules in the test sample is determined comprises a principal component analysis method.

In certain embodiments, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the test sample is no more than about 1/2, or about 1/5, or about 1/10.

In certain embodiments, wherein the number of known molecules in the simple sample ranges from 1 to 100, or 1 to 50, or 1 to 20, or 1 to 10.

In certain embodiments, wherein the number of known molecules in the simple sample ranges from 2 to 100, or 2 to 50, or 2 to 20, or 2 to 10.

In certain embodiments, wherein the standard spectra database establishment comprises: collecting a plurality of spectral images of a certain molecule, calculating to obtain an average spectrum of the molecule, obtaining average spectra of other molecules in the same way, and incorporating the average spectra into a standard spectrum database to obtain the standard spectrum database.

In certain embodiments, the concentration of a molecule is between 0.1mM and 10mM when a spectral image of the molecule is acquired.

In certain embodiments, the number of spectral images in which a molecule is collected is no less than about 10, about 20, about 50, about 100, or about 200.

In certain embodiments, further comprising normalizing the intensity of the obtained average spectrum of the molecules.

In certain embodiments, wherein the standard spectra database establishment comprises: collecting multiple spectrum images of a certain molecule, calculating to obtain the intensity of the average spectrum of the molecule, normalizing, wherein the obtained spectrum is the standard spectrum of the molecule, obtaining the standard spectra of other molecules in the same way, and incorporating the standard spectra into a standard spectrum database to obtain the standard spectrum database.

In certain embodiments, wherein the standard spectra database establishment comprises: collecting a plurality of spectral images of a certain molecule, averaging the obtained plurality of spectral images, normalizing the intensity of the spectrum to an interval [0,1], wherein the obtained spectrum is the standard spectrum of the molecule, obtaining the standard spectra of other molecules in the same way, and bringing the standard spectra into a standard spectrum database to obtain the standard spectrum database.

In some embodiments, the method includes collecting spectra of a plurality of samples to be tested, performing algorithm analysis on each spectrum separately, obtaining coefficients of different known components, processing the coefficients, and finally obtaining an analysis result of relative concentrations of different known molecules in the sample.

In certain embodiments, wherein the processing comprises: averaging, summing, ANOVA analysis, and/or student t-test.

In certain embodiments, the number of spectra in which the test sample is collected is required to ensure that molecular information has been collected in substantially the entire test sample.

In certain embodiments, the amount of spectrum in which it is determined that substantially complete molecular information in the test sample has been collected includes, but is not limited to, by Pearson coefficient comparison.

In certain embodiments, wherein the obtaining of the Pearson coefficient comprises: and taking the average value of the spectra of the M samples to be detected as a standard spectrum, taking N spectra for averaging every time, calculating Pearson coefficients of the N spectra and the standard spectrum, repeating the operation for N times, and averaging N Pearson coefficients to serve as correlation coefficients corresponding to the N spectra.

In certain embodiments, wherein M is about 50 to 500, or about 100 to 400, or about 200 to 300.

In certain embodiments, wherein n is from about 3 to 30, or from about 4 to 20, or from about 5 to 10.

In certain embodiments, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.

In certain embodiments, wherein the sample to be tested comprises a chemical sample or a biological sample.

In certain embodiments, wherein the sample to be tested comprises a liquid sample.

In certain embodiments, the spectral image comprises an infrared spectrum and a raman spectrum.

In certain embodiments, the infrared spectroscopy comprises surface enhanced infrared spectroscopy.

In certain embodiments, the raman spectrum comprises a surface enhanced raman spectrum.

In certain embodiments, the surface enhanced raman spectrum is a broad spectrum surface enhanced raman spectrum.

In another aspect, the present application provides a method for analyzing based on surface enhanced raman spectroscopy, comprising the following steps: based on a Surface Enhanced Raman Spectroscopy (SERS) standard spectrum database, unmixing the spectrum obtained by testing by adopting a weighted nonnegative matrix factorization algorithm (NMF-CLS) to obtain the type and relative concentration of known molecules contained in a sample to be tested; the known molecules are molecules contained in an SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectra of different molecules;

wherein the objective function of the NMF-CLS algorithm is set to:

the spectrum matrix is an m x n matrix V and represents n spectra in total, and each spectrum consists of m points; m r₁Matrix W of⁽¹⁾Reference spectra representing known molecules arranged in columns, m r₂W of (2)⁽²⁾Indicating arrangement in columnsThe spectrum of the unknown molecule of (a); r is₁Matrix H of n⁽¹⁾And r₂Matrix H of n⁽²⁾Respectively represent W⁽¹⁾And W⁽²⁾The corresponding relative concentration; wherein r is₁And r₂Respectively represent r₁A known molecule and r₂A species of unknown molecule; α represents a weight set for a known molecule, α ≧ 0; since the reference spectrum W of the molecule is known⁽¹⁾Is known, to find W⁽²⁾、H⁽¹⁾And H⁽²⁾And F in the equation is minimized, so that the relative concentration corresponding to the known molecule can be obtained.

In certain embodiments, wherein SERS employs untargeted broad spectrum detection.

In some embodiments, the method for setting the weight α includes:

3) setting different weight alpha, unmixing the spectrum of the simple sample obtained by testing by adopting an NMF-CLS algorithm, obtaining the respective corresponding coefficients of the known molecules, establishing a regression equation by using the coefficients and the concentration of the known molecules, and calculating the R-square value of the regression equation to obtain the alpha of the highest R-square as the optimal weight value of the sample to be tested.

In certain embodiments, wherein the standard spectra database establishment comprises: collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain an SERS average spectrum of the molecule, obtaining SERS average spectra of other molecules in the same way, and bringing the SERS average spectra into a standard spectrum database to obtain an SERS standard spectrum database.

In certain embodiments, the concentration of a molecule is between 0.1mM and 10mM when an image of the SERS spectrum of that molecule is taken.

In certain embodiments, the number of SERS spectral images in which a molecule is collected is no less than about 10, about 20, about 50, about 100, or about 200.

In certain embodiments, further comprising normalizing the intensity of the SERS average spectrum of the obtained molecule.

In certain embodiments, wherein the standard spectra database establishment comprises: collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain the intensity of the SERS average spectrum of the molecule and normalizing, wherein the obtained spectrum is the SERS standard spectrum of the molecule, and obtaining the SERS standard spectra of other molecules in the same way and bringing the SERS standard spectra into the SERS standard spectrum database to obtain the SERS standard spectrum database.

In certain embodiments, wherein the SERS standard spectral database establishment comprises: collecting a plurality of SERS spectral images of a certain molecule, averaging the plurality of SERS spectral images, normalizing the intensity of the average spectrum to an interval [0,1], obtaining the spectrum which is the SERS standard spectrum of the molecule, obtaining the SERS standard spectra of other molecules in the same way, and bringing the SERS standard spectra into a standard spectrum database to obtain the SERS standard spectrum database.

In some embodiments, the method comprises collecting SERS spectra of a plurality of samples to be measured, and performing algorithm analysis on each SERS spectrum separately to obtain coefficients H of known components⁽¹⁾And then processing is carried out, and finally, the analysis result of the relative concentration of the known molecules in the sample is obtained.

In certain embodiments, wherein the collection of the number of spectra of the test sample is required to ensure that molecular information has been collected in substantially the entire test sample.

In certain embodiments, the determination that the amount of spectrum for which molecular information in a complete test sample has been collected comprises, but is not limited to, comparison by Pearson's coefficient.

In certain embodiments, the number of spectra in which the sample to be tested is scanned is not less than about 20, or about 30, or about 40, or about 50.

In certain embodiments, the number of spectra in which the sample to be tested is scanned is from about 20 to 200, or from about 30 to 160, or from about 40 to 120, or from about 50 to 80.

In some embodiments, the SERS spectrum of the sample to be measured is scanned at a speed of about 1-5 s/sheet.

In certain embodiments, wherein the biological sample comprises a cell culture fluid, a cell supernatant, a cell lysate, blood, a blood-derived product, lymph, urine, tears, saliva, cerebrospinal fluid, stool, synovial fluid, sputum, a cell, an organ, or a tissue.

In certain embodiments, wherein the molecules in the SERS standard database comprise metabolites.

In certain embodiments, wherein the molecules in the SERS standard database comprise small molecule metabolites.

In another aspect, the present application provides a metabolomics data processing method, comprising: and (3) after the spectral data of the biological samples of the same type are unmixed by adopting a weighted nonnegative matrix factorization algorithm, obtaining the types and relative concentration intervals of the known molecules in the samples, and obtaining a characteristic spectral database of the biological samples.

In certain embodiments, the metabolomics data processing method further comprises: and obtaining the types and relative concentration intervals of the known molecules in other types of biological samples in the same way, and incorporating the intervals into the characteristic spectrum database to obtain the characteristic spectrum database containing different types of biological samples.

A metabolomic analysis method, comprising: based on a standard spectrum database, unmixing the spectrum of the sample to be tested obtained by testing by adopting an NMF-CLS algorithm to obtain the type and relative concentration of the metabolite contained in the sample to be tested, wherein the metabolite is a molecule contained in the standard spectrum database.

In certain embodiments, the spectrum of the sample to be tested is a broad spectrum SERS spectrum.

In certain embodiments, wherein the metabolite is a molecule contained within the SERS standard spectral database.

In certain embodiments, it further comprises performing relevant biomedical analyses based on the obtained species of metabolites and their relative concentrations.

In certain embodiments, wherein the biomedical analysis comprises analysis of differential metabolite data.

In certain embodiments, the biomedical analysis comprises comparison of the metabolite species and relative concentrations of the test sample to a signature spectrum database.

In certain embodiments, wherein the biomedical analysis further comprises classifying or staging the sample.

In certain embodiments, wherein the spectral data comprises raman spectral data and infrared spectral data.

In certain embodiments, wherein the infrared spectroscopy data comprises surface enhanced infrared spectroscopy data.

In certain embodiments, wherein the raman spectral data comprises surface enhanced raman spectral data.

In certain embodiments, wherein the standard spectra database comprises a raman spectra standard spectra database and an infrared spectra standard database.

In certain embodiments, wherein the raman spectroscopy standard spectral database comprises a SERS standard spectral database.

In another aspect, the present application provides a method of determining a biomarker, comprising:

1) respectively obtaining spectral data of a sample group sample and a comparison group sample, and based on a standard spectral database, unmixing the spectrum obtained by testing by adopting a weighted nonnegative matrix decomposition algorithm, wherein the obtained sample group sample and the comparison group sample respectively contain the types and relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectral database;

2) differential molecules were screened as biomarkers.

In certain embodiments, the differential molecules comprise differential metabolites.

In certain embodiments, wherein step 2) comprises cross-selecting the plurality of differential metabolites by ANOVA analysis (ANOVA Test) and Logistic Regression (Logistic Regression).

In certain embodiments, wherein the ANOVA analysis comprises a statistical analysis of the data from the different classes to identify metabolites in which statistical differences occur between the different classes.

In certain embodiments, wherein the logistic regression comprises using the relative concentration data for classification, finding metabolites that contribute to distinguishing data classes.

In some embodiments, wherein the logistic regression is normalized using L1, the absolute value weight is greater than 0 when classifying, which is considered to contribute to classification.

In certain embodiments, it further comprises validating the obtained differential metabolites.

In certain embodiments, wherein the validating comprises performing a regression analysis of the actual concentrations of the samples with coefficients unmixed by a weighted non-negative matrix factorization algorithm.

In certain embodiments, wherein said validating comprises validating by analyzing differential metabolites as being compatible with physiology or pathology.

In another aspect, the present application provides a method of detecting the presence of, or assessing the risk of developing, a disease or condition, the method comprising the steps of:

1) obtaining the spectrum of an individual sample to be tested, and based on a standard spectrum database, unmixing the spectrum obtained by testing by adopting weighted nonnegative matrix factorization (NMF-CLS) to obtain the type and relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectrum database;

2) comparing the relative concentration of the known metabolite to a normal interval; and

3) determining whether the individual has, or is at risk for developing, a disease or condition.

In another aspect, the present application provides a method of determining the stage of a disease or condition, the method comprising the steps of:

2) comparing the relative concentration of the biomarker to a known stage level; and

3) determining the stage or type of the disease or disorder.

In certain embodiments, wherein the disease or disorder is selected from the group consisting of: infectious diseases, proliferative diseases, neurodegenerative diseases, cancer, psychological diseases, metabolic diseases, autoimmune diseases, sexually transmitted diseases, gastrointestinal diseases, pulmonary diseases, cardiovascular diseases, stress and fatigue-related disorders, mycoses, pathogenic diseases and obesity-related disorders.

In another aspect, the present application provides a method of cell or microorganism analysis, the method comprising the steps of:

1) obtaining the spectral data of the sample to be tested, based on the standard spectral database, adopting weighted nonnegative matrix factorization (NMF-

CLS) unmixing the spectrum obtained by the test to obtain the type and relative concentration of the known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in a standard spectrum database;

3) determining the physiological or pathological state, physiological or pathological type of said cell or microorganism.

In certain embodiments, the method further comprises further screening the identified cells or microorganisms for a desired type of target cell or microorganism.

In certain embodiments, the spectrum of the sample to be tested comprises a broad spectrum SERS spectrum.

In another aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned method.

In certain embodiments, the computer-readable storage medium further stores standard spectral database data.

In certain embodiments, the standard spectral database comprises a SERS standard spectral database.

In another aspect, the present application provides an apparatus comprising a memory storing a standard spectra database and a computer program, and a processor implementing the steps of the aforementioned method when executing the computer program.

In another aspect, the present application provides a spectral unmixing system based on a weighted non-negative matrix factorization algorithm, which includes: and the solving optimization module is used for solving the weighted nonnegative matrix factorization algorithm by adopting an iteration method to complete spectrum data unmixing.

In some embodiments, the system further includes a weight optimization module, configured to solve the known molecular weights in a linear regression manner, and determine an optimal weight value.

In certain embodiments, the system further comprises an evaluation module for evaluating the unmixing results using the relative concentrations of known molecules.

In another aspect, the present application provides the use of the aforementioned computer-readable storage medium, the aforementioned device, or the aforementioned system in the manufacture of a device for the analysis of compounds and/or the classification and detection of microorganisms.

In another aspect, the present application provides the use of the aforementioned computer-readable storage medium, the aforementioned device, or the aforementioned system in the preparation of a device for metabolomic data processing and/or analysis.

In another aspect, the present application provides a metabolomics analysis device, comprising: and the data processing module is used for analyzing the obtained surface enhanced Raman spectrum data of the sample to be detected to obtain the type and relative concentration of the metabolite in the sample.

In some embodiments, the data processing module includes a solution optimization module for solving the weighted non-negative matrix factorization algorithm by using an iterative method to complete spectral data unmixing.

In some embodiments, the data processing module includes a weight optimization module, configured to solve the known molecular weights in a linear regression manner to determine optimal weight values.

In certain embodiments, the data processing module comprises an evaluation module for evaluating the unmixing results using the relative concentrations of known molecules.

In certain embodiments, the evaluating comprises classifying the test sample using a differential metabolite classification test model.

In some embodiments, the apparatus further includes a spectrum detection module, configured to perform spectrum detection on the sample to be detected, and obtain spectrum data of the sample to be detected.

In some embodiments, the device further comprises a test sample collection module for collecting a test sample based on a metabolomics approach.

Other aspects and advantages of the present application will be readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application have been shown and described in the following detailed description. As those skilled in the art will recognize, the disclosure of the present application enables those skilled in the art to make changes to the specific embodiments disclosed without departing from the spirit and scope of the invention as it is directed to the present application. Accordingly, the descriptions in the drawings and the specification of the present application are illustrative only and not limiting.

Drawings

The specific features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates will be better understood by reference to the exemplary embodiments described in detail below and the accompanying drawings. The drawings are briefly described as follows:

FIG. 1 is a flow chart showing the steps of the weighted non-negative matrix factorization algorithm described herein;

FIG. 2 shows SERS standard spectra of 89 metabolite molecules constructed in example 1 of the present application;

FIGS. 3a-3g show a flow chart of model validation and the fitting effect of SERS spectra after unmixing according to NMF-CLS in example 2 of the present application;

FIG. 4a shows the fitting effect of the SERS spectra of the sample in example 2 of the present application after unmixing according to CLS;

FIG. 4b shows the fitting effect of the SERS spectra of the sample in example 2 of the present application after unmixing by NMF;

FIG. 4c shows the spectrum of a known component calculated by NMF in example 2 of the present application;

FIGS. 5a-5c show the fitting effect of the model SERS spectra in example 3 after unmixing according to NMF-CLS;

FIGS. 6a-6c show the calculation of the number of spectra required for different cell samples in example 4 of the present application;

FIG. 7a shows the cell morphology change of each cell in example 4 of the present application;

FIG. 7b shows the fitting effect of the SERS spectra of each cell culture fluid after unmixing according to CLS in example 4 of the present application;

FIG. 7c shows a SERS spectrum thermal map obtained from 200 spectra of DAY2 data of each cell culture solution in example 4 of the present application;

FIG. 7d is a graph showing the coefficient change curves of 8 different metabolites in each cell culture solution in example 4 of the present application;

FIGS. 8a-8c show the calculation of the number of spectra required for different serum samples according to example 5 of the present application;

FIG. 9 shows a SERS spectrum thermal map obtained by selecting 200 spectra from different serum samples in example 5 of the present application;

FIG. 10 shows the fitting effect of the SERS spectra of different serum samples according to NMF-CLS unmixing in example 5 of the present application;

FIG. 11 shows 16 differential metabolites screened from different serum samples according to example 5 of the present application;

FIGS. 12a-12c show the results of metabolomic classification and psa screening in example 5 of the present application.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification.

Definition of terms

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Also, unless otherwise specified, except within the claims, the use of "or" includes "and vice versa. Non-limiting terms are not to be construed as limiting unless expressly stated or otherwise clearly indicated by the context (e.g., "comprising," "having," and "including" generally indicate "including, but not limited to"). The inclusion of a singular form such as "a," "an," and "the" in the claims includes plural referents unless expressly stated otherwise. To assist in understanding and preparing the invention, the following illustrative, non-limiting examples are provided.

In the present application, the term "Metabolome" generally refers to the collection of all metabolites in a cell, tissue, organ or organism of an organism, and generally refers to a Metabolome which refers to the generic term for small molecule metabolites having a relative molecular mass of less than about 1500Da (Da: Dalton).

The term "small molecule metabolite" includes both organic and inorganic molecules, which are present in cells, cellular compartments, or organelles, typically having a molecular weight below 2000 or 1500. The term does not include macromolecules such as large proteins (e.g., proteins having a molecular mass of greater than 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, or 10,000), large nucleic acids (e.g., nucleic acids having a molecular mass of greater than 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, or 10,000), or large polysaccharides (e.g., polysaccharides having a molecular mass of greater than 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, or 10,000). Small molecule metabolites of cells are often found free in solution in the cytoplasm or in other organelles (e.g., mitochondria), where they form a pool of intermediates that can be further metabolized or used to produce macromolecules (referred to as macromolecules). The term "small molecule metabolite" includes signal molecules and intermediates in chemical reactions that convert energy from food into a useful form. Examples of small molecule metabolites include phospholipids, glycerophospholipids, lipids, plasmalogens, sugars, fatty acids, amino acids, nucleotides, intermediates formed during cellular processes, isomers, and other small molecules found within cells. In one embodiment, the small molecules of the invention are isolated. Preferred metabolites include lipids and fatty acids.

By way of non-limiting example, the small molecule metabolite may be selected from: 1, 3-uric acid dimethyl ester, levoglucosan, 1-methylnicotinamide, 2-hydroxyisobutyrate, 2-oxoglutarate, 3-aminoisobutyrate, 3-hydroxybutyrate, 3-hydroxyisovalerate, 3-indole sulfate, 4-hydroxyphenylacetate, 4-hydroxyphenyllactic acid, 4-pyridineoxolate, acetate, acetoacetate, acetone, adipate, alanine, allantoin, asparagine, betaine, carnitine, citrate, sarcocin, creatinine, dimethylamine, ethanolamine, formate, trehalose, fumarate, glucose, glutamine, glycine, hippurate, histidine, hypoxanthine, isoleucine, lactate, leucine, lysine, mannitol, N-dimethylglycine, arginine, glycine, arginine, glycine, arginine, glycine, arginine, glycine, arginine, glycine, arginine, glycine, arginine, lysine, glycine, O-acetylcarnitine, pantothenate, propylene glycol, pyroglutamate, pyruvate, quinolinate, serine, succinate, sucrose, taurine, threonine, trigonelline, trimethylamine-N-oxide, tryptophan, tyrosine, cytosine, uracil, urea, valine, xylose, aconitic acid, inositol, trans-aconitic acid, 1-methylhistidine, 3-methylhistidine, ascorbate, phenylacetylglutamine, 4-hydroxyproline, gluconate, galactose, galactitol, plant galactose, lactose, phenylalanine, proline betaine, trimethylamine, butyrate, propionate, isopropanol, mannose, 3-methylxanthine, ethanol, benzoate, glutamate, glycerol, guanosine, guanine, xanthine, adenine, uric acid, adenosine, arginine, and the like, Inosine, inosinic acid, CO2, H2O, N-carbamoyl-beta-alanine, ammonia, beta-aminoisobutyric acid, putrescine, spermidine, spermine, methionine, S-adenosylmethionine, decarboxylated S-adenosylmethionine, arginine, ornithine, putrescine, N1-acetylspermidine, N1-acetylspermine, elF5A (Lys), elF5A (Dhp), elF5A (84), N1N 2-diacetylspermine, 3-aminopropionaldehyde, 3-acetylaminopropionaldehyde, acrolein, FDP-lysine protein, threo-Ds-isocitrate, oxalyl-succinate, 2-oxo-glutarate, oxalyl-acetate, L-glutamate, 2-hydroxy-glutarate, acetyl-CoA, beta-alanine, ammonia, beta-aminoisobutyric acid, putrescine, spermine, spermidine, spermine, elF 5-3538 (Dhp), elF 5-Hpu), N-acetyl-arginine, 3-aminopropionaldehyde, acrolein, FDP-lysine, and L-glutamate, Cis-aconitic acid, D-isocitric acid salt,. alpha. -ketoglutarate salt, succinyl-CoA, malate, (-) O-acetyl-carnitine, itaconate salt, glycolate salt, glyoxylate salt, oxalate salt, oxalyl-CoA, formyloxy-CoA, glucose 6-phosphate (G6P), fructose 6-phosphate (F6P), fructose 1, 6-biphosphate (F1,6BP), glyceraldehyde 3-phosphate (GADP), dihydroxyacetone phosphate (DHAP), 1, 3-biphosphoglycerate (1,3BPG), 3-phosphoglycerate (3PG), 2-phosphoglycerate (2PG), phosphoenolpyruvate (PEP), D-glucose, D-glucono-1, 5-lactone, D-gluconate, alpha-D-mannose 6-P, D-mannose, D-fructose, D-sorbitol, glycerone-P, sn-glycero-3P, D-glyceraldehyde, 1,2 propane-diol, 2-hydroxypropanal, 3-P-serine, 3-P-hydroxyacetonate, D-glycerate, hydroxyacetonate, L-alanine, L-alanyl-tRNA, L-glutamate, 2-oxoglutarate, L-lactate, D-lactate, adenosine triphosphate (ADP), Adenosine Diphosphate (ADP), H +, succinate, O2, NADH, NAD +, NADPH, 6-phosphogluconolactone, 6-phosphogluconate, ribulose 5-phosphate, glucose-D-glucuronate, L-glyceraldehyde, L-1, 2-alanyl-tRNA, L-glutamate, L-oxoglutarate, L-lactate, ATP-lactate, adenosine triphosphate (ADP), H +, succinate, O2, NADH +, NAD +, NADPH +, 6-phosphogluconolactone, 6-phosphogluconate, ribulose 5-phosphate, glucose-D-phosphate, D-glucuronate, D-glucuronate, D, L-L, Ribose-5-phosphate, xylulose-5-phosphate, glyceraldehyde 3-phosphate, sedoheptulose 7-phosphate, fructose 6-phosphate, erythrose 4-phosphate, xylulose 5-phosphate, D-ribulose, D-ribitol, D-ribose, L-ribulose, sedoheptulose 1,7P2, 3-oxo-6-P-hexulose, L-ornithine, carbamyl phosphate, L-citrulline, argininosuccinic acid, L-arginine, L-aspartate, Adenosine Monophosphate (AMP), pyrophosphate, trans- Δ 2-enoyl-CoA, L- β -hydroxyalkyl-CoA, β -ketoethyl CoA, FADH2, acyl-CoA, propionyl-CoA, Adenosine Monophosphate (AMP), pyrophosphate, trans- Δ 2-enoyl-CoA, and pharmaceutically acceptable salts thereof, Inosine Monophosphate (IMP), Xanthosine Monophosphate (XMP), Guanosine Monophosphate (GMP), xanthosine, adenylosuccinic acid, Uridine Monophosphate (UMP), thymidine, thymine, deoxyribose-1-phosphate, deoxythymidine monophosphate (dTMP), deoxycytidine monophosphate (dCMP), retinyl palmitate, palmityl-CoA, isotretinoin, beta-glucuronide, retinal, beta-carotene, retinoic acid, calcifediol, 25-hydroxyergocalciferol, calcitriol, methylcobalamin, 5' -deoxyadenosylcobalamin, alpha-CECH, NH4+, alpha-ketoglutarate, oxaloacetate, gamma-semialdehyde, delta-1-pyrroline-5-carboxylate, citrulline, NH3, N glutamate 5, N10-methylene THF, alpha-ketoglutarate, oxaloacetate, gamma-semialdehyde, alpha-pyrroline-5-carboxylate, citrulline, N3, N glutamate 5, N10-methylene THF, 3-phosphoglycerate, alpha-ketobutyrate, alpha-amino-beta-ketobutyrate, aminoacetone, cysteine, beta-sulfinyl pyruvate, bisulfite, sulfite, sulfate, glutathione, hypotaurine, adenosine 5 ' -phosphosulfate, adenosine 3 ' -phosphosulfate, homocysteine, alpha-keto-beta-methylvalerate, alpha-keto isocaproate, alpha-keto isovalerate, alpha-methylbutyryl-CoA, methylcrotonyl-CoA, 3-methyl-3-hydroxybutyric acid acyl-CoA, 2-methylacetoacetyl-CoA, isovaleryl-CoA, 3-methylcrotonyl-CoA, 3-methylpentadienyl-CoA, alpha-amino-beta-ketobutyrate, aminoacetone, cysteine, beta-sulfinate, bisulfite, sulfite, sulfate, glutathione, hypotaurine, adenosine 5 ' -phosphosulfate, 3 ' -phosphoadenosine 5 ' -phosphosulfate, homocysteine, alpha-keto-isovalerate, alpha-ketobutyrate, alpha-methylbutyryl-CoA, 3-methyl-3-acetyl-CoA, 3-methylglutaryl-CoA, isovaleryl-CoA, 3-ketoglutaryl-CoA, 3-ketobutyrate, beta-ketobutyrate, L-CoA, L-D-CoA, L-D, L-D, L-CoA, L-D, L-CoA, L-CoA, L-D, L-CoA, L-D-CoA, L-D, L-CoA, L-D-L-D-CoA, L-D, L-D-L-, 3-hydroxy-3-methylglutaryl-CoA, acetoacetate, isobutyryl CoA, methylpropenyl-CoA, 3-hydroxyisobutyryl-CoA, methylmalonate monoaldehyde, p-hydroxyphenylpyruvate, homogentisate, 4-maleylacetoacetate, 4-fumarylacetoacetate, fumarate, 3-hydroxytrimethyllysine, 4-trimethylaminobutyraldehyde, gamma-butyrobetaine, urocanic acid ester, 4-imidazolidinone-5-propionate, N-formimidomethyl-L-glutamate, N5-formimidomethyl-tetrahydrofolate, histamine, N-formimido-canine urea, kynurenic acid, 3-hydroxycanine urea, anthranilate, dihydrogenate, dihydrogennicotinate, dihydrogennicotinaldehyde, dihydrogencarbamate, or a salt of a, 3-hydroxyanthranilate, glutaryl-CoA, acetoacetyl-CoA, and combinations thereof.

In the present application, the term "biological sample" or "chemical sample" may include various biological samples or chemical samples suitable for observation (e.g., imaging) or examination. Chemical samples include any chemical mixture or compound. Biological samples include, but are not limited to, cell cultures or extracts thereof; biopsy material obtained from an animal (e.g., a mammal) or an extract thereof; and blood, saliva, urine, feces, semen, tears, or other body fluids or extracts thereof. For example, the term "biological sample" refers to any solid or fluid sample obtained from, excreted by, or secreted by any living organism, including unicellular microorganisms (such as bacteria and yeast) and multicellular organisms (such as plants and animals, e.g., vertebrates or mammals, and in particular healthy or apparently healthy human subjects or human patients affected by a condition or disease to be diagnosed or studied). The biological sample may be in any form, including solid materials such as tissues, cells, cell aggregates, cell extracts, cell homogenates, or cell fractions; or a biopsy, or a biological fluid. Biological fluids may be obtained from fluids obtained from any site (e.g., blood, saliva (or oral washes containing buccal cells), tears, plasma, serum, urine, biliary fluid, cerebrospinal fluid, amniotic fluid, peritoneal fluid and pleural fluid, or cells therefrom, aqueous or vitreous fluid, or any bodily secretion), exudate (e.g., fluids obtained from abscesses or any other site of infection or inflammation), or from joints (e.g., normal joints or joints affected by diseases such as rheumatoid arthritis, osteoarthritis, gouty or septic arthritis). The biological sample may be obtained from any organ or tissue (including biopsy or autopsy specimens) or may comprise cells (whether primary cells or cultured cells) or media conditioned by any cell, tissue, or organ. Biological samples may also include tissue sections, such as frozen sections taken for histological purposes. Biological samples also include mixtures of biomolecules including proteins, lipids, carbohydrates and nucleic acids produced by partially or completely fractionating cells or tissue homogenates. Although the sample is preferably taken from a human individual, the biological sample may be from any animal, plant, microorganism, cell, virus, yeast, etc.

In the present application, the term "subject" generally refers to humans as well as non-human animals at any developmental stage, including, for example, mammals, birds, reptiles, amphibians, fish, worms, and single cells. Cell cultures and biopsy samples are considered the majority of animals. In certain exemplary embodiments, the non-human animal is a mammal (e.g., a rodent, a mouse, a rat, a rabbit, a monkey, a dog, a cat, a sheep, a cow, a primate, or a pig). The animal may be a transgenic animal or a human inbred. If desired, the biological sample may be subjected to preliminary processing, including preliminary separation techniques.

In the present application, the terms "microbial" and "microorganism" include all microorganisms, including bacteria, viruses and fungi.

In the present application, the term "cell" generally refers to its meaning as generally recognized in the art. The cell may be prokaryotic (e.g., a bacterial cell) or eukaryotic (e.g., a mammalian or plant cell). The cells may be of somatic or germline origin, totipotent or pluripotent, dividing or non-dividing. The cell may also be derived from or may comprise a gamete or embryo, a stem cell, or a fully differentiated cell.

In the present application, the terms "disease" or "disorder" are used interchangeably and generally refer to any deviation of a subject from a normal state, such as any change in the state of the body or certain organs, that prevents or disturbs the performance of a function, and/or that causes symptoms such as discomfort, dysfunction, distress or even death in a person who is diseased or in contact with it. The disease or condition may also be referred to as disorder (distemper), malaise (ailing), ailment (ailment), disease (malady), disorder (disorder), disease (sickness), illness (illness), physical discomfort (compliance), inderdisposition, or affactation. The term "staging" generally refers to identifying a particular stage at which a disease has progressed.

In this application, the term "Pearson correlation coefficient" or "Pearson coefficient" generally refers to the quotient of the covariance and the standard deviation between two sets of variables being calculated. The different pearson coefficients are worth the following: the Pearson coefficient value is positive, which means that the Pearson coefficient value and the Pearson coefficient value are in positive correlation, namely in a relationship of monotone increasing; negative means negative correlation, i.e. a monotonically decreasing relationship. In the application, the pearson coefficients are used for judging the number of the spectra, the average value of 200 spectra is used as a standard spectrum, N spectra are taken out each time for averaging, the pearson coefficients of the N spectra and the standard spectrum are calculated, the operations are repeated for 5 times, and 5 pearson coefficients are averaged to be used as correlation coefficients corresponding to the N spectra. The number of spectra is considered sufficient when the Pearson coefficient value is greater than 0.8 or converges.

The absolute value of the Pearson coefficient is different in the different intervals, and the correlation degree is different:

absolute value of Pearson coefficient	Degree of correlation
		0-0.2	Weak correlation
0.2-0.5	Middle correlation
		0.5-0.8	Strong correlation
0.8-1.0	Very strong correlation

In this application, the term "about" generally means about (approximate), near (coherent), roughly (roughly), or left or right (around). When the term "about" is used in reference to a range of values, the cutoff or particular value is used to indicate that the recited value may differ from the recited value by as much as 10%. Thus, the term "about" can be used to encompass variations of ± 10% or less, variations of ± 5% or less, variations of ± 1% or less, variations of ± 0.5% or less, or variations of ± 0.1% or less from a particular value.

Detailed Description

Weighted moment non-negative matrix factorization algorithm (NMF-CLS)

On one hand, the application provides a new spectrum unmixing method, namely a weighted matrix nonnegative matrix decomposition algorithm, and obtains the relative concentration corresponding to a specific component on the basis of obtaining a better fitting effect. On the basis of the target function of the original equation (1), the weight for the spectrum of the known component is added, and the reference spectrum of the unknown component is calculated by an algorithm, so that a better fitting effect and the corresponding type and relative concentration information of the known component are obtained simultaneously. The weighted objective function is:

wherein the matrix W of m r1⁽¹⁾A matrix W representing a reference spectrum of known components (known components in a standard database) arranged in columns, m r2⁽²⁾Spectra representing the unknown components arranged in columns are calculated by an algorithm. r 1n matrix H⁽¹⁾And r2 n matrix H₂Respectively represent W⁽¹⁾And W⁽²⁾The corresponding relative concentration, α, represents the weight set for the known component.

Since the reference spectrum for a known composition is accurate, the objective of the algorithm is to find W⁽²⁾、H⁽¹⁾And H⁽²⁾So that F in equation (4) is minimized. Thereby obtaining F with respect to W⁽²⁾、H⁽¹⁾And H⁽²⁾The partial derivatives of (a) are:

the multiplicative update proposed by Lee and Seung (Lee, D.D.; Seung, H.S. Nature 1999, 401, 788-⁽¹⁾、H⁽¹⁾And H⁽²⁾Is the iterative formula of

And performing iteration updating H ^ 1), W ^ 2 and H ^ 2 according to an iteration formula, stopping iteration when the maximum iteration number N or F is reduced to a set threshold sigma, and obtaining the final result of the relative concentration corresponding to the known component after iteration is stopped, wherein H ^ ((1)) is the final result of the relative concentration corresponding to the known component.

Algorithm sets a spectral matrix W of known components⁽¹⁾And a coefficient (relative concentration) matrix H corresponding thereto⁽¹⁾Spectral matrix W of unknown composition⁽²⁾And a coefficient (relative concentration) matrix H corresponding thereto⁽²⁾. Wherein H⁽¹⁾、W⁽²⁾And H⁽²⁾The algorithm is obtained by calculation in an iterative process, and the algorithm implementation steps and a flow chart are as shown in figure 1:

2) randomly initializing a coefficient matrix H of known composition⁽¹⁾Spectral matrix W of unknown composition⁽²⁾And coefficient matrix H⁽²⁾；

3) Iteratively updating H according to the multiplicative derivation referred to in the application⁽¹⁾、W⁽²⁾And H⁽²⁾；

4) Stopping the iteration when the maximum number of iterations N (e.g. 20) or F falls below a set threshold σ (e.g. 0.000001);

5) after iteration has stopped, H⁽¹⁾I.e. the final result of the coefficients (relative concentrations) for the known components.

W⁽¹⁾And H⁽¹⁾Respectively, the spectrum of the known component and the coefficient calculated for the known component, it being noted that W⁽¹⁾Each column of data in (a) is a Raman spectrum of a component, corresponding to which is H⁽¹⁾Each row of (A) is the coefficient of the component under all spectra, i.e. H⁽¹⁾W means W⁽¹⁾Coefficient of the ith component. Therefore, for any one of the known components, no further calculation is needed, and only the target component corresponding to H needs to be taken out⁽¹⁾The row of data of (a).

In certain embodiments, the standard spectrum is detected under the same conditions as the sample to be detected. For example, the standard spectrum and the metabolic sample are detected by using broad spectrum SERS spectrum.

In certain embodiments, the weight α is inversely related to the ratio of the number of known molecules (r1) to the number of unknown molecules (r2) contained in the test sample. In some embodiments, the weight α is non-linearly related to the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the sample to be tested.

i.e. the classical least squares method.

In certain embodiments, the standard spectra database creation further comprises collecting open source data derived from spectra in other literature.

For example, the standard spectral database establishment may include: obtaining an average spectrum of a certain molecule, normalizing the intensity of the spectrum to an interval [0,1], wherein the obtained spectrum is a standard spectrum of the molecule, obtaining standard spectra of other molecules in the same way, and incorporating the standard spectra into a standard spectrum database to obtain a standard spectrum database; the average spectrum can be obtained through open source data, or can be obtained by collecting a plurality of standard substance spectrum images and averaging.

In some embodiments, the method includes collecting spectra of a plurality of samples to be tested, performing algorithm analysis on each spectrum separately, obtaining coefficients of different known components, and then performing processing (such as averaging, summing, ANOVA analysis and/or student t-test) to obtain an analysis result of the relative concentrations of different known molecules in the sample.

In certain embodiments, the number of spectra in which the sample to be tested is collected is about 50 to 200.

For example, a spectral image unmixing method based on the NMF-CLS algorithm may include: collecting spectra of a plurality of samples to be tested, based on a standard spectrum database, adopting an NMF-CLS algorithm to carry out algorithm analysis on each spectrum independently, obtaining coefficients of known components of each spectrum, then carrying out processing (such as averaging or summation), and finally obtaining an analysis result of the relative concentration of known molecules in the sample; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectra of different molecules; the collection of the number of spectra of the test sample is required to ensure that molecular information has been collected in substantially the entire test sample (e.g., the number of spectra is about 50, about 100, or about 200).

For example, biological or chemical samples include biomolecules, nucleosides, nucleic acids, polynucleotides, oligonucleotides, proteins, enzymes, polypeptides, antibodies, antigens, ligands, receptors, polysaccharides, carbohydrates, polyphosphates, nanopores, organelles, lipid layers, tissues, organs, organisms, bodily fluids. The term "biological or chemical sample" may include biologically active compound(s), such as analogs or mimetics of the foregoing species. As used herein, the term "biological sample" may include samples such as cell lysates, intact cells, organisms, organs, tissues, and bodily fluids. "body fluids" may include, but are not limited to, blood, dry blood, blood clots, serum, plasma, saliva, cerebrospinal fluid, pleural fluid, tears, ductal fluid of the breast, lymph, sputum, urine, amniotic fluid, and semen. The sample may comprise a "non-cellular" bodily fluid. "acellular bodily fluids" include less than about 1% (w/w) whole cell material. Plasma or serum are examples of non-cellular body fluids. The sample may comprise a sample of natural or synthetic origin (i.e., a sample of cells made acellular). In some embodiments, the biological sample may be from a human or non-human source. In some embodiments, the biological sample may be from a human patient. In some embodiments, the biological sample may be from a human neonate.

Setting of the weight alpha

In some embodiments, the method for setting the weight α includes:

For example, the number of known molecules in the simple sample may be 2 to 10, and the ratio of the number of molecules in the simple sample to the number of molecules in the sample to be tested does not exceed 1/2. In some embodiments, the number of known molecules in the simple sample may be 2, and the number of molecules in the sample to be tested may be 4, 5, 6, 7 or more.

Since the value of α is related to the ratio of the amounts of known and unknown components, in some embodiments, α can be calculated from a simple model that establishes the same ratio, as follows:

1. calculating the total component quantity in the complex sample by means of principal component analysis and the like, and calculating the ratio of the known component quantity to the unknown component quantity according to the known component quantity;

2. selecting a small amount (such as 3) of known molecules and obtaining a standard spectrum thereof, additionally setting a certain amount of unknown molecules to ensure that the number ratio of the known molecules to the unknown molecules is equal to that of a complex sample, and manually preparing solutions of the known molecules with different concentrations;

3. and performing unmixing on the Raman spectrum of the solution by adopting different alpha values, taking out coefficients corresponding to known molecules, establishing a regression equation for the coefficients and the concentration, and calculating the R-square value of the coefficients so as to obtain the alpha of the highest R-square as the optimal weight value suitable for the complex model.

Analysis method based on surface enhanced Raman spectroscopy

wherein the objective function of the NMF-CLS algorithm is set to:

In some embodiments, the method for setting the weight α includes:

For example, the SERS standard spectral database establishment may include: obtaining an average SERS spectrum of a certain molecule, normalizing the intensity of the spectrum to an interval [0,1], wherein the obtained spectrum is the SERS standard spectrum of the molecule, obtaining SERS standard spectra of other molecules in the same way, and bringing the SERS standard spectra into an SERS standard spectrum database to obtain an SERS standard spectrum database; the SERS average spectrum can be obtained through open source data, and can also be obtained through collecting multiple standard SERS spectrum images and averaging.

In some embodiments, the method comprises collecting SERS spectra of a plurality of samples to be measured, and performing algorithm analysis on each SERS spectrum separately to obtain coefficients H of known components⁾¹⁾Followed by processing (e.g., averaging, summing, ANOVA analysis, and/or student t-test) to arrive at an analytical result of the relative concentrations of known molecules in the sample.

For example, a surface enhanced raman spectroscopy based analysis method may include: collecting SERS spectra of a plurality of samples to be detected, based on an SERS standard spectrum database, adopting an NMF-CLS algorithm to independently perform algorithm analysis on each SERS spectrum, obtaining coefficients of known components of each SERS spectrum, then performing processing (such as averaging or summing), and finally obtaining an analysis result of the relative concentration of known molecules in the sample; the known molecules are molecules contained in an SERS standard spectrum database, and the SERS standard spectrum database consists of standard spectra of different molecules; the collection of the number of spectra of the test sample is required to ensure that molecular information has been collected in substantially the entire test sample (e.g., the number of spectra is about 50, about 100, or about 200).

In certain embodiments, wherein the biological sample comprises a cell culture fluid, a cell supernatant, a cell lysate, blood, a blood-derived product (such as buffy coat, serum, or plasma), lymph, urine, tears, saliva, cerebrospinal fluid, stool, synovial fluid, sputum, a cell, an organ, or a tissue.

For example, wherein the biological sample may be selected from the group consisting of: blood, plasma, urine, saliva, tears, and cerebrospinal fluid.

Metabolic group processing or analysis method

In certain embodiments, the spectrum is a SERS spectrum.

For example, the same type of biological sample may be serum or cell culture fluid.

For example, the biological samples of the same type may be obtained from different sources, such as serum samples from different populations (healthy vs diseased population), or cell culture fluids from different cell types (normal vs tumor cells).

For example, the characteristic spectrum database may include a SERS characteristic spectrum database of serum samples of healthy or diseased people.

For example, the metabolomic analysis method may comprise: collecting SERS spectra of a plurality of samples to be detected, based on an SERS standard spectrum database, adopting an NMF-CLS algorithm to independently perform algorithm analysis on each SERS spectrum, obtaining coefficients of known components of each SERS spectrum, then performing processing (such as averaging or summing), and finally obtaining an analysis result of the relative concentration of known molecules (metabolite molecules) in the sample; the known molecules are molecules contained in an SERS standard spectrum database, and the SERS standard spectrum database consists of standard spectra of different molecules; the collection of the number of spectra of the test sample is required to ensure that molecular information has been collected in substantially the entire test sample (e.g., the number of spectra is about 50, about 100, or about 200).

2) differential molecules were screened as biomarkers.

For example, the method of determining a biomarker may comprise:

1) respectively obtaining spectral data of a sample group sample and a comparison group sample, wherein each sample collects a plurality of pieces of spectral data, each SERS spectrum is subjected to algorithm analysis by adopting an NMF-CLS algorithm based on a standard spectral database, coefficients of known components of each SERS spectrum are obtained and then are processed (such as average or summation), the obtained sample group sample and the comparison group sample respectively contain the type and relative concentration of known molecules, and the known molecules are molecules contained in the standard spectral database; the number of spectra per collected test sample is required to ensure that molecular information has been collected in substantially the entire test sample (e.g., the number of spectra is about 50, about 100, or about 200);

2) performing statistical analysis on the data of the different classes by ANOVA analysis to find metabolites in which statistical differences occur between the different classes, wherein logistic regression comprises classifying by using relative concentration data to find metabolites contributing to distinguishing data classes, wherein logistic regression is normalized by L1, and the logistic regression is considered to contribute to classification when the absolute value weight is greater than 0; the intersection of differential metabolites selected by ANOVA and logistic regression were taken as biomarkers.

ANOVA and logistic regression screening

ANOVA performed statistical analysis on the data of different classes to find metabolites where statistical differences occurred between the different classes. The ANOVA will give the test level, i.e. the p-value, with different classes of data entered. The smaller the p-value, the higher the difference between groups, and it is generally considered that when the p-value is less than 0.05, the difference between groups is significant. The application calculates the p value of the test level of each metabolite in different categories according to the relative concentration data of the metabolites, and the metabolites with the p value less than 0.05 are considered as metabolites with significant difference, and are kept.

Logistic regression uses relative coefficient data for classification to find metabolites that contribute to distinguishing between different classes of data. With the L1 regularization, the absolute value of the calculated weight when classifying is greater than 0, which is considered to contribute to the classification. When the logistic regression algorithm is used for classification, different calculation weights are set for relative concentration data of different metabolites, the calculation weights are the importance of the metabolites during classification, and the larger the absolute value of the calculation weights is, the higher the importance is. In the case of the L1 regularization, the weight value of the metabolite with low importance is set to 0 by the algorithm, so the metabolite with the calculated weight absolute value greater than 0 is taken as the metabolite with possible difference.

And taking the intersection of the differential metabolites selected by the ANOVA mode and the logistic regression mode as a final metabolite screening result.

Use of

1) comparing the relative concentration of the known metabolite to a normal interval; and

For example, the method of detecting the presence of a disease or condition, or assessing the risk of developing a disease or condition, may comprise the steps of:

1) obtaining a surface enhanced Raman spectrum of an individual sample to be tested, and based on an SERS standard database, unmixing the spectrum obtained by testing by adopting weighted nonnegative matrix factorization (NMF-CLS) to obtain the type and relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the SERS standard database;

3) determining the stage or type of the disease or disorder.

For example, the method of determining the stage of a disease or disorder may comprise the steps of:

3) determining the stage or type of the disease or disorder.

1) obtaining spectral data (such as SERS spectrum) of a cell sample to be tested, and based on a standard spectral database (such as SERS standard), unmixing the spectrum obtained by testing by adopting weighted nonnegative matrix factorization (NMF-CLS) to obtain the type and relative concentration of a known metabolite contained in the sample to be tested, wherein the known metabolite is a molecule contained in the standard spectral database;

Computer readable storage medium, apparatus and system

For example, the metabolomic analysis device may comprise:

1) the data processing module is used for analyzing the obtained surface enhanced Raman spectrum data of the sample to be detected to obtain the type and relative concentration of metabolites in the sample; the data processing module comprises:

i) the solving optimization module is used for solving the weighted nonnegative matrix factorization algorithm by adopting an iteration method to complete spectrum data unmixing; ii) a weight optimization module for solving the known molecular weight by adopting a linear regression mode to determine an optimal weight value; iii) an evaluation module for evaluating the unmixing results using the relative concentrations of known molecules;

2) the spectrum detection module is used for performing spectrum detection on the sample to be detected to obtain spectrum data of the sample to be detected;

3) and the to-be-detected sample acquisition module is used for acquiring a to-be-detected sample based on a metabonomics method.

The application also discloses the following embodiments:

1. a spectral image unmixing method based on a weighted non-negative matrix factorization algorithm (NMF-CLS), comprising: based on a standard spectrum database, unmixing the spectrum obtained by testing by adopting an NMF-CLS algorithm to obtain the type and relative concentration of the known molecules contained in the sample to be tested; the known molecules are molecules contained in a standard spectrum database, and the standard spectrum database consists of standard spectra of different molecules;

wherein the objective function of the NMF-CLS algorithm is set to:

wherein, the spectrum matrix is a matrix V of m × n, which represents n spectra in total, and each lightThe spectrum consists of m points; m r₁Matrix W of⁽¹⁾Reference spectra representing known molecules arranged in columns, m r₂W of (2)⁽²⁾A spectrum representing unknown molecules arranged in columns; r is₁Matrix H of n⁽¹⁾And r₂Matrix H of n⁽²⁾Respectively represent W⁽¹⁾And W⁽²⁾The corresponding relative concentration; wherein r is₁And r₂Respectively represent r₁A known molecule and r₂A species of unknown molecule; α represents a weight set for a known molecule, α ≧ 0; since the reference spectrum W of the molecule is known⁽¹⁾Is known, to find W⁽²⁾、H⁽¹⁾And H⁽²⁾And F in the equation is minimized, so that the relative concentration corresponding to the known molecule can be obtained.

2. The method of embodiment 1, wherein the H⁽¹⁾、W⁽²⁾And H⁽²⁾Is calculated in an iterative process.

3. The method of any one of embodiments 1-2, wherein F is with respect to W⁽²⁾、H⁽¹⁾And H⁽²⁾The partial derivatives of (a) are:

iterative updating according to an iterative formula H⁽¹⁾、W⁽²⁾And H⁽²⁾Stopping iteration when the maximum iteration number N or F is reduced to a set threshold value sigma, and after the iteration is stopped, H⁽¹⁾I.e., the final result of the relative concentrations of each known component.

4. The method of embodiment 3, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.

5. The method of embodiment 3, wherein the threshold σ is no more than about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.

6. The method of any one of embodiments 1-5, wherein the H⁽¹⁾、W⁽²⁾And H⁽²⁾The calculation process of (2) includes:

7. The method according to any one of embodiments 1 to 6, wherein the detection conditions of the standard spectrum are the same as the detection conditions of the sample to be tested.

8. The method according to any one of embodiments 1 to 7, wherein the weight α is inversely related to the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the sample to be tested.

9. The method of any one of embodiments 1-8, wherein when W⁽²⁾And H⁽²⁾When the weights are not present, the weight alpha is set to be 0, and the target function of the NMF-CLS algorithm is set as:

i.e. the classical least squares method.

10. The method according to any one of embodiments 1 to 8, wherein when the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the sample to be tested is not less than 1, the weight α is set to 0, and the objective function of the NMF-CLS algorithm is set to:

11. the method according to any of embodiments 1-8, wherein when the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the sample to be tested is less than 1, the weight α is not 0, and the objective function of the NMF-CLS algorithm is set as:

12. the method according to any one of embodiments 1 to 11, wherein the method of setting the weight α includes:

13. The method of embodiment 12, wherein the method of determining the ratio of the number of known molecules to the number of unknown molecules in the test sample comprises a principal component analysis method.

14. The method of any one of embodiments 12-13, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the test sample is no more than about 1/2, or about 1/5, or about 1/10.

15. The method of any one of embodiments 12-14, wherein the number of known molecules in the simple sample ranges from 1 to 100, or from 1 to 50, or from 1 to 20, or from 1 to 10.

16. The method of any one of embodiments 12-15, wherein the number of known molecules in the simple sample ranges from 2 to 100, or from 2 to 50, or from 2 to 20, or from 2 to 10.

17. The method according to any of embodiments 1-16, wherein the standard spectral database building comprises: collecting a plurality of spectral images of a certain molecule, calculating to obtain an average spectrum of the molecule, obtaining average spectra of other molecules in the same way, and incorporating the average spectra into a standard spectrum database to obtain the standard spectrum database.

18. The method of embodiment 17, wherein the molecule is present at a concentration of 0.1mM to 10mM when the spectral image of the molecule is acquired.

19. The method of any one of embodiments 17-18, wherein the number of spectral images collected for a molecule is no less than about 10, about 20, about 50, about 100, or about 200.

20. The method of any one of embodiments 17-19, further comprising normalizing the intensity of the average spectrum of the obtained molecules.

21. The method according to any of embodiments 17-20, wherein the standard spectral database building comprises: collecting multiple spectrum images of a certain molecule, calculating to obtain the intensity of the average spectrum of the molecule, normalizing, wherein the obtained spectrum is the standard spectrum of the molecule, obtaining the standard spectra of other molecules in the same way, and incorporating the standard spectra into a standard spectrum database to obtain the standard spectrum database.

22. The method according to any one of embodiments 17-21, wherein the standard spectral database building comprises: collecting a plurality of spectral images of a certain molecule, averaging the obtained plurality of spectral images, normalizing the intensity of the spectrum to an interval [0,1], wherein the obtained spectrum is the standard spectrum of the molecule, obtaining the standard spectra of other molecules in the same way, and bringing the standard spectra into a standard spectrum database to obtain the standard spectrum database.

23. The method according to any one of embodiments 17-22, comprising collecting spectra of a plurality of samples to be tested, performing algorithm analysis on each spectrum separately, obtaining coefficients of different known components, and then processing the coefficients, and finally obtaining an analysis result of the relative concentrations of different known molecules in the sample.

24. The method of embodiment 23, wherein the processing comprises: averaging, summing, ANOVA analysis, and/or student t-test.

25. The method of embodiment 23, wherein collecting the number of spectra of the test sample is required to ensure that molecular information has been collected in substantially the entire test sample.

26. The method of embodiment 25, wherein determining that the amount of spectra that collect molecular information in substantially the entire test sample comprises, but is not limited to, by Pearson coefficient comparison.

27. The method of embodiment 26, wherein said obtaining of Pearson coefficients comprises: and taking the average value of the spectra of the M samples to be detected as a standard spectrum, taking N spectra for averaging every time, calculating Pearson coefficients of the N spectra and the standard spectrum, repeating the operation for N times, and averaging N Pearson coefficients to serve as correlation coefficients corresponding to the N spectra.

28. The method of embodiment 27, wherein said M is about 50 to 500, or about 100 to 400, or about 200 to 300.

29. The method of embodiment 27, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.

30. The method of any one of embodiments 26-29, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.

31. The method of any one of embodiments 1-30, wherein the sample to be tested comprises a chemical or biological sample.

32. The method of any of embodiments 1-31, wherein the sample to be tested comprises a liquid sample.

33. The method of any one of embodiments 1-32, wherein the spectral images comprise infrared spectra and raman spectra.

34. The method of embodiment 33, wherein the infrared spectroscopy comprises surface enhanced infrared spectroscopy.

35. The method of embodiment 33, wherein the raman spectroscopy comprises surface enhanced raman spectroscopy.

36. The method of embodiment 35, wherein the surface enhanced raman spectroscopy is broad-spectrum surface enhanced raman spectroscopy.

37. An analysis method based on surface enhanced Raman spectroscopy comprises the following steps: based on a Surface Enhanced Raman Spectroscopy (SERS) standard spectrum database, unmixing the spectrum obtained by testing by adopting a weighted nonnegative matrix factorization algorithm (NMF-CLS) to obtain the type and relative concentration of known molecules contained in a sample to be tested; the known molecules are molecules contained in an SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectra of different molecules;

wherein the objective function of the NMF-CLS algorithm is set to:

38. The method of embodiment 37, wherein said H⁽¹⁾、W⁽²⁾And H⁽²⁾Is calculated in an iterative process.

39. The method of any one of embodiments 37-38, wherein F is with respect to W⁽²⁾、H⁽¹⁾And H⁽²⁾The partial derivatives of (a) are:

40. The method of embodiment 39, wherein the maximum number of iterations N is not less than about 20, about 25, about 30, about 40, about 50, about 100, or about 200.

41. The method of embodiment 39, wherein the threshold σ is not more than about 0.01, or about 0.001, or about 0.0001, or about 0.00001, or about 0.000001.

42. The method of any one of embodiments 37-41, wherein said H⁽¹⁾、W⁽²⁾And H⁽²⁾The calculation process of (2) includes:

43. The method of any one of embodiments 37-42, wherein the standard spectrum is detected under the same conditions as the sample to be tested.

44. The method of any one of embodiments 37-43, wherein SERS employs non-targeted broad spectrum detection.

45. The method according to any one of embodiments 37-44, wherein the weight α is inversely related to the ratio of the number of known molecules (r1) and the number of unknown molecules (r2) contained in the test sample.

46. The method of any of embodiments 37-45, wherein the weight α is set by a method comprising:

2) determining the ratio of the number of known molecules to the number of unknown molecules in a sample to be detected;

3) configuring a plurality of simple samples containing a few known molecules with different concentration gradients and a certain number of unknown molecules, wherein the number ratio of the known molecules to the unknown molecules in the simple samples is equal to that of the samples to be detected;

4) setting different weight alpha, unmixing the spectrum of the simple sample obtained by testing by adopting an NMF-CLS algorithm, obtaining the respective corresponding coefficients of the known molecules, establishing a regression equation by using the coefficients and the concentration of the known molecules, and calculating the R-square value of the regression equation to obtain the alpha of the highest R-square as the optimal weight value of the sample to be tested.

47. The method of embodiment 46, wherein the method of determining the ratio of the number of known molecules to the number of unknown molecules in the test sample comprises a principal component analysis method.

48. The method of any one of embodiments 46-47, wherein the ratio of the number of molecules in the simple sample to the number of molecules in the test sample is no more than about 1/2, or about 1/5, or about 1/10.

49. The method of any one of embodiments 46-48, wherein the number of known molecules in the simple sample ranges from 1 to 100, or from 1 to 50, or from 1 to 20, or from 1 to 10.

50. The method of any one of embodiments 46-49, wherein the number of known molecules in the simple sample ranges from 2 to 100, or from 2 to 50, or from 2 to 20, or from 2 to 10.

51. The method according to any one of embodiments 37-50, wherein said standard spectral database building comprises: collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain an SERS average spectrum of the molecule, obtaining SERS average spectra of other molecules in the same way, and bringing the SERS average spectra into a standard spectrum database to obtain an SERS standard spectrum database.

52. The method of embodiment 51, wherein the molecule has a concentration of 0.1mM to 10mM when the SERS spectrum image of the molecule is collected.

53. The method of any one of embodiments 51-52, wherein the number of SERS spectral images of a molecule collected is no less than about 10, about 20, about 50, about 100, or about 200.

54. The method according to any one of embodiments 51-53, further comprising normalizing the intensity of the obtained SERS average spectrum of the molecule.

55. The method according to any of embodiments 51-54, wherein said standard spectral database building comprises: collecting a plurality of SERS spectrum images of a certain molecule, calculating to obtain the intensity of the SERS average spectrum of the molecule and normalizing, wherein the obtained spectrum is the SERS standard spectrum of the molecule, and obtaining the SERS standard spectra of other molecules in the same way and bringing the SERS standard spectra into the SERS standard spectrum database to obtain the SERS standard spectrum database.

56. The method according to any one of embodiments 51-55, wherein said SERS standard spectral database establishing comprises: collecting a plurality of SERS spectral images of a certain molecule, averaging the plurality of SERS spectral images, normalizing the intensity of the average spectrum to an interval [0,1], obtaining the spectrum which is the SERS standard spectrum of the molecule, obtaining the SERS standard spectra of other molecules in the same way, and bringing the SERS standard spectra into a standard spectrum database to obtain the SERS standard spectrum database.

57. The method according to any of embodiments 37-56, comprising collecting SERS spectra of a plurality of samples to be measured, and performing algorithm analysis on each SERS spectrum separately to obtain coefficients H of known components⁽¹⁾And then processing is carried out, and finally, the analysis result of the relative concentration of the known molecules in the sample is obtained.

58. The method of embodiment 57, wherein the processing comprises: averaging, summing, ANOVA analysis, and/or student t-test.

59. The method of embodiments 37-58, wherein said collecting a number of spectra of a test sample is required to ensure that molecular information has been collected in substantially the entire test sample.

60. The method of embodiment 59, wherein said determining that the amount of spectra for which molecular information in a complete test sample has been collected comprises, but is not limited to, by Pearson coefficient comparison.

61. The method of embodiment 60, wherein said obtaining of Pearson coefficients comprises: and taking the average value of the spectra of the M samples to be detected as a standard spectrum, taking N spectra for averaging every time, calculating Pearson coefficients of the N spectra and the standard spectrum, repeating the operation for N times, and averaging N Pearson coefficients to serve as correlation coefficients corresponding to the N spectra.

62. The method of embodiment 61, wherein said M is about 50 to 500, or about 100 to 400, or about 200 to 300.

63. The method of embodiment 61, wherein n is about 3 to 30, or about 4 to 20, or about 5 to 10.

64. The method of any one of embodiments 60-63, wherein the Pearson coefficient is not less than about 0.8, about 0.85, about 0.9, or about 0.95.

65. The method of any one of embodiments 59-64 wherein the sample to be tested is scanned for a number of spectra of not less than about 20, or about 30, or about 40, or about 50.

66. The method of any one of embodiments 59-65, wherein the number of spectra of the sample to be tested is scanned is about 20 to 200, or about 30 to 160, or about 40 to 120, or about 50 to 80.

67. The method of any one of embodiments 37-66, wherein the SERS spectrum of the sample to be measured is scanned at a speed of about 1-5 s/sheet.

68. The method of any one of embodiments 37-67, wherein said sample to be tested comprises a chemical or biological sample.

69. The method of any one of embodiments 37-68, wherein the sample to be tested comprises a liquid sample.

70. The method of embodiment 68, wherein said biological sample comprises a cell culture fluid, a cell supernatant, a cell lysate, blood, a blood-derived product, lymph, urine, tears, saliva, cerebrospinal fluid, stool, synovial fluid, sputum, a cell, an organ, or a tissue.

71. The method according to any one of embodiments 37-70, wherein the molecules in the SERS standards database comprise metabolites.

72. The method of embodiment 71, wherein the molecules in the SERS standards database comprise small molecule metabolites.

73. A metabolomics data processing method, comprising: and (3) after the spectral data of the biological samples of the same type are unmixed by adopting a weighted nonnegative matrix factorization algorithm, obtaining the types and relative concentration intervals of the known molecules in the samples, and obtaining a characteristic spectral database of the biological samples.

74. The method of embodiment 73, further comprising: and obtaining the types and relative concentration intervals of the known molecules in other types of biological samples in the same way, and incorporating the intervals into the characteristic spectrum database to obtain the characteristic spectrum database containing different types of biological samples.

75. A metabolomic analysis method, comprising: based on a standard spectrum database, unmixing the spectrum of the sample to be tested obtained by testing by adopting an NMF-CLS algorithm to obtain the type and relative concentration of the metabolite contained in the sample to be tested, wherein the metabolite is a molecule contained in the standard spectrum database.

76. The method of embodiment 75, further comprising performing relevant biomedical analyses based on the obtained species of metabolites and their relative concentrations.

77. The method of embodiment 76, wherein said biomedical analysis comprises analysis of differential metabolite data.

78. The method of embodiment 76, wherein said biomedical analysis comprises analysis by comparing the metabolite species and relative concentrations of a test sample to a signature spectrum database.

79. The method of embodiment 76, wherein the biomedical analysis further comprises classifying or staging the sample.

80. A method of determining a biomarker, comprising:

3) respectively obtaining spectral data of a sample group sample and a comparison group sample, and based on a standard spectral database, unmixing the spectrum obtained by testing by adopting a weighted nonnegative matrix decomposition algorithm, wherein the obtained sample group sample and the comparison group sample respectively contain the types and relative concentrations of known molecules, and the known molecules are molecules contained in the standard spectral database;

4) differential molecules were screened as biomarkers.

81. The method of embodiment 80, wherein the differential molecules comprise differential metabolites.

82. The method of embodiment 80, wherein step 2) comprises cross-selecting a plurality of differential metabolites by ANOVA analysis (ANOVA Test) and Logistic Regression (Logistic Regression).

83. The method of embodiment 82, wherein said ANOVA analysis comprises statistical analysis of data from different classes to find metabolites where statistical differences occur between different classes.

84. The method of embodiment 82, wherein the logistic regression comprises using relative concentration data for classification, finding metabolites that contribute to distinguishing data classes.

85. The method of embodiment 84, wherein the logistic regression is normalized using L1, which is classified with an absolute value weight greater than 0, which is considered to contribute to the classification.

86. The method of embodiment 80, further comprising validating the obtained differential metabolites.

87. The method of embodiment 86, wherein said validating comprises performing a regression analysis of the actual concentrations of the samples with coefficients unmixed by a weighted non-negative matrix factorization algorithm.

88. The method of embodiment 86, wherein said validating comprises validating by analyzing differential metabolites as being compatible with physiology or pathology.

89. A method of detecting the presence of, or assessing the risk of developing, a disease or condition, the method comprising the steps of:

90. A method of determining the stage of a disease or condition, the method comprising the steps of:

3) determining the stage or type of the disease or disorder.

91. The method according to any one of embodiments 89-90, wherein the disease or disorder is selected from the group consisting of: infectious diseases, proliferative diseases, neurodegenerative diseases, cancer, psychological diseases, metabolic diseases, autoimmune diseases, sexually transmitted diseases, gastrointestinal diseases, pulmonary diseases, cardiovascular diseases, stress and fatigue-related disorders, mycoses, pathogenic diseases and obesity-related disorders.

92. A method of cell or microorganism analysis, the method comprising the steps of:

1) obtaining spectral data of a cell sample to be tested, and based on a standard spectral database, unmixing the spectrum obtained by testing by adopting weighted nonnegative matrix factorization (NMF-CLS) to obtain the type and relative concentration of known metabolites contained in the sample to be tested, wherein the known metabolites are molecules contained in the standard spectral database;

93. The method of embodiment 92, further comprising screening the identified cells or microorganisms for a desired type of target cell or microorganism.

94. The method of any one of embodiments 73-93, wherein the spectroscopic data comprises raman spectroscopic data and infrared spectroscopic data.

95. The method of embodiment 94, wherein said infrared spectroscopy data comprises surface enhanced infrared spectroscopy data.

96. The method of embodiment 94, wherein the raman spectral data comprises surface enhanced raman spectral data.

97. The method of any one of embodiments 73-93, wherein the standard spectra database comprises a raman spectra standard spectra database and an infrared spectra standard database.

98. The method of embodiment 97, wherein the raman spectroscopy standard spectral database comprises a SERS standard spectral database.

99. The method of any of embodiments 93-98, wherein the spectrum of the sample under test comprises a broad spectrum SERS spectrum.

100. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of embodiments 1 to 99.

101. The computer readable storage medium of embodiment 100, further having standard spectral database data stored thereon.

102. The computer readable storage medium of embodiment 101, the standard spectral database comprising a SERS standard spectral database.

103. An apparatus comprising a memory storing a standard spectral database and a computer program, and a processor implementing the steps of the method of any of embodiments 1 to 99 when executing the computer program.

104. A spectral unmixing system based on a weighted non-negative matrix factorization algorithm, comprising: and the solving optimization module is used for solving the weighted nonnegative matrix factorization algorithm by adopting an iteration method to complete spectrum data unmixing.

105. The system of embodiment 104, further comprising a weight optimization module that solves for known molecular weights using linear regression to determine optimal weight values.

106. The system of embodiment 104, further comprising an evaluation module for evaluating the unmixing results using the relative concentrations of known molecules.

107. Use of the computer readable storage medium of any one of embodiments 100-102, the device of embodiment 103, or the system of any one of embodiments 104-106 in the preparation of a device for analysis of a compound and/or classification and detection of a microorganism.

108. Use of the computer readable storage medium of any one of embodiments 100-102, the device of embodiment 103, or the system of embodiment 104-106 in a preparation device for metabolomic data processing and/or analysis.

109. A metabolomic analysis device, comprising: and the data processing module is used for analyzing the acquired spectral data of the sample to be detected to obtain the type and relative concentration of the metabolite in the sample.

110. The apparatus of embodiment 109, wherein the data processing module comprises a solution optimization module configured to solve the weighted non-negative matrix factorization algorithm using an iterative approach to complete spectral data unmixing.

111. The apparatus according to embodiment 109, wherein the data processing module comprises a weight optimization module configured to solve the known molecular weights by linear regression to determine optimal weight values.

112. The apparatus of embodiment 109, wherein the data processing module comprises an evaluation module for evaluating the unmixing results using the relative concentrations of known molecules.

113. The device of embodiment 112, wherein said evaluating comprises classifying the test sample using a differential metabolite classification test model.

114. The apparatus according to embodiment 109, further comprising a spectrum detection module configured to perform spectrum detection on the sample to be detected, so as to obtain spectrum data of the sample to be detected.

115. The device of embodiment 109, further comprising a test sample collection module configured to collect a test sample based on a metabolomics approach.

Without wishing to be bound by any theory, the following examples are intended only to illustrate the methods, uses, etc. of the present application and are not intended to limit the scope of the invention of the present application.

Examples

Example 1SERS Standard database establishment

The SERS standard database totally comprises the SERS spectra of 89 metabolite molecules (figure 2), wherein the SERS spectra of 57 molecules are derived from open source data in other documents, and the rest 32 metabolite molecule standard products (with purity of 98% and above) purchased by the laboratory are subjected to SERS test and measured SERS spectra.

(1) Dissolving metabolite molecules in water according to a certain concentration (0.1mM-10mM), and mixing with the silver nanoparticles according to a proportion;

(2) and testing the SERS spectrum of the mixed sample, wherein the testing parameters can be as follows: 638nm laser, integration time 1 second, total 201 Raman spectra collected;

(3) in order to prevent the situation that the ununiformity of unmixed coefficients is caused by the intensity difference of standard spectra, the 201 spectra obtained by the sample modification are averaged, the intensity of the spectra is normalized to the interval [0,1], and the obtained spectra are the standard SERS spectra of the metabolite molecules and are included in an SERS standard database;

(4) performing the same process for other purchased metabolite molecular standards;

(5) the SERS standard database can be continuously expanded.

Example 2 model verification of

2.1 model validation

2.1.1 model validation establishment of SERS database used:

for model validation, only two known molecules, cysteine and arginine, were included in the design database.

2.1.2 preparation of a sample to be tested:

as shown in fig. 3a, 4 samples are prepared, and are numbered from (i) to (iv), wherein the concentration of arginine decreases sequentially with the number, the concentration of cysteine increases sequentially with the number of the sample, each sample further contains five other substances (not contained in the SERS standard database used for model verification), and the concentrations are unknown;

2.1.3SERS Spectroscopy test:

mixing the prepared sample with silver particles, and then carrying out Raman spectrum test, wherein the test parameters are as follows: (laser wavelength 638nm, laser power 100%, integration time 1s, 10 times objective, total number of Raman spectra tested 201/sample); and averaging 201 spectra obtained by testing to obtain an average spectrum for detecting the SERS spectrum.

In fig. 3b, i represents the SERS standard spectrum of cysteine, iii represents the SERS standard spectrum of arginine, and ii represents a portion of the SERS spectrum randomly taken out in the background.

2.1.4 weight optimization:

adjusting the alpha value to calculate the coefficients of cysteine and arginine under different samples, comparing the obtained coefficients with known standard concentrations to establish a regression curve, taking a fitting effect R square value of the regression curve as a judgment standard, selecting the alpha value with the best regression effect as a weight value for later analysis, wherein the alpha value is set to be 0.5, and the objective function added with the weight is as follows:

2.1.5 algorithm resolution:

the 201 scanned spectra were unmixed using 2.1.4 SERS database based algorithms to obtain the cysteine and arginine coefficients for each SERS spectrum, and averaged to obtain the average coefficient (i.e., relative content) of the known molecules (cysteine and arginine) contained in the SERS database used for the model validation, with the results shown in table 1 and fig. 3c-3 e.

The black thin line in fig. 3c represents the average spectrum of the detected SERS spectra of the four samples mixed at different concentrations, the color line represents the sum of the products of the reference spectrum and the corresponding coefficients fitted by the algorithm, i.e., the reduced average spectrum of the fitted SERS spectra, and the nearly coincident spectrum represents a good fitting effect, indicating that the algorithm has performed effective dissociation on the spectra.

Fig. 3d is a representative spectrum of four randomly selected samples, again showing good fit.

Table 1 model validation of the coefficients for cysteine and arginine

As shown in table 1 and fig. 3e, the coefficients of cysteine and arginine in the four samples and the actual concentrations are calculated based on the algorithm of the SERS database to perform linear fitting, and the relative content obtained by the analysis of the algorithm has good linear correlation with the concentration of the metabolite molecules designed in the preparation process of the sample to be measured.

2.2 weight settings contrast

If the weight α is not set (i.e., α is set to 0) in the case of unknown components and the ratio of the unknown components is large, the unmixed coefficients may not have a good linear relationship with the true concentration, although they still have a good fitting effect. The influence of different weight alpha values on the unmixing effect is verified by comparing the unmixing results of arginine in the model verification part, fig. 3f shows the result of adding weight (alpha is 0.5), and fig. 3g shows the result of not adding weight (alpha is 0), so that the coefficient analyzed without weighting can not effectively represent the relative concentration, and the coefficient analyzed after adding weight can effectively represent the relative concentration.

2.3 Demix method comparison

Fig. 4a shows the fitting effect of four samples containing cysteine and arginine at different concentrations in example 2.1 after being unmixed according to the least square method (CLS), where the black line is the average spectrum of the detected SERS spectrum, the red line is the average spectrum of the fitted spectrum of the CLS algorithm, and the blue line is the difference between the two, and it can be seen that the fitting effect is not good.

Fig. 4b shows the fitting effect of the four samples containing cysteine and arginine at different concentrations in example 2.1 after unmixing according to the non-Negative Matrix Factorization (NMF), which can be seen to be good, but the known component spectrum calculated by the non-negative matrix factorization in fig. 4c can be seen to have no obvious raman peak, while the standard raman spectrum of the actual molecule (e.g., cysteine) has an obvious raman peak, which indicates that the known component spectrum analyzed by the non-negative matrix algorithm cannot be matched with the raman spectrum of the actual molecule (the standard spectrum of the molecule).

Example 3 model verification two

3.1 model validation establishment of SERS database:

for model validation, only five known molecules of Tyrosine (Tyrosine), Guanine (Guanine), Cytosine (Cytosine), Asparagine (Asparagine) and Adenine (Adenine) are contained in the design database.

3.2 preparing a sample to be detected:

tyrosine (Tyrosine), Guanine (Guanine), Cytosine (Cytosine), Asparagine (Asparagine) and Adenine (Adenine) were added to 3 samples at different concentrations, and another 15 molecular solutions of unknown concentrations were mixed.

3.3SERS Spectroscopy testing:

In fig. 5a, i represents SERS standard spectra of 5 known molecules, and ii represents raman spectra of a randomly selected mixed solution containing 5 target molecules and 15 unknown molecules, and it can be seen that the background is complex.

3.4 weight optimization:

adjusting the alpha value to calculate the coefficient of the target molecule under different samples, comparing with the known standard concentration and establishing a regression curve, taking the fitting effect R square value of the regression curve as a judgment standard, selecting the alpha value with the best regression effect as a weight value for later analysis, wherein the alpha value is set to be 0.5, and the objective function added with the weight is as follows:

3.5 algorithm analysis:

and (3) respectively unmixing 201 scanned spectra by adopting a 3.4 SERS database-based algorithm to obtain the coefficients of known molecules (tyrosine, guanine, cytosine, asparagine and adenine) contained in each SERS spectrum, and averaging to obtain the average coefficient (relative content) of the known molecules contained in the SERS database used for model verification. The same samples were measured for concentration by mass spectrometry and regression analysis was performed with coefficients unmixed by the algorithm.

Fig. 5b shows the fitting effect of the SERS spectra of three different mixed liquids, the black thin line represents the average spectrum of the detected SERS spectra of the three different mixed liquids, the color line represents the sum of the products of the reference spectrum and the corresponding coefficient fitted by the algorithm, i.e., the reduced average spectrum of the fitted SERS spectra, and the almost coincidence of the two shows that the fitting effect is good, indicating that the algorithm has effectively dissociated the spectra.

Fig. 5c shows the matching effect between the calculated average coefficient and the concentration detected by the mass spectrum, and it can be seen that the average coefficient (relative content) calculated by the algorithm and the concentration of the sample detected by the mass spectrum can be well matched, and the sensitivity of the SERS spectrum is higher, so that the detection with higher sensitivity can be realized.

Example 4 cell experiments

4.1 establishing a SERS database used for cell validation as described in example 1;

4.2 preparation of samples to be tested: the extracellular metabolic behaviors of the three groups of cells, which were expressed in the cell culture medium and varied with the number of days, were compared in a co-arrangement, and the cell groups were set to LO2 (human normal stem cells), HepG2 (human liver cancer cells) and HepG2+ MTX (human liver cancer cells administered anticancer drug methotrexate).

Each group was tested for 5 days of sustained metabolic activity by taking 400. mu.l of cell culture fluid from the cell culture dish per day. Removing dead cells and cell debris from the cell culture fluid sample by gradient centrifugation and removing protein molecules by ultrafiltration (3KD-cutoff) in sequence, thereby obtaining metabolic molecule components in the cell culture fluid sample as a subsequent sample to be detected;

4.3SERS Spectroscopy testing:

mixing the sample to be tested with the silver nanoparticles, and carrying out SERS spectrum test, wherein the test parameters are as follows: (laser wavelength 638nm, laser power 100%, integration time 5s, 10 times objective, total number of Raman spectra tested 201/sample);

200 spectra of DAY2 data of each culture solution are selected to obtain a SERS spectrum heat map (FIG. 7c), wherein the abscissa represents Raman shift, the ordinate represents the number of spectra, each row of pixels represents one spectrum, and the color of each pixel represents the Raman intensity of each row of pixels. It can be seen that there are fluctuations in peak position and peak intensity between spectra in the same measurement, and therefore it can be considered that there are differences in the kind and amount of molecules appearing in the raman enhancement hot spot region at different times, depending on the concentration and type of the molecules, and thus it is necessary to measure raman spectra several times to reflect the molecular composition in the serum.

4.3.1 Spectrum quantity calculation

The part is used for calculating the number of spectrums acquired in each measurement, so that the whole information of the sample can be acquired.

We used the average of 200 spectra as the standard spectrum, take N spectra each time to average and calculate the pearson coefficients of the N spectra and the standard spectrum, repeat the above operations 5 times, and average the 5 pearson coefficients as the correlation coefficients corresponding to the N spectra.

It is considered that when the curve converges (when the pearson coefficient is greater than 0.8), the curve represents that the required information is basically acquired. The results are shown in fig. 6, where around 50, the spectra have substantially converged (pearson coefficient greater than 0.8) for different types of data.

4.4 weight optimization:

4.5 algorithm resolution:

and (3) respectively unmixing 201 spectra obtained by scanning by adopting a 4.4 SERS database-based algorithm to obtain the coefficient of the known molecule contained in each SERS spectrum, and averaging to obtain the average coefficient (namely relative content) of the known molecule contained in the SERS database used for model verification. The obtained relative content can be used for further related biomedical analysis, such as metabolic difference between normal cells and tumor cells, metabolic behavior change monitoring of tumor cells after antitumor drug treatment, and the like.

FIG. 7b shows the fitting effect of SERS spectra of three cell culture fluids every day, the black thin line represents the average spectrum of the detected SERS spectra, the color line represents the sum of the products of the reference spectrum and the corresponding coefficients fitted by the algorithm, i.e., the reduced average spectrum of the fitted SERS spectra, and the almost coincidence of the two shows that the fitting effect is good, which indicates that the algorithm has effectively dissociated the spectra.

4.6 differential metabolite screening

And taking the intersection of the differential metabolites selected by the ANOVA mode and the logistic regression mode as a final metabolite screening result. As shown in fig. 7d, 8 exemplary differential metabolites were screened and their calculated coefficient (i.e., relative concentration) change curves were analyzed.

Example 5 serum experiments

5.1 establishing a SERS database used for cell validation as described in example 1;

5.2 preparation of samples to be tested:

in the serum test, a serum sample stored at the temperature of-80 ℃ (wherein the sample is from 85 BPH patients, 85 PCa patients and 75 healthy subjects) is thawed in the environment of 4 ℃, then the serum is subjected to ultrafiltration (3KD cutoff), so that the protein in the serum is removed, the components of metabolite molecules in the serum are obtained, and the components of the metabolite molecules in the serum are used as samples to be tested;

5.3SERS Spectroscopy test:

DAY2 data for each culture broth 200 spectra were taken to obtain SERS spectral heatmap (fig. 9). The abscissa is Raman shift and the ordinate represents the number of spectra. It can be seen that there are fluctuations in peak position and peak intensity between spectra in the same measurement, and therefore it can be considered that there are differences in the kind and amount of molecules appearing in the raman enhancement hot spot region at different times, depending on the concentration and type of the molecules, and thus it is necessary to measure raman spectra several times to reflect the molecular composition in the serum.

5.3.1 Spectrum quantity calculation

It is considered that when the curve converges (when the pearson coefficient is greater than 0.8), the curve represents that the required information is basically acquired. The results are shown in fig. 8, where around 50, the spectra have substantially converged (pearson coefficient greater than 0.99) for different types of data.

5.4 weight optimization:

5.5 algorithm analysis:

and (3) respectively unmixing 201 spectra obtained by scanning by adopting an algorithm based on an SERS database of 5.4 to obtain the coefficient of the known molecule contained in each SERS spectrum, and averaging to obtain the average coefficient (namely relative content) of the known molecule contained in the SERS database used for model verification. The relative amounts obtained can be used for further relevant biomedical analyses, such as early screening of diseases, typing of diseases, staging of diseases, etc.

Fig. 10 shows the fitting effect of the SERS spectra of the sera of three people, the black thin line represents the average spectrum of the detected SERS spectra, the color line represents the sum of the products of the reference spectrum and the corresponding coefficient fitted by the algorithm, i.e., the reduced average spectrum of the fitted SERS spectra, and the almost coincidence of the two shows that the fitting effect is good, indicating that the algorithm has effectively dissociated the spectra.

5.6 differential metabolite screening

And taking the intersection of the differential metabolites selected by the ANOVA mode and the logistic regression mode as a final metabolite screening result. The present application screens the following 16 differential metabolites using Anova and logistic regression cross analysis after unmixing the surface enhanced raman spectra of sera (fig. 11). To further analyze the results, we removed the coefficients for these 16 differential metabolites for all samples (fig. 12 a). The data for these 16 differential foreign body compositions were used to classify prostate cancer, benign prostatic hyperplasia and healthy human samples and compared to the results of psa screening for clinical fluid biopsies as shown in figures 12b and 12 c. The results show that the analysis results of the screened metabolites are superior to the results of the psa screening.

Claims

wherein the objective function of the NMF-CLS algorithm is set to:

the spectrum matrix is an m x n matrix V and represents n spectra in total, and each spectrum consists of m points; m r_iMatrix W of⁽¹⁾Reference spectra representing known molecules arranged in columns, m r₂W of (2)⁽²⁾A spectrum representing unknown molecules arranged in columns; r is₁Matrix H of n⁽¹⁾And r₂Matrix H of n⁽²⁾Respectively represent W⁽¹⁾And W⁽²⁾The corresponding relative concentration; wherein r is₁And r₂Respectively represent r₁A known molecule and r₂A species of unknown molecule; α represents a weight set for a known molecule, α ≧ 0; since the reference spectrum W of the molecule is known⁽¹⁾Is known, to find W⁽²⁾、H⁽¹⁾And H⁽²⁾And F in the equation is minimized, so that the relative concentration corresponding to the known molecule can be obtained.

2. The method of claim 1, wherein the H⁽¹⁾、W⁽²⁾And H⁽²⁾Is calculated in an iterative process.

3. The method of any one of claims 1-2, wherein the F is with respect to W⁽²⁾、H⁽¹⁾And H⁽²⁾The partial derivatives of (a) are:

4. The method of any one of claims 1-3, wherein the H⁽¹⁾、W⁽²⁾And H⁽²⁾The calculation process of (2) includes:

5. The method of any one of claims 1-4, wherein the sample to be tested comprises a chemical sample or a biological sample.

6. The method of any one of claims 1-5, the spectral images comprising infrared spectra and Raman spectra.

7. An analysis method based on surface enhanced Raman spectroscopy comprises the following steps: based on a Surface Enhanced Raman Spectroscopy (SERS) standard spectrum database, unmixing the spectrum obtained by testing by adopting a weighted nonnegative matrix factorization algorithm (NMF-CLS) to obtain the type and relative concentration of known molecules contained in a sample to be tested; the known molecules are molecules contained in an SERS standard spectrum database, and the SERS standard spectrum database consists of SERS standard spectra of different molecules;

wherein the objective function of the NMF-CLS algorithm is set to:

8. The method of claim 7, wherein the molecules in the SERS standards database comprise metabolites.

9. A metabolomic analysis method, comprising: based on a standard spectrum database, unmixing the spectrum of the sample to be tested obtained by testing by adopting an NMF-CLS algorithm to obtain the type and relative concentration of the metabolite contained in the sample to be tested, wherein the metabolite is a molecule contained in the standard spectrum database.

10. A method of determining a biomarker, comprising:

2) differential molecules were screened as biomarkers.