CN111220756A

CN111220756A - Radix rehmanniae identification platform and radix rehmanniae identification method using same

Info

Publication number: CN111220756A
Application number: CN201811419256.0A
Authority: CN
Inventors: 张晓哲; 赵楠; 程孟春
Original assignee: Dalian Institute of Chemical Physics of CAS
Current assignee: Dalian Institute of Chemical Physics of CAS
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-06-02

Abstract

The application discloses a radix rehmanniae identification platform and a method for identifying radix rehmanniae by using the platform. The platform comprises a known sample information database module, an unknown sample information database module, a known sample chromatogram-mass spectrum image module, an unknown sample chromatogram-mass spectrum image module and an unknown sample identification module. And determining whether the chromatogram-mass spectrum data of the unknown sample is matched with the chromatogram-mass spectrum data of the known sample by comparing the generated chromatogram-mass spectrum data image of the unknown sample with the chromatogram-mass spectrum data image of the known sample, thereby identifying the unknown sample. The method provided by the invention can realize comprehensive characterization of spatial information among a large amount of compounds in the rehmannia glutinosa sample by using a traditional Chinese medicine chromatography-mass spectrometry high-dimensional image technology, realizes matching and identification of unknown samples by using the spatial information, and has the advantages of rapidness, high flux, high precision, high reliability and the like.

Description

Radix rehmanniae identification platform and radix rehmanniae identification method using same

Technical Field

The application relates to the technical field of traditional Chinese medicine detection, in particular to a radix rehmanniae identification platform and a radix rehmanniae identification method.

Background

The compound formation of complex samples is extremely complex. The traditional Chinese medicine is a typical complex sample, the contained components are extremely complex, the structure is various, the variety is various, the common types comprise phenols, alkaloids, saponins, terpenoids, flavonoids, lactones, anthrone, organic acids, tannins and the like, a single traditional Chinese medicine contains hundreds of thousands of secondary metabolites and micromolecular components, and the components of a traditional Chinese medicine compound preparation combined by various traditional Chinese medicines are more. Accordingly, a large amount of information is contained in the complex sample. Scientific problems such as the interrelation of Chinese herbal medicine compounds, the difference of drug properties and effects of different Chinese herbs, the difference of chemical components of the same herb, the influence of producing area, year and growth environment on the quality of herb, etc. are all contained therein.

Current research on complex samples faces two important bottlenecks: on the one hand, the research mostly adopts fragmented and dotted low-dimensional data, such as chromatographic retention time, m/z value, daughter ion fragment information and the like, and the correlation among the large amount of chemical components cannot be reflected even if the low-dimensional data is neglected. High-dimensional data is just a powerful carrier of massive amounts of information. High-dimensional data can effectively represent spatial information of data points in a sample to reflect their spatial relationship, as compared to low-dimensional data. Therefore, the high-dimensional data of the compound of the complex sample can be obtained, processed and mined from the complex sample. On the other hand, data resources generated by experiments are huge but scattered, and data generated by related researches cannot be integrated and utilized, so that the input cost of manpower, material resources, time and the like in scientific research work is high, and the output is not obvious. Database technology is a method for computer-aided management and data integration. It is the direction to solve the above-mentioned problems to combine high-dimensional data with database technology to build a high-dimensional data database.

The acquisition of high dimensional data requires the use of co-instrumentation. The chromatography-mass spectrometry combined technology combines a separation method-chromatography with extremely wide application range and a mass spectrometry which is sensitive, exclusive and can provide molecular weight and structural information, and is obviously an ideal means for acquiring high-dimensional data of a complex sample. At present, some databases based on chromatography-mass spectrometry technology are available, and can be roughly divided into two types:

1. standard compound mass spectra database: for example, a NIST standard compound substance spectrum database published by the National Institute of Science and Technology (NIST) of America records tens of thousands of standard substance spectra, and plays a great role in the metabonomics research on a GC-MS platform; also, for example, the Human Metabolome Database (HMDB) is currently the most complete and comprehensive Database of Human metabolites and Human metabolism. Such databases have found wide application in many research areas. However, the number of compounds that this type of database can provide is limited and does not provide chromatographic retention information for the compounds. Zhang Jia Yuan et al (pharmacy report, 2012,47(9):1187-1192) utilize high performance liquid chromatography-electrospray ion trap tandem mass spectrometry (HPLC-ESIIT-MS/MS), and establish a liquid chromatography-mass spectrometry-database (LC-MS-DS) containing 636 natural compounds (including common types of natural products such as flavone, coumarin, lignin, terpene and glycosides thereof, steroid and glycosides thereof, organic acid, alkaloid, anthraquinone, amino acid and the like) by taking a commercial workstation library editor as a platform, and are used for identification and targeted separation of unknown components of natural products. The database belongs to a standard compound substance spectrum database, and the reliability of the spectrum library retrieval can be evaluated by matching the retention time and the ultraviolet absorption spectrum of an unknown component and a reference substance or comparing whether main ion fragments in multi-level mass spectrograms of the unknown component and the reference substance are the same, so that the reliability of the result is improved. This database can only be used for the identification of compounds, and cannot be used for the identification of biological samples including natural products.

2. Compound information base: the UNIFI chinese medicine database introduced by WATERS corporation contains all the herbs listed in the chinese pharmacopoeia, 2010, and thousands of compound information (main compounds reported in the literature) related to these herbs. The database needs to obtain the chromatogram-mass spectrum data of the traditional Chinese medicine to be detected based on Ultra Performance Liquid Chromatography (UPLC) and quadrupole time-of-flight mass spectrometry (QTOF MS), the molecular formula is presumed according to the accurate molecular weight and is matched with the compound structure in the database, and the theoretical fragments calculated by software are matched with the collected secondary ions for confirmation. The database has the advantages that all the herbs and main compounds in the 2010 version Chinese pharmacopoeia are integrated, and the compound scale reaches thousands. The feasibility of scaling up the compound number of the database relative to a database of standard compound profiles with limited sources of standard material is evident. However, the database does not actually have real chromatogram-mass spectrum data of each compound, the identification of the compound only utilizes high-resolution mass spectrum to obtain accurate molecular weight presumed molecular formula, and the reliability is improved by combining theory to calculate secondary fragment matching. Although high resolution mass spectrometry provides accurate molecular weight of compounds to predict possible molecular formulas, the number of possible candidates for the same molecular formula is large, and even though the total number of compounds in the database is thousands, the average number of compounds in each herb is only tens, and most of them are high-content common compounds. The chemical components of the traditional Chinese medicine have typical complex diversity, hundreds of components of each traditional Chinese medicine may exist, the compound in the database may only contain a small part of the chemical components in the traditional Chinese medicine to be detected, and the identification capability of the traditional Chinese medicine for the medium-low content components is very limited. And the technology of theoretical calculation of secondary fragments is not mature at present, the accuracy is not high, and the matching result may have deviation, thereby causing false positive or false negative. The database also has compatibility problem and is only suitable for the WATERS workstation system. CellTie et al invented a database construction method suitable for mass spectrometry data analysis of natural products (application No. 201510443268.7). The method downloads all related compounds from a PubChem, CA or Reaxyz compound database, carries out computer simulation cracking on the compounds based on a cracking rule, obtains cracking fragments of the compounds, records related information of the compounds and the fragments, and then establishes the database. Compared with the UNIFI traditional Chinese medicine database, the method has the advantages that the number of the compounds is rich, the cracking rule is based on the cracking rule reported by the existing literature and the compound identification is completed by combining computer simulation cracking, and the reliability of the result is relatively improved. But the same as the UNIFI traditional Chinese medicine database, the database data is only based on compound structure information data, and no compound actual spectrogram exists; in addition, different instruments, different parameters have a great influence on the fragmentation behavior of the compounds, and the adaptability of the database to different sources (instruments, experimental conditions, etc.) is not clear.

The chromatography-mass spectrometry database takes a compound as a main body, focuses on the characteristics of a single dimension in data, stores part of the data in multi-dimension data, and does not convert the multi-dimension data into high-dimension data for integrated use. The traditional Chinese medicine chromatogram-mass spectrum high-dimensional image database established by the invention takes the traditional Chinese medicine as a main body, and not only comprises the traditional Chinese medicine integral information, but also comprises the single-point information of the traditional Chinese medicine compound. The Chinese medicine chromatogram-mass spectrum high-dimensional image database can be used for various researches such as identification, classification, quality control and deep data mining of Chinese medicines.

It should be noted that the Chinese medicine identification method of the invention can be applied to data obtained under similar or similar sample analysis conditions, so that the applicability of the method is greatly improved.

Disclosure of Invention

To solve the problems in the prior art, one aspect of the present invention provides a raw rehmannia glutinosa recognition platform, which comprises the following modules:

the system comprises a known sample information database module, an unknown sample information database module, a known sample chromatogram-mass spectrum image module, an unknown sample chromatogram-mass spectrum image module and an unknown sample identification module;

the known sample information database module transmits chromatogram-mass spectrum data of a known sample to the known sample chromatogram-mass spectrum image module, and the known sample chromatogram-mass spectrum image module outputs a first data image;

the unknown sample information database module transmits the chromatography-mass spectrometry data of the unknown sample to the unknown sample chromatography-mass spectrometry image module, and the unknown sample chromatography-mass spectrometry image module outputs a second data image;

the unknown sample identification module is used for recording the sample information of the known sample and the generated first data image, and comparing the generated second data image with the first data image to determine whether the chromatogram-mass spectrum data of the unknown sample is matched with the chromatogram-mass spectrum data of the known sample.

In a preferred embodiment, the chromatography-mass spectrometry data of the known sample comprises raw chromatography-mass spectrometry information of the known sample and the chromatography-mass spectrometry data of the unknown sample comprises raw chromatography-mass spectrometry information of the unknown sample.

In a preferred embodiment, the chromatography-mass spectrometry data of the known sample further comprises high dimensional data of each compound in the known sample, and the chromatography-mass spectrometry data of the unknown sample further comprises high dimensional data of each compound in the unknown sample.

The high-dimensional data expresses spatial information among data points in the sample, and the spatial information is a matrix formed by at least one of the following information: distance information between data points; angular relationship information between data points; coordinate position information of the data points; density information of the data points; edge range information for the set of data points; intensity information of the data points.

Preferably, the distance information between data points comprises at least one of chromatographic retention time t, m/z value, m value, z value, peak intensity I.

Preferably, the intensity information of the data point includes at least one of information reflected by the intensity of the size or brightness of the data point.

Preferably, the high dimensional data may be stored as a table file or a text file, and further preferably, the table file is one or more of · xls,. xlsx,. csv,. xml, and the text file is at least one of ·, α docx,. txt,. rtf.

In a preferred embodiment, the high-dimensional data image generated by the high-dimensional data includes at least one of an original image generated by the high-dimensional data, an image generated based on image features, an image generated by performing conversion processing on the image, and an image constructed by using a function.

Preferably, the image features comprise clusters of data point points, common particles, sample contours.

Preferably, the image conversion process includes at least one of a blurring process of the image or a process of subjecting the image to different resolutions.

Preferably, the function comprises at least one of chromatographic retention times t, m/z, m, I.

Preferably, the high-dimensional image is an image of more than two dimensions;

preferably, the image file may be stored in any image file format.

In a preferred embodiment, the known sample comprises at least one of a standard and a known chinese medicine sample.

Preferably, the standard substance comprises at least one of a reference substance of traditional Chinese medicine, traditional Chinese medicine marking components and main chemical components of traditional Chinese medicine in '2015 edition Chinese pharmacopoeia'.

Preferably, the known traditional Chinese medicine sample is a sample with definite category information, and the category information comprises at least one of the species, the origin, the part and the processing mode of the sample;

preferably, the known TCM sample comprises at least one of TCM raw material, decoction pieces and powder. Further preferably, the known chinese medicine sample includes at least one of different parts of chinese medicine and their processed products.

In a preferred embodiment, the unknown sample identification module comprises an image segmentation tool or a clustering tool.

In a preferred embodiment, the database type in each database module in the raw rehmannia glutinosa recognition platform provided by the present invention comprises at least one of a folder data set, a web page database, a database based on a commercialized workstation or a database based on a user self-developed workstation.

Preferably, the database format includes at least one of text, EXCEL, Oracle, mysql, split, or microsoft sqlserver.

Another aspect of the present invention provides a method for identifying dried rehamnnia root using a dried rehamnnia root identification platform, the method at least comprising the steps of:

1) acquiring raw chromatograph-mass spectrum data of a known sample and an unknown sample using chromatography and mass spectrometry;

2) generating chromatogram-mass spectrum high-dimensional data of a known sample and an unknown sample, wherein the chromatogram-mass spectrum high-dimensional data expresses spatial information among data points;

3) generating a chromatogram-mass spectrum high-dimensional data image of a known sample and an unknown sample, enabling each ion in the high-dimensional data to correspond to a point in a formed image one by one, enabling each point to have own coordinate information, enabling the intensity of each point to be represented by the size or/and the intensity of the brightness of the point, and enabling the point in the high-dimensional data image to correspond to the high-dimensional data one by one;

4) dividing points in the chromatogram-mass spectrum high-dimensional image of the unknown sample into n point clusters (n is an integer more than or equal to 1) by using an image dividing tool or a clustering tool, and respectively scanning and matching the chromatogram-mass spectrum high-dimensional image of the unknown sample after the point clusters are extracted and the mass spectrum-chromatogram high-dimensional image of the known sample one by one;

5) ranking known samples matched with unknown samples according to the matching degree, sequentially retrieving the known samples in original chromatogram-mass spectrum data information and/or high-dimensional data information of the unknown samples according to the matching degree ranking, wherein the number of the labeled compounds corresponding to the labeled compounds of the known samples is more than or equal to 1, and when the labeled compounds are retrieved from the unknown samples, the unknown samples are accepted as the known samples, and the retrieval is stopped; if the first ranked known sample is not searched in the unknown sample, then retrieving a second ranked known sample marker compound in the unknown sample, and so on until the marker compound is retrieved; if all the matched marked compounds in the known samples are not retrieved from the unknown samples, the established database is considered to contain no unknown samples;

in a preferred embodiment, the coordinate information includes at least one of distance information between data points, angular relationship information between data points, coordinate position information of data points, density information of data points, edge range information of a set of data points, and intensity information of data points.

In a preferred embodiment, a point cluster is a collection of spatially close data points, where the number of data points n ≧ 3 within the point cluster.

Preferably, each of said clusters of points has its own centre point.

Preferably, the shape of the dot clusters is arbitrary.

In a preferred embodiment, raw chromatograph-mass spectrometry data of a known sample and an unknown sample are obtained by:

separating the mixed molecules in the known and unknown samples by selective action by using a chromatograph and an ion mobility spectrometry instrument to obtain different chromatographic retention time information t;

separating and detecting compounds in a sample according to different mass-to-charge ratios of molecules by using the electromagnetic field effect of a mass spectrometer to obtain different mass-to-charge ratio information m/z;

analyzing the sample extract by using a chromatography-mass spectrometer to obtain original chromatography-mass spectrometry data;

in a preferred embodiment, the time t used for chromatographic separation is in the range of 1 to 10000s and the m/z scan of the ions is in the range of 50 to 10000 Da.

In a preferred embodiment, the method may further comprise subjecting the acquired raw chromatography-mass spectrometry data to at least one of retention time correction, filtering and normalization.

In a preferred embodiment, the method may further comprise the step of using quality control samples and mixing standard internal standards.

Preferably, the quality control sample comprises at least one of a known sample or a mixture thereof, an unknown sample or a mixture thereof, and a mixture of two or more standards, and is used to evaluate the quality of data.

Preferably, internal standards for mixed standards can be used when mixed standards are employed to improve the reproducibility of the assay and to perform retention time corrections.

In a preferred embodiment, the unknown sample is at least one of a raw herb, a decoction piece, a powder, a preparation, a different part of a herb, and a processed product thereof.

Preferably, the preparation comprises traditional Chinese medicine granules or traditional Chinese medicine injection.

The beneficial effects that this application can produce include:

1) the radix rehmanniae recen identification platform established by the invention comprises a traditional Chinese medicine chromatogram-mass spectrum high-dimensional image database, wherein the database takes the traditional Chinese medicine as a main body and comprises the traditional Chinese medicine integral information and single-point information of traditional Chinese medicine compounds. Therefore, the traditional Chinese medicine identification platform can well reveal the correlation among the complex components of the traditional Chinese medicine and can realize comprehensive characterization on the spatial information among a large amount of compounds in a traditional Chinese medicine sample.

2) The radix rehmanniae recen chromatogram-mass spectrum high-dimensional image database can be used for various researches such as identification, classification, quality control and deep data mining of traditional Chinese medicines.

3) The radix rehmanniae identification method is suitable for data obtained under similar or similar sample analysis conditions, so that the applicability of the method is greatly improved.

4) The radix rehmanniae recen identification method provided by the invention realizes matching and identification of the known sample and the unknown sample by utilizing the spatial information of the sample, and has the advantages of rapidness, high flux, high precision, high reliability and the like.

Drawings

Fig. 1 is a schematic diagram illustrating the inventive concept.

Detailed Description

The present application will be described in detail with reference to examples, but the present application is not limited to these examples.

The following uniform interpretation of the relevant terms is as follows:

in the present application, "high-dimensional" refers to two and more dimensions. The "lower dimension" is one dimension.

The "common ions" refer to the same component (retention time and m/z are the same) in the same or different sample high-dimensional images.

"sample contours" refer to contours of a high-dimensional image produced by a sample.

A schematic diagram of the inventive concept is shown in fig. 1.

1. Establishing a Chinese medicine chromatogram-mass spectrum high-dimensional image database:

1) acquiring and processing raw chromatography-mass spectrometry (X-MS) data of a known chinese medicine sample in a known sample information database module 20: acquiring original X-MS data of a known traditional Chinese medicine sample by using a chromatogram and a mass spectrum, introducing the original X-MS data of the known traditional Chinese medicine sample into peak extraction software such as Progenisis QI, and carrying out data processing on the original X-MS data by using the chromatogram-mass spectrum;

2) generating high-dimensional data 200 of the known traditional Chinese medicine sample and generating a high-dimensional data image in the known sample chromatogram-mass spectrum image module 22: obtaining m/z, t, I, m and z values of each compound in a sample, generating a high-dimensional data matrix (such as an m/z-t-I matrix, an m-z-t-I matrix or an m-t-I matrix), and generating known traditional Chinese medicine sample chromatography-mass spectrometry combined high-dimensional data 200; the high dimensional data 200 is imported into image generation software such as Matlab to generate a first data image 220. Enabling each ion in the high-dimensional data to correspond to a point in a constructed image one by one, wherein each point has own coordinate information (such as t, m/z or m and z), the intensity of each point is represented by the size or/and the intensity of the brightness of the point, and the points in the high-dimensional data image correspond to the high-dimensional data one by one;

3) establishing a chromatogram-mass spectrum high-dimensional image database of known traditional Chinese medicine samples: taking the obtained high-dimensional data image of the known traditional Chinese medicine samples of 1 or more than 2 types as a traditional Chinese medicine chromatogram-mass spectrum high-dimensional image database, wherein the number of the samples in each type of the known traditional Chinese medicine samples is 1 or more than 2; the traditional Chinese medicine chromatogram-mass spectrum high-dimensional image database comprises sample information of known traditional Chinese medicine samples, original X-MS data information, high-dimensional data information and high-dimensional image data information;

2. and (3) rapidly identifying radix rehmanniae:

1) acquisition of unknown sample high-dimensional image data 400: adopting the same or similar operation parameters and conditions as those in the step 1, operating according to the steps 1) to 2) in the step 1, analyzing the unknown sample to be analyzed, and obtaining original X-MS data and high-dimensional data of the unknown sample; obtaining an X-MS second data image 420 of the unknown sample from the X-MS data by using image generation software;

2) identifying the unknown sample in the unknown sample identification module 60;

A. dividing points in the unknown sample X-MS high-dimensional image into n point clusters (n is more than or equal to 1 integer) by utilizing an image dividing tool such as a Matlab2016b self-contained dividing program in machine learning or a clustering tool such as K-Means, DBSCAN or Fanny and the like;

the point cluster refers to a set of points close to each other in space, and the number n of the points in the point cluster is more than or equal to 3;

each point cluster can have a central point, and the shape of the point cluster can be any shape;

B. respectively scanning and matching the unknown sample X-MS second data image 420 after the point cluster extraction and the known traditional Chinese medicine sample X-MS first data image 220 in the traditional Chinese medicine X-MS high-dimensional image database one by one;

during scanning, aligning the origin, the t axis and the m/z (m) axis of the two X-MS high-dimensional images;

during scanning, the point cluster is taken as a whole, and the moving range is 0-T_kWherein T is_kThe maximum analysis time corresponding to the known traditional Chinese medicine sample;

during scanning, each point cluster of the unknown sample reserves the position and the geometric shape of an m/z (or m) axis and is scanned along a time axis (t);

through scanning, common points which can be accurately matched in t and m/z (or m) in the X-MS high-dimensional image of the unknown sample point cluster and the known traditional Chinese medicine sample are searched; in the scanning process, when a point in a point cluster in an unknown sample is matched with a point in a known traditional Chinese medicine sample X-MS high-dimensional image, the absolute deviation value (T tolerance) of T allowed by each point is equal to or more than T, and T is equal to the sum of the average deviation value (absolute value, which can be calculated by repeatedly measuring 1 or more than 1 standard substance or 1 or more than 1 compound in a certain sample) of retention time allowed by a chromatograph during the acquisition of the unknown sample X-MS data and the average deviation value (absolute value, which can be calculated by repeatedly measuring 1 or more than 1 standard substance or 1 or more than 1 compound in a certain sample) of retention time allowed by the chromatograph during the acquisition of the known traditional Chinese medicine sample X-MS data;

in the scanning process, when a point in a point cluster in an unknown sample is matched with a point in a known traditional Chinese medicine sample X-MS high-dimensional image, the allowed m/z (or m) absolute measurement error [ m/z (or m) tolerance ] of each point is more than or equal to A, wherein A is equal to the sum of the allowed mass average deviation (absolute value, which can be repeatedly measured by a correction fluid used by an instrument) during mass spectrometry scanning during the acquisition of the X-MS data of the unknown and known traditional Chinese medicine samples;

when one point in the unknown sample point cluster and a certain point of the known traditional Chinese medicine sample meet t deviation and m/z (or m) deviation, the point is considered to meet the matching requirement;

during scanning, the scanning step length of the point cluster along the time axis (T) is less than or equal to T, and under the normal condition, T is more than 0s and less than 10000 s;

C. when a point cluster moves to each position of the t axis of the known traditional Chinese medicine sample X-MS high-dimensional image, recording the number of matching points, the coordinates of each matching point and the coordinates of the geometric center point of the point cluster;

D. when each position is calculated, the matching degree (Si) between one point cluster (i, i is more than or equal to 1 integer) of the unknown sample and the X-MS high-dimensional image of the known traditional Chinese medicine sample can be calculated, and the size of the matching degree can be calculated by utilizing a statistical tool such as Matlab to calculate one or more than two of the point number, the similarity (such as Euclidean distance method in image similarity calculation) or the correlation (such as 2D-correlation coefficient in Matlab) of the point cluster (i) and the X-MS high-dimensional image of the known traditional Chinese medicine sample;

the matching degrees obtained by the three methods are respectively represented by points (or functions of the points), similarity or correlation;

the matching degree of the point clusters is linearly or nonlinearly related to the four variables of the point number, the coordinate position (t, m/z) and the intensity matched with the point clusters; the basis for calculating the number of points (or a function of the number of points), the similarity or the correlation is based on the relation transformation of four variables;

different matching degree calculation methods can be selected to respectively calculate the overall matching degree of the point clusters and the X-MS high-dimensional images of the known traditional Chinese medicine samples;

the number of the matching points refers to the number of the points of the point cluster meeting the matching condition; based on the steps, carrying out mathematical weighting processing (such as addition, average or logarithm taking) on the maximum matching degree (Si) of each point cluster in the unknown sample X-MS high-dimensional image to obtain the integral matching degree (Sc) of the unknown sample X-MS high-dimensional image and the known traditional Chinese medicine sample X-MS high-dimensional image;

E. repeating the steps, and analyzing the matching degree between the X-MS high-dimensional images of the unknown sample and the X-MS high-dimensional images of other known traditional Chinese medicine samples one by one to obtain the integral matching degree (Sc) between the X-MS high-dimensional images and each known traditional Chinese medicine sample;

F. the class to which the unknown sample belongs may be determined without or with the aid of a threshold value;

when the threshold value is not used, matching the unknown sample with the known traditional Chinese medicine sample by utilizing the steps, and sequencing the matching degrees from large to small, wherein if the rank of the matching degree of the unknown sample and a certain known traditional Chinese medicine sample is more advanced, the probability that the unknown sample is the sample is higher, and otherwise, the probability is smaller;

when the threshold value is used, setting a threshold value gamma for judging the credibility range of matching of unknown samples of different sources and similar known traditional Chinese medicine samples;

the threshold value can be set according to a statistical method: operating according to steps 1) to 2), selecting more than 2 known traditional Chinese medicine samples of the same type as training samples of a certain type by adopting the same or similar operating parameters and conditions, and analyzing to obtain X-MS original data; converting X-MS original data or a multi-dimensional information text into an X-MS high-dimensional image by using image generation software (such as Matlab2016b) to obtain a training X-MS high-dimensional image set of the sample; matching the training X-MS high-dimensional image set with the X-MS high-dimensional images of the same type of known traditional Chinese medicine samples, finding a matching degree distribution interval by a statistical method (such as probability, ratio and the like), and selecting the lower limit of the matching degree in the distribution interval as a threshold value gamma of the sample;

in addition, the threshold value can be obtained by utilizing literature reports or experimental observation, the distribution interval (n is more than or equal to 2) of the matching degree of a certain type of samples and the known traditional Chinese medicine samples (the analysis result is obtained by adopting the same or similar operation parameters and conditions according to the steps 1-2) in operation), and the lower limit of the matching degree in the distribution interval is selected as the threshold value gamma of the type of samples;

matching the unknown sample with the known traditional Chinese medicine sample, and sequencing the matching degrees from large to small, wherein if the matching degree of the unknown sample and the known traditional Chinese medicine sample is ranked more forward and Sc is greater than a threshold value gamma measured by the known traditional Chinese medicine sample, the probability that the unknown sample is the sample is higher, and if not, the probability is lower;

3) verification of unknown sample identification results

Arranging the known traditional Chinese medicine samples matched with the unknown samples in the step 2 according to the matching degree rank, sequentially searching the marked compounds (the number of the marked compounds is more than or equal to 1) corresponding to the known traditional Chinese medicine samples in the original X-MS data information and/or the high-dimensional data information of the unknown samples according to the matching degree rank, receiving the unknown samples as the known traditional Chinese medicine samples when the marked compounds are searched in the unknown samples, and stopping searching; if the first ranking known traditional Chinese medicine sample is not searched in the unknown sample, then searching a second ranking known traditional Chinese medicine sample for the marker compound in the unknown sample, and so on until the marker compound is searched; if all the matched marked compounds in the known traditional Chinese medicine samples are not retrieved from the unknown samples, the established database is considered to contain no unknown samples.

In step 2, when it is known whether the sample database has the standard, there is a slight difference:

search for marker compounds with standards: and (3) obtaining high-dimensional data of the standard sample by adopting the method in the step 1. Matching the high-dimensional data of the marked compound with the high-dimensional data of the unknown sample, and searching ions in the unknown sample, the retention time t and the m/z of which with the marked compound meet a threshold window;

search for marker compounds without standards: searching the m/z value of the marker compound in the unknown sample, and searching for the ions in the unknown sample, the retention time t and the m/z of the marker compound in the known traditional Chinese medicine sample both meeting a threshold window.

In step 1, in order to make the unknown sample comparable to the known traditional Chinese medicine sample, the same or similar repeatable sample processing, raw data acquisition and data processing methods should be adopted for each sample during the preparation of the unknown sample, the acquisition of the raw data and the data processing.

In step 1, the mean deviation (absolute value) of the retention time of the chromatograph means the mean value (absolute value) of the time deviation of each compound when the chromatograph repeatedly measures the same sample under the same conditions, and the measurement can be performed by using a mixed standard.

In step 1, raw chromatography-mass spectrometry data is obtained by:

1) separating the mixed molecules in the traditional Chinese medicine sample by a chromatograph and an ion mobility spectrometer through a selective action to obtain different retention time information t;

2) the mass spectrometer separates and detects according to different mass-to-charge ratios of molecules under the action of an electric field or a magnetic field to obtain different mass-to-charge ratio information m/z;

3) analyzing the Chinese medicinal sample extract with a chromatography-mass spectrometer, wherein the time (t) range for chromatographic separation is 1-10000s, and the ion (m/z) scanning range is 50-10000 Da; chromatography-mass spectrometry (X-MS) data were obtained.

In the step 1, the acquired original data can be subjected to one or more than two data processing of retention time correction, filtering, normalization and the like; wherein the retention time correction can adopt a retention time correction of a plurality of compounds (more than or equal to 2) in the sample to be analyzed, a retention time correction of a mixed standard substance or other retention time correction modes.

The high-dimensional data may include all ions in the high-dimensional data matrix, or the ions in the high-dimensional data matrix may be selectively retained.

The spot location of the high-dimensional data image is determined by the nature of the compound: the vertical axis represents the retention time of the chromatogram, and the compounds are distributed along the direction of the vertical axis from large to small according to the polarity; the horizontal axis represents m/z value, and the compounds are distributed along the horizontal axis from small to large according to the m/z value; the same compound can exist in a plurality of forms such as excimer ions, addition ions, fragment ions and the like in a mass spectrum, and each compound can exist spots at different horizontal axis positions at the same longitudinal axis position; compounds of similar nature (spots) form regional clusters of dots representing a certain type of substance.

The more ions contained in the chromatogram-mass spectrum data, the richer and more beneficial the constructed chromatogram-mass spectrum high-dimensional image information is to be identified.

The noise can cause recognition deviation, and early denoising is carried out by utilizing the signal-to-noise ratio or isotope distribution form of each ion in the original chromatogram-mass spectrum data, so that the higher the recognition accuracy is.

Step 1 does not require a forced time correction.

The chromatogram-mass spectrum information or ion mobility spectrum-mass spectrum information in the database can be expanded into two dimensions, three dimensions or higher dimensions.

Example 1 establishment of Chinese medicine chromatography-Mass Spectrometry high-dimensional image database

Preparation of known traditional Chinese medicine sample

The preparation method of the traditional Chinese medicine sample comprises but is not limited to solvent extraction, and comprises a method suitable for preparing all traditional Chinese medicine samples. Known Chinese medicine samples in the database of the present invention were prepared from 547 varieties of control drugs from the national institute for food and drug (see Table 1). Taking 100mg of each control medicinal material powder, respectively adding 0.5ml of 50% methanol by volume concentration, performing ultrasonic extraction for 10min, performing high-speed centrifugation at 15000 r/min for 10min to obtain supernatant, adding 0.5ml of 50% methanol by volume concentration into filter residue again, performing ultrasonic extraction for 10min, and performing high-speed centrifugation at 15000 r/min for 10min to obtain supernatant. Mixing the two extractive solutions to obtain supernatant.

Secondly, acquiring and processing the original data of the chromatography-mass spectrum of the known traditional Chinese medicine sample

The method is based on a chromatography-mass spectrometry combined technology to obtain the original data of the known traditional Chinese medicine sample. It is known that the original data of the traditional Chinese medicine sample needs to be analyzed under the same condition to obtain a comparable chromatogram-mass spectrum high-dimensional image. 6520Q-TOF-MS (Agilent Corp, USA) was cascaded using an Agilent 1290 ultra performance liquid chromatography system (Agilent, Waldbronn, Germany).

1. Chromatographic process

Using an Agilent ZORBAX Eclipse Plus C18 column (3.0X 150mm,1.8 μm), mobile phase A was water (0.5% acetic acid) and phase B was acetonitrile, gradient elution: 0 to 15 minutes, 5% -100% of phase B, 15 to 20 minutes, 100% of phase B, 20 to 21 minutes, 100% -5% of phase B, 21 to 25 minutes, 5% of phase B, flow rate 0.3 ml/minute. The column temperature was 60 ℃ and the amount of sample was 2. mu.l.

2. Mass spectrometry method

The mass spectrum adopts ESI ion source and negative ion mode to collect data. The data acquisition range is m/z 100-3200. The temperature was 350 ℃, the dryer flow rate was 8L/min, the atomization gas pressure was 40psi, the capillary voltage was 3500V, the Fragmentor voltage was 200V, and the skimmer voltage was 65V.

3. Data processing of chromatography-mass spectrometry raw data of known traditional Chinese medicine sample

The raw data of the present invention includes chromatographic information, such as chromatographic retention time and peak intensity, and mass spectral information, such as mass to charge ratio, for each compound in the sample extract. Raw data processing includes correction, filtering, and normalization of the data. And importing the original data into peak extraction software Progenisis QI, setting a threshold value as 0.005% of the intensity of a base peak to remove noise signals, acquiring m/z, t and I values of each compound in the sample, generating an m/z-t-I data matrix, and storing the m/z-t-I data matrix in an EXCEL table-csv file format.

Thirdly, acquiring high-dimensional data of known traditional Chinese medicine samples and high-dimensional images of chromatography-mass spectrometry

1. Acquisition of high dimensional data

And (3) importing the step file of processing the original data into Matlab software, and reserving the ions with the ion intensity ranking of the top 2000.

2. Creation of high dimensional data images

The points in the chromatogram-mass spectrum high-dimensional image correspond to high-dimensional data one by one. And (3) introducing the high-dimensional data into Matlab software, and drawing an m/z-t-I graph of the sample by taking m/z and t as coordinates, wherein each measurable compound has specific mass and time coordinates, and the mass spectrum signal intensity (peak value) I value of the compound is expressed by the area of a point or the chroma value of the point.

3. Conversion of high-dimensional image of chromatogram-mass spectrum

The high-dimensional data image can adopt the original image established in the steps to carry out conversion processing on the image, and the processing modes comprise image fuzzification processing, image different-resolution processing and the like.

Spatial information of high-dimensional image of chromatography-mass spectrum

The X-MS high-dimensional image of the present invention includes, but is not limited to, speckle and dot clusters. Each spot is generated by one compound, but each compound can generate one or more spots. The spot location is determined by the nature of the compound: the vertical axis represents the retention time of the chromatogram, and the compounds are distributed along the direction of the vertical axis from large to small according to the polarity; the horizontal axis represents m/z value, and the compounds are distributed along the horizontal axis from small to large according to the m/z value; the same compound can exist in multiple forms of excimer ions, addition ions, fragment ions and the like in a mass spectrum, so that each compound can exist in spots with the same longitudinal axis position and different transverse axis positions. Compounds of similar nature (spots) form regional clusters of dots representing a certain type of substance.

Fifthly, establishing high-dimensional images of traditional Chinese medicine chromatography-mass spectrometry

The database established by the embodiment includes, but is not limited to, text, EXCEL, Oracle, mysql, split or microsoft sql server, and the like. Obtaining a traditional Chinese medicine chromatogram-mass spectrum high-dimensional image database of 547 varieties of control medicinal materials, wherein the database comprises: 1) the sample information base in the EXCEL format comprises sample numbers, names, sources, specifications, medicinal material parts, orders, families, genera and species; 2) a folder-format original data database of all variety chromatography-mass spectra; 3) and the high-dimensional image database of all variety high-dimensional data in a folder format.

Example two: quick identification of raw rehmannia root samples

First, preparation of unknown sample

The preparation method of the unknown sample is the same as that of the known traditional Chinese medicine sample. In this example, the unknown sample was named as SS 2-6520-006-0007. Taking 100mg of each unknown sample powder, respectively adding 0.5ml of 50% methanol by volume concentration, carrying out ultrasonic extraction for 10 minutes, carrying out high-speed centrifugation at 15000 r/min for 10 minutes, taking supernatant, adding 0.5ml of 50% methanol by volume concentration into filter residue again, carrying out ultrasonic extraction for 10 minutes, and carrying out high-speed centrifugation at 15000 r/min for 10 minutes, taking supernatant. Mixing the two extractive solutions to obtain supernatant.

Acquiring and processing original data of chromatography-mass spectrum of unknown sample

And acquiring the original data of the unknown sample based on a chromatography-mass spectrometry technology. The unknown sample raw data needs to be analyzed under the same or similar conditions with the known traditional Chinese medicine sample so as to obtain a comparable chromatogram-mass spectrum high-dimensional image. Unknown samples SS2-6520-006-0007 were concatenated with 6520Q-TOF-MS (Agilent Corp, USA) using an Agilent 1290 ultra performance liquid chromatography system (Agilent, Waldbronn, Germany).

1. Chromatographic process

2. Mass spectrometry method

The Agilent 6520Q-TOF-MS mass spectrum adopts an ESI ion source and an anion mode to collect data. The data acquisition range is m/z 100-3200. The temperature was 350 ℃, the dryer flow rate was 8L/min, the atomization gas pressure was 40psi, the capillary voltage was 3500V, the Fragmentor voltage was 200V, and the skimmer voltage was 65V.

3. Data processing of unknown sample chromatography-mass spectrometry raw data

The raw data includes chromatographic information, such as chromatographic retention time and peak intensity, and mass spectral information, such as mass to charge ratio, for each compound in the sample extract. Raw data processing includes correction, filtering, and normalization of the data. And importing the original data into peak extraction software Progenisis QI, setting a threshold value as 0.005% of the intensity of a base peak to remove noise signals, acquiring m/z, t and I values of each compound in the sample, generating an m/z-t-I data matrix, and storing the m/z-t-I data matrix in an EXCEL table-csv file format.

Thirdly, acquiring high-dimensional data of unknown sample and high-dimensional image of chromatogram-mass spectrum

1. Acquisition of high dimensional data

2. Creation of high dimensional data images

And points in the chromatogram-mass spectrum high-dimensional image correspond to high-dimensional data one by one. And (3) introducing the high-dimensional data into Matlab software, and drawing an m/z-t-I graph of the sample by taking m/z and t as coordinates, wherein each measurable compound has specific mass and time coordinates, and the mass spectrum signal intensity (peak value) I value of the compound is expressed by the area of a point or the chroma value of the point.

3. Conversion of high-dimensional image of chromatogram-mass spectrum

The high-dimensional data image can adopt the original image established in the steps to carry out conversion processing on the image, and the conversion processing comprises processing modes such as image fuzzification processing, image different-resolution processing and the like. In this example, a high-dimensional data raw chromatography-mass spectrometry high-dimensional image is used.

Fourth, identification of unknown samples

1. Firstly, dividing points in an X-MS high-dimensional image of a sample SS2-6520-006-0007 to be detected into 34 point clusters by using a clustering tool Clusterdp in machine learning; the number n of the points in the point cluster is more than or equal to 10;

2. respectively scanning and matching the chromatogram-mass spectrum high-dimensional image of the sample to be detected after the point cluster extraction and the chromatogram-mass spectrum high-dimensional image of the reference sample (m);

3. during scanning, aligning the origin, the t axis and the m/z axis of the two chromatograph-mass spectrum high-dimensional images, then keeping the position and the geometric shape of the m/z axis of each point cluster of the sample to be detected, and continuously scanning along a time axis (t); searching a common point which can be accurately matched in t and m/z in a chromatographic-mass spectrum high-dimensional image of the sample point cluster to be detected and a reference sample (m) through scanning;

4. during scanning, the point cluster as a whole moves within the range of 0-T_kT is the effective analysis time corresponding to the sample, and T is taken in the example_k＝1000s；

5. During scanning, the scanning step length of the point cluster along a time axis (t) is 1 s;

6. in the scanning process, when a point cluster in the sample to be detected is matched with a point in a chromatogram-mass spectrum high-dimensional image of a reference sample (m), the allowed minimum t deviation (ttolerance) of each point is +/-30 s; the minimum deviation allowed by m/z (or m) [ m/z (or m) tolerance ] is +/-0.01 Da;

7. when a point cluster moves to each position of the t axis of the X-MS high-dimensional image of the reference sample (m), recording the number of matching points, the coordinate of each matching point and the coordinate of the geometric center point of the point cluster;

8. calculating the correlation between a point cluster (i) of the sample to be detected and a reference sample (m) in a traditional Chinese medicine X-MS high-dimensional image database by using a 2D correlation function in Matlab;

9. calculating the maximum correlation degree of each point cluster of the sample to be detected and a reference sample chromatogram-mass spectrum high-dimensional image in the direction of the t axis;

10. calculating the matching degree (S) of each point cluster in the X-MS high-dimensional image of the sample to be detected and the chromatogram-mass spectrum high-dimensional image of the reference sample by using a point number calculating method according to the position of the point cluster for obtaining the maximum correlation degree_i)；

S_iRepresenting the matching degree corresponding to the ith point cluster; k represents that a total of k points in the point cluster meet the matching requirement,

a function of the relationship of m/z (or substitution of m), t (chromatographic retention time) and I (signal intensity of the ion) for each match point;

representing the function value corresponding to the j point;

can be expressed by the following functional formula,

x, y, z refer to the index of the three variables I, m/z, and t, where x is 0 or greater; y is more than or equal to 0; z is more than or equal to 0;

in this embodiment, x is taken to be 0; y is 1/2; z is 1/2;

11. according to the steps, the overall matching degree (S) of the X-MS high-dimensional image of the sample to be detected and the X-MS high-dimensional image (m) of the reference sample is calculated_c)；

n represents the number of all matching points corresponding to all point clusters at the maximum matching degree,

representing the corresponding of each point (1-n) obtained by point clustering

A value;

12. and repeating the steps to respectively obtain the matching degree of each detection sample.

And respectively matching SS2-6520-006-0007 samples to be detected with 547 type reference samples, wherein the matching degree of the samples to be detected and the rehmannia glutinosa reference samples DB-A2-6-0004-03 is the highest and is 195.05% (the matching degree of all the reference samples is shown in Table 2).

Fifthly, verification of unknown sample identification result

In the order of matching degree, unknown samples SS2-6520-006-0007 correspond to the known sample with the highest matching degree to be rehmannia glutinosa, main components (t7.29min, m/z623.1978) of the known rehmannia glutinosa sample are searched in the unknown samples SS2-6520-006-0007, and as a result, compounds t6.23min and m/z623.1974 are searched in the unknown samples SS2-6520-006-0007, and the searched compounds are within an acceptable retention time and an m/z window, so that the unknown samples SS2-6520-006-0007 are accepted to be rehmannia glutinosa. And the rehmannia glutinosa sample is correctly identified by referring to the medicinal material information of the unknown sample.

Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

TABLE 1

TABLE 2

Claims

1. A raw rehmannia root identification platform, comprising:

2. The radix rehmanniae identification platform of claim 1 wherein the chromatography-mass spectrometry data of the known sample comprises raw chromatography-mass spectrometry information of the known sample and the chromatography-mass spectrometry data of the unknown sample comprises raw chromatography-mass spectrometry information of the unknown sample;

preferably, the chromatography-mass spectrometry data of the known sample further comprises high-dimensional data of each compound in the known sample, and the chromatography-mass spectrometry data of the unknown sample further comprises high-dimensional data of each compound in the unknown sample;

further preferably, the spatial information between data points in the high-dimensional data expression sample is a matrix formed by at least one of the following information:

distance information between data points;

angular relationship information between data points;

coordinate position information of the data points;

density information of the data points;

edge range information for the set of data points;

intensity information of the data points;

preferably, the distance information between the data points comprises at least one of chromatographic retention time t, m/z value, m value, z value, peak intensity I;

3. The raw rehmannia glutinosa recognition platform according to claim 2, wherein the high-dimensional data generated high-dimensional data image comprises at least one of an original image generated by the high-dimensional data, an image generated based on image characteristics, an image generated by transforming the image, and an image constructed by using a function;

preferably, the image features comprise clusters of data point points, common particles, sample contours;

preferably, the image conversion process includes at least one of a process of blurring the image and a process of subjecting the image to different resolutions;

preferably, the function comprises at least one of chromatographic retention time t, m/z, m, peak intensity I;

preferably, the high-dimensional image is an image of more than two dimensions;

preferably, the image file is stored in an image file format.

4. The raw rehmannia glutinosa recognition platform according to claim 1, wherein the known sample comprises at least one of a standard or a known chinese herb sample;

preferably, the standard substance comprises at least one of a reference substance, a traditional Chinese medicine marking component and a main chemical component of the traditional Chinese medicine 2015 edition of Chinese pharmacopoeia;

preferably, the known traditional Chinese medicine sample is a sample with definite category information, and the category information comprises at least one of the species, the producing area, the part and the processing mode of the sample;

preferably, the known traditional Chinese medicine sample comprises at least one of raw traditional Chinese medicine materials, decoction pieces and powder, and further preferably, the known traditional Chinese medicine sample comprises at least one of different parts of traditional Chinese medicines and processed products thereof.

5. The radix rehmanniae identification platform of claim 1 wherein the unknown sample identification module comprises an image segmentation tool or a clustering tool.

6. The raw rehmannia glutinosa recognition platform according to claim 1, wherein the database type in each database module comprises at least one of a folder data set, a web page database, a database based on a commercialization workstation or a database based on a user self-developed workstation.

7. A method for identifying dried rehamnnia root by using the dried rehamnnia root identification platform of any one of claims 1 to 6, wherein the method comprises at least the following steps:

2) generating chromatography-mass spectrometry high-dimensional data of a known sample and an unknown sample, wherein the chromatography-mass spectrometry high-dimensional data expresses spatial information among data points;

4) dividing points in the chromatogram-mass spectrum high-dimensional image of the unknown sample into n point clusters by using an image dividing tool or a clustering tool, wherein n is an integer more than or equal to 1, and respectively scanning and matching the chromatogram-mass spectrum high-dimensional image of the unknown sample after the point clusters are extracted and the mass spectrum-chromatogram high-dimensional image of the known sample one by one;

preferably, the coordinate information includes at least one of distance information between data points, angular relationship information between data points, coordinate position information of data points, density information of data points, edge range information of a data point set, and intensity information of data points;

preferably, the point cluster is a set of data points close in space, and the number n of the data points in the point cluster is more than or equal to 3;

preferably, each of said clusters has its own centre point;

preferably, the shape of the dot clusters is arbitrary.

8. The method of claim 7, wherein the raw chromatography-mass spectrometry data of the known and unknown samples is obtained by:

preferably, the time t used for chromatographic separation is in the range of 1-10000s and the m/z scan of the ions is in the range of 50-10000 Da.

9. The method of claim 8, further comprising subjecting the acquired raw chromatography-mass spectrometry data to at least one of retention time correction, filtering, and normalization;

preferably, the method further comprises the step of using quality control samples and mixing standard internal standards;

preferably, the quality control sample comprises at least one of a known sample or a mixture thereof, an unknown sample or a mixture thereof, and a mixture of two or more standards, and the quality control sample is used for evaluating data quality;

preferably, internal standards of the mixed standards are used when the mixed standards are employed to improve the reproducibility of the assay and to perform retention time corrections.

10. The method of claim 8, wherein the unknown sample is at least one of a raw herb, a decoction piece, a powder, a preparation, different parts of a herb, and processed products thereof;

preferably, the preparation comprises traditional Chinese medicine granules or a traditional Chinese medicine injection.