WO2023207453A1 - 一种基于光谱聚类的中药成分分析方法及系统 - Google Patents
一种基于光谱聚类的中药成分分析方法及系统 Download PDFInfo
- Publication number
- WO2023207453A1 WO2023207453A1 PCT/CN2023/083467 CN2023083467W WO2023207453A1 WO 2023207453 A1 WO2023207453 A1 WO 2023207453A1 CN 2023083467 W CN2023083467 W CN 2023083467W WO 2023207453 A1 WO2023207453 A1 WO 2023207453A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- chinese medicine
- traditional chinese
- sample
- samples
- new
- Prior art date
Links
- 239000003814 drug Substances 0.000 title claims abstract description 148
- 238000004458 analytical method Methods 0.000 title claims abstract description 98
- 239000004615 ingredient Substances 0.000 title claims abstract description 65
- 230000003595 spectral effect Effects 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 111
- 238000002329 infrared spectrum Methods 0.000 claims abstract description 47
- 238000012360 testing method Methods 0.000 claims abstract description 22
- 238000012795 verification Methods 0.000 claims abstract description 10
- 238000007621 cluster analysis Methods 0.000 claims description 12
- 238000010200 validation analysis Methods 0.000 claims description 11
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000012937 correction Methods 0.000 abstract description 15
- 239000000523 sample Substances 0.000 description 126
- 238000007781 pre-processing Methods 0.000 description 7
- 238000001228 spectrum Methods 0.000 description 7
- 238000004497 NIR spectroscopy Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 235000006533 astragalus Nutrition 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000000513 principal component analysis Methods 0.000 description 4
- 230000009897 systematic effect Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 241001061264 Astragalus Species 0.000 description 2
- 241000045403 Astragalus propinquus Species 0.000 description 2
- QMNWISYXSJWHRY-YLNUDOOFSA-N astragaloside IV Chemical compound O1[C@H](C(C)(O)C)CC[C@]1(C)[C@@H]1[C@@]2(C)CC[C@]34C[C@]4(CC[C@H](O[C@H]4[C@@H]([C@@H](O)[C@H](O)CO4)O)C4(C)C)[C@H]4[C@@H](O[C@H]4[C@@H]([C@@H](O)[C@H](O)[C@@H](CO)O4)O)C[C@H]3[C@]2(C)C[C@@H]1O QMNWISYXSJWHRY-YLNUDOOFSA-N 0.000 description 2
- QMNWISYXSJWHRY-BCBPIKMJSA-N astragaloside IV Natural products CC(C)(O)[C@@H]1CC[C@@](C)(O1)[C@H]2[C@@H](O)C[C@@]3(C)[C@@H]4C[C@H](O[C@@H]5O[C@H](CO)[C@H](O)[C@@H](O)[C@H]5O)[C@H]6C(C)(C)[C@H](CC[C@@]67C[C@@]47CC[C@]23C)O[C@@H]8OC[C@@H](O)[C@H](O)[C@H]8O QMNWISYXSJWHRY-BCBPIKMJSA-N 0.000 description 2
- 235000019206 astragalus extract Nutrition 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- PFKIBRPYVNVMRU-UHFFFAOYSA-N cyclosieversioside F Natural products CC(C)(O)C1COC(C)(C1)C2C(O)CC3(C)C4CC(OC5OC(CO)C(O)C(O)C5O)C6C(C)(C)C(CCC67CC47CCC23C)OC8OCC(O)C(O)C8O PFKIBRPYVNVMRU-UHFFFAOYSA-N 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 229930182478 glucoside Natural products 0.000 description 2
- 150000008131 glucosides Chemical class 0.000 description 2
- 150000004676 glycans Chemical class 0.000 description 2
- 238000004128 high performance liquid chromatography Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 229920001282 polysaccharide Polymers 0.000 description 2
- 239000005017 polysaccharide Substances 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 210000004233 talus Anatomy 0.000 description 2
- 238000010521 absorption reaction Methods 0.000 description 1
- 239000004480 active ingredient Substances 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- OQUKIQWCVTZJAF-UHFFFAOYSA-N phenol;sulfuric acid Chemical compound OS(O)(=O)=O.OC1=CC=CC=C1 OQUKIQWCVTZJAF-UHFFFAOYSA-N 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/20—Identification of molecular entities, parts thereof or of chemical compositions
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N21/00—Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
- G01N21/17—Systems in which incident light is modified in accordance with the properties of the material investigated
- G01N21/25—Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
- G01N21/31—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
- G01N21/35—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
- G01N21/359—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light using near infrared light
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Definitions
- the present invention relates to the technical field of near-infrared spectroscopy analysis, and in particular to a method and system for analyzing traditional Chinese medicine components based on spectral clustering.
- NIR spectroscopy is an electromagnetic wave with a wavelength range from 780nm to 2526nm. It mainly reflects the spectral absorption of the overtones and combined bands of C-H, O-H and N-H vibrations. It is fast, low-cost, simple to operate, non-destructive and reproducible. And it conforms to the concept of green analytical chemistry and other characteristics. As a rapid analysis technology, near-infrared spectroscopy has been widely used in pharmaceutical science, food science, petrochemistry and other fields. It has shown great potential in the qualitative identification, quantitative analysis and real-time online analysis of traditional Chinese medicine and food. potential.
- this disclosure proposes a method and system for analyzing traditional Chinese medicine ingredients based on spectral clustering. Select the sample closest to the center of each category of the spectrum as the candidate sample, add it to the calibration set divided by the original sample set to complete the update of the original calibration set, and update and train the traditional Chinese medicine component analysis model, so that the obtained trained traditional Chinese medicine component analysis model Accuracy increases, with better predictive performance.
- a method for analyzing traditional Chinese medicine ingredients based on spectral clustering including:
- the specific process of obtaining the trained traditional Chinese medicine component analysis model is: obtaining the near-infrared spectrum samples of traditional Chinese medicine components; dividing the near-infrared spectrum samples of traditional Chinese medicine components into the original sample set and the new sample set; dividing the original sample set into a calibration set and a verification set Set, use the calibration set and validation set to construct an analysis model of traditional Chinese medicine ingredients; perform cluster analysis on the new sample set to obtain different sample categories; select the sample closest to the center of each category as the candidate sample; add the candidate sample to the original sample set to divide
- the outgoing calibration set forms a new calibration set, and the remaining samples in the new sample set except the candidate samples are used as the test set.
- the new calibration set and the test set are used to train the traditional Chinese medicine component analysis model to obtain the trained traditional Chinese medicine component analysis model.
- Data acquisition module used to acquire near-infrared spectra of traditional Chinese medicine
- the result acquisition module is used to obtain the analysis results of traditional Chinese medicine ingredients based on the near-infrared spectrum of traditional Chinese medicine and the trained traditional Chinese medicine ingredient analysis model;
- the specific process of obtaining the trained traditional Chinese medicine component analysis model is: obtaining the near-infrared spectrum samples of traditional Chinese medicine components; dividing the near-infrared spectrum samples of traditional Chinese medicine components into the original sample set and the new sample set; dividing the original sample set into a calibration set and a verification set Set, use the calibration set and validation set to construct an analysis model of traditional Chinese medicine ingredients; perform cluster analysis on the new sample set to obtain different sample categories; select the sample closest to the center of each category as the candidate sample; add the candidate sample to the original sample set to divide
- the outgoing calibration set forms a new calibration set, and the remaining samples in the new sample set except the candidate samples are used as the test set.
- the new calibration set and the test set are used to train the traditional Chinese medicine component analysis model to obtain the trained traditional Chinese medicine component analysis model.
- an electronic device including a memory, a processor, and computer instructions stored in the memory and run on the processor.
- the computer instructions are run by the processor, a traditional Chinese medicine based on spectral clustering is completed. The steps described in the compositional analysis method.
- a computer-readable storage medium for storing computer instructions.
- the steps described in a traditional Chinese medicine component analysis method based on spectral clustering are completed.
- the present disclosure first trains the original model through the original sample set to obtain the traditional Chinese medicine ingredient analysis model, and then selects the sample closest to the center of each spectrum category from the new sample set as the candidate sample.
- the candidate samples are added to the calibration set divided by the original sample set to form a new calibration set.
- the traditional Chinese medicine ingredient analysis model is updated and trained through the new calibration set, and finally the trained traditional Chinese medicine ingredient analysis model is obtained.
- the finally obtained trained traditional Chinese medicine ingredient analysis model is The model has better prediction performance and improves its practical application value.
- Figure 1 is a flow chart of the method disclosed in Embodiment 1;
- Figure 2 is the near-infrared spectrum of all samples 1 in Example 1;
- Figure 3 is the distribution diagram of all samples 1 in the first and second PC spaces in Embodiment 1;
- Figure 4 is a dendrogram of clustering results using the Ward method for all samples in Example 1;
- Figure 5 is a dendrogram of clustering results using the Average method for all samples in Example 1;
- Figure 6 is a distribution diagram of samples selected by different methods in the first and second PC spaces in Embodiment 1;
- Figure 7 is the near-infrared spectrum of all samples 2 in Example 1;
- Figure 8 is the distribution diagram of all sample 2 in the first and second PC space in Embodiment 1;
- Figure 9 is a dendrogram of the clustering results of all samples 2 using the Ward method in Example 1;
- Figure 10 is a dendrogram of the clustering results using the Average method for all samples 2 in Example 1;
- Figure 11 is a distribution diagram of samples selected by different methods in the first and second PC spaces in Embodiment 1.
- a method for analyzing traditional Chinese medicine ingredients based on spectral clustering including:
- the specific process of obtaining the trained traditional Chinese medicine component analysis model is: obtaining the near-infrared spectrum samples of traditional Chinese medicine components; dividing the near-infrared spectrum samples of traditional Chinese medicine components into the original sample set and the new sample set; dividing the original sample set into a calibration set and a verification set Set, use the calibration set and validation set to construct an analysis model of traditional Chinese medicine ingredients; perform cluster analysis on the new sample set to obtain different sample categories; select the sample closest to the center of each category as the candidate sample; add the candidate sample to the original sample set to divide
- the outgoing calibration set forms a new calibration set, and the remaining samples in the new sample set except the candidate samples are used as the test set.
- the new calibration set and the test set are used to train the traditional Chinese medicine component analysis model to obtain the trained traditional Chinese medicine component analysis model.
- Ward method or Average method is used to perform cluster analysis on the new sample set.
- the traditional Chinese medicine ingredient analysis model adopts PLS model, neural network model or support vector machine model.
- the near-infrared spectrum samples of traditional Chinese medicine ingredients are preprocessed, and the original sample set and the new sample set are constructed from the preprocessed near-infrared spectrum samples of traditional Chinese medicine ingredients.
- the samples in the original sample set and the new sample set do not overlap.
- a traditional Chinese medicine ingredient analysis method based on spectral clustering includes:
- a spectrometer is used to obtain the near-infrared spectrum of traditional Chinese medicine.
- the traditional Chinese medicine component analysis model adopts PLS model, neural network model or support vector machine model, etc.
- the near-infrared spectrum samples of traditional Chinese medicine ingredients include each ingredient index and the reference value of the ingredient index measured using industry standard detection methods.
- the near-infrared spectrum samples of traditional Chinese medicine ingredients can be directly divided into the original sample set and the new sample set; the near-infrared spectrum of traditional Chinese medicine ingredients can also be preprocessed first, and the preprocessed near-infrared spectrum can be divided to obtain the original sample set. Sample set and new sample set.
- the preprocessing methods for the near-infrared spectra of traditional Chinese medicine ingredients include: any one of smoothing processing, first-order derivative calculation, second-order derivative calculation, standardization processing, baseline drift processing, standard normal variable processing, multivariate scattering correction processing, etc. A combination of species or species.
- the number of calibration set X cal is greater than or equal to the number of validation set X val , and the ratio of the number of samples in the calibration set and validation set X val is 2:1 or above.
- S241 Perform cluster analysis (HCA) on the sample spectra in the new sample set, and divide the new samples into different X new,i according to the selected clustering results and category data, where "i" represents different categories.
- HCA cluster analysis
- any one of the Ward method and the Average method is used to perform cluster analysis on the samples.
- x center, i represents the sample center of different categories
- N represents the number of samples in each category.
- S244 Sort the calculated Euclidean distances; select the sample with the smallest Euclidean distance in each category as the candidate sample X sel .
- S245 Add all candidate samples to the calibration set divided by the original sample set to form a new calibration set. Use the remaining samples in the new sample set except the candidate samples as the test set X test . Use the new calibration set and the test set to analyze the traditional Chinese medicine ingredients. Carry out training and obtain the trained traditional Chinese medicine component analysis model.
- This embodiment discloses a traditional Chinese medicine component analysis method based on spectral clustering. By selecting the sample closest to the center of each spectrum category as a candidate sample, it is added to the correction set divided by the original sample set, and the traditional Chinese medicine component analysis model is further updated. This enables the trained traditional Chinese medicine component analysis model to have better prediction performance for unknown new samples and has more practical application value.
- the method disclosed in this example was verified using commercially available Astragalus membranaceus extract (RAE) as an example.
- RAE Astragalus membranaceus extract
- the number of samples measured was a total of 82 RAE samples, including 9 batches collected from 5 manufacturers.
- the specific information is shown in Table 1. Among them, a total of 53 samples from S1 to S6 are used as the original sample set X, which is used to establish the traditional Chinese medicine component analysis model, and the remaining 29 samples (S7 to S9) are used as the new sample set X new .
- the near-infrared spectra of the original sample and the new sample were measured by Antaris II AA-NIR spectrometer (Thermo Fisher Scientific Co., Ltd., USA). The measured near-infrared spectrum is shown in Figure 2. The solid line is the original sample and the dotted line is the new sample.
- Astragaloside IV Astragaloside IV
- CG callisoflavone glucoside
- APS astragalus polysaccharide
- the original sample set is divided into a calibration set X cal and a validation set X val using the commonly used KS method, which are used to develop and validate the traditional Chinese medicine ingredient analysis model respectively.
- the number of samples in the calibration set is 36 and the number of samples in the prediction set is 17.
- Near-infrared spectra are processed using the preprocessing method of SNV combined with first-order derivatives.
- PCA principal component analysis
- Figure 4 and Figure 5 show the clustering results of new samples after preprocessing by Ward and Average methods.
- the dotted line, dotted line and solid line represent that the new samples are divided into 4, 5 and 6 categories respectively.
- the new samples are divided into For different categories X new,i .
- Figures 4 and 5 although the dendrograms formed by the two clustering methods are different, the classification results of samples in categories 1-5 are consistent, so the samples included in categories 1-5 should be consistent. .
- the sample center x center,i of each category is calculated respectively, and then the Euclidean distance d x(j) of each sample to the respective category center is calculated and sorted, and the closest to the center of each category is selected.
- the samples are added to the calibration set X cal as candidate samples X sel to form a new calibration set to update the traditional Chinese medicine component analysis model, and the remaining samples are used as the test set X test to verify the updated model.
- For predicting new samples through model update as few new samples as possible should be selected. We selected 3 (about 10% of 29 new samples) to 6 samples (about 20%) for model update. The results are shown in Table 3.
- the R t and RPD t values are both higher than the values directly predicted by the traditional Chinese medicine component analysis model, and the RMSEA value is also significantly reduced, indicating that the training The latest traditional Chinese medicine component analysis model has greatly improved the content prediction of new samples.
- the RMSEA values of the three components of ASA IV, CG and APS decreased from 0.0637, 0.0261 and 4.1141 to 0.0063, 0.0011 and 1.0133 respectively, proving that using the method disclosed in this embodiment can greatly improve the model's prediction ability for unknown new samples.
- the method disclosed in this embodiment is compared with commonly used methods such as the RS method, SPXY method and KS method. Due to the randomness of the RS method, ten repeated samplings will be performed, and ten samples will be taken. The average results are compared with other methods. The other three methods all select the same sample number range (3-6) as the method disclosed in this embodiment. ) samples are added to the original calibration set, and the representativeness of the selected samples is evaluated by updating the performance of the model to compare the modeling performance and predictive capabilities of different methods. The relevant results are shown in Table 3.
- the results of the four methods after model update when the minimum number of samples (3 samples) are selected are compared.
- the results are shown in Table 4. From Table 4, when the number of selected samples is the smallest, the CCD method has obvious advantages over the other three methods.
- the RPD t values of the model updated through the method disclosed in this embodiment are all greater than 3.5, indicating that the method disclosed in this embodiment greatly improves the applicability of the updated model.
- Figure 6 shows the distribution of samples selected by the three methods of CCD, SPXY and KS in the first and second PC spaces.
- the enlarged pictures are the enlarged images of S7 (ac), S8 (df) and S9 (gi), where (a), (d) and (g) represent ASA IV; (b), (e) and (h) represent CG; (c), (f) and (i) represent Table APS. It can be seen from the figure that the samples selected by the method disclosed in this embodiment are basically closer to the center of each category, and may be better representative of the samples of the corresponding category, thereby obtaining better results.
- the commercially available Astragalus membranaceus extract was once again used as an example to verify the method disclosed in this example.
- a total of 82 RAE samples were measured, and 9 batches were collected from 5 manufacturers.
- the specific information is shown in Table 1.
- the near-infrared spectra of the original sample and the new sample were measured by a Micro-NIR 1700 micro near-infrared spectrometer (VIAVI, USA).
- the measured near-infrared spectrum is shown in Figure 7.
- the solid line is the original sample and the dotted line is the new sample.
- Astragaloside IV (ASA IV), callisoflavone glucoside (CG) and astragalus polysaccharide (APS) are used as reference ingredient indicators.
- the original sample set is divided into a correction set X cal and a verification set X val using the commonly used KS method, which are used to develop and verify the original model respectively.
- the number of samples in the correction set is 36 and the number of samples in the prediction set is 17.
- Near-infrared spectra are processed using the preprocessing method of SNV combined with first-order derivatives. Taking APS as an example, the principal component analysis (PCA) score plot of all sample spectra after preprocessing is shown in Figure 8.
- PCA principal component analysis
- the distribution of the new samples is not included in the spectral space of the original samples, but is divided into different clusters, And there is basically no overlapping area between the original sample set and the new sample set, indicating that there may be systematic differences between the new samples and the original samples.
- Figures 9 and 10 show the clustering results of new samples after preprocessing by Ward and Average methods.
- the dotted line, dotted line and solid line represent that the new samples are divided into 4, 5 and 6 categories respectively, and the new samples are divided into different categories X new,i .
- the classification results of samples in categories 1-5 are consistent, so the samples included in categories 1-5 should be consistent.
- HCA Since HCA divides new samples into different categories according to different number of categories, in order to select samples that are representative of the categories, the sample center x center,i of each category is first calculated, and then each sample is calculated to its respective The Euclidean distance of the category center is sorted, and a sample closest to the respective category center is selected as a candidate representative sample X sel and added to the correction set X cal divided by the original sample set to form a new correction set.
- Update the traditional Chinese medicine ingredient analysis model Specifically, 3 (approximately 10% of the 29 new samples) to 6 samples (approximately 20%) are selected for model update. Table 5 shows the optimal content prediction results of the three active ingredients of the remaining unselected sample X test in the new sample after the model update.
- the R t and RPD t values are higher than those directly predicted by the original model, and the RMSEA value is also significantly reduced, indicating that the model has improved the content prediction of new samples.
- the RMSEA values of the three components of ASA IV, CG and APS decreased from 0.0507, 0.0268 and 3.6572 to 0.0085, 0.0029 and 1.2583 respectively, the R t values increased from 0.9428, 0.5250 and 0.8827 to 0.9931, 0.9876 and 0.9723, while the RPD t values also increased. They increased from 0.47, 0.12 and 1.01 to 4.66, 4.39 and 3.13 respectively, which proves that using the method disclosed in this embodiment for model updating can greatly improve the performance of the model and the prediction of unknown new samples.
- the method disclosed in this embodiment is compared with classic methods such as the RS method, SPXY method and KS method.
- the RS method is used to perform ten repeated samplings, and the average result of the ten times is taken and compared with other methods.
- Select the method disclosed in this embodiment to select Samples with the same number range are added to the correction set divided by the original sample set to form a new correction set, and the representativeness of the selected samples is evaluated through the performance of the new correction set update model to compare the modeling performance and prediction capabilities of different methods.
- Table 6. The relevant results are shown in Table 6. .
- the enlarged pictures are the enlarged images of S7 (ac), S8 (df) and S9 (gi), where (a), (d) and (g) represent ASA IV; (b), (e) and (h) represent CG; (c), (f) and (i) represent APS. It can be seen from the figure that the samples selected by the method of this embodiment are basically closer to the center of each category, and may be better representative of the samples of the corresponding category, thereby obtaining better results.
- the new sample does have certain systematic differences from the original sample, resulting in the spectrum of the sample showing different categories, and the traditional Chinese medicine component analysis model cannot be applied.
- the above two verification examples use different equipment to obtain samples, but both can be verified.
- This embodiment uses the original correction set combined with a small number of selected new samples to update the traditional Chinese medicine ingredient analysis model, and selects the sample closest to the category center as a candidate sample to update the original sample set division. The resulting calibration set makes the selected samples representative, and the updated model prediction results are all good.
- the method disclosed in this embodiment with the RS, SPXY and KS methods, it has certain advantages.
- the sample selection and model updating method based on the spectral clustering center disclosed in this embodiment can be extended to various fields and has more practical significance.
- a traditional Chinese medicine component analysis system based on spectral clustering including:
- Data acquisition module used to acquire near-infrared spectra of traditional Chinese medicine
- the result acquisition module is used to obtain the analysis results of traditional Chinese medicine ingredients based on the near-infrared spectrum of traditional Chinese medicine and the trained traditional Chinese medicine ingredient analysis model;
- the specific process of obtaining the trained traditional Chinese medicine component analysis model is: obtaining the near-infrared spectrum samples of traditional Chinese medicine components; dividing the near-infrared spectrum samples of traditional Chinese medicine components into the original sample set and the new sample set; dividing the original sample set into a calibration set and a verification set Set, use the calibration set and validation set to construct an analysis model of traditional Chinese medicine ingredients; perform cluster analysis on the new sample set to obtain different sample categories; select the sample closest to the center of each category as the candidate sample; add the candidate sample to the original sample set to divide
- the outgoing calibration set forms a new calibration set, and the remaining samples in the new sample set except the candidate samples are used as the test set.
- the new calibration set and the test set are used to train the traditional Chinese medicine component analysis model to obtain the trained traditional Chinese medicine component analysis model.
- an electronic device including a memory, a processor, and computer instructions stored in the memory and executed on the processor.
- the computer instructions are executed by the processor, a method disclosed in Embodiment 1 is completed.
- a computer-readable storage medium for storing computer instructions.
- the steps of the traditional Chinese medicine component analysis method based on spectral clustering disclosed in Embodiment 1 are completed. the steps described.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Pathology (AREA)
- Evolutionary Biology (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- Analytical Chemistry (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Investigating Or Analysing Materials By Optical Means (AREA)
Abstract
本公开公开的一种基于光谱聚类的中药成分分析方法及系统,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。通过该训练好的中药成分分析模型进行中药成分分析时,提高了模型预测的精度。
Description
本发明要求于2022年4月28日提交中国专利局、申请号为202210461016.7、发明名称为“一种基于光谱聚类的中药成分分析方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本发明中。
本发明涉及近红外光谱分析技术领域,尤其涉及一种基于光谱聚类的中药成分分析方法及系统。
本部分的陈述仅仅是提供了与本公开相关的背景技术信息,不必然构成在先技术。
近红外(NIR)光谱是一种波长范围为780nm至2526nm的电磁波,主要反映C-H、O-H和N-H振动的泛音和组合带的光谱吸收,具有快速、成本低、操作简单、无损、重现性好以及符合绿色分析化学理念等特点。近红外光谱分析方法作为一种快速分析技术,已广泛应用于制药科学、食品科学和石油化学等多个领域,在对于中药和食品等的定性鉴定、定量分析和实时在线分析方面显示出巨大的潜力。
建立有效的近红外光谱定量模型是近红外光谱技术用于中药和食品等质量监测的关键问题,为了建立近红外光谱定量模型,引入了多种建模方法,但无论采用何种建模方法,所建模型的校正样本都需要覆盖预测样本的特征信息,然而在实际应用中对于新测定的样本通常难以满足这一要求,例如由于产地、生长年份、气候条件和提取方法等的不同,所测得新样本的光谱数据和质量属性可能存在差异,甚至这种差异很大,从而导致建立的原始模型准确性下降。
有两种常见的方法可以解决由于新测定样品和原始样品之间的系统差异而导致的模型准确性损失:一种是只使用新样本重建新模型,即在新模型中不使用原始校准集样本,但这样会丢失原始模型的信息,造成在模型中的时间和精力的浪费;另一种解决方案是模型更新的方法,即使用原始校正集样本结合少量选定的新样本对原始模型进行更新以提高建模精度,由于只需要选择少量的新样本,模型更新的时间和成本比重建新模型的时间和成本要少,更适合实际应用,从大量新样本中选择具有代表性的样本是模型更新的关键问题,但现有的从大量新样本中选择代表性样本时,并未考虑新样本的光谱信息,因此选择的样本是否具有代表性很难确定,导致更新后的模型准确性依然不高。
发明内容
本公开为了解决上述问题,提出了一种基于光谱聚类的中药成分分析方法及系统,通过选
择距离光谱各类别中心最近的样本为候选样本,加入原始样本集划分出的校正集中完成对原校正集的更新,对中药成分分析模型进行更新训练,使得获得的训练好的中药成分分析模型的准确性提高,具有更好的预测性能。
为实现上述目的,本公开采用如下技术方案:
第一方面,公开了一种基于光谱聚类的中药成分分析方法,包括:
获取中药的近红外光谱;
根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果;
其中,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
第二方面,提出了一种基于光谱聚类的中药成分分析系统,包括:
数据获取模块,用于获取中药的近红外光谱;
结果获取模块,用于根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果;
其中,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
第三方面,提出了一种电子设备,包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成一种基于光谱聚类的中药成分分析方法所述的步骤。
第四方面,提出了一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成一种基于光谱聚类的中药成分分析方法所述的步骤。
与现有技术相比,本公开的有益效果为:
1、本公开在获得训练好的中药成分分析模型时,首先通过原始样本集对原始模型进行训练获得中药成分分析模型,之后从新样本集中选择距离光谱各类别中心最近的样本为候选样本,
将候选样本加入原始样本集划分出的校正集中形成新校正集,通过新校正集对中药成分分析模型进行更新训练,最终获得训练好的中药成分分析模型,该最终获得的训练好的中药成分分析模型,具有更好的预测性能,提高了实际应用价值。
本发明附加方面的优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
构成本申请的一部分的说明书附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。
图1为实施例1公开方法的流程框图;
图2为实施例1中所有样本一的近红外光谱图;
图3为实施例1中所有样本一在第一和第二PC空间的分布图;
图4为实施例1中所有样本一采用Ward方法的聚类结果树状图;
图5为实施例1中所有样本一采用Average方法的聚类结果树状图;
图6为实施例1中不同方法所选样本在第一和第二PC空间的分布图;
图7为实施例1中所有样本二的近红外光谱图;
图8为实施例1中所有样本二在第一和第二PC空间的分布图;
图9为实施例1中所有样本二采用Ward方法的聚类结果树状图;
图10为实施例1中所有样本二采用Average方法的聚类结果树状图;
图11为实施例1中不同方法所选样本在第一和第二PC空间的分布图。
下面结合附图与实施例对本公开作进一步说明。
应该指出,以下详细说明都是例示性的,旨在对本申请提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。
实施例1
在该实施例中,公开了一种基于光谱聚类的中药成分分析方法,包括:
获取中药的近红外光谱;
根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果;
其中,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
进一步的,选择距离各类别中心最近的样本为候选样本的具体过程为:
计算各类别的样本中心;
计算各样本到各自类别的样本中心的欧氏距离;
对计算的欧氏距离进行排序;
选择各类别中欧式距离最小的样本为候选样本。
进一步的,采用Ward方法或Average方法对新样本集进行聚类分析。
进一步的,通过原始样本集构建获得中药成分分析模型的具体过程为:
建立原始中药成分分析模型;
将原始样本集划分为校正集和验证集,对原始中药成分分析模型进行训练,获得中药成分分析模型。
进一步的,中药成分分析模型采用PLS模型、神经网络模型或支持向量机模型。
进一步的,对中药成分近红外光谱样本进行预处理,通过预处理后的中药成分近红外光谱样本构建原始样本集和新样本集。
进一步的,原始样本集和新样本集中的样本不重合。
对本实施例公开的一种基于光谱聚类的中药成分分析方法进行详细说明。
如图1所示,一种基于光谱聚类的中药成分分析方法,包括:
S1:获取中药的近红外光谱。
在具体实施时,采用光谱仪获取中药的近红外光谱。
S2:根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果。
其中,中药成分分析模型采用PLS模型、神经网络模型或支持向量机模型等。
建立原始中药成分分析模型,并对原始中药成分分析模型进行训练,获得训练好的中药成分分析模型,具体过程为:
S21:获取中药成分近红外光谱样本,该样本用于后续的模型训练。
中药成分近红外光谱样本包含采用行业标准检测方法测定的各成分指标及成分指标的参考值。
S22:对中药成分近红外光谱样本划分为原始样本集X和新样本集Xnew。
在具体实施时,可以直接将中药成分近红外光谱样本划分为原始样本集和新样本集;也可以先对中药成分近红外光谱进行预处理,将预处理后的近红外光谱进行划分,获得原始样本集和新样本集。
其中,对中药成分近红外光谱进行的预处理方式包括:平滑处理、一阶导数计算、二阶导数计算、标准化处理、基线漂移处理、标准正态变量处理、多元散射校正处理等中的任意一种或多种的组合。
S23:通过原始样本集对建立的原始中药成分分析模型进行训练,获得中药成分分析模型。
在具体实施时,将原始样本集X划分为校正集Xcal和验证集Xval,对建立的原始中药成分分析模型进行训练,获得中药成分分析模型。
其中,校正集Xcal的数量大于或等于验证集Xval的数量,校正集和验证集Xval样本数量的设置比例为2:1及以上。
将原始样本集X划分为校正集Xcal和验证集Xval的划分方法可以为:KS方法、Rank-KS方法、SPXY方法、Rank-SPXY方法及含量梯度法中的任意一种。
S24:通过新样本集Xnew对中药成分分析模型进行进一步的训练,获得训练好的中药成分分析模型,具体为:
S241:将新样本集中的样本光谱进行聚类分析(HCA),根据选择的聚类结果和类别数据,将新样本分为不同的Xnew,i,其中“i”表示不同的类别。
在具体实施时,采用Ward方法和Average方法等中的任意一种,对样本进行聚类分析。
S242:计算各类别的样本中心xcenter,i,可以通过以下公式获得:
xcenter,i=Xnew,i/N (1)
xcenter,i=Xnew,i/N (1)
其中,xcenter,i表示不同类别的样本中心,“N”表示各类别中的样本个数。
S243:计算各样本xnew,j到各自类别中心的欧氏距离dx(j):
S244:对对计算的欧氏距离进行排序;选择各类别中欧式距离最小的样本为候选样本Xsel。
S245:将所有候选样本加入原始样本集划分出的校正集中,形成新校正集,将新样本集中除候选样本外的其余样本作为测试集Xtest,通过新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
本实施例公开的一种基于光谱聚类的中药成分分析方法,通过选择距离光谱各类别中心最近的样本作为候选样本加入原始样本集划分出的校正集中,并进一步对中药成分分析模型进行更新,使训练后的中药成分分析模型对未知新样本具有更好的预测性能,更具有实际应用价值。
以市售的黄芪提取物(RAE)为实例对本实施例公开方法进行验证。
测定的样本数共有82个RAE样本,包括从5个制造商收集的9个批次,具体信息见表1。其中,S1~S6共53个样本为原始样本集X,用于建立中药成分分析模型,其余29个样本(S7~S9)被用作为新样本集Xnew。原始样本和新样本的近红外光谱由Antaris Ⅱ AA-NIR光谱仪(赛默飞科技有限公司,美国),测得的近红外光谱见图2,实线为原始样本,虚线为新样本。黄芪甲苷(ASA IV)、毛蕊异黄酮葡萄糖苷(CG)和黄芪多糖(APS)作为参照的成分指标Y,采用HPLC法分别测定黄芪提取物中ASA IV和CG的含量,由高效液相色谱仪(1260,安捷伦科技有限公司,美国)测得;采用苯酚硫酸法测定黄芪提取物中APS的含量,结果如表1所示。
表1 RAE样品信息表
注:10:1的浓缩比表示将10倍重量的原料浓缩为1倍重量,以此类推。A,B,C,D和E
厂家均来自陕西省。
注:10:1的浓缩比表示将10倍重量的原料浓缩为1倍重量,以此类推。A,B,C,D和E
厂家均来自陕西省。
采用常用的KS方法将原始样本集划分为校正集Xcal和验证集Xval,分别用于开发和验证中药成分分析模型,其中校正集样本数为36,预测集样本是17。近红外光谱采用SNV结合一阶导数的预处理方法进行处理。
以APS为例,经预处理后的所有样本光谱的主成分分析(PCA)得分图如图3所示。如图3所示,新样本的分布区域不包含在原始样本的光谱空间中,而是分成不同的簇,从呈现不同聚
类的角度来看,原始样本集和新样本集之间基本上没有重叠区域,表明新样本与原始样本相比可能存在系统性差异。因此,使用中药成分分析模型直接预测新样本会导致结果不佳。
为此,首先对RAE的新样本集进行聚类分析。图4和图5展示了经Ward和Average两种方法对经预处理后新样本的聚类结果,点虚线、虚线和实线分别代表新样本分为4、5和6类,将新样本分为了不同的类别Xnew,i。从图4、5中可以看出,虽然两种聚类方法形成的树状图有所不同,但是1-5类的样本分类结果是一致的,所以1-5类包含的样本应该是一致的。
根据聚类的结果,分别计算出每个类别的样本中心xcenter,i,其次计算每个样本到各自类别中心的欧氏距离dx(j)并进行排序,挑选出最接近每个类别中心的样本作为候选样本Xsel加入校正集Xcal中,形成新校正集进行中药成分分析模型的更新,剩余的样本作为测试集Xtest对更新后的模型进行验证。对于通过模型更新来预测新样本,应选择尽可能少的新样本,我们选择了3个(大约29个新样本的10%)到6个样本(大约20%)用于模型更新,结果见表3。由结果可以看出,采用本实施例公开方法(CCD方法)进行模型更新后,Rt和RPDt值均比采用中药成分分析模型直接预测的值要高,RMSEA值也大幅度降低,说明训练后的中药成分分析模型对于新样本的含量预测有了很大的提高。其中,ASA IV、CG和APS三种成分的RMSEA值分别从0.0637、0.0261和4.1141降低到0.0063、0.0011和1.0133,证明了使用本实施例公开方法可以大大提高模型对未知新样本的预测能力。
表2模型更新前后对测试集样本含量预测的对比结果
注:“-”表示选择的样本数为0。
注:“-”表示选择的样本数为0。
为评价本实施例公开方法的性能效果,将本实施例公开方法与常用方法如RS法、SPXY法和KS法进行比较,其中采用RS方法由于随机性,将进行十次重复采样,并取十次的平均结果与其他方法进行比较。其他三种方法均选取与本实施例公开方法选取的相同样本数量范围(3~6
个)的样本加入原校正集,并通过更新模型的性能来评估所选样本的代表性,以此比较不同方法的建模性能和预测能力。有关结果见表3。
表3不同方法进行模型更新的最佳结果
由表4可知,四种选择样本的方法均能使更新后的模型更好地预测新样品中各成分的含量,说明模型更新策略是可行的,并有效地将更新后的模型应用于新样本。与原始模型相比,采用本实施例公开方法进行模型更新的预测结果与RS、SPXY和KS法相比均达到了最佳预测结果,RMSEA值较低,RPDt值较高。
为了进一步证明本实施例公开方法的实用性,比较了四种方法在选择最少样本数(3个样本)时进行模型更新后的结果,结果见表4。从表4看,当所选择的样品的数量最小时,CCD方法比其他三种方法更具有明显的优势。此外,经本实施例公开方法进行模型更新的RPDt值均大于3.5,说明本实施例公开方法大大提高了更新后的模型的适用性。图6展示了CCD、SPXY和KS法三种方法选择的样品在第一和第二PC空间中的分布,放大图为S7(a-c)、S8(d-f)和S9(g-i)的放大图像,其中(a)、(d)和(g)代表ASA IV;(b)、(e)和(h)代表CG;(c)、(f)和(i)代
表APS。由图可以看出,本实施例公开方法选取的样本基本上更接近每个类别的中心,可能对相应类别的样本具有更好的代表性,从而获得更好的结果。
表4不同方法在选择3个样本时进行模型更新的结果
再次以市售的黄芪提取物(RAE)为实例对本实施例公开方法进行验证,测定的样本数共有82个RAE样本,从5个制造商收集的9个批次,具体信息见表1。原始样本和新样本的近红外光谱由Micro-NIR 1700微型近红外光谱仪(VIAVI,美国)测得,测得的近红外光谱见图7,实线为原始样本,虚线为新样本。黄芪甲苷(ASA IV)、毛蕊异黄酮葡萄糖苷(CG)和黄芪多糖(APS)作为参照成分指标。
采用常用的KS方法将原始样本集划分为校正集Xcal和验证集Xval,分别用于开发和验证原始模型,其中校正集样本数为36,预测集样本是17。近红外光谱采用SNV结合一阶导数的预处理方法进行处理。以APS为例,经预处理后的所有样品光谱的主成分分析(PCA)得分图如图8所示。如图8所示,新样本的分布均不包含在原始样本的光谱空间中,而是分成不同的簇,
且原始样本集和新样本集之间基本上没有重叠区域,表明新样本与原始样本之间可能存在系统性差异。
首先,对RAE的新样本集进行聚类分析。图9和图10展示了经Ward和Average两种方法对经预处理后新样品的聚类结果。点虚线、虚线和实线分别代表新样本分为4、5和6个类别,将新样本分为了不同的类别Xnew,i。从图中可以看出,1-5类的样本分类结果是一致的,所以1-5类包含的样本应该是一致的。
由于HCA将新样本根据不同的类别数划分成了不同的类别,为了选出具有类别代表性的样本,首先计算出了每个类别的样本中心xcenter,i,接着分别计算每个样本到各自类别中心的欧氏距离并将其进行排序,挑选出最接近各自类别中心的一个样本作为候选代表样本Xsel加入至原始样本集划分出的校正集Xcal中形成新校正集,通过新校正集进行中药成分分析模型的更新。具体的选择了3个(大约29个新样本的10%)到6个样本(大约20%)用于模型更新。表5显示了模型更新后新样本中剩余未选择样本Xtest的三种活性成分的最佳含量预测结果。采用本实施例公开方法(CCD方法)进行模型更新后,Rt和RPDt值均比采用原始模型直接预测的值要高,RMSEA值也大幅度降低,说明模型对于新样本的含量预测有了很大的提高。ASA IV、CG和APS三种成分的RMSEA值分别从0.0507、0.0268和3.6572降低到0.0085、0.0029和1.2583,Rt值从0.9428、0.5250和0.8827增大到0.9931、0.9876和0.9723,同时RPDt值也分别从0.47、0.12和1.01提升至4.66、4.39和3.13,证明了使用本实施例公开方法进行模型更新可以大大提高模型的性能和对未知新样本的预测。
表5模型更新前后对测试集样本含量预测的对比结果
将本实施例公开方法与经典方法如RS法、SPXY法和KS法进行比较,其中采用RS方法进行十次重复采样,取十次的平均结果并与其他方法进行比较。选取与本实施例公开方法所选
相同数量范围的样本加入原始样本集划分出的校正集中形成新校正集,并通过新校正集更新模型的性能来评估所选样本的代表性,以此比较不同方法的建模性能和预测能力。有关结果见表6。。
表6不同方法进行模型更新的最佳结果
由表6可知,四种方法均能使更新后的模型更好地预测新样品中各成分的含量。与原始模型相比,除了CG成分外,采用本实施例公开方法进行模型更新后的预测结果与RS、SPXY和KS法方法相比均达到了最佳预测结果,RMSEA值较低,RPDt值较高,CG成分的结果也与其他方法的结果相似。
此外,比较了四种方法在选择最少样本数(3个样本)时进行模型更新后的结果,结果见表7。从结果中可以看出,当只选择最少的样本数时,可以得到本实施例公开方法与其他三种方法相似或者更好的结果。此外,CCD更新模型的RPDt值均大于2,说明本实施例公开方法进行模型更新后的模型可以对未知新样本进行含量预测。图11展示了CCD、SPXY和KS法三种方法选择的样品在第一和第二PC空间中的分布,放大图为S7(a-c)、S8(d-f)和S9(g-i)的放大图像,其中(a)、(d)和(g)代表ASA IV;(b)、(e)和(h)代表CG;(c)、(f)和(i)代表APS。
由图可以看出,本实施例方法选取的样本基本上更接近每个类别的中心,可能对相应类别的样本具有更好的代表性,从而获得更好的结果。
表7不同方法在选择3个样本时进行模型更新的结果
由以上两个实例可以看出,新样本确实与原始样本存在一定的系统性差异,导致样品的光谱呈现出不同的类别,中药成分分析模型无法适用。以上两个验证示例获取样本的设备不同,但均能验证本实施例采用原始校正集结合少量选取的新样本更新中药成分分析模型,选择最接近类别中心的样本作为候选样本来更新原始样本集划分出的校正集,使选择的样本具有代表性,更新后的模型预测结果均良好。此外,将本实施例公开方法与RS、SPXY和KS法进行比较,具有一定的优势。此外,本实施例公开的基于光谱聚类中心的样本选择及模型更新方法可以扩展到各个领域,更具有实际意义。
实施例2
在该实施例中,公开了一种基于光谱聚类的中药成分分析系统,包括:
数据获取模块,用于获取中药的近红外光谱;
结果获取模块,用于根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果;
其中,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
实施例3
在该实施例中,公开了一种电子设备,包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成实施例1公开的一种基于光谱聚类的中药成分分析方法所述的步骤。
实施例4
在该实施例中,公开了一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成实施例1公开的一种基于光谱聚类的中药成分分析方法所述的步骤。
最后应当说明的是:以上实施例仅用以说明本发明的技术方案而非对其限制,尽管参照上述实施例对本发明进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本发明的具体实施方式进行修改或者等同替换,而未脱离本发明精神和范围的任何修改或者等同替换,其均应涵盖在本发明的权利要求保护范围之内。
Claims (10)
- 一种基于光谱聚类的中药成分分析方法,其特征在于,包括:获取中药的近红外光谱;根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果;其中,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
- 如权利要求1所述的一种基于光谱聚类的中药成分分析方法,其特征在于,选择距离各类别中心最近的样本为候选样本的具体过程为:计算各类别的样本中心;计算各样本到各自类别的样本中心的欧氏距离;对计算的欧氏距离进行排序;选择各类别中欧式距离最小的样本为候选样本。
- 如权利要求1所述的一种基于光谱聚类的中药成分分析方法,其特征在于,采用Ward方法或Average方法对新样本集进行聚类分析。
- 如权利要求1所述的一种基于光谱聚类的中药成分分析方法,其特征在于,通过原始样本集构建获得中药成分分析模型的具体过程为:建立原始中药成分分析模型;将原始样本集划分为校正集和验证集,对原始中药成分分析模型进行训练,获得中药成分分析模型。
- 如权利要求1所述的一种基于光谱聚类的中药成分分析方法,其特征在于,中药成分分析模型采用PLS模型、神经网络模型或支持向量机模型。
- 如权利要求1所述的一种基于光谱聚类的中药成分分析方法,其特征在于,对中药成分近红外光谱样本进行预处理,通过预处理后的中药成分近红外光谱样本构建原始样本集和新样本集。
- 如权利要求1所述的一种基于光谱聚类的中药成分分析方法,其特征在于,原始样本集和新样本集中的样本不重合。
- 一种基于光谱聚类的中药成分分析系统,其特征在于,包括:数据获取模块,用于获取中药的近红外光谱;结果获取模块,用于根据中药的近红外光谱和训练好的中药成分分析模型,获得中药成分分析结果;其中,获取训练好的中药成分分析模型的具体过程为:获取中药成分近红外光谱样本;将中药成分近红外光谱样本划分为原始样本集和新样本集;将原始样本集划分为校正集和验证集,利用校正集和验证集构建获得中药成分分析模型;对新样本集进行聚类分析,获得不同的样本类别;选择距离各类别中心最近的样本为候选样本;将候选样本加入原始样本集划分出的校正集中形成新校正集,将新样本集中除候选样本外的其余样本作为测试集,利用新校正集和测试集对中药成分分析模型进行训练,获得训练好的中药成分分析模型。
- 一种电子设备,其特征在于,包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成权利要求1-7任一项所述的一种基于光谱聚类的中药成分分析方法的步骤。
- 一种计算机可读存储介质,其特征在于,用于存储计算机指令,所述计算机指令被处理器执行时,完成权利要求1-7任一项所述的一种基于光谱聚类的中药成分分析方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210461016.7A CN114783539B (zh) | 2022-04-28 | 2022-04-28 | 一种基于光谱聚类的中药成分分析方法及系统 |
CN202210461016.7 | 2022-04-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023207453A1 true WO2023207453A1 (zh) | 2023-11-02 |
Family
ID=82434752
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/083467 WO2023207453A1 (zh) | 2022-04-28 | 2023-03-23 | 一种基于光谱聚类的中药成分分析方法及系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114783539B (zh) |
WO (1) | WO2023207453A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115017671A (zh) * | 2021-12-31 | 2022-09-06 | 昆明理工大学 | 基于数据流在线聚类分析的工业过程软测量建模方法、系统 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114783539B (zh) * | 2022-04-28 | 2024-09-27 | 山东大学 | 一种基于光谱聚类的中药成分分析方法及系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107563448A (zh) * | 2017-09-11 | 2018-01-09 | 广州讯动网络科技有限公司 | 基于近红外光谱分析的样本空间聚类划分法 |
CN109540836A (zh) * | 2018-11-30 | 2019-03-29 | 济南大学 | 基于bp人工神经网络的近红外光谱糖度检测方法及系统 |
CN110220866A (zh) * | 2019-06-05 | 2019-09-10 | 温州大学 | 一种基于cars-svm算法的淫羊藿药材质量快速检测方法 |
WO2019192433A1 (zh) * | 2018-04-03 | 2019-10-10 | 深圳市药品检验研究院(深圳市医疗器械检测中心) | 一种基于近红外光谱技术对中药皂角刺的真伪进行化学模式识别的方法 |
US20210404952A1 (en) * | 2019-10-17 | 2021-12-30 | Shandong University | Method for selection of calibration set and validation set based on spectral similarity and modeling |
CN114783539A (zh) * | 2022-04-28 | 2022-07-22 | 山东大学 | 一种基于光谱聚类的中药成分分析方法及系统 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101532954B (zh) * | 2008-03-13 | 2011-11-30 | 天津天士力现代中药资源有限公司 | 一种用红外光谱结合聚类分析鉴定中药材的方法 |
CN104849234A (zh) * | 2015-04-30 | 2015-08-19 | 江苏扬农化工集团有限公司 | 基于近红外光谱分析吡虫啉原药主成分含量的测定方法 |
CN113376117A (zh) * | 2021-02-27 | 2021-09-10 | 南京海源中药饮片有限公司 | 一种当归的近红外在线质量检测方法 |
-
2022
- 2022-04-28 CN CN202210461016.7A patent/CN114783539B/zh active Active
-
2023
- 2023-03-23 WO PCT/CN2023/083467 patent/WO2023207453A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107563448A (zh) * | 2017-09-11 | 2018-01-09 | 广州讯动网络科技有限公司 | 基于近红外光谱分析的样本空间聚类划分法 |
WO2019192433A1 (zh) * | 2018-04-03 | 2019-10-10 | 深圳市药品检验研究院(深圳市医疗器械检测中心) | 一种基于近红外光谱技术对中药皂角刺的真伪进行化学模式识别的方法 |
CN109540836A (zh) * | 2018-11-30 | 2019-03-29 | 济南大学 | 基于bp人工神经网络的近红外光谱糖度检测方法及系统 |
CN110220866A (zh) * | 2019-06-05 | 2019-09-10 | 温州大学 | 一种基于cars-svm算法的淫羊藿药材质量快速检测方法 |
US20210404952A1 (en) * | 2019-10-17 | 2021-12-30 | Shandong University | Method for selection of calibration set and validation set based on spectral similarity and modeling |
CN114783539A (zh) * | 2022-04-28 | 2022-07-22 | 山东大学 | 一种基于光谱聚类的中药成分分析方法及系统 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115017671A (zh) * | 2021-12-31 | 2022-09-06 | 昆明理工大学 | 基于数据流在线聚类分析的工业过程软测量建模方法、系统 |
Also Published As
Publication number | Publication date |
---|---|
CN114783539A (zh) | 2022-07-22 |
CN114783539B (zh) | 2024-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023207453A1 (zh) | 一种基于光谱聚类的中药成分分析方法及系统 | |
CN110687072B (zh) | 一种基于光谱相似度的校正集和验证集的选择及建模方法 | |
WO2016000088A1 (zh) | 一种基于最佳指数-相关系数法的高光谱波段提取方法 | |
Granholm et al. | Quality assessments of peptide–spectrum matches in shotgun proteomics | |
CN104062258B (zh) | 一种采用近红外光谱快速测定复方阿胶浆中可溶性固形物的方法 | |
CN109187614B (zh) | 基于核磁共振和质谱的代谢组学数据融合方法及其应用 | |
CN104062257A (zh) | 一种基于近红外光谱测定溶液中总黄酮含量的方法 | |
CN109557165B (zh) | 用于监控质谱成像制备工作流程的质量的方法 | |
CN106248621A (zh) | 一种评价方法与系统 | |
Bowling et al. | Analyzing the metabolome | |
WO2020248961A1 (zh) | 一种无参考值的光谱波数选择方法 | |
CN117147672A (zh) | 用于糖尿病肾病风险判别的生物标志物组合及其应用 | |
CN114970675A (zh) | 基于特征选择的人工鼻冰箱食物新鲜度检测系统和方法 | |
CN108663334B (zh) | 基于多分类器融合寻找土壤养分光谱特征波长的方法 | |
Colangelo et al. | Development of a highly automated and multiplexed targeted proteome pipeline and assay for 112 rat brain synaptic proteins | |
Gurung et al. | Model selection challenges with application to multivariate calibration updating methods | |
Ding et al. | Rapid Assessment of Exercise State through Athlete’s Urine Using Temperature‐Dependent NIRS Technology | |
CN109932335A (zh) | 一种用于植物中天然橡胶含量测定的方法及测定用led近红外光谱仪 | |
CN109243527A (zh) | 一种酶切概率辅助的肽段可检测性预测方法 | |
CN115620818A (zh) | 一种基于自然语言处理的蛋白质质谱肽段验证方法 | |
CN110310706A (zh) | 一种蛋白质无标绝对定量方法 | |
CN111220565B (zh) | 一种基于cpls的红外光谱测量仪器标定迁移方法 | |
CN111474124B (zh) | 一种基于补偿的光谱波长选择方法 | |
CN107506824A (zh) | 一种配电网的不良观测数据检测方法及装置 | |
CN112326574A (zh) | 一种基于贝叶斯分类的光谱波长选择方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23794888 Country of ref document: EP Kind code of ref document: A1 |