CN107918718B

CN107918718B - Sample component content determination method based on online sequential extreme learning machine

Info

Publication number: CN107918718B
Application number: CN201711068234.XA
Authority: CN
Inventors: 单鹏; 赵煜辉; 张贝; 淳宝生
Original assignee: Northeastern University Qinhuangdao Branch
Current assignee: Northeastern University Qinhuangdao Branch
Priority date: 2017-11-03
Filing date: 2017-11-03
Publication date: 2020-05-22
Anticipated expiration: 2037-11-03
Also published as: CN107918718A

Abstract

The invention discloses a sample component content determination method based on an online sequence extreme learning machine, which comprises the following steps: collecting a spectral data sample of a sample, and modeling by utilizing an online sequential extreme learning machine algorithm; and determining the component content of the sample by using the established model. The invention carries out modeling by utilizing an online sequential extreme learning machine algorithm, and only the learned knowledge is reserved for later use without reserving the used data; when new spectral data comes, only the hidden layer output of the new coming data needs to be calculated, and then the output weight between the middle hidden layer and the output layer is dynamically updated by utilizing the learned knowledge, so that the rapid modeling can be carried out. Compared with the traditional modeling method, the modeling speed is improved, unnecessary repeated calculation amount and consumption of data storage space are reduced, the precision and generalization performance of the model are improved, and the method can process the data which come one by one at a time and can also process the data which come one by one.

Description

Sample component content determination method based on online sequential extreme learning machine

Technical Field

The invention relates to a sample component content determination method based on an online sequence extreme learning machine, and belongs to the technical field of sample component content determination.

Background

The near infrared spectrum technology is a rapid, lossless and low-cost indirect analysis technology, the near infrared spectrum of a sample can be rapidly measured by using an infrared spectrometer, and a multivariate calibration model between the near infrared spectrum of the sample and the content of effective components is established by combining a chemometrics method, so that the response components of an unknown sample can be predicted. However, in actual use the near infrared spectral data is not generated once, but is generated in a stream. If a model is established on an existing data sample, and a new data sample may be generated along with the change of time, in order to improve the generalization performance and the prediction accuracy of the model, the newly generated data and the previous data need to be modeled together. The simplest and most direct method is to rerun all the existing data through the original algorithm, but the method is acceptable when the data volume is small, and if the data is measured by GB, the newly arrived data sample can be as many as several MB, so that the original data and the new data are together modeled, which is time-consuming and labor-consuming, sometimes the previous newly arrived data is not processed, and updated data arrives, and obviously, complete re-modeling is impossible under the condition. Online streaming algorithms are also increasingly being adapted to feed forward neural networks with Radial Basis Function (RBF) nodes. Many algorithms appear in the development process of processing the online streaming learning algorithm, and the typical algorithms are GAP-PBF algorithm and GGAP-RBF algorithm. It is desirable that these algorithms simplify the learning process and increase the learning speed, and that these algorithms require information on the distribution of input samples or the order of input samples. However, the modeling speed of the algorithms is still slow, and the generalization performance is also common. And these algorithms can only process new data one by one instead of one by one block.

In addition, in the process of measuring the near infrared spectrum, the original multivariate calibration model loses effect due to different measuring instruments or the change of measuring conditions, and the re-establishment of the model is time-consuming and labor-consuming, and even the re-modeling is not feasible at some time. It is more acceptable to do a calibration shift to correct the spectral data of the main instrument and the other instrument (sub-instrument). In essence, the spectra of the sub-instruments are transformed to look more like the data of the main spectrometer, which can then be processed using a model of the main spectrometer. Over the past few years, different calibration migration techniques have been developed, and common calibration migration methods include: a multivariate scatter correction method (abbreviated MSC), a direct normalization method (abbreviated DS), an indirect normalization method (abbreviated PDS), a typical correlation analysis method (abbreviated CCA), and the like. However, the existing calibration migration method still has the problems of poor component content prediction accuracy and stability; in addition, the multivariate scatter calibration method needs to measure an ideal spectrum of a sample to be measured and then correct other measured spectra by using the ideal spectrum, but it is difficult to obtain a so-called ideal spectrum in practical application.

Disclosure of Invention

The invention aims to provide a sample component content measuring method based on an online sequence extreme learning machine, which can effectively solve the problems existing in the prior art, in particular the problems that the existing algorithm has low modeling speed and general generalization performance, and only can process new data one by one and cannot process the data block by block.

In order to solve the technical problems, the invention adopts the following technical scheme: a sample component content determination method based on an online sequence extreme learning machine is characterized in that a spectral data sample of a sample is collected, and modeling is carried out by utilizing an online sequence extreme learning machine algorithm; and determining the component content of the sample by using the established model.

The method for measuring the content of the sample components based on the online sequential extreme learning machine specifically comprises the following steps:

s1, according to the initial main spectrum SP_master(0)And the corresponding content y of the sample component₀And the number L of nodes of the hidden layer, and an initial weight matrix α from the hidden layer to the output layer is calculated⁽⁰⁾Wherein, SP_master(0)And y₀Comprising M₀A sample is obtained;

s2, when there is a new main spectrum SP_master(k+1)And the corresponding content y of the sample component_k+1At the time of arrival, a weight matrix α from the hidden layer to the output layer is calculated according to an online sequence extreme learning machine algorithm^(k+1)(ii) a Wherein, the k +1 th arriving data SP_master(k+1)And y_k+1Comprising M_k+1A sample is obtained; k is more than or equal to 0;

s3, if a new main spectrum SP still exists_master(k+1)And the corresponding sample component content y_k+1If yes, let k be k +1, go to S2, otherwise go to S4;

s4, calculating and obtaining the spectral data sp of the sample according to the following formula_masterPreThe corresponding component content prediction value pre _ y:

pre_y＝H_preα^(new)；

wherein the content of the first and second substances,

α^(new)is the latest hidden-to-output layer weight matrix, sp_masterPreComprises N samples; w and b are respectively a randomly generated orthogonal input weight matrix and an offset; g (w, sp)_masterPreAnd b) is an activation function.

Preferably, the method further comprises the following steps: modeling spectral data samples of a main spectrometer and samples collected from the spectrometer by utilizing an online sequential extreme learning machine algorithm to realize the migration of spectral data of the slave spectrometer to a spectral data space of the main spectrometer; and then, measuring the component content of the sample by using a content prediction model established by the main spectrometer.

More preferably, the modeling of the spectrum data samples of the main spectrometer and the sample collected from the spectrometer by using the online sequential extreme learning algorithm, so as to migrate the spectrum data of the slave spectrometer to the spectrum data space of the main spectrometer, includes the following steps:

s01, according to the initial main spectrum SP_master0And from spectrum sp_slave0And number of hidden nodes L, generating a weight matrix β from hidden to output⁽⁰⁾Wherein SP_master0The number of samples contained in (1) is M₀；

S02, when there is new inclusion M_k+1SP of individual sample_master(k+1)And sp_slave(k+1)At the time of arrival, a weight matrix β from the hidden layer to the output layer is calculated according to an online sequence extreme learning machine algorithm^(k+1)(ii) a Wherein k is more than or equal to 0;

s03, if there is a new SP of the sample_master(k+1)And sp_slave(k+1)When the arrival comes, k is made to be k +1, and the process goes to S02, otherwise, the process goes to S04; s04, testing data sp containing N samples according to the following formula_slaveTestCarrying out migration:

sp_{slaveTomaster}＝H'_testβ^new

wherein sp_{slaveTomaster}Representing the spectrum data after migration β^newA weight matrix representing the latest hidden layer to the output layer;

w and b are respectively a randomly generated orthogonal input weight matrix and an offset; g (w, sp)_slaveTestAnd b) is an activation function.

In the method for measuring the component content of the sample based on the online sequential extreme learning machine, the number L of hidden nodes is less than or equal to the number M of initial samples₀. Therefore, the OSELM-based sample component content prediction model has higher calculation speed and consumes less system resources.

Preferably, in step S1,

α⁽⁰⁾＝(H₀ ^TH₀)^-1H₀ ^Ty₀；

wherein the content of the first and second substances,

preferably, in step S2,

wherein the content of the first and second substances,

preferably, in step S01,

β⁽⁰⁾＝(H'₀ ^TH'₀)^-1H'₀ ^Tsp_master0；

wherein the content of the first and second substances,

preferably, in step S02,

wherein the content of the first and second substances,

in the invention, the optimal number L of hidden nodes is determined by a k-fold cross validation method; the sigmoid function is adopted as the activation function, so that the prediction precision of the content of the sample components can be improved.

The method for measuring the component content of the sample is suitable for all kinds of spectrum samples in online detection, and particularly has better effect on measuring the component content of tablets and corns.

Preferably, the spectral data of the slave spectrometer is migrated to the spectral data space of the master spectrometer as a new master spectrum SP_master(k+1)Measuring the component content of the sample by using a content prediction model established by a main spectrometer to obtain the corresponding component content y of the sample_k+1Let k be k +1, and go to S2. Thereby, the prediction accuracy of the model can be further improved.

Compared with the prior art, the method utilizes the online sequential extreme learning machine algorithm to carry out modeling, so that the previously used data is not required to be reserved, and only the previously learned knowledge is reserved for later use; when new spectral data comes, only the hidden layer output of the new coming data needs to be calculated, and then the output weight between the middle hidden layer and the output layer is dynamically updated by utilizing the learned knowledge, so that the rapid modeling can be carried out. Compared with the traditional modeling method, the modeling speed is increased, unnecessary repeated calculation amount and consumption of data storage space are reduced, the precision and generalization performance of the model are improved, and the method can process the data which come one by one at a time and can also process the data which come one by one. In addition, the invention realizes the migration of the spectrum data of the slave spectrometer to the spectrum data space of the master spectrometer by utilizing the online sequence extreme learning machine algorithm to model the spectrum data samples of the master spectrometer and the samples collected from the slave spectrometer; and then, the content of the sample component is determined by using a content prediction model established by the main spectrometer, so that the precision and the stability of the content prediction of the sample component are improved. The experimental results show that: the calibration migration method based on the online sequence extreme learning machine algorithm shows better performance on the tablet data set and the corn data set compared with the PDS and the spectrum migration algorithm based on the CCA.

Drawings

FIG. 1 shows spectra in the maize dataset (a) for MP5, (b) for M5, (c) for MP 6;

FIG. 2 shows a tablet data set master spectrum (a) slave spectrum (b) with a deviation spectrum (c);

FIG. 3 is a diagram illustrating RMSEP as a function of the number of hidden nodes;

FIG. 4 is the M5 and MP5 deviations after migration (a) and the M5 and MP5 deviations without migration (b) of the maize dataset;

FIG. 5 is a schematic illustration of the effect of modeled sample number on RMSEP;

FIG. 6 is a diagram of the relationship between RMSEP and the number of hidden nodes;

FIG. 7 is a schematic diagram of predicted values of various types of spectra with respect to water;

FIG. 8 is a schematic diagram showing the predicted values of protein (a), starch (b), oil (c), and water (d);

FIG. 9 is a schematic diagram of an ELM-based migration algorithm versus OSELM-based migration algorithm modeling runtime comparison;

FIG. 10 is a schematic representation of the relationship of hidden nodes to the spectrum RMSEP in a tablet data set;

fig. 11 is a schematic diagram of (a) spectral migration residual (b) non-migrated residual based on tablet data;

FIG. 12 is a graphical representation comparing the predicted value to the actual value of the first active ingredient in a pharmaceutical tablet;

FIG. 13 is a graphical representation comparing the predicted value and the actual value of (a) a second active ingredient and (b) a third active ingredient in a pharmaceutical tablet;

FIG. 14 is a graphical comparison of predicted results of PDS, CCA, TLOSELM based on moisture content of corn data;

FIG. 15 is a graphical comparison of predicted results for PDS, CCA, TLOSELM based on the third active ingredient of the tablet;

FIG. 16 is a schematic flow chart of the method of the present invention.

The invention is further described with reference to the following figures and detailed description.

Detailed Description

Example 1 of the invention: a sample component content determination method based on an online sequence extreme learning machine is disclosed, as shown in FIG. 16, a spectral data sample of a sample is collected, and modeling is performed by utilizing an online sequence extreme learning machine algorithm; and determining the component content of the sample by using the established model.

The method specifically comprises the following steps:

pre_y＝H_preα^(new)

wherein the content of the first and second substances,

In order to enable the OSELM-based sample component content prediction model to have higher calculation speed and consume less system resources, the number L of hidden layer nodes is less than or equal to the number M of initial samples₀。

In the step S1, in the step S,

α⁽⁰⁾＝(H₀ ^TH₀)^-1H₀ ^Ty₀；

wherein the content of the first and second substances,

in the step S2, in the step S,

wherein the content of the first and second substances,

in order to perform accurate content prediction on spectral data collected from a spectrometer, the method further comprises the following steps: modeling spectral data samples of a main spectrometer and samples collected from the spectrometer by utilizing an online sequential extreme learning machine algorithm to realize the migration of spectral data of the slave spectrometer to a spectral data space of the main spectrometer; and then, measuring the component content of the sample by using a content prediction model established by the main spectrometer.

The method for modeling the spectrum data samples of the main spectrometer and the samples collected from the spectrometer by utilizing the online sequence extreme learning machine algorithm to realize the migration of the spectrum data of the slave spectrometer to the spectrum data space of the main spectrometer comprises the following steps:

sp_{slaveTomaster}＝H'_testβ^new

Specifically, in step S01,

β⁽⁰⁾＝(H'₀ ^TH'₀)^-1H'₀ ^Tsp_master0；

wherein the content of the first and second substances,

in the step S02, in the step S,

wherein the content of the first and second substances,

in the method, the optimal number L of hidden nodes is determined by a k-fold cross validation method; the activation function adopts a sigmoid function.

Example 2: a sample component content determination method based on an online sequence extreme learning machine is characterized in that a spectral data sample of a sample is collected, and modeling is carried out by utilizing an online sequence extreme learning machine algorithm; and determining the component content of the sample by using the established model.

The method specifically comprises the following steps:

pre_y＝H_preα^(new)

wherein the content of the first and second substances,

Specifically, in step S1,

α⁽⁰⁾＝(H₀ ^TH₀)^-1H₀ ^Ty₀；

wherein the content of the first and second substances,

specifically, in step S2,

wherein，

To verify the effect of the present invention, the inventors compared the migration algorithm based on OSELM (i.e., online sequential extreme learning machine) in the invention with PDS algorithm and CCA-based migration algorithm. During the experiment, the slave spectrum needs to be migrated to the master space using PDS, CCA and OSELM based algorithmic models. And then the transferred spectrum data is brought into a prediction model which is established on the master spectrum and corresponds to the spectrum and the component content to predict the component content.

1.1 Experimental Environment

This experiment was performed based on python 2.7. The operating system of the computer is win8.1, 64-bit operating system, the CPU is AMD A84500, and the memory is 8 GB. Several commonly used packages of python were used for the experiments such as: numpy, matplotpy, and sklern packages. The programs used in the experiment are developed and completed on an integrated development environment Eclipse.

1.2 Experimental data

In this experiment, a corn NIR spectral data set was used, which contained eighty different NIR data samples. The data set contains three NIR spectral data sheets of M5, MP5 and MP6 (these three spectral data sheets are NIR spectral measurements from different spectrometers on the same substance, i.e. corn), and the data set also contains physicochemical characteristics corresponding to the eighty spectral data samples, such as: water content (water), oil (soil), protein content (protein), starch content (starch). The spectral data ranged in wavelength from 1100nm to 2498nm, spaced 2nm apart (containing 700 channels). The data sheet for the spectrum MP5 is from FOSS nissystem 5000 as the master (master) instrument, while M5, MP6 are from FOSS nissystem 6000 and FOSS nissystem 5000 as slave (slave) spectra, respectively.

In fig. 1, (a) the subgraph is the spectrum of MP5, (b) the subgraph is the spectrum of M5, and (c) the subgraph is the spectrum of MP6, and it can be seen from fig. 1 that the images of MP5 and MP6 are very similar because they are measured by the same type of spectrometer. The difference between the M5 image and the MP5 image is large, and M5 has a significant upward movement tendency compared with MP 5.

The tablet spectral dataset, which was a copy of NIR spectral data published by IDRC in 2002, and which included 654 tablets from two spectrometers, was also used in this experiment. These two spectrometers are FOSS NIRSs and Silver-Spring, respectively. The two NIR spectral data from different spectrometers are divided into a calibration set (calibration set containing 155 spectral data samples) and a test set (test set containing 460 spectral data samples) and also a validation set (validation set containing 40 spectral data samples). The spectral wavelengths in the data set were centered between 1100nm and 1750 nm.

In FIG. 2, the two spectra SPEC of the validation set in the tablet data set are shown in FIG. 2 for panels (a) and (b), respectively₁And spectrometer SPEC₂The above collection results for the same substance. The difference between the spectral results measured at SPEC1 and the results measured at SPCE2 is shown in panel (c) of fig. 2. From the results of sub-graph (c), it can be seen that the difference in the spectra obtained on the two spectrometers is not very significant over most of the wavelength bands. The results of sub-graph (c) can even be said to be very similar, since the difference tends to be substantially 0 and the band change only after 1700nm starts to be relatively large.

1.3 design of the experiment

1.3.1 corn-based experiments

The NIR spectral data of corn was divided into triplicates, and the data was divided into calibration set (calibration set), validation set (validation set) and test set (test set). The calibration set contained 56 NIR spectral data, the validation set 8 NIR spectral data, and the test set 16 NIR spectral data. The calibration set is used to build an online migration model, and the verification set is used to select the optimal number of hidden nodes. The test set is used to represent the generalization performance of the algorithm.

1. Selection of the number L of parametrically optimal hidden nodes

In order to select the number of hidden nodes that is optimal for the migration model, the inventors used 8-fold cross validation to select. And drawing a root mean square error graph obtained by cross validation of different hidden layer nodes. The calculation formula of the spectrum root mean square error after migration is as follows:

equation (1) is the equation for calculating the root mean square error, where NumSample represents the number of samples of the batch of NIR spectra to be migrated. It can be seen from fig. 3 that as the number of hidden nodes increases, the RMSEP first decreases and then slowly increases, and when the number of hidden nodes reaches a certain number, the increasing speed of the RMSEP is obviously increased. From the figure it follows that: the hidden node setting 19 is most suitable in terms of root mean square error of spectral shift.

Therefore, the hidden node L of the model is set to the optimal number of nodes 19. Giving algorithm 32 sp initially_masterNIR spectral data samples and 32 sp_slaveNIR spectral data samples. As input and output, respectively, of the model, which is substantially sp_slaveTo sp_masterEstablishing an initial migration model may result in an initial output weight matrix β⁰Then, assuming one NIR spectral data sample comes each time, an updated output weight matrix β is calculated based on OSELM's idea^(k). And substituting the spectrum into the established migration model when the slave spectrum needing to be migrated exists. The NIR spectral data thus achieve the goal of spatial migration to the master spectrum based on the OSELM algorithm.

The upper subfigure (a) in FIG. 4 is the pair sp_slaveTestSubtracting the corresponding sp from the result obtained after migration_masterTestAs a result, the lower subfigure (b) is sp_slaveTestDirect subtraction of sp_masterTestThe result of (1). From the upper graph (a), it can be seen that the modeling of the transfer learning is realized by the OSELM, and most of sp in the upper sub graph (a)_slaveTestAll have better migration effect, sp after migration_slaveTestAnd sp_masterTestThe difference fluctuates slightly substantially around zero. While the slave spectrum without migrationThe difference from the master spectrum fluctuates between 0.04 and 0.06.

2. Influence of initial sample number on algorithm precision during calibration and migration of spectrum

The model is a migration learning model built based on OSELM. One problem with initialization in the OSELM algorithm is how many samples to use as the initial owned samples to generate the base model. It is likely that the initial samples will have different effects on the established migration model, and the following table 1 shows the effect of the initial sample number on the model.

The results also conform to the inclusion of the OSELM algorithm, with all incoming data samples ultimately utilized in the online modeling of the streaming incoming NIR spectral data set, with OSELM and ELM being substantially different in nature, with OSELM as an online modeling algorithm that reduces the number of unnecessary iterations, however, if the number of initially owned NIR samples is less than the number of hidden nodes, then output matrix β is solved to solve⁽⁰⁾The problem of solving the violation matrix may be involved, so that the root mean square error of the migration processing result is larger.

TABLE 1 initialization sample number vs. RMSEP

3. Influence of each sequentially arriving data block size on algorithm precision

In the above experiments, it is assumed that samples come one by one, and it is seen below whether the manner in which NIR spectral data comes has an influence on the accuracy of the model. The default setting of the experiment is that the number of hidden nodes is 15, and the number of samples is 32 at the beginning. Step in table 2 represents the number of NIR spectral samples per time. It can be seen from the table that the NIR spectral data samples, whether piecemeal or piecemeal, have no effect on the performance of the algorithm. The incoming NIR samples are eventually used so that for testing, the root mean square error for prediction is the same as long as they are modeled using the same number of NIR samples and number of hidden nodes L.

TABLE 2 influence of streaming data sample size on RMSEP

4. Influence of number of samples on prediction precision based on OSELM migration model

FIG. 5 shows NIR spectral data and sp shifted by OSELM as NIR spectral data samples arrive sequentially_masterTestThe magnitude of the root mean square error. It can be seen that as online learning progresses with more and more NIR data samples accumulated, the general trend for the root mean square error after test set migration is smaller and smaller. It can be seen in the figure that the prediction error increases at about the time of the 55 th and 56 th samples, which may be the occurrence of relatively abnormal data samples, resulting in the situation becoming worse. But in terms of the predicted general trend, the longer the online learning time is, the more the accumulated knowledge about the corn spectrum data is, and the accuracy of the transfer learning is improved. From the data in the figure, it can be concluded that: the more the number of the incoming samples is, the more accurate the model is, the more stable the generalization capability is. This conclusion is equivalent to the need to respond to online modeling, and if data arrives not all at once but one block or one block at a time and there is a prediction need in the interim, then a model must be built based on the existing sample data, and when a sample arrives next time, the model is updated without reconstructing the model from scratch, which is the connotation of OSELM.

Previous experimentsThe root mean square error between the spectrum after migration and the main spectrum is discussed, but the ultimate goal of the migration model is to make the spectrum at SPEC_slaveUp collected sp_slaveCan be utilized in SPEC_masterOn-line update model built above, i.e. ultimately it is desired to pass SPEC_masterThe model built above predicts about sp_slaveThe physical and chemical properties of the composition. The effect of migration will therefore be seen in the later tests from the high level of physicochemical properties. Due to sp_slaveTo sp_masterThe process of migration is performed online, thus for sp_masterThe prediction model with the corresponding physicochemical characteristic y also needs to be dynamically updated. In the following experiment, sp collected on a slave spectrometer was first collected_slaveMigration of online to sp_masterSubstituting the shifted spectrum into sp-based spectrum_masterAnd y. Sp passage is required in maize data sets_slaveThe data were predicted for moisture, protein, oil, and starch content.

5. Looking at the performance of a migration model at the height of physicochemical indexes

Firstly, selecting the optimal number of hidden nodes of a prediction model from a main spectrum to physicochemical characteristics through k-fold cross validation. In the following experiment, the spectrum data after migration was substituted into the newly established physicochemical characteristic prediction model. As can be seen from table 3, L represents the number of hidden nodes of the migration model, and RMSEP represents the root mean square error with respect to the prediction of physicochemical properties. It can be seen that if the number of hidden nodes is set to be too small, the root mean square error of the prediction result is relatively large, and if the number of hidden nodes is larger than a certain point, the prediction error starts to increase. It is very obvious in the table that when the number of hidden nodes is larger than the number of initial samples, the calculated root mean square error becomes large sharply.

TABLE 3 influence of number of hidden nodes on RMSEP of physicochemical indices

In order to show the result more clearly, the relationship between RMSEP and the number L of hidden nodes can be plotted by a broken line. The trend can be seen more intuitively in fig. 6.

The abscissa in FIG. 7 represents water_testAnd the ordinate represents the corresponding predicted value pre _ water. Triangle representation directly combines sp_slvaeTestThe predicted value obtained by the prediction model with the physicochemical characteristics is introduced, and the pentagram represents sp_slaveTestAnd carrying out migration and then bringing the migration into a prediction value of a prediction model of the physicochemical property. The plus sign indicates sp_masterTestThe predicted values are brought into the model. It can be seen in FIG. 7 that the predicted physicochemical properties of the migrated spectral data are very close to the true values if SPEC_slaveDirect use of SPEC without migration of the spectra collected above_masterThe prediction error is very large by the aid of the physicochemical characteristic prediction model. From FIG. 7, it can be seen that the OSELM-based migration model has achieved great success on the maize dataset.

As shown in fig. 8, triangles, stars and plus signs in the subgraph all have the same meaning as expressed in the upper graph. Wherein, the abscissas of the subgraphs (a), (b), (c) and (d) are protein, starch, oil content and water content in turn, and the ordinate of each subgraph is the predicted value of the corresponding abscissa in turn. The first panel depicts a prediction model established for proteins based on physicochemical properties of maize. The second sub-graph depicts a predictive model established for starch (starch) based on physicochemical properties in the corn data set. The third sub-graph depicts a prediction model based on oil content in corn. The fourth figure is the top figure 7. The effect of spectral data migration can be clearly seen from the results in fig. 8. In the four subgraphs (a), (b), (c) and (d), a triangle and a star have obvious decomposition intervals.

6. Operational efficiency of OSELM

Screenshot 9 the first line represents the time required for a conventional ELM to run once. The second row represents the time required for the OSELM to migrate 11 times online. The third row represents the time required if the model is modeled 11 times with the ELM algorithm. From FIG. 9, it can be seen that the time to run the ELM migration algorithm once for this experiment on the corn data set is approximately 0.02 seconds, while the OSELM migration algorithm modeled 11 runs for approximately 0.04 seconds. It is not difficult to derive OSELM-based migration models by simple calculations much more than ELM-only migration models. OSELM is faster than traditional ELM iterative modeling because OSELM reduces the amount of iterative computations. OSELM retains knowledge of the last time it modeled, and it only needs to model new knowledge if new data arrives.

7. Summary of experiments based on maize dataset:

from the above experimental results, it can be seen that the OSELM-based NIR spectral data migration model exhibits relatively ideal performance on the corn data set. The result of substituting the spectrum after migration and the spectrum without migration into a prediction model about physicochemical properties is obviously different. Compared with the traditional online learning algorithm, the algorithm has the advantages of being faster and stronger in adaptability. The algorithmic model can resolve whether the data arrives one at a time or two, three, or more on the corn data set. And the algorithm does not need to specify the size of the next brought data volume before starting to run. The algorithm makes it possible to predict the corresponding water content, oil content, protein content and starch content by directly utilizing the model on the master instrument without repeated modeling of the silver NIR spectrum.

1.3.2 tablet-based experiments

The data in the tablet data set was collected on two different spectrometers. Spectrometer 1 is herein denoted SPEC for convenience of description below₁ Spectrometer 2 as SPEC₂。SPEC₁The collected NIR spectral data are recorded as sp₁，SPEC₂The collected NIR spectral data is recorded as sp₂. When the data are introduced, the data set is provided with 9 different tables: calibration set (calibrate _1, calibrate _2, calibrate _ Y), validation set (valid _1, valid _2, valid _ Y), test set (test _1, test _2, test _ Y). Wherein reference numerals 1, 2 denote NIR spectral data from the spectrometer SPEC₁And SPEC₂. In machine learning algorithms, calibrarte set is typically used to build the model and calibrateset is the hyper-parameter used to select the algorithm model. Test set is typically used to Test the performance of the model, which is then examinedAnd bringing the established model to obtain a result. The result can be used for judging the generalization ability and the accuracy of the model in the same way as the mode whether the overfitting phenomenon occurs. But only 155 samples in the calibre set and up to 460 samples in the test set. Such a situation does not conform to the test set settings in general machine learning algorithms, and therefore the experimental data is subdivided in the present experiment. In the experiment, valid set is not needed for a while, and the change of the hyper-parameter in the algorithm can be shown by the result of a plurality of experiments to cause the change of the performance of the algorithm. The data for the test is assumed in this experiment to be 40 NIR spectral data samples.

1. Selecting optimal number of hidden layer nodes based on spectrum migration

The optimal number of hidden nodes must be chosen differently for different datasets, tablet datasets have a greater number of NIR spectral samples than corn datasets, and therefore β is sought in building a preliminary migration model⁽⁰⁾The initial number of samples is set to 100. In such a context, it is desirable to explore the selection of the rms error and the number L of hidden nodes based on spectral shifts.

The abscissa hiddenNum in fig. 10 below represents the root mean square error of the spectrum after migration and the corresponding main spectrum, and the ordinate RMSEP corresponds to the number of hidden nodes in the migration algorithm. It can be seen in fig. 10 that RMSEP exhibits an unstable state fluctuating up and down with an increase in the number of hidden nodes, but still exhibits a downward trend in terms of the general trend, with the RMSEP value being minimal when hiddenNum is about 59. As hiddenNum continues to increase, RMSEP begins to exhibit a slowly rising trend. The RMSEP in the graph can be seen to increase dramatically when the number of hidden nodes exceeds the number of samples originally assigned.

It can be seen from FIG. 11 that the NIR spectral data after OSELM algorithm migration is largely centered around 0 fluctuations and generally ranges from [ -0.02,0.02]. But the migration effect of a few samples is not very good. This is probably because the difference of the original data itself is small. When the graph is carefully viewed, it can be found that the change in SPEC₁NIR data collected above and at SPEC₂The difference of the NIR data collected above is [ -0.1,0.1 [ -0.1 [)]And can be seen from the figureThe difference between the spectra is substantially close to 0 out of most of the bands.

2. Effect of selection of activation function on algorithm accuracy

Sometimes, the selection of the activation function has a direct influence on the performance of the algorithm, for example, the tanh function maps the final value between [ -1,1], and the sigmoid function maps the final value between [0,1], so that the obtained result difference is relatively large. Therefore, the influence of the activation function on the algorithm precision is to be researched in the experiment of the section. The usual sigmoid, tanh, tribas, hardlim functions are used here. The following experiment starts with the selection of the optimal activation function. In table 4, RMSEP indicates the root mean square error of the predicted value of the first physicochemical characteristic in the tablet data set, N indicates the number of samples participating in modeling, 20 NIR samples are initially input in the experiment, and the hidden node L is set to 15. From experimental results, it can be found that the RMSEP shows a downward trend as the number of samples learned by the model increases no matter which activation function is selected. Comparing the data in the table can learn that the sigmoid function has better performance on the tablet data set than the tanh, tribas and hardlim functions.

TABLE 4 Effect of activation function on Algorithm accuracy

3. Initializing the relation between the input sample number and the hidden node

The following experiment explores how the number of NIR spectral data samples input at initialization and the number of hidden nodes L set up will affect the performance of the algorithm when the migration model migrates tablet NIR spectral data. In table 5, the horizontal axis represents the number L of hidden nodes, Sn represents the number of NIR samples input at initialization, and the number in the middle of the table represents RMSEP. The empty space in table 5 indicates that the error is large here. It is obvious from the number change in the table that RMSEP increases sharply once the number of hidden nodes is set to exceed the number of samples at the beginning, and sometimes the error can reach thousands. It is thus verified again that the number of hidden nodes cannot exceed the initial number of samples. The numbers in the table also show that the variation in the number of samples of the NIR spectrum input for initialization has substantially no effect on RMSEP when the hidden node is unchanged (but the number of samples needs to be equal to or greater than the number of hidden nodes).

TABLE 5 tablet data set initialization input sample number versus hidden layer node relationship

4. Effect of each incoming data block size on algorithm accuracy

The following table is used to indicate whether the size of the data block streamed in has an impact on the accuracy of the algorithm. In this experiment, the hidden node is set to 90, and the number of initialized NIR samples is 100.

As can be seen from the data in table 6, the size of each incoming data does not affect the accuracy of the algorithm when streaming data is processed by the OSELM algorithm. The OSELM algorithm does not need to know the next incoming data block size. And the OSELM algorithm has the advantage that most other algorithms do not have that it can handle either one incoming data sample or a data block of varying size.

TABLE 6 fast sample size of streaming data vs. RMSEP

5. From the high level of physicochemical indices, consider the performance of OSELM-based migration models

Using the tablet NIR data set, the spectral data measured from the instrument SPEC2 is first shifted online to the spectral space of the master SPEC2 and then used to generate predicted values for the prediction model for the first active ingredient in the tablet. The predicted effect of the first active ingredient on the spectra collected from the instrument can be observed in figure 12. SP1_ predict in fig. 12 indicates the predicted value obtained by substituting the test data of the main apparatus into the first active ingredient prediction model of the tablet. TLSP2_ predict represents the predicted value based on the first active ingredient prediction model after using the OSELM migration model. SP2_ predict indicates that SP2 directly substituted without migration into the prediction model for the first active ingredient in the tablet to obtain a predicted value. The figure shows that there are clear lines of resolution between the triangles and the plus and the stars, and thus the effect on the nominal shift from the spectrum is still significant.

Plotted in figure 13 are the predicted values of SP1_ test, TLSP2_ test, SP2 for the second and third active ingredients in the tablets. Although the five stars and the plus signs do not return to the straight line, which only indicates that the parameters are not well selected or the data set has poor effect on the physicochemical characteristic prediction model, the first sub-graph still shows that the predicted values of the plus signs and the five stars are basically the same, and the triangles obviously have obvious segmentation with the five stars and the plus signs. The second sub-graph shows about the same performance as the first sub-graph and will not be discussed here.

1.3.3 comparative test

Whether the model established above is superior to other algorithms or not needs to be known by a comparison party. The PDS algorithm and CCA algorithm described in the background were used in comparative experiments. The PDS algorithm has window size as the hyper-parameter, and different window sizes will have certain influence on the algorithm effect. The table below shows the performance of the PDS in different windows by selecting different window value sizes. The PDS, the CCA and the TLOSELM (namely, the OSELM-based migration algorithm provided by the invention) are used for migrating the sample spectrum data, and then the migrated model is substituted into an online-established prediction model about physicochemical characteristics (the hyper-parameters of the prediction model are selected through cross validation), so that the physicochemical characteristic value corresponding to the migrated spectrum can be predicted. The root mean square error of only the two physicochemical properties, moisture and protein, were compared in the corn data set. Comparison of the root mean square error of the predicted values of the three active ingredients was performed in the tablet data set.

Tables 7 and 8 show a comparison of the distribution representing the prediction error of the three algorithms on corn moisture and protein content. Table 9, table 10, 11 distributions show the predicted error of PDS, CCA, TLOSELM distribution on the tablet data set for the first, second, and third active ingredients. The letter N in the above tables indicates the number of samples used to construct the migration model, and RMSEP indicates the root mean square error with respect to physicochemical properties. From the data in table 7 and fig. 9, it can be concluded that in the corn dataset, the shift model based on the CCA algorithm is better than the PDS algorithm when the M5 spectrum shifts to the MP5 spectrum, but the PDS effect is better when the MP6 shifts to the MP5 spectrum. The TLOSELM algorithm (i.e., the online sequential extreme learning machine-based migration algorithm of the present invention) performs best among the three algorithms for the entire corn NIR spectral dataset. In all three algorithms, as the number of modeling samples increases, the RMSEP of the TLOSELM algorithm model and the CCA algorithm model is continuously reduced, but the PDS algorithm shows an unstable aspect, and initially, as the number of samples increases, the RMSEP starts to decrease, and as the number of samples continues to increase, the RMSEP starts to show a growing trend. Tables 9, 10, 11 show that the TLOSELM algorithm is still the best performing of the three algorithms in the pill data set. In addition to predicting a CCA better than a PDS for the first active component, PDS is better than the CCA algorithm in the prediction for the second and third active components. FIG. 14 is a graph of the predicted effect of three algorithms on corn moisture content and FIG. 15 is a graph of the predicted effect of three algorithms on the third active ingredient of the tablet data.

TABLE 7 maize data set error prediction for water under different algorithms

TABLE 8 prediction error of maize data set for protein under different algorithms

TABLE 9 prediction error of first active ingredient of tablet data set

TABLE 10 prediction error of second active ingredient in tablet data set

TABLE 11 prediction error of third active ingredient for tablet data set

It was concluded from the above experiments that: on corn and tablet data sets, the TLOSELM algorithm of the present invention has better performance in shifting from the spectrum to the main spectrum than the PDS algorithm and the CCA algorithm. When using MP6 as the slave spectrum and MP5 as the master spectrum in the corn data set, while still TLOSELM best, it can be seen that the CCA algorithm is not much better than the PDS algorithm. While CCA and TLOSELM have a great advantage over PDS when the tablet set or M5 is the main spectrum from spectrum MP5, and the TLOSELM algorithm of the present invention performs optimally.

Claims

1. A sample component content determination method based on an online sequence extreme learning machine is characterized by comprising the following steps: collecting a spectrum data sample of a sample, and modeling the spectrum data sample of the sample collected on the main spectrometer and the slave spectrometer by utilizing an online sequence extreme learning machine algorithm to realize the migration of the spectrum data of the slave spectrometer to the spectrum data space of the main spectrometer; then, measuring the component content of the sample by using a content prediction model established by a main spectrometer; the method comprises the following steps of utilizing an online sequence extreme learning machine algorithm to model spectral data samples of a main spectrometer and samples collected from the spectrometer, and realizing the migration of spectral data of the slave spectrometer to a spectral data space of the main spectrometer, wherein the online sequence extreme learning machine algorithm comprises the following steps:

s01, according to the initial main spectrum

And from the spectrum

And the number L of hidden layer nodes, and generating a weight matrix from the hidden layer to the output layer

Wherein

The number of samples contained in

(ii) a Wherein the content of the first and second substances,

；

；

s02, when there is a new inclusion

Of a sample

And

when arriving, the weight matrix from the hidden layer to the output layer is calculated according to the algorithm of the online sequence extreme learning machine

(ii) a Wherein the content of the first and second substances,

；

，

；

；

；

s03, if there is new sample

And

when the current signal arrives, k = k +1, and go to S02, otherwise go to S04;

s04, testing data containing N samples according to the following formula

Carrying out migration:

wherein the content of the first and second substances,

representing the migrated spectral data;

a weight matrix representing the latest hidden layer to the output layer;

(ii) a w and b are respectively a randomly generated orthogonal input weight matrix and an offset;

is an activation function.

2. The method for measuring the content of the components in the sample based on the online sequential extreme learning machine as claimed in claim 1, which comprises the following steps:

s1, according to the initial main spectrum

And the corresponding sample component content

And the number L of nodes of the hidden layer, and calculating an initial weight matrix from the hidden layer to the output layer

Wherein, in the step (A),

and

comprises

A sample is obtained;

s2, when there is a new main spectrum

And the corresponding sample component content

(ii) a Wherein, the k +1 th coming data

And

comprises

A sample is obtained;

；

s3, if there is also a new main spectrum

And corresponding sample component content

If yes, let k = k +1, go to S2, otherwise go to S4;

s4, calculating and obtaining the spectral data of the sample according to the following formula

Corresponding component content prediction value

：

Wherein the content of the first and second substances,

；

the latest hidden-to-output layer weight matrix,

comprises N samples; w and b are respectively a randomly generated orthogonal input weight matrix and an offset;

is an activation function.

3. The method for measuring the content of the sample components based on the online sequential extreme learning machine according to claim 1 or 2, wherein the number L of hidden nodes is less than or equal to the number of initial samples

。

4. The method for determining the content of a sample component based on an on-line sequential extreme learning machine according to claim 3, wherein in step S1,

；

wherein the content of the first and second substances,

。

5. the method for determining the content of a sample component based on an on-line sequential extreme learning machine according to claim 4, wherein in step S2,

wherein the content of the first and second substances,

；

；

。

6. the method for measuring the content of the sample components based on the online sequential extreme learning machine according to claim 1 or 2, characterized in that the optimal number L of hidden nodes is determined by a k-fold cross validation method; the activation function adopts a sigmoid function.