WO2007106942A1

WO2007106942A1 - Analysis of grape quality using neural network

Info

Publication number: WO2007106942A1
Application number: PCT/AU2007/000349
Authority: WO
Inventors: Les Janik
Original assignee: Commonwealth Scientific And Industrial Research Organisation; Grape And Wine Research And Development Corporation; The Australian Wine Research Institute; The University Of Adelaide; National Wine And Grape Industry Centre; State Of Victoria As Represented By Department Of Natural Resources And Environment; Minister For Primary Industries, Natural Resources And Regional Development As Represented By The Department Of Primary Industries And Resources South Australia; Horticulture Australia Limited; Winemakers' Federation Of Australia Inc; The Australian Dried Fruits Association, Inc.
Priority date: 2006-03-21
Filing date: 2007-03-21
Publication date: 2007-09-27

Abstract

Prediction of non-linear properties of fruit samples, such as total anthocyanin concentration in grapes, Brix or pH. A set of data such as near infrared spectroscopic data is obtained from a training set of fruit samples, and from that data a reduced set of variables is derived which have co-correlation with the non-linear property. The reduced set of variables can be partial least squares (PLS) scores obtained by applying a PLS regression, and/or wavelength specific portions of the raw data, determined to have co-correlation with the non-linear property. A feed-forward back-propagation artificial neural network (ANN) is trained by using the reduced set of variables as inputs to the ANN. The ANN, once calibrated, is used to predict the non-linear property in data obtained from a prediction set of fruit samples.

Description

Analysis of grape quality using Neural Network

Cross-Reference to Related Applications

The present application claims priority from Australian Provisional Patent Application No 2006901444 filed on 21 March 2006, the content of which is incorporated herein by reference.

Technical Field

The present invention relates to prediction of fruit properties from a fruit sample, and in particular relates to the prediction of total anthocyanin concentration in fruit such as red wine grapes.

Background of the Invention

Anthocyanins are the natural phenolic glycoside compounds found in the skins of red wine grapes which most strongly influence a red wine's colour. The quality of wine produced from red grapes is related to total anthocyanin concentration. The prediction of °Brix, pH and total anthocyanin concentration as a measure of colour have been reported for grape homogenates using partial least squares (PLS) regression models of their visible and near infrared (NIR) spectra and reference laboratory calibration data. These predictions, particularly that of total anthocyanin concentration, are considered to have the potential to be an indicator of grape quality for potential use as a payment criteria for the grower. Acceptance of NIR predicted data is, however, contingent upon the NIR PLS method meeting minimum standards of accuracy and robustness, particularly in predictions for early new vintage samples.

The current global PLS method, as used for red-grape homogenates, has been shown to have a number of deficiencies. First, the global PLS method for the prediction of total anthocyanins shows a moderate regression curvature, leading to under-predictions for low and high values. Second, prediction errors are fairly evenly distributed across the full anthocyanin concentration range, leading to high relative errors at low anthocyanin concentrations. Third, prediction of new-season anthocyanin data from previous vintages often results in high errors, with low regression slope and high bias. The first two of these deficiencies in global PLS can be partly corrected by using a local PLS model fitted to only a subset of the data in the vicinity of the sample requiring prediction. Local calibrations also improve the predictions for new vintages but can be affected by samples within the individual local calibration sets that may be outliers resulting in high prediction errors.

Part of the problem in predicting new vintage data accurately is that the assumptions about the performance of the global PLS models are often based on cross-validation during calibration training of the PLS model. The leave-one-out cross-validation process uses all the samples in the calibration, in turn, as unknowns and therefore runs a risk that there may be a high probability of a significant correlation between samples in that they are derived from the same vintages with similar properties. Cross- validation error statistics may thus appear to be optimistic and over-fitting may occur, leading to poorer predictions than suggested by the cross-validation. Another problem is that data for the new season may be non-linearly related to environmental or anthropogenic influences. As a linear regression method, PLS regression may not cope adequately with such non-linear effects.

Thus, prediction of new season data from global PLS calibrations of combined vintage samples are often unsuccessful or inaccurate, particularly without the inclusion of some samples from the new vintage.

Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is solely for the purpose of providing a context for the present invention. It is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.

Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

Summary of the Invention According to a first aspect the present invention provides a method of prediction of at least one non-linear property of a prediction set of fruit samples, the method comprising: obtaining a set of data from a training set of fruit samples; deriving from the set of data a reduced set of variables possessing co-correlation with the at least one non-linear property; training a feed-forward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN, to obtain a calibrated ANN; and inputting data from the prediction set of fruit samples into the calibrated ANN to predict the at least one non-linear property of the prediction set of fruit samples.

According to a second aspect the present invention provides a system for predicting at least one non-linear property of a prediction set of fruit samples, the system comprising: a sampling device for obtaining a set of data from a training set of fruit samples; a processor for deriving from the set of data a reduced set of variables possessing co-correlation with the at least one non-linear property, and for training a feed-forward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN, to obtain a calibrated ANN; and an input for inputting data from the prediction set of fruit samples into the calibrated ANN to predict the at least one non-linear property of the prediction set of fruit samples.

According to a third aspect the present invention provides a computer program for predicting at least one non-linear property of a prediction set of fruit samples, the computer program comprising: code for obtaining a set of data from a training set of fruit samples; code for deriving from the set of data a reduced set of variables possessing co- correlation with the at least one non-linear property; code for training a feed-forward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN₅ to obtain a calibrated ANN; and code for inputting data from the prediction set of fruit samples into the calibrated ANN to predict the at least one non-linear property of the prediction set of fruit samples.

Preferably, the set of data from the training set of fruit samples comprises spectral and reference data, with the reduced set of variables being derived from the spectral data, and the ANN being validated upon the reference data.

The present invention recognises that such a calibration process for the ANN provides for a calibrated ANN which is sufficiently robust that the ANN calibration may be 'transferred' to a prediction set of fruit samples distinct from the training set of fruit samples.

For example, transferability of the ANN calibration may be exploited in embodiments where the training set of fruit samples comprises fruit samples from one or more preceding fruit seasons, with the prediction set of fruit samples comprising a current fruit season. In such embodiments the data from the prediction set may comprise data obtained from an early part of the current season, and for example predictions thus obtained from the calibrated ANN may be used to determine pricing of the entire current fruit season.

In other embodiments, the ANN calibration may be so transferred from the training set of fruit samples which is obtained from a first region, to the prediction set of fruit samples which is obtained from a second region distinct from the first region. For example, the first region and the second region may be in distinct parts of a single country or state. Alternatively, the first region and second region may be located in different countries or continents.

In still further embodiments, the ANN calibration may be transferred where the training set of fruit samples is obtained by a first type of sampling device, and the prediction set of fruit samples is obtained by a second type of sampling device. For example, the first type of sampling device may comprise a FOSS NIRSystems 6500 NIR spectrometer.

The second type of sampling device may comprise a Zeiss NIR spectrometer. Whereas

ANN calibrations have in the past proved to be of minimal value when transferred from one type of spectrometer to another, such embodiments of the present invention may provide for a robust calibration which can be so transferred to yield useful predictions.

In yet other embodiments, the ANN calibration may be transferred by utilising homogenate fruit samples as the training set of fruit samples, with the prediction set of fruit samples comprising whole, or substantially whole, fruit.

The at least one non-linear property is preferably related to anthocyanin concentration in the fruit samples. For example the at least one non-linear property may comprise fruit colour, which is closely influenced by anthocyanin concentration and in many applications is an important variable in determining grower payments.

The set of data from the training set of fruit samples, and the data from the prediction set of fruit samples, is preferably NIR data. The set of data from the training set of fruit samples serves the purpose of calibration reference data.

According to a fourth aspect the present invention provides a method of prediction of at least one non-linear property of grape samples, the method comprising: obtaining a set of data from the grape samples; deriving from the set of data a reduced set of variables possessing co-correlation with the at least one non-linear property; and predicting at least one non-linear property of the grape samples with a feedforward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN.

The reduced set of variables are preferably derived in a manner to maximise co- variance between the reduced set of variables and the at least one non-linear property or analyte.

In some embodiments of the invention, the reduced set of variables may be derived by applying a PLS regression analysis to the set of data, to obtain PLS scores to serve as the reduced set of variables. Preferably, such embodiments of the method further comprise developing an optimal PLS calibration for the set of data and subsequently selecting an optimal subset of the PLS scores or spectral data for ANN prediction.

Back propagation training or calibration of the ANN is preferably carried out on the reduced set of variables. The ANN preferably utilises a sigmoid function for non-linear prediction.

In preferred embodiments of the invention, the set of data comprises spectroscopic NIR data. The present invention has particular application where the at least one non-linear property to be predicted comprises a concentration of total anthocyanin, for example to evaluate a colour of red grape samples. The at least one non linear property to be predicted may comprise total anthocyanin concentration in red-grapes, °Brix or pH.

Embodiments of the invention preferably comprise obtaining between 5 and 25 PLS scores as inputs to the ANN, and for example may comprise obtaining between 10 and 20 PLS scores for use as inputs to the ANN. Most preferably, the method comprises obtaining 12 optimum PLS scores from a total of 25 available for use as inputs to the ANN for total anthocyanins. In such embodiments the ANN preferably adopts a 12:3:1 architecture. Some PLS scores may be de-selected from the full set of PLS scores and alternative ANN architectures may be used. In alternative embodiments, the reduced set of variables input to the ANN may comprise a subset of the raw NIR spectral data obtained from within at least one wavelength region correlating with the at least one non-linear property. In such embodiments, the at least one wavelength region may be determined by a search method to identify one or more wavelength regions having maximum co-correlation with the at least one non-linear property.

Thus, some embodiments of the present invention may provide for the prediction of total anthocyanin concentration in red-grape homogenates using near-infrared spectroscopy and feed-forward back-propagation neural networks, to predict future grape or fruit colour and assist in valuation of a given grape or fruit harvest.

In a further aspect, the present invention provides a NIR spectrometer for predicting at least one non-linear property of a prediction set of fruit samples, the NIR spectrometer comprising: a memory storing a calibrated ANN, the ANN having been obtained by feedforward back-propagation training upon a reduced set of NIR training variables possessing co-correlation with the at least one non-linear property; means for obtaining NIR spectral data from the prediction set of fruit samples; and a processor for deriving from the NIR spectral data a reduced set of prediction variables and for inputting the reduced set of prediction variables into the calibrated ANN.

The memory may store a plurality of ANNs, each calibrated by training the ANN upon a distinct training set of fruit samples.

Brief Description of the Drawings An example of the invention will now be described with reference to the accompanying drawings, in which: Figure 1 illustrates a feed-forward back-propagation ANN model;

Figure 2a is a plot of PLS score 2 vs. PLS score 1 for both the 1999-2003 vintages (CAL samples) and the 2004 vintage (VAL sample), together with histograms of the distribution of each PLS score; Figure 2b is a plot of PLS score 3 vs. PLS score 1 for both the 1999-2003 vintages (CAL samples) and the 2004 vintage (VAL samples), together with histograms of the distribution of each PLS score;

Figure 3 a is a global PLS regression plot for total anthocyanin concentration for leave-one-out cross-validation predicted values vs. reference laboratory values in the 1999-2003 vintage red-grape homogenates calibration;

Figure 3b is a global PLS regression plot for predicted concentrations vs. reference laboratory values in the 2004 vintage samples from 1999-2003 vintage red- grape homogenates;

Figure 4a is a regression plot of a randomly selected set of ANN training 1999- 2003 validation data (VLD) for total anthocyanin concentrations vs. reference laboratory values for the ANN calibration model using 1999-2003 sample PLS scores and laboratory reference data;

Figure 4b is a regression plot of ANN predicted anthocyanin concentrations vs. reference laboratory values for the 2004 vintage (TST) samples, using the PLS predicted scores as inputs to the ANN calibration model of a first embodiment of the invention trained on the 1999-2003 data;

Figure 5 is a regression plot for PLS-predicted total anthocyanin concentrations vs. reference laboratory values for the 2004 vintage, obtained from a Win-ISI "LOCAL" PLS model calibrated to the 1999-2003 data; Figure 6 is a regression plot for predicted total anthocyanin concentrations vs.

2004 vintage reference laboratory values obtained by using full spectrum raw spectral data as inputs to a prior art ANN trained on the 1999-2003 data;

Figure 7 is a regression plot for predicted total anthocyanin concentrations vs. reference laboratory values for the 2004 vintage, using raw spectral data reduced to 20 wavelengths between 604nm and 904nm as inputs to an ANN network trained on the 1999-2003 data in accordance with a second embodiment of the present invention; Figure 8a is a regression plot of ANN training validation data for total anthocyanin concentrations vs. reference laboratory values using a calibration model, different to the present invention, which uses PCA scores and laboratory reference data as inputs to a 1999-2003 vintage ANN model; Figure 8b is a regression plot of predicted anthocyanin concentrations vs. reference laboratory values for the 2004 vintage samples, using the same 1999-2003 vintage trained PCA + ANN calibration model as for Fig. 8a;

Figures 9a to 9c illustrate the ability of a PLS model, trained on the previous Australian 1999 to 2004 vintage samples, to predict the total anthocyanin concentrations of a non- Australian data set, with Figure 9a depicting the regression between PLS calibration model values versus reference data, and Figure 9b and 9c illustrating the regressions for PLS predictions for a small subset of non- Australian samples and a larger set of non- Australian samples respectively; and

Figures 10a to 10c illustrate the ability of an ANN model trained in accordance with another embodiment of the invention to predict total anthocyanin concentrations of the same non- Australian data sets as for Figures 9a to 9c, with Figure 10a depicting the regression between the ANN training values for the 1999-2004 vintage versus reference data, Figure 10b illustrating the regression for the small Validation subset of non- Australian samples, and Figure 10c illustrating the regression for values predicted for the non- Australian Test set.

Description of the Preferred Embodiments

Non-linearity in prediction algorithms often results when there are significant spectral variations corresponding to low concentration sample spectra compared to those for high concentrations, or when there are non-linear sample matrix scattering/reflection characteristics. The prediction of grape anthocyanin concentration exhibits this phenomenon, whereby a "banana-shaped" regression curve between reference and predicted values is obtained. One solution to this curvature has been to develop specific partial least squares (PLS) calibrations for concentration segments, for example one PLS calibration for low colour and another PLS calibration for high colour, or to use a local PLS algorithm where only the samples most similar to the unknown are used for calibration building and prediction.

The present invention recognises that use of a feed-forward back propagation artificial neural network (ANN) offers an improved chemometric method compared to PLS for dealing with predicting non-linear sample properties from spectra. In the context of this invention, an ANN deals with non-linear NIR spectral responses by weighting the spectral inputs with non-linear sigmoid functions and summing them to derive a nonlinear response.

A simple multilayer perceptron is depicted in Figure 1, which illustrates a feed-forward back-propagation ANN model 100. In the architecture illustrated in Equation 1, /input variables x,- are connected to a single output node (O) through two layers of nodes, a single layer of/ (in this case three) input nodes (g,-) by Wy weights, and a single layer of J (in this example also three) hidden nodes (hj), by weighting coefficients (Wj). The output from node O is the required predicted value (Y). There are also coefficients aj and b accounting for modelling bias in the input and hidden layers, and a modelling error term E. The / input variables Xj can be, for example, spectral data and Y the required analyte concentrations (one for each of the /samples in this case).

The output 7 of the neural network 100 can be expressed as:

where a_j, b, Wy, W_j and E are determined by training the network. The coefficients wu to Wy are the input weights by which each x, element is multiplied, Wi to Wj are the weights by which the contents of each hidden hi node is multiplied and E is an error term. The functions β, andf₀ can be either linear or non-linear, but in this study both are the non-linear logistic (sigmoid) function although either a linear or other non-linear functions such as the hyperbolic function can be used. The sigmoid function can be defined as: /W = T l- +^ e ^xr [2] where x is the input data value and e = 2.30258. Although the input and output neurons are interconnected, there are no direct connections between the input, hidden and output nodes within a layer.

In order to use the network 100 for prediction, Wy and Wj must first be determined by calibration training, i.e. adjusting Wy and Wj for each variable to give a minimum E value. This is usually done by trial and error by testing the predicted Y value against those determined by reference methods in a validation set, with due regard to overfϊtting or underfitting the calibration.

The learning rule for trial and error testing is through a process called back- propagation, whereby the xy values are tested one by one in random order and the values of Wy, Wi, q and b are updated each time to calculate and minimize the error e. This error is then back propagated to the hidden layer and the process repeated for all the Wy and Wi values. Learning constraints such as learning momentum, rate and the choice of minimisation algorithm regulate the speed and stability of convergence.

In ANN training, there is no restriction on the values of wy and W_t, so that ANN is prone to the risk of overfitting and also lacks the ability to provide qualitative information on the most relevant spectral input information with regard to Y. Further, there are a number of difficulties in ANN calibration training using infrared spectral data. The time taken to train the calibration is dependent on the square of the size of the Ar-variable data set, and proportional to the number of samples. If the inputs used are simply the raw spectral data, then extremely long computation times and vast computer memory resources are required. The spectral data can be pruned according to input selection methods provided for the particular ANN method used, for example in order to reduce the full spectra to those wavelengths having maximum covariance with the analyte values. However, the present invention exploits the recognition that a particularly applicable technique of reducing computation resources and improving the robustness of the ANN calibration, is to use PLS scores as x,- inputs. A PLS algorithm can be expressed as:

Y = ∑b_j{∑w₉x] + E = ∑b/_j +E [3]

where the W_y terms are the (I) PLS loading weights for each / factor and are equivalent to the ANN weights, b_j terms are the PLS regression coefficients equivalent to the bj constants in the ANN model linking the hidden layer nodes to the output, t,- are the PLS scores equivalent to the ANN input node terms, Wy and tj are orthogonal to each other, and there is no non-linear / function. The learning procedure for PLS is quite different to that of ANN in that Wy is determined to give as much relevance to the scores for prediction as possible, and thus maximise the covariance between 7 and x,.

By using PLS scores as the x_/ inputs to an ANN such as shown in Figure 1, not only is the input data conveniently and easily reduced, often from thousands of points to perhaps 10-20 points, but the data is already pre-processed, for example in the case of scatter correction, with derivative conversion. Furthermore, if the scores are derived from PLS regression, the first few scores are those with maximum covariance with analyte concentration and the corresponding loading weights can provide valuable qualitative spectral information, unlike a standard ANN implementation.

In the preferred embodiment, to prepare the PLS score data for ANN inputs, an optimum PLS calibration model is first developed for the raw (or pre-processed) spectral data obtained from the grape samples. This provides a set of PLS scores for a model with maximum covariance between the x and Y data for the optimum number of PLS factors, and for the lowest residual for a linear model. All the PLS calibration statistics thus apply. The scores for the optimised PLS model are then used as the input Xi data for the ANN, together with any reference laboratory Y data to be used for the calibration training and testing. The advantage of this approach is that the ANN is simply weighting the input PLS scores with weighting terms through the sigmoid functions to account for the regression non-linearity. In practice, this requires that the PLS calibration model is used to first predict the scores of the calibration samples and unknowns for the optimum number of factors, and that this data is then entered as inputs to the ANN. The ANN training uses a validation set randomly selected for calibration training and then a separate "test" set for independent validation. In practice, not all PLS scores are used in the ANN calibration since some of the PLS scores account for PLS non-linearity. The following experiments show that, using PLS scores as inputs to an ANN, regressions between reference and predicted "test" data are more linear, robust, have lower bias and are more accurate compared to global PLS, and are faster, more robust and more accurate compared to using raw full-spectrum data as input to an ANN.

Experimental Samples

Grape samples (n = 3134) from the 1999-2003 vintages and the 2004 vintage (n = 250), comprising 9 varieties and 11 regions were frozen for two months at -18°C, then thawed and homogenised at room temperature using a Retsch high-speed homogeniser for 20 sec. Aqueous ethanol (10 mL, 50% v/v) was added to extract the anthocyanins from the Ig of the homogenates, and the contents mixed for one hour. The samples were then centrifuged {Universal 32R Hettich, Tuttlingen, Germany) at 4000 rpm for 10 min. A 200 mL aliquot of the supernatant was acidified with 3.8 mL of 1 M HCl and the concentration of total anthocyanins (expressed as equivalent malvidin-3- glucoside). The total anthocyanin concentrations were determined using a UV/visible spectrophotometer (Cary 300, Varian, Australia) at 520nm. A description summary of the samples and reference anthocyanin data showing year of vintage, number of samples (N), minimum (Min), maximum (Max), median, skewness .and standard deviation (SD) are presented in Table 1. Year N Min Max Median Skew SD

1999 146 0.19 1.52 0.59 0.81 0.25

2000 781 0.20 2.28 1.24 -0.09 0.42

2001 140S 0.28 2.65 1.12 0.42 0.43

2002 410 0.13 3.10 1.51 -0.09 0.56

2003 392 0.25 2.88 1.46 0.03 0.58

2004 250 0.23 3.03 1.33 0.52 0.55

Table 1

Spectra Homogenised sub-samples were scanned on a FOSS NIRSystems 6500 (Foss NIRSystems, Silver Springs, MD, USA) consisting of a monochromator with wavelength range of 400 - 2500nm), a 10 mm path length quartz cuvette and software for collecting, storing and processing data (Win-ISI II™, Version 1.5, FOSS NIRSystems, Silver Springs, USA, 2000). The NIR spectra and data were imported into Unscrambler™ (CAMO A/S Trondheim Norway) software and then exported in JCAMP format into Grams/AI™ Version-7.00 (Thermo-Electron) for global PLS analysis. The spectra were also imported into Win-ISI™ LOCAL (WinISI II™, Version 1.5, FOSS NIRSystems, Silver Springs, USA, 2000) for "LOCAL" PLS analysis together with the laboratory reference data for total anthocyanin concentration, °Brix and pH. Spectral data was not preprocessed, except for mean centring and data point reduction.

Chemometrics

The resulting Grams Multifile from JCAMP import of spectral data and the laboratory reference data were entered into a Grams PLSplus/AI™ spreadsheet for global PLS analysis. PLS regression analysis was carried out on the full spectrum range using the 1999-2003 vintage samples (CAL) for calibration, with "leave-one-out" cross- validation for global PLS training (Grams PLSplus/AI™). A randomly selected subset of samples from the 1999-2003 data was used as a validation (VLD) set during ANN training. The 2004 vintage samples were used as an independent test set for the ANN (TST set), the global Grams/AI (VAL) and LOCAL Win-ISI™ (VAL) validation.

The PLS scores were determined from the global Grams PLSplus/Al™ calibration model for the optimum number of model factors. These scores were copied into an Excel spreadsheet, together with columns of calibration reference for total anthocyanin concentrations. The Excel spreadsheet was then saved for use as the input to the ANN software (Neurolntelligence™ Ver-2.1, Alyuda Research Inc). The software provides options for filtering spectral outliers from the input data and the input PLS scores, partitioning the data into sets of training (TRN), validation (VLD) and test (TST) samples. Automatic random partitioning was used for the TRN and VLD samples and 300 iterations were performed for each test. For this present study, all the 1999-2003 vintage samples were partitioned randomly as TRN (2350) and VAL (784) samples) and all the 2004 vintage (250) samples were allocated to the TST set. No samples were removed as outliers but data were missing for some samples. ANN training was used to test the number of training iterations, the types of training algorithms, and parameters such as learning rate and randomisation range, and the final selection of scores to be used for the optimum ANN model.

Results

Global PLS score versus score plots for the first three PCs for total anthocyanate concentration (mg/g) are presented in Figure 2. In particular, Figures 2a and 2b depict the relationship between PLS score 2 vs. PLS score 1 (Fig. 2a), and PLS score 3 vs. PLS score 1 (Fig. 2b) for both the 1999-2003 vintages and the 2004 vintage red-grape homogenates, together with histograms of the distribution of each PLS score. The histogram plots of Figures 2a and 2b show that there is a significant positive bias of the VAL sample PLS score-2 and PLS score-3 data with respect to those from the CAL set.

Obtained global PLS regression plots are presented in Figure 3 for total anthocyanin concentration, first using leave-one-out cross-validation for the CAL samples, and then predicting for the VAL samples. In particular, Figure 3 a is a global PLS regression plot for total anthocyanin concentration (mg/g) for leave-one-out cross-validation predicted values vs. reference laboratory values in the 1999-2003 vintage red-grape homogenates, while Figure 3b is a global PLS regression plot for predicted concentrations vs. reference laboratory values in the 2004 vintage samples from 1999- 2003 vintage red-grape homogenates.

Using 12 PLS factors, the standard error of cross-validation (SECV) for the PLS cross- validation for total anthocyanins was 0.16, with a coefficient of determination (R²) of 0.91, a slope of 0.89, an intercept of 0.14, and bias close to zero. The prediction error for the VAL set was of similar magnitude with an SEP = 0.16 and R² = 0.88, but the slope was now 0.78, the intercept was only 0.74, and the bias was very high at 0.44 mg/g. A significant negative curvature was clearly observed in all the PLS plots for total anthocyanin concentration, with values under-predicted for low and high colour samples. The curvature was also significant for the VAL sample predictions, with predicted values as low as 0.4 units below the reference values near 0.50 mg/g total anthocyanin. The curvature could be slightly reduced by using more PLS terms but this could risk overfϊtting the model and therefore could not be justified.

Figures 3a and 3b thus illustrate the non-linearity, high bias, and lack in accuracy of the global PLS regression.

By contrast, the plots in Figure 4 illustrate the results for the ANN model, depicting the predicted versus reference laboratory values for the model validation, and the predicted versus known reference laboratory values for the test set using the 1999-2003 vintage ANN calibration model. In particular, Figure 4a is a regression plot of ANN training validation for total anthocyanin concentrations vs. reference laboratory values, for the calibration model using the pre-determined PLS scores and laboratory reference data for inputs. Figure 4b is a regression plot of predicted anthocyanin concentrations vs. reference laboratory values for the 2004 vintage, using the 1999-2003 vintage ANN calibration model. For these plots, the PLS scores used for the ANN inputs in accordance with the present invention were predicted from the raw spectra using the global 1999-2003 vintage PLS model. The ANN model, using the PLS scores as inputs and a 12:3:1 architecture over 250 iterations and with a Quasi-Newton minimisation algorithm, resulted in a VAL prediction SEP = 0.18 and R² = 0.90, but with a significantly improved slope of 0.98, almost zero bias, and very little apparent regression curvature.

Figure 5 is a regression plot for predicted total anthocyanin concentrations (mg/g) vs. reference laboratory values for the 2004 vintage, obtained from a Win-ISI "LOCAL" PLS model. This model was calibrated to the 1999-2003 data, in order to further illustrate the performance of the present invention relative to alternative techniques. The regression between predicted versus known total anthocyanin concentrations depicted in Figure 5, shows a plot with a slope still close to one (0.91) but with an R² = 0.82, an increased SEP = 0.23, and bias == 0.10.

Thus, the ANN model for total anthocyanins gives a tighter regression and, in this application, outperforms both the global PLS model and the local PLS model.

The prediction of grape quality parameters, such as total anthocyanin concentration in red-grapes, °Brix and pH, using infrared methods with PLS regression has been proposed for use in payment purposes for the grape-grower. The success of such a payment method is, however, contingent upon the predictions being accurate and the prediction model calibrations obtained from past vintage(s) being useable for the coming vintage.

The present invention recognises two problems which compromise the validity of the infrared method. The first problem is the non-linearity ("banana-shape") of a PLS regression for predicted total anthocyanin concentration, and the second is the change in the chemical and physical characteristics of the new season product compared to previous season(s). When using PLS techniques, these two factors usually require that PLS calibrations must be re-modelled with some new samples each year, in order to tailor the calibration for the range of anthocyanin concentrations in the new season samples. The process of adding new-season samples to the existing calibrations is intended to reduce the regression curvature for total anthocyanin concentration, and hopefully account for other physical changes in the grape matrix with the new season, but is nevertheless an undesirable additional step.

Previously, "local" PLS regression has been proposed to improve regression linearity and accuracy using large calibration database and focussed "local" calibration samples similar to those from the new-season. Figure 5 shows that these requirements are partly met, with the prediction regression essentially linear and the prediction errors for 2004 vintage samples acceptable in comparison to the total anthocyanin predictions from the global PLS depicted in Figure 3b where there are serious slope and bias problems. However, one difficulty of using local PLS regression for routine predictions is that such software is usually instrument-specific, for example being part of the FOSS NIRSystems 6500 spectrometer system or require spectrum data export into Win-ISI format. This presents a challenge in implementation of the local PLS regression for other instruments, which is an important consideration for selecting the prediction technique for routine use.

The present invention shows that feed-forward back-propagation ANN when using PLS scores as inputs (Figure 4), outperforms PLS analysis of red-grape parameters from

NIR spectra, whether by global PLS analysis (Figure 3) or local PLS analysis (Figure

5). The use of full-spectrum raw spectral points as inputs to the ANN models for training, can substantially monopolise or even exceed available computing resources, and significantly increases the time taken for training. The use of PLS scores as inputs is therefore seen as a means of reducing the input data, but still retain a large amount of the spectral information, as well as ensuring that the maximum correlation between the analyte values and the spectral information is contained in the input data.

To confirm difficulty in using spectral data inputs to an ANN, calibrations using raw spectral data were performed. Initial attempts to train an ANN network using all 3000 spectral data points were unsuccessful due to the sheer size of the ANN model, so that the raw spectral data were first reduced to each fifth data point prior to input. Figure 6 is a regression plot for predicted total anthocyanin concentrations (mg/g) vs. reference laboratory values obtained by using raw spectral data from the 2004 vintage as inputs to a prior art ANN trained on the 1999-2003 data. The training for raw spectral inputs, using a 210:4:1 ANN architecture for 500 iterations, resulted in a relatively poor prediction, with regression R² = 0.83, SEP = 0.28, slope = 1.13, intercept = 0.27 and bias = 0.45. A major problem identified was the very large curvature in the regression plot as is clearly visible in Figure 6, this curvature in fact being far larger than for the prediction using global PLS shown in Figure 3b.

The present invention recognises that not all PLS scores are required as inputs to the ANN. By optimising the ANN inputs to be those with the highest correlation with the analyte data, however, the computing requirements and training time can be greatly reduced, and at the same time the robustness of the calibration increased. As an alternative embodiment of the invention, rather than using PLS scores as ANN inputs, the raw NIR data was pruned to obtain a data subset including only those wavelength regions correlating most strongly with analyte concentration using a step-forward search method. The step-forward search revealed an important wavelength region between 654nm and 904nm, for the best prediction. An ANN network was trained using only this wavelength region for input into a 20:5:1 network architecture trained over 500 iterations and with a Quasi-Newton minimisation algorithm. Figure 7 is a regression plot for the predicted total anthocyanin concentrations (mg/g) vs. reference laboratory values for the 2004 vintage, using optimised raw spectral data reduced to the 604 nm to 904 nm wavelength region as inputs to an ANN network trained on the 1999-2003 data. The spectral data was first reduced by a factor of 5 before entering the data into the ANN training spreadsheet and then optimised by forward stepwise search to give 20 spectral wavelength points in the 604-904 nm range. This embodiment of the invention yields R² = 0.90, SEP = 0.16, slope = 0.87 and a very low bias = -0.02. Regression curvature was almost completely eliminated, as illustrated by the substantially linear distribution of points in Figure 7. Figure 7 compares well with the results for PLS score inputs to the ANN shown in Figure 4, which possess SEP = 0.18, R² = 0.90, slope = 0.98 and almost zero bias. The stepwise forward search method used for the test illustrated in Figure 7 is by no means exhaustive and other wavelengths, excluded in this example, may also have a significant predictive contribution. By searching for the optimum wavelengths with the ANN searching algorithm, we are in effect also aiming to force a co-variance between the spectrum inputs and analyte concentrations, similar to that achieved formally by PLS regression, although with a less rigid protocol..

To further assess the importance of using ANN inputs correlated with the property to be predicted, the use of PCA scores as ANN inputs was also assessed. Notably, there is no assurance that PCA scores are in any way correlated with analyte concentration scores. Figure 8a is a regression plot of ANN training 'leave one out' validation for total anthocyanin concentrations (mg/g) vs. reference laboratory values for a calibration model different to the present invention where PCA scores and 1999-2003 vintage laboratory reference data are used for inputs to an ANN. Figure 8b is a regression plot of predicted anthocyanin concentrations vs. reference laboratory values for the 2004 vintage, using the same PCA + ANN calibration model as Fig. 8a and trained on the 1999-2003 data. The PCA scores used for the ANN inputs were predicted from the raw spectra by the 1999-2003 vintage model.

Training validation accuracy for total anthocyanin concentration using the PCA score inputs in the ANN model is shown in Figure 8(a), with R² = 0.94, SEV = 0.11, slope = 0.96 and zero bias. While the calibration accuracy seemed good, the prediction accuracy for the test set, however, was very poor with R = 0.81, SEP = 0.36, slope = 1.35 and bias increasing 0.58. More importantly, the regression plot, as shown in Figure 8(b), showed extreme "S" shaped curvature, suggesting that the modelling was based significantly on non-relevant PCA scores having little correlation with the concentrations to be predicted. While there is no assurance that PCA scores are in any way correlated with analyte concentration, on the other hand PLS scores are derived according to co-correlations between spectral intensities at certain wavelengths and the analyte concentrations, and thus in accordance with the present invention selected PLS scores and specific wavelength data used as inputs to an ANN yield improved performance.

Thus, the PLS regression analysis for the prediction of total anthocyanins in red-grape homogenates from NIR spectra illustrated in Figure 3 indicate that the prediction of new season data from previous vintage calibrations are often unsuccessful or inaccurate, without the inclusion of some samples from the new vintage. Prediction using feed-forward back-propagation ANN illustrated in Figure 4 reveals the improved capability of ANN to model the non-linear covariance between spectral intensities and anthocyanin concentrations. The improved prediction accuracy and robustness of ANN over global PLS regression using PLS scores as inputs to the ANN models is thus demonstrated.

Further, while ANN models using raw full spectra inputs fail to successfully predict new vintage samples, careful wavelength selection based on maximal correlation or covariance with the property in question results in prediction accuracy similar to that obtained from PLS score inputs. The ANN predictions were found to be more linear, robust and more accurate than global PLS for prediction of new vintage data, with regression slope close to one and low prediction bias. Furthermore, the ANN model was found to outperform even local PLS regression for this application.

Thus, the ANN method proposed here exploits the use of PLS scores as inputs to the ANN models, or a wavelength specific subset of the NIR data as inputs to the ANN models, rather than using PCA scores or the full spectrum raw spectral data as inputs to the ANN, to predict new season grape homogenate data.

To further investigate the transferability of the ANN models trained in accordance with the present invention, the ability of artificial neural networks to predict total anthocyanin concentrations in non- Australian red-grape homogenates scanned overseas, using the Australian 1999-2004 calibration data set for ANN training, was tested. Figures 9 and 10 illustrate the results of this testing.

Partial Least-Squares (PLS) and Artificial Neural Networks (ANN) regression were used to predict total anthocyanin concentrations in red-grape homogenates scanned overseas. The NIR homogenate spectra were obtained using a cloned FOSS InfraXact spectrometer during the 2004 vintage season. Total anthocyanin concentrations were predicted using the 1999-2004 vintage Australian FOSS NIRSystems 6500 calibration model, and results compared for PLS and ANN.

PLS scores were generated for all samples in the Australian and non-Australian data sets from the Australian PLS calibration model. Six of these scores (scores 1 to 4, 7 and 9) were selected from the first 18 PLS loading scores. The predicted data was split into three sets: 1. Six scores from the first 18 PLS scores for the Australian 1999-2004 data set to be calculated and used as the training set, including the reference total anthocyanin concentrations and the PLS-predicted values and scores.

2. Six scores selected from the first 18 PLS scores calculated for a set of 45 overseas samples to be used as a validation set, including the reference total anthocyanin concentrations and the PLS-predicted values and scores.

3. Six scores selected from the first 18 PLS scores calculated for another independent overseas set of 658 samples to be used as an independent Test set, including the reference total anthocyanin concentrations and the PLS-predicted values and scores.

Table 2 gives statistics for predictions using a global PLS model. The PLS model was trained by cross-validation using the full 1999-2004 vintage AWRI data set. The predictions were tested on two overseas sample sets; a small set of 45 samples and a larger set of 658 samples used as independent test samples. Calibration-Training set Validation set Test set

(PLS AWRI 1999-2004) (PLS overseas data) (PLS overseas data)

N = 3385 N = 45 N = 658

Min = 0.13 Min = 0.36 Min = 0.30

Max = 3.10 Max = 1.58 Max = 2.18

Mean = 1.25 Mean = 0.98 Mean = 1.00

Median = 1.22 Median = 0.94 Median = 0.97

SD = 0.50 SD = 0.31 SD = 0.30

R² = 0.886 R² = 0.430 R² = 0.457

SEC 0.16 SEV 0.20 SEP 0.20

Slope = 0.88 Slope = 0.55 Slope = 0.59

Intercept = 0.13 Intercept = 0.98 Intercept = 0.87

Bias = -0.02 Bias = 0.54 Bias = 0.46

Table 2

Table 3 gives statistics for predictions using a feed-forward ANN model with a 6:4:1 network architecture of the selected PLS score inputs. The ANN model was trained with PLS scores from the full 1999-2004 vintage AWRI data set and validated during training with predicted scores from the 45-sample overseas data set. The predictions were tested on a larger set of 658 overseas samples as an independent test set.

Calibration-Training set Validation set Test set

ANN AWRI 1999-2004 (Overseas data) (Overseas data)

N = 3385 N = 45 N = 658

Min = 0.13 Min = 0.36 Min = 0.30

Max = 3.10 Max = 1.58 Max = 2.18

Mean = 1.25 Mean = 0.98 Mean = 1.00

Median = 1.22 Median = 0.94 Median = 0.97

SD = 0.50 SD = 0.31 SD = 0.30

R² = 0.878 R² = 0.716 R² = 0.806

SEC 0.16 SEV 0.14 SEP 0.12

Slope = 0.88 Slope = 0.71 Slope = 0.79

Intercept = 0.16 Intercept = 0.26 Intercept = 0.22

Bias = 0.00 Bias = -0.03 Bias = 0.01

Table 3

The statistics presented in Tables 2 and 3, and the plots of Figures 9 and 10, show that there is a major improvement in the prediction for the ANN model trained in accordance with the present invention, compared to predictions using the normal PLS method. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. Notably, while the specific embodiments illustrate the transferability of the trained ANN model from the preceding 1999-2003 seasons to the subsequent 2004 season, other ANN models may in other embodiments of the invention be exploited for transferring an ANN model trained on a training set of fruit data, in order to predict one or more non linear properties of another set of fruit, the prediction set, based on data from the prediction set of fruit. For example, after being trained on a set of data from fruit in a first region, the trained ANN may be used for predictions of a set of fruit from a second region distinct from the first region. Further, the training data may be obtained by use of a first spectrometer, with the data from the prediction set of fruit being obtained by use of a different model of spectrometer. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:

1. A method of prediction of at least one non-linear property of a prediction set of fruit samples, the method comprising: obtaining a set of data from a training set of fruit samples; deriving from the set of data a reduced set of variables possessing co-correlation with the at least one non-linear property; training a feed-forward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN, to obtain a calibrated ANN; and inputting data from the prediction set of fruit samples into the calibrated ANN to predict the at least one non-linear property of the prediction set of fruit samples.

2. The method of claim 1 wherein the set of data from the training set of fruit samples comprises spectral and reference data, with the reduced set of variables being derived from the spectral data, and the ANN being validated upon the reference data.

3. The method of claim 1 or claim 2 wherein the training set of fruit samples comprises fruit samples from one or more preceding fruit seasons, with the prediction set of fruit samples comprising a current fruit season.

4. The method of claim 3 wherein the data from the prediction set comprises data obtained from an early part of the current fruit season.

5. The method of claim 4 wherein predictions obtained from the calibrated ANN are used to deteπnine pricing of substantially the entire current fruit season.

6. The method of any one of claims 1 to 5 wherein the training set of fruit samples is obtained from a first region, and the prediction set of fruit samples is obtained from a second region distinct from the first region.

7. The method of claim 6 wherein the first region and the second region are in distinct parts of a single country or state.

8. The method of claim 6 wherein the first region and second region are located in different countries or continents.

9. The method of any one of claims 1 to 8 wherein the training set of fruit samples is obtained by a first type of sampling device, and the prediction set of fruit samples is obtained by a second type of sampling device distinct from the first type.

10. The method of claim 9 wherein one of the first and second types of sampling device comprises a FOSS NIRSystems 6500 NIR spectrometer, and the other of the first and second types of sampling device comprises a Zeiss NIR spectrometer.

11. The method of any one of claims 1 to 10 wherein the training set of fruit samples comprises homogenate fruit samples, and the prediction set of fruit samples comprises substantially whole fruit.

12. The method of any one of claims 1 to 11 wherein the at least one non-linear property is related to anthocyanin concentration in the fruit samples.

13. The method of claim 12 wherein the at least one non-linear property comprises fruit colour.

14. The method of any one of claims 1 to 13 wherein the set of data from the training set of fruit samples, and the data from the prediction set of fruit samples, is NIR data.

15. A system for predicting at least one non-linear property of a prediction set of fruit samples, the system comprising: a sampling device for obtaining a set of data from a training set of fruit samples; a processor for deriving from the set of data a reduced set of variables possessing co-correlation with the at least one non-linear property, and for training a feed-forward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN, to obtain a calibrated ANN; and an input for inputting data from the prediction set of fruit samples into the calibrated ANN to predict the at least one non-linear property of the prediction set of fruit samples.

16. The system of claim 15 wherein the set of data from the training set of fruit samples comprises spectral and reference data, with the reduced set of variables being derived from the spectral data, and the ANN being validated upon the reference data.

17. The system of claim 15 or claim 16 wherein the training set of fruit samples comprises fruit samples from one or more preceding fruit seasons, with the prediction set of fruit samples comprising a current fruit season.

18. The system of claim 17 wherein the data from the prediction set comprises data obtained from an early part of the current fruit season.

19. The system of claim 18 wherein predictions obtained from the calibrated ANN are used to determine pricing of substantially the entire current fruit season.

20. The system of any one of claims 15 to 19 wherein the training set of fruit samples is obtained from a first region, and the prediction set of fruit samples is obtained from a second region distinct from the first region.

21. The system of claim 20 wherein the first region and the second region are in distinct parts of a single country or state.

22. The system of claim 20 wherein the first region and second region are located in different countries or continents.

23. The system of any one of claims 15 to 22 wherein the training set of fruit samples is obtained by a first type of sampling device, and the prediction set of fruit samples is obtained by a second type of sampling device distinct from the first type.

24. The system of claim 23 wherein one of the first and second types of sampling device comprises a FOSS NIRSystems 6500 NIR spectrometer, and the other of the first and second types of sampling device comprises a Zeiss NIR spectrometer.

25. The system of any one of claims 15 to 24 wherein the training set of fruit samples comprises homogenate fruit samples, and the prediction set of fruit samples comprises substantially whole fruit.

26. The system of any one of claims 15 to 25 wherein the at least one non-linear property is related to anthocyanin concentration in the fruit samples.

27. The system of claim 26 wherein the at least one non-linear property comprises fruit colour.

28. The system of any one of claims 15 to 27 wherein the set of data from the training set of fruit samples, and the data from the prediction set of fruit samples, is NIR data.

29. A computer program for predicting at least one non-linear property of a prediction set of fruit samples, the computer program comprising: code for obtaining a set of data from a training set of fruit samples; code for deriving from the set of data a reduced set of variables possessing co- correlation with the at least one non-linear property; code for training a feed-forward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN, to obtain a calibrated ANN; and code for inputting data from the prediction set of fruit samples into the calibrated ANN to predict the at least one non-linear property of the prediction set of fruit samples.

30. A method of prediction of at least one non-linear property of grape samples, the method comprising: obtaining a set of data from the grape samples; deriving from the set of data a reduced set of variables possessing co-correlation with the at least one non-linear property; and predicting at least one non-linear property of the grape samples with a feedforward back-propagation artificial neural network (ANN) using the reduced set of variables as inputs to the ANN.

31. The method of claim 30 wherein the reduced set of variables are derived in a manner to maximise co-variance between the reduced set of variables and the at least one non-linear property or analyte.

32. The method of claim 30 or claim 31 wherein the reduced set of variables is derived by applying a PLS regression analysis to the set of data, to obtain PLS scores to serve as the reduced set of variables.

33. The method of claim 32 further comprising developing an optimal PLS calibration for the set of data and subsequently selecting an optimal subset of the PLS scores or spectral data for ANN prediction.

34. The method of any one of claims 30 to 33 further comprising carrying out back propagation training of the ANN on the reduced set of variables.

35. The method of any one of claims 30 to 34 wherein the ANN utilises a sigmoid function for non-linear prediction.

36. The method of any one of claims 30 to 35 wherein the set of data comprises spectroscopic NIR data.

37. The method of any one of claims 30 to 36 wherein the at least one non-linear property to be predicted comprises at least one of: total anthocyanin concentration in red-grapes, °Brix and pH.

38. The method of any one of claims 30 to 37, further comprising obtaining between 5 and 25 PLS scores as inputs to the ANN.

39. The method of claim 38 wherein between 10 and 20 PLS scores are obtained for use as inputs to the ANN.

40. The method of claim 38 wherein 12 optimum PLS scores are obtained from a total of 25 available for use as inputs to the ANN.

41. The method of claim 40 wherein the ANN adopts a 12:3:1 architecture.

42. The method of any one of claims 30 to 41 wherein the reduced set of variables input to the ANN comprise a subset of raw NIR spectral data, obtained from within at least one wavelength region correlating with the at least one non-linear property.

43. The method of claim 42 wherein the at least one wavelength region is determined by a search method to identify one or more wavelength regions having maximum co-correlation with the at least one non-linear property.

44. A NIR spectrometer for predicting at least one non-linear property of a prediction set of fruit samples, the NIR spectrometer comprising: a memory storing a calibrated ANN, the ANN having been obtained by feed- forward back-propagation training upon a reduced set of NIR training variables possessing co-correlation with the at least one non-linear property; means for obtaining NIR spectral data from the prediction set of fruit samples; and a processor for deriving from the NIR spectral data a reduced set of prediction variables and for inputting the reduced set of prediction variables into the calibrated ANN.

45. The NIR spectrometer of claim 44 wherein the memory stores a plurality of ANNs₅ each calibrated by training the ANN upon a distinct training set of fruit samples.