WO2019229766A1

WO2019229766A1 - Method for prediction of intracellular analyte content class

Info

Publication number: WO2019229766A1
Application number: PCT/IN2019/050412
Authority: WO
Inventors: Binay PANDA; Neeraja M KRISHNAN
Original assignee: Panda Binay; Krishnan Neeraja M
Priority date: 2018-05-27
Filing date: 2019-05-25
Publication date: 2019-12-05

Abstract

The present invention demonstrates a test case using the deep learning approach to estimate the intracellular concentration class (high or low) of a metabolite, azadirachtin.We divided the input data randomly into learning- and test-sets ten times to avoid sample bias and to optimize the model parameters during cross-validation. Models built from the first training set yielded prediction errors of 12.12% and 6.67% (for leaf), and 13.33% each (for fruit), while the models built from the more robust second training set yielded 19.13% and 15.11% (for leaf), and 8% and 26.67% (for fruit), each, for low and high metabolite classes, respectively. The validation accuracies were 67.85% and 70% for the first and second training sets, respectively. We developed a desktop application and a mobile application for real-time to predict the metabolite class.

Description

METHOD FOR PREDICTION OF INTRACELLULAR

ANALYTE CONTENT CLASS

FIELD OF INVENTION: The present invention generally relates to the field of artificial intelligence and biology. The invention relates to the method for predicting the concentration class of the intracellular analyte using image-based deep learning. More specifically, this invention relates to the images of leaves and/or fruits of a plant and using those images to predict an intracellular metabolite’s content class in those organs using deep learning-based artificial intelligence approach utilizing Convolutional Neural Networks (CNNs).

BACKGROUND OF THE INVENTION

Measuring the concentration of an analyte (metabolite, enzyme, protein, chemical moiety etc.) within plants, animals and microbial cells is a frequent practice in biology. This is routinely achieved by using chemical, biochemical, immunological or imaging- based methods with various readouts. Although accurate and precise, many of these methods are time taking and require extensive sample handling and preparation, expensive reagents and equipment, and skilled manpower. In circumstances, where a quick estimation of the intracellular metabolite is required, such methods are unnecessary but currently unavoidable. It is ideal if there is a method that takes an image of the organ of choice and provides quick readouts of the intracellular analyte class. Such a method is currently not available.

Machine learning and artificial intelligence-based methods have found their utility in biology. Algorithms based on deep learning learn from raw biological data (genomic data, images, proteins etc.) to predict patterns in new unknown data and generate biological inferences. In the current invention, we tried using the same to learn from images of leaves and fruits from sampled plants alongside their corresponding concentrations of an intracellular metabolite and build models to predict the metabolite concentration class of an incoming new leaf or fruit. Metabolites are intermediate products of metabolic reactions mediated by the catalytic action of enzymes. They can be classified as primary or secondary. Examples of some primary metabolites are amino acids, vitamins, organic acids, and nucleotides. The cell synthesizes most of the primary metabolites, since they are required for growth. It also produces many secondary metabolites that are not required for primary metabolic processes. Some examples of these are drugs, fragrances, dye, pigments, pesticides and food additives. Secondary metabolites, especially those derived from microbial and plant sources have wide range of applications in agriculture and pharmaceutical industry. The commercial importance of secondary metabolites has appreciated in the recent years. This has invoked exploring the possibility of producing bioactive plant metabolites synthetically in the lab, either by total synthesis, plant tissue culture or metabolic engineering. Increasing demand of plant-based metabolites as a cheaper, non-toxic and environment- friendly alternative to chemically synthesized products has made it necessary to identify high metabolite yielding plants. In the current work, we focused on azadirachtin, a potent and versatile tetranortriterpenoid metabolite, found predominantly in the fruits and to a lesser degree, in the leaves of the tree, Azadirachta indica. OBJECT OF THE INVENTION

The main object of the present invention is to provide a quick scientific method to identify organs that have high concentration of a specific analyte using artificial intelligence. More specifically, we used the method to identify high metabolite producing plants by correlating images of leaves and/or fruits of a plant to a specific metabolite content in those organs using deep learning with CNNs.

Another object of present invention is to deep learn from images of leaves and fruits from a plant alongside their corresponding intracellular concentration of a metabolite and build models that can predict the metabolite concentration of an incoming new organ, a leaf and/or a fruit.

SUMMARY OF INVENTION

The present invention provides a method for estimation of intracellular analyte content class using image-based deep learning. This invention demonstrates the use of deep learning utilizing the image of an organ to estimate the intracellular concentration class of a specific analyte within its cells.

The main embodiment of the present invention provides a quick scientific method to identify organs that have high concentration of a specific analyte using artificial intelligence. The method comprises of steps of (i) collection of plant material from a broad area and taking pictures of the representative samples; (ii) estimation of analyte content in selected samples and tabulation of results; (iii) learning and test set classification for the low and high analyte categories ; (iv) algorithm development using Tensorflow Library and CNN architecture; and (v) development of desktop app and mobile app to provide real-time classification of the analyte content by pointing the camera to a leaf or a fruit.

A further embodiment of the invention provides the processing of selected samples for estimation of analyte content.

Another embodiment of invention provides the learning and test set classification by the division of the dataset into ten random learning and test sets each for the low and high analyte categories respectively, for training and prediction using the TensorFlow machine-learning algorithm. The minimum and maximum values were determined for the analyte content. The average (avg) was determined as the mid-point of the min-max extremes. An analyte value below avg-O. lavg was considered as 'low’, and above avg+O. lavg was considered‘high’. Accordingly, an analyte value above 0.65% was determined as‘high’, and a value below 0.53% was determined as 'low’. A further embodiment of invention provides the development of algorithm using Tensorflow Library and CNN architecture. The tool in this invention is named “AzaClassifier” based on the test case to classify azadirachtin. AzaClassifer (desktop version) was developed using JAVA swing GUI for its native look and is backed up with Linux shell scripting and Python3 in its backend. TensorFlow 1.6.0 packages in Python3 were used to build the models and to classify leaves and fruits under the above said categories.

Another embodiment of invention provides for the development of desktop and mobile apps to provide real-time classification of the analyte content by pointing the camera to a leaf or a fruit.

In an alternate embodiment of the present invention images were augmented by synthetic distortions. The background of images was removed by using an online editing tool PhotoScissors and then a Python module, augmentor, was used to augment the background-removed images with 7 types of distortions, namely, Rotate [rotate (probability=l, max_left_rotation=l ..25, max_right_rotation=l ..25), rotate90 and rotate270], Zoom [percentage_area=0.8 or min_factor=l. l and max_factor=1.5], Flip [lef-right or top-bottom], Distortion [random (grid_width=4, grid_height=4, magnitude=8) or gaussian (grid_width=5, grid_height=7, magnitude=2, corner- 'bell", method="in" or "out", mex=0.5, mey=0.5, sdx=0.05, sdy=0.05), Crop [random (percentage_area=0.5) or by_size (width=5, 10, 15, 20, 25, 30, 35 or 40, height=6.25, 12.5, 18.75, 25, 31.25, 37.5, 43.75 or 50, centre=True) or crop_centre (percentage_area=0.8, randomise _percentage_area=False), Brightness [random

(min_factor=0.5, max_factor=1.6), Resize [width = 60..220, height = 100..375]

BRIEF DESCRIPTION OF DRAWINGS Figure-1 shows map of India with the area selected for sampling highlighted (A) and the zoomed in sampling area where actual sampling was done (B). Figure -2 shows representative images of leaves and fruits of Azadirachta indica for four azadirachtin categories (0.20-0.40, 0.40-0.60, 0.60-0.80 and 0.80-1.00).

Figure -3 shows a desktop interface of the classifier app.

Figure -4 shows a mobile interface of the classifier app DETAILED DESCRIPTION OF INVENTION

In describing the embodiments of the invention, specific terminology is resorted for the sake of clarity. However, it is not intended that the invention be limited to the specific terms so selected and it is to be understood that each specific term includes all technical equivalents that operate in a similar manner to accomplish a similar purpose. Embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying figures and detailed in the following description. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments may be practiced and to further enable those with the skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.

Machine learning and artificial intelligence-based methods have found their utility in biology and medicine (Ma et al. 2014; McKinney et al. 2006; Nesbeth et al. 2016). Algorithms based on deep learning and using raw biological data (genomic data, images, proteins etc.) can predict patterns within unknown data to generate biological inferences. The present invention is related to a quick scientific method to identify high analyte producing plants by deep learning from the images of leaves and fruits of a plant alongside their corresponding intracellular concentration of a metabolite and build models to predict the metabolite concentration class of a new incoming leaf or fruit. The method of present invention has explored the possibility to estimate secondary metabolite content class a priori from leaf and/or fruit images using deep learning. The present invention has focused on the azadirachtin, a potent and versatile tetranotriterpenoid metabolite found predominantly in the fruits and to a lesser degree, in the leaves of the tree, Azadirachta indica. The different embodiments of the invention herein are further described by taking Azadirachta indica (Neem) and its analyte azadirachtin as a non-limiting example.

PLANT MATERIAL COLLECTION

A total of 209 neem trees were sampled in an area spanning roughly 560,000 sq kms (Figure 1). We collected five leaves and fruits per tree. Both leaves and fruits were photographed in controlled light conditions, using a light box with amounted camera attached to a tripod, focussed on the sample. This minimized variability due to the image capture conditions. Representative images of leaves and fruits for varying levels of azadirachtin, post measurement, are shown in Figure 2.

AZADIRACHTIN CONTENT ESTIMATION

Azadirachtin (both azadirachtin A and azadirachtin B) was measured following a multi- step process as described below. All plant samples were processed for analysis as soon as they arrived in the laboratory from the field (usually within 48hrs of collection).

Processing of Fruits

Ripe, mature fruits (yellow) were de-pulped using hands within 48hrs post collection. The de-pulped seeds were dried under the shade for 1-3 days in a well-ventilated (with 8-10% moisture content) and well-sealed (from any animals and pests) room. The room was maintained in ambient temperature. Following this, the seeds were divided into two equal batches. One batch was decorticated to yield seed kernels and the other was stored at 4°C. The seed kernels were ground to make fine paste using mixer-grinder and divided into two equal parts that served as duplicate samples processed in parallel, both for moisture and metabolite content analysis. The fine seed paste was transferred into a 250 ml glass-stoppered flask and 100 ml of methanol was added to the paste. The content was dissolved well using a magnetic stirrer for 2hrs. The mixture was filtered using Whatman filter paper and the filtrate was collected to be used in the next step. About 2ml of the filtrate was passed through a LC18, solid phase extraction tube, and eluted with 4 x 2 ml portions of 90% aqueous methanol and the content was made up to lOml in a volumetric flask for HPLC analysis. The HPLC experiment was performed with a reverse phase Purospher STAR RP-18 column (Merck) and using commercial azadirachtin A and azadirachtin B standards. The rest of the filtrate was stored at 4°C for future use. The pre-cleaning procedure was followed using methanol solution (90: 10) three times. Processing of Leaves

Healthy leaves (green and free from any external infection and deformities) were selected and air-dried under similar condition as fruits for 3 days but turned once during the drying condition. The moisture of the room was maintained between 8-10% during the drying process. The dried leaves were ground using mixture-grinder to make into fine powder. About l2g of the fine leaf powder was taken in a 250ml glass flask with a stopper and lOOml of solution (65:35 ratio of H₂0: acetonitrile) was added. The mixture was dissolved using a magnetic stirrer for 2hrs and the solution was filtered using Whatman filter paper and then the filtrate was centrifuged at 5000g to collect the supernatant. Half of the supernatant was stored at 4°C for further use and the rest was taken for metabolite analysis using HPLC. The HPLC experiment was performed with a reverse phase Purospher STAR RP-18 column (Merck) and using commercial azadirachtin A and azadirachtin B standards.

DEEP LEARNING

Learning and Test Set Classification Appended Table 1 provides sampled trees with tree codes, dimensions of collected fruits (length and diameter in mms), and azadirachtin contents (%) from fruits pooled from each tree. It also highlights the division of the dataset into ten random learning and test sets each for the low and high azadirachtin categories respectively, for training and prediction using the TensorFlow machine-learning library.

The minimum and maximum values were determined for the azadirachtin content. The average (avg) was determined as the mid-point of the min-max extremes. An azadirachtin value below avg-0. lavg was considered as 'low’, and above avg+O. lavg was considered‘high’. Accordingly, an azadirachtin value above 0.65% was determined as‘high’, and a value below 0.53% was determined as 'low’. Out of 209 trees, 65 belonged to the azadirachtin high category, 99 belonged to the azadirachtin low category, and the remaining 45 had intermediate azadirachtin values. The sample numbers for the 'low’ and‘high’ azadirachtin categories in ten learning and test sets generated for 10-fold cross validation are summarised in appended Table 2.

Algorithm Development Using Tensor flow Library and CNN Architecture

The tool in this invention is named“AzaClassifier” based on the test case to classify azadirachtin. AzaClassifer (desktop version) was developed using JAVA swing GUI (Horstmann & Cornell 2013) for its native look and is backed up with Linux shell scripting and Python3 in its backend. TensorFlow 1.6.0 packages (Abadi et al. 2016) in Python3 were used to build the models and classify leaves and fruits under the above- said categories. TensorFlow is an open-source framework for building Deep Learning Neural Networks (Schmidhuber 20l5)and is very useful for developing applications in biology (Rampasek & Goldenberg 2016). However, there are quite a few challenges when it comes to the visual recognition. Particularly, for an on-device or embedded application, models must be able to run effectively in an environment where resources like computation, power and space are restricted. The Mobile Net architecture (Howard et al. 20l7)provides a solution to circumvent this. It is a family of mobile-first computer vision models for TensorFlow, designed specifically for an effective increase in accuracy, with limited resources for an on-device or embedded application. Mobile Nets are actually small, low-power consuming, and low-latency models designed to deal with the limited resources of a number of use cases. The latest release (1.11) consists of the model definition for Mobile Nets in TensorFlow using TF-Slim in addition to 16 pre trained Image Net classification checkpoints to be used in mobile projects of any size. Using TensorFlow, azadirachtin prediction models were built for 17 types of configurations including 16 Mobile net (pre-trained deep CNN network versions for width multipliers of 0.25, 0.50, 0.75, 1.0 and image input sizes of 128, 160, 192, and 224 ) and Inception (pre-trained deep CNN network) v3(Szegedy et al. 2015),. Each model was trained for 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500 and 5000 training steps. AzaClassifer (Android version) is built using Android Studio 3.0(Smyth 2017), compiled and run based on TensorFlow Android version 1.6.0.

Desktop App

The desktop App was built to classify azadirachtin content category (low or high) for a single leaf or fruit image, or for multiple leaf or fruit images processed at once (Figure 3). The results were bulk exported into an Excel sheet, while storing the image identity, and the azadirachtin category were predicted with a prediction confidence score (probability ranging from 0 to 1).

Mobile App

The mobile Android version of the App was built to allow for a live or real-time classification of the azadirachtin content class by pointing the mobile device’s camera to a leaf or a fruit (Figure 4). A‘Save’ option in the App allows the user to capture the image along with the azadirachtin classification while providing an identifier to the leaf or fruit. In cases where there is a lot of fluctuation, the same sample may be captured multiple times and the readout consistent in a majority of captures may be considered. RESULTS

Leaf and Fruit Azadirachtin Models The model parameters in our case yielded a prediction sensitivity of at least 80% (i.e., error at most 20%) in the test set as highlighted in appended Table 3. The highlighted values (yellow for fruit and green for leaf) reflected the best combination of sensitivity values for both azadirachtin low and high categories, and corresponding model parameters were chosen for the leaf and fruit models. Results from all combinations of model parameters are provided in appended Table 4.

A picture is worth a thousand words. Indeed, an image compresses a lot of information that cannot be delineated in a simple manner. In such cases, it is sensible to analyse the image data rather than textual data, former being more intact in its complexity and the latter, relatively lossy. With advances in machine learning and artificial intelligence algorithms, image-based learning has become a routine machine learning approach to address a wide array of problems, including biological issues (Minervini et al. 2014; Rostam et al. 2017). The performance of any machine learning approach is dependent on both, adequacy of the sampled data and the complexity of the underlying model. In cases where the training set data is limiting, chances of the model over-fitting to the training data and not generalizing to other‘unseen’ data are greater, while in cases where the model is over-simplified or over-regularized, the model under-fits the data. The prediction accuracy of the classifier App that we built was not 100% and we got a prediction error of 12.12% and 6.67%, and 13.33% each using the leaf and fruit models, respectively, for the azadirachtin low and high categories.

In an independent validation set using fruits from 27 trees collected in a different year from an area covering 130,000 sq km, we found the prediction error using the chosen fruit model to be 18.52%. The prediction accuracy is expected to increase with further imaging and metabolite content analyses of neem leaves and fruits, and utilizing this first to validate the underlying model parameters, and then to enhance the training set, in an iterative manner to minimize both, cross-validation and true validation errors. TensorFlow is an open-source framework for building deep learning neural networks (Schmidhuber 2015) and has found its applications in biology (Rampasek & Goldenberg 2016).

In another embodiment of the invention images of fruits and leaves were augmented by the distortions . The methods followed and results obtained are discussed as following:

METHODS

Augmentation of Images by Distortions

The original images were first subject to background removal using an online editing tool named PhotoScissors This step was performed

to base the classification on the foreground, and not the variable background (white in case of the training set). A Python module, Augmentor, was used to augment the background-removed images with 7 types of distortions, namely, Rotate [rotate (probability=l, max_left_rotation=l ..25, max_right_rotation=l ..25), rotate90 and rotate270], Zoom [percentage_area=0.8 or min_factor=1.1 and max_factor=1.5], Flip [lef-right or top-bottom], Distortion [random (grid_width=4, grid_height=4, magnitude=8) or gaussian (grid_width=5, grid_height=7, magnitude=2, corner- 'bell", method="in" or "out", mex=0.5, mey=0.5, sdx=0.05, sdy=0.05), Crop [random (percentage_area=0.5) or by_size (width=5, 10, 15, 20, 25, 30, 35 or 40, height=6.25, 12.5, 18.75, 25, 31.25, 37.5, 43.75 or 50, centre=True) or crop_centre (percentage_area=0.8, randomise _percentage_area=False), Brightness [random

(min_factor=0.5, max_factor=l .6), Resize [width = 60..220, height = 100..375]

Model Testing

A hundred and sixty Mobile net models with 4 image resolution options (224, 192, 160 and 128), 4 width multiplier options (0.25, 0.50, 0.75 and 1.0) and 10 training steps (500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500 and 5000) were trained with the background-removed fruit and leaf images, and the 7 distortions, individually, and tested with images not used during training from the respective 8 image types. For each model, the azaL and azaH classification accuracies were recorded. For each distortion type, presence of at least one model with >58% classification accuracy for both azaL (low azadirachtin) and azaH (high azadirachtin) categories was ascertained. Following this ascertainment, a hundred and sixty Mobile Net models were constructed as above and ten Inception models (with above mentioned ten training steps), were trained with all eight types of images together, and tested with images not used during training.

For the leaf model, the background- removed images were distorted internally (Crop - 30%, Scale - 30% and Flip - left and right) using the python module‘retrain’ of

TensorFlow library

followed by training and classification using hundred and

sixty Mobile Net models.

Short-Listins and Choosing the Final Fruit and Leaf Models Fruit models shortlisted from Mobile Net and Inception architectures, with at least 80% classification accuracies for both azaL and azaH categories, and leaf models from TensorFlow’s inbuilt distortions while retraining using Mobile Net, with at least 70% classification accuracies for both categories were shortlisted. These thresholds were set to be the highest possible common classification accuracies for azaL and azaH categories. The final fruit and leaf model was chosen to be the ones, with the highest possible average classification accuracy between azaL and azaH categories.

Independent Validation Set

Fruits sampled from 29 trees (five fruits per tree) in Karnataka during May - Jul, 2016, served as an independent validation set. Images of these fruits were subject to background removal using the online Photo Scissors tool (link referenced above). Since these were images of multiple fruits photographed in a tray, from each image, five randomly chosen individual fruits were chosen for classification. The same could not be performed for leaves, since there was only a single image of bunch of leaves, out of which it was possible to choose only two representative individual leaf images. The choosing of an individual fruit from amongst a group of them was performed using the Fuzzy Select tool in GIMP (http://gimp.org; v 2.8.16). The per-tree azadirachtin classification was decided to be the classification which repeated in a majority across the five individual classifications.

RESULTS

Augmentation of images by distortions The number of images used in training and testing for each distortion type are listed in Table Nl . The image numbers are highest for the Rotation distortion type, augmented ~60-fold over the original i.e. background-removed images, followed by ~22-fold for Crop and ~l 7-fold for Resize.

Suitable fruit and leaf models for individual distortion types using Mobile Net architecture

Fruit and leaf models for individual distortion types were deemed suitable based on them resulting in at least 58% classification accuracy for azaL and azaH content categories (Table N2). Among the fruit models, there were 34 suitable ones for Brightness, 35 for Resize, 21 for Crop, 47 for Distortion, 49 for Rotate, 40 for background- removed, 49 for Zoom and 53 for Flip, and among the leaf models, there were 13 suitable ones for Brightness, 15 for Resize, 4 for Crop, 15 for Random/Gaussian blur, 1 for Rotate, 21 for background-removed, 24 for Zoom, and 16 for Flip distortion types. This criterion was used to decide in pooling the images from all distortion types, for training and classification, to construct near-final models. Suitable fruit and leaf models for all images , factoring in all distortions post background-removal, using Mobile Net and Inception architectures Fruit and leaf models for all distortion types were deemed suitable based on them resulting in at least 70% classification accuracy for azaL and azaH content categories (Tables N3 and N4). As per these criteria, 15 fruit models were deemed suitable with the MobileNet architecture, and 7 with the Inception architecture. No leaf model passed these criteria, with either architecture.

Suitable leaf models for background-removed images, factoring in three distortions built in using Tensor Flow library, using Mobile Net architecture

Six leaf models were deemed suitable based on them resulting in at least 70% classification accuracy for azaL and azaH content categories within the original background- removed test set of leaf images (Table N5).

Final fruit and leaf models of azadirachtin classification

The thresholds for the fruit and leaf models were individually set to be the highest possible common classification accuracies for azaL and azaH categories, at 80% and 70%, respectively. Two fruit models and six leaf models passed these criteria. The final models were chosen to be ones resulting from Inception V3 architecture, at 2500 training steps, for the fruit, with an average classification accuracy of 82.88%, and from the Mobile Net architecture, with built-in distortions while retraining, image resolution of 128, width multiplier of 0.50, at 2500 training steps, for the leaf, with an average classification accuracy of 82.67% (Table N6). Since the leaf model was derived from a training set augmented with only three distortion types, it was less sensitive to classifying aza content class from all seven distortion types. The aza classification accuracy was >70% for both azaL and azaH categories, for Brightness, Flip, Gaussian/Random blur and Zoom distortion types. For Resize, Rotate and Crop distortion types, while the azaL classification accuracy was 90.67%, 77.26% and 89.27%, respectively, the azaH classification accuracy remained <50% at 25.74%,

42.08% and 42.58%, respectively (Table N6). Independent validation of the fruit model of azadirachtin classification

The final fruit model was validated in -70% of an independent data set (Table N7) of 28 trees.

Although azadirachtin provides a cheaper, non-toxic and environment friendly alternative to chemical pesticides, large variation in the azadirachtin content among trees depending on factors like, the geography, quality of seed, age of tree, soil parameters and climatic factors, among others (Chary 2011; Kaushik et al. 2007; Sidhu et al. 2003; Singh et al. 1999; Srivastava et al. 2010) makes a compelling case to pre determine the azadirachtin content before collection of seeds. Interestingly, as described in the above references, there is variation in azadirachtin content between individual tree within a single provenance, and this trend was observed in all of the provenances selected from five agro-climatic regions of India. Such variations among individual trees of a provenance suggest that climatic factors (rainfall, humidity, or temperature) solely do not influence azadirachtin content, and that there are individual genetic differences among the trees sampled for the study. In one study (Singh et al. 1999), the authors have noted that the neem germplasms within India constitute a broad genetic base, with the genetic similarity coefficient ranging from 0.74 to 0.93. They found several accession-specific regions in the genome, which could be used to identify one from the other. Additionally, total organic synthesis of azadirachtin takes time and provides low yield of the metabolite (Veitch et al. 2008). Therefore, it is important to identify trees that yield high azadirachtin-bearing fruits. Keeping this in mind, we explored the possibility to estimate azadirachtin content a priori by using leaf and/or fruit images using deep learning. Our long-term interest is to understand the complete pathway of metabolite production in neem and produce it synthetically by metabolic engineering. As we show in the current invention, it is possible to identify fruits and leaves of trees that yield high level of azadirachtin. Although the pre-trained models developed for the metabolite prediction in the desktop and mobile versions of the App are specific to azadirachtin, there is a possibility that using a different set of training data along with the developer’s version of the App one can build models to estimate the intracellular concentration for other analytes.

The above disclosure is non-limiting and modifications and variations are possible without departing from the spirit and scope of the invention. Since other modifications and changes, varied to fit particular operating requirements and environments, are apparent to those skilled in the art, the invention is not considered limited to the example chosen for purposes of disclosure, and covers all changes and modifications which do not constitute departures from the true spirit and scope of this invention.

Having thus described the invention, what is desired to be protected by Fetters Patent is presented in appended claims.

REFERENCES

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S,

Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, and Zheng X. 2016. TensorFlow: Farge-Scale Machine Fearning on Heterogeneous Distributed Systems. arXiv: 160304467v2. Chary P. 2011. A comprehensive study on characterization of elite Neem chemotypes through mycofloral, tissue-cultural, ecomorphological and molecular analyses using azadirachtin-A as a biomarker. Physiol Mol Biol Plants 17:49-64. 10.1007/s 12298-010-0047- 1

Horstmann CS, and Cornell G. 2013. Core Java : Prentice Hall.

Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, and Adam H. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv: 17040486L

Kaushik N, Singh BJ, Tomar UK, Naik SN, Vir S, Bisla SS, Sharma KK, Banerjee SK, and Thakkar P. 2007. Regional and habitat variability in azadirachtin content of Indian neem. Curr Sci 92:7.

Ma C, Zhang HH, and Wang X. 2014. Machine learning for Big Data analytics in plants. Trends Plant Sci 19:798-808. l0. l0l6/j.tplants.20l4.08.004

McKinney BA, Reif DM, Ritchie MD, and Moore JH. 2006. Machine learning for detecting gene-gene interactions: a review. Appl Bioinformatics 5:77-88. Minervini M, Abdelsamea MM, and Tsaftaris SA. 2014. Image-based plant phenotyping with incremental learning and active contours. Ecological Informatics. Ecological Informatics 23: 14.

Nesbeth DN, Zaikin A, Saka Y, Romano MC, Giuraniuc CV, Kanakov O, and Laptyeva T. 2016. Synthetic biology routes to bio-artificial intelligence. Essays Biochem

60:381-391. 10.1042/EBC20160014

Rampasek L, and Goldenberg A. 2016. TensorFlow: Biology's Gateway to Deep Learning? Cell Syst 2: 12-14. l0. l0l6/j.cels.20l6.0l.009

Rostam HM, Reynolds PM, Alexander MR, Gadegaard N, and Ghaemmaghami AM.

2017. Image based Machine Learning for identification of macrophage subsets.

Sci Rep 7:3521. 10.1038 1598-017-03780-z

Schmidhuber J. 2015. Deep learning in neural networks: an overview. Neural Netw 61 :85-117. l0. l0l6/j.neunet.20l4.09.003

Sidhu OP, Kumar V, and Behl HM. 2003. Variability in Neem (Azadirachta indica) with respect to azadirachtin content. J Agric Food Chem 51 :910-915.

10.102l/jf025994m

Singh A, Negi MS, Rajagopal J, Bhatia S, Tomar UK, Srivastava PS, and Lakshmikumaran M. 1999. Assessment of genetic diversity in Azadirachta indica using AFLP markers. Theo Appl Genet 99:8.

Smyth N. 2017. Android Studio 3.0 Development Essentials - Android 8 Edition : Create Space Independent Publishing Platform.

Srivastava P, Hazarika RR, Singh M, and Chaturvedi R. 2010. Assessment of Age and Morphometric Parameters of Seeds on Azadirachtin Production in Neem Seed Kernels collected from various Ecotypes. Research J Chemistry and Environment 14:5.

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, and Wojna Z. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv: 151200567.

Veitch GE, Boyer A, and Ley SV. 2008. The azadirachtin story. Angew Chem Int Ed Engl 47:9402-9429. l0. l002/ame.200802675

Table 1. Description of data used for training and prediction of AZA content based on leaf and fruit image models, using TensorFlow API. Azadirachtin (A+B) category: H: high, M: med and W: low. The dataset was divided into ten random learning (L) and test (T) sets each for the low and high azadirachtin categories respectively.

Table 2. Division of trees into ten random pairs of learning and test sets for the azadirachtin

low and high categories.

Table 3. Combinations of model parameters used in the Aza classifier for both azadirachtin low and azadirachtin high categories.

T able 4. Combinations of all model parameters for both azadirachtin low and azadirachtin high categories. Samples: Fruit: F and Leaf: L, Augmentation and Crop: Y:yes and N: no.

N1

N2

N2

N2

N3

N3

N3

N3

N3

N4

N5

N5

N

N6

N7

Claims

I / WE CLAIM:

1. A novel method for identification of high or low analyte content in a plant part, comprising steps of;

a) collection of plant parts; and

b) taking pictures of collected plant parts at a central location; or taking pictures of plant parts while they are still attached to the plant;

c) estimation of analyte content class from images of collected plant parts; and tabulation of predicted results;

d) learning and test set classification for the low and high analyte classes ; e) development of training algorithm to build predictive models, analyse new incoming images of plant parts, and identify high and low content of analyte in plant parts; and

f) development of desktop app and mobile app to provide bulk or individual classification from saved image(s) of collected plant parts, and real-time classification of the analyte content by pointing the camera to the plant part, respectively.

2. A novel method for identification of high or low analyte content in a plant part as mentioned in claim 1, wherein a large batch of plant parts is collected from a broad area.

3. A novel method for identification of high or low analyte content in a plant part as claimed in claim 2 wherein a total of 209 neem trees were sampled in an area spanning an area of roughly 560,000 sq kms.

4. A novel method for identification of high or low analyte content in a plant part as claimed in claim 1 wherein images of the plant parts were captured under controlled light conditions to minimize the variability due to the image capture conditions.

5. A novel method for identification of high or low analyte content in a plant part as claimed in claim 4 wherein sample images were optionally put for background removal and 7 types of distortions, namely, Rotate [rotate (probability=l, max left rotation=l..25, max right rotation=1..25), rotate90 and rotate270], Zoom [percentage area=0.8 or min factor=1.1 and max factor =1.5], Flip [lef-right or top-bottom], Distortion [random (grid width=4, grid height=4, magnitude=8) or gaussian (grid width=5, grid height=7, magnitude=2, corner="bell", method="in" or "out", mex=0.5, mey=0.5, sdx=0.05, sdy=0.05), Crop [random (percentage area=0.5) or by_size (width=5, 10, 15, 20, 25, 30, 35 or 40, height=6.25, 12.5, 18.75, 25, 31.25, 37.5, 43.75 or 50, centre=True) or crop centre (percentage area=0.8, randomise percentage area=False), Brightness [random (min factor=0.5, max factor=l.6),

Resize [width = 60..220, height = 100..375]

6. A novel method for identification of high or low analyte content in a plant part as claimed in claim 1 wherein the actual measurement of analyte content in sampled plant material is done by a process comprising;

i) air-drying the collected samples for 1-3 days in a well-ventilated condition with 8-10% moisture content;

ii) grinding the dry samples to a fine paste using mixture -grinder;

iii) dissolving the fine paste in lOOml of methanol for 2 hours using magnetic stirrer and filter the filterate by using Whatman filter paper ;

iv) passing 2ml of the filtrate through a LC18 solid phase extraction tube and eluting analyte with 4 x 2 ml portions of 90% aqueous methanol; and v) analysing using High Performance Liquid Chromatography (HPLC) with a reverse phase Purospher STAR RP-18 column (Merck) and a commercial analyte standard.

7. A novel method for identification of high or low analyte content in a plant part as claimed in claim 1 wherein dimensions of collected plant parts, and analyte contents (%) from plant parts pooled from each tree were classified into ten random learning and test sets each for the low and high analyte categories respectively, for training and prediction using the TensorFlow machine-learning algorithm.

8. A novel method for identification of high or low analyte content in a plant part as claimed in claim 1 wherein the algorithm to train and predict high and low content of analyte in plant material is developed using Tensorflow library and CNN architecture.

9. A novel method for identification of high or low analyte content in a plant part as claimed in claim 1 wherein desktop app and mobile app are developed to provide bulk or indivudual classification of saved images of plant materials and real-time classification of the analyte content by pointing the camera to the plant material, respectively.

10. A novel method for identification of high or low analyte content in a plant part as claimed in claim 9 wherein the desktop App is used to classify analyte content category (low or high) for a single plant part image, or for mutiple plant part images processed at once.

11. A novel method for identification of high or low analyte content in a plant part as claimed in claim 9 wherein the mobile android version of the App is used for a live or real-time classification of the analyte content class (low or high) by pointing the mobile device’s camera to a plant part.