CN114842266A

CN114842266A - Food image classification method and system, storage medium and terminal

Info

Publication number: CN114842266A
Application number: CN202210563101.4A
Authority: CN
Inventors: 凌旭峰; 樊江玲; 梁景新; 赵艳妮
Original assignee: SHANGHAI NORMAL UNIVERSITY TIANHUA COLLEGE
Current assignee: SHANGHAI NORMAL UNIVERSITY TIANHUA COLLEGE
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-02

Abstract

The invention provides a food image classification method and system, a storage medium and a terminal, comprising the following steps: training an image feature recognition model based on an automatic supervision learning method; extracting image features of the food image based on the image feature recognition model; training a food image classification model according to the image features based on a supervised learning method; and sequentially inputting the food images into the image feature recognition model and the food image classification model to obtain food image classification results. The food image classification method and system, the storage medium and the terminal can extract the food image features based on the generation type self-supervision learning method so as to realize the identification and classification of the food images.

Description

Food image classification method and system, storage medium and terminal

Technical Field

The invention relates to the technical field of image classification, in particular to a food image classification method and system, a storage medium and a terminal.

Background

Food is closely related to human daily life, wherein food production and transportation, food safety, food supply chain and the like are industries related to the national civilization and are developed into huge industries distributed in all corners of the human society. In recent years, with the development of electronic commerce and intelligent technology, there is an increasing demand for automatic identification of food types in the fields of food processing, vending machines, cafeterias, food delivery robots, AI dieticians, automatic cash registers, and the like, and the automatic food identification technology is receiving attention of researchers. The use of computer vision technology for food image recognition has become a popular research area. The food image recognition is a process of automatically extracting various image characteristics of food shape, color, texture, semantics and the like by using image processing and machine vision technology, automatically analyzing and matching, and automatically classifying and recognizing food types. However, food image recognition has the following problems:

(1) the food image has the characteristics of complex shape, large color change, complex components and the like;

(2) the food image changes greatly due to the influences of factors such as food shooting angle, distance, placement, light and the like;

(3) the same food has completely different display images in different regions, different flavors and different methods;

(4) the recognition accuracy of the system also decreases when the same food appears in different recipes.

The food pictures have the characteristics of variable structure, large background interference, small difference between classes, large difference between classes and the like, so that the identification difficulty of the food images is higher than that of common fine-grained pictures. At present, in the field of food picture identification, the problems of low precision, poor generalization and the like still exist in the identification and classification of food pictures.

In order to improve the identification and classification precision of food pictures, the global and local detail information of the food pictures is fully utilized, the Lianghua-gang and other parts propose a fine-grained food picture identification model of a multistage convolution feature pyramid, the model consists of a food feature extraction network, an attention area positioning network and a feature fusion grid, background information with large interference is discarded, features are extracted only for a food target area, and a feature pyramid structure is added into each stage of feature extraction network according to the characteristic of large scale change of the food pictures, so that the robustness of the model to the target size is improved. The experimental results show that the model respectively obtains Top-1 correctness rates of 91.4%, 82.8% and 90.3% on Food picture data sets of Food-101, ChineseFoodNet and Food-172.

Zhang Steel and the like provide a food image recognition method based on DCNN (Deep Convolutional Neural Network) and transfer learning, which adopts a pre-trained DCNN model on an ImageNet image data set to initialize Network parameters, utilizes a fine-tuning training mode to perform transfer learning on a self-established small-scale food image data base set so as to obtain the high-level attribute characteristics of the food image, and inputs the high-level attribute characteristics learned by DCNN into a linear support vector machine to classify the food image. The experimental result shows that the food image identification performance obtained by the method reaches 94.20%, which is superior to the manual characteristics such as gradient direction histogram, Gabor wavelet transform and the like.

Aiming at the problem that the traditional neural network cannot effectively classify Chinese dishes with higher similarity, Dengchun et al provides a Chinese dish identification model based on an improved residual error network. The algorithm firstly fuses multi-scale features to extract semantic information of deep images, then adds an attention mechanism layer to give more attention to important parts of the images, and finally calculates inter-class similarity by utilizing triple loss and inputs results into a support vector machine for classification. Experiments show that the model shows more excellent performance on a Chinese dish public data set and a data set acquired by a subject group.

Aiming at the problem of low dish image identification precision in the food image identification field, such as competitive bidding, a Chinese dish and raw material identification method integrating transfer learning and an inclusion-V3 model is provided. The method trains the bottleneck layer on the basis of a pre-training model, adds a new full-connection layer and completes the identification and classification of dish images of a Chinese meal dish database VIREO Food-172 containing 172 types of foods. Experiments show that when the iteration times are 50000 times, the dish name identification accuracy rate reaches 70.85%, and the raw material identification rate reaches 56.26%.

Guo Xinyue et al propose a deep learning model dish image recognition method based on the combination of transfer learning and batch normalization. The method takes a pre-trained VGG-16 network as a transfer learning basis, and performs batch normalization processing on partial convolution layers and full-connection layer output to obtain a feature set after scale conversion and translation. The function of the transfer learning is to solve the overfitting problem caused by the deep learning, and the implicit characteristics which are more discriminative than the artificial characteristics can be obtained; the effect of the batch normalization process alleviates the gradient disappearance problem in deep learning. Experiments show that the accuracy of the method on Vireo Food-172 and UEC-Food-100 data sets is greatly improved.

Twiny and the like propose a food category identification algorithm based on color features, and solve the problem of low efficiency of manual price calculation in the checkout of the free restaurant. The algorithm extracts a target area through edge projection, clusters and partitions food images based on a Lab color model, acquires color characteristics of various sub-areas by using the HSV color model, and identifies the types of food based on the area colors. Simulation experiments and statistical analysis were performed for each 30 images of class 1 and class 3 foods, respectively. The result shows that the recognition accuracy of the algorithm can reach 95.6%, and the recognition speed is 0.119 seconds at the fastest speed.

Li Jie et al propose a real-time food recognition system on Android equipment, which recognizes a food target by a deep learning method, and queries and displays nutritional parameters of the food target by using an Android network communication technology; meanwhile, a hopping convolutional neural network is provided, the network complexity is reduced while the recognition precision is kept on the recognition task, and the HTTP protocol is used for sending and receiving information. Through experimental comparison verification, the design system has better food identification performance.

Yaoweisheng and the like propose a food image classification model based on self-supervision preprocessing, and learn the characteristics of food images to a higher degree by a self-supervision learning mode. The model is built on the basis of a dense connection network-based food image classification model DenseFood, a semantic recovery self-supervision strategy is adopted, the trained network weight is used for initializing the DenseFood model, and classification tasks are trained and fine-tuned. The self-supervision strategy and the dense connection network for semantic recovery are both focused on extracting image features, and the food image features are fully learned by combining the two so as to achieve better food image classification accuracy. The experimental results on the food data set VIREO-172 show that the method is superior to other strategies on the food image classification model.

Lvyongqiang and the like propose a small sample food identification method, which introduces a learnable relation network as a nonlinear measurement function of a ternary convolutional neural network and is used for small sample food identification. The method uses a ternary neural network to learn the characteristic embedding representation of an image, then adopts a relationship network with stronger discrimination capability as a nonlinear measurement function, learns more fine-grained distinguishing information between classes in the classes based on an end-to-end training mode, and uses a triple sample online sampling scheme which can lead the model training to be more stable, and tests are carried out on Food-101, VIREO-172 and Chinese Food data sets. Experimental results show that the performance of the method is improved by 3.0% on average compared with a small sample learning method of a twin network, and the performance of the method is improved by 1.0% on average compared with a method of a ternary neural network based on a linear measurement function.

However, the existing supervised learning-based food image classification method has the following three disadvantages:

1) the supervised learning-based method needs a large amount of data to be manually labeled with a data set for support, and a large amount of training is carried out on the data set to obtain a model for training convergence. The data set needs manual labeling, which is time-consuming and expensive, and the labeling of samples in special fields such as medical fields is extremely expensive, which increasingly becomes the bottleneck of artificial intelligence development.

3) Data labeling can cause information loss, information contained in one image is rich, besides labeled objects, background information, secondary target information and the like, and a single training task only extracts the labeled information in the image and ignores other useful information. Due to information loss, the feature extraction capability of the supervised learning model is distorted and the generalization capability of the supervised learning model is poor.

Since 2020, as a special case of unsupervised learning, the self-supervised learning has gained attention increasingly, and becomes the most promising development direction at present. The self-supervised learning is realized by automatically generating labels from the interrelation of data and a heuristic algorithm through an exquisite design instead of no supervised data. The current field of self-supervised learning can be broadly divided into two categories. The first type is generative self-supervision learning, such as scene occlusion removal, depth estimation, optical flow estimation, image association point matching and the like; the second category is discriminant self-supervised learning, and typical methods include solution to Jigsaw Puzzles, motion propagation, rotation prediction, contrast learning, and the like. The self-supervised learning advantages include: firstly, features extracted by the self-supervision method have image segmentation capability and comprise scenes, layouts and object boundaries; secondly, the features extracted by the self-supervision method do not need any fine adjustment, linear classifier or data enhancement, and only the basic KNN method is used for classification, so that a good classification effect can be achieved. Therefore, the self-supervised learning method can overcome the defects of the supervised learning method.

The self-supervised learning can be divided into two types of discriminant self-supervised learning and generative self-supervised learning. The typical discriminant self-supervision learning method is comparative learning, and the main idea is to require to learn a characteristic-characterized learning model by automatically constructing similar examples and dissimilar examples, so that the similar examples are gathered in a projection space, and the dissimilar examples are separated in the projection space. The generative self-monitoring learning method mainly trains a deep learning model to reconstruct the covered partial image. The training converged reconstruction model is similar to an image encoder and has the characteristic extraction capability of an image. The image reconstruction task is relatively difficult, pixel-level reconstruction is generally required, and image feature extraction must include detail information and global information.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a food image classification method and system, a storage medium and a terminal, which can extract food image features based on a generative self-supervised learning method to realize the identification and classification of food images.

To achieve the above and other related objects, the present invention provides a food image classification method, comprising the steps of: training an image feature recognition model based on an automatic supervision learning method; extracting image features of the food image based on the image feature recognition model; training a food image classification model according to the image features based on a supervised learning method; and sequentially inputting the food images into the image feature recognition model and the food image classification model to obtain food image classification results.

In an embodiment of the present invention, training an image feature recognition model based on an auto-supervised learning method includes the following steps:

equally dividing an input image into a plurality of image blocks;

coding each image block to generate a coded value corresponding to the image block;

randomly covering the plurality of image blocks according to a preset proportion, inputting an image block sequence obtained after covering into a BEiT model, and obtaining a feature vector of the covered image block;

inputting the characteristic vector into an MIM model to obtain an encoding value of the characteristic vector;

acquiring errors of the coded values of the covered image blocks and the coded values of the eigenvectors; and if the error is not in a preset range, adjusting the parameter values of the BEiT model and the MIM model until the error is in the preset range, and taking the BEiT model as the image feature identification model.

In an embodiment of the present invention, a discrete variational encoder is used for encoding each image block, and the discrete variational encoder is configured to convert the image block into a discrete encoded value.

In an embodiment of the present invention, the parameter values of the BEiT model include a batch size of 32, the optimizer selects Adm, the learning rate is a cosine optimization scheme, the warm-up period is 5, the layer attenuation is 0.75, the fall path is 0.2, the weight attenuation is 0.05, and the training period is 200 times.

In an embodiment of the present invention, the predetermined ratio is 40%.

In an embodiment of the present invention, based on the supervised learning method, the training of the food image classification model according to the image features includes the following steps:

acquiring the image characteristics of the food image extracted by the image characteristic identification model;

inputting the image features into the food image classification model;

obtaining a food image classification result output by the food image classification model;

and if the accuracy of the food image classification result does not reach a preset threshold value, adjusting the parameters of the food image classification model until the accuracy of the food image classification result reaches the preset threshold value.

In an embodiment of the present invention, the food image classification model uses a fully connected MLP network.

The invention provides a food image classification system, which comprises an image feature training module, an extraction module, an image classification training module and a detection module;

the image feature training module is used for training an image feature recognition model based on an automatic supervision learning method;

the extraction module is used for extracting the image characteristics of the food image based on the image characteristic recognition model;

the image classification training module is used for training a food image classification model according to the image characteristics based on a supervised learning method;

the detection module is used for sequentially inputting food images into the image feature recognition model and the food image classification model to obtain food image classification results.

The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the food product image classification method described above.

The invention provides a food image classification terminal, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is used for executing the computer program stored in the memory so as to enable the food image classification terminal to execute the food image classification method.

As described above, the food image classification method and system, the storage medium, and the terminal according to the present invention have the following advantages:

(1) the food image characteristics can be extracted based on a generating type self-supervision learning method so as to realize the identification and classification of the food images, and the method can be widely applied to various application scenes such as supermarkets, cafeterias and the like;

(2) the model can be pre-trained in the self-supervision learning training mode without a large number of manually labeled sample sets, so that the workload of manual labeling is greatly reduced, and the bottleneck of manually labeled data sets is broken through. On the basis of the pre-training model, only the labeled data set needs to be trained for the second time, and the converged classifier can achieve good classification effect;

(3) the self-attention backbone network model is integrated through self-supervision learning training, the feature extraction robustness is good, and the generalization capability is strong; compared with a discriminant self-monitoring method, the generative self-monitoring method can extract the detail features of the image and can also extract the semantic features of the global image;

(4) for newly added food samples, the knowledge distillation model is adopted, only a small amount of newly added samples are needed, and training of the new model can be realized through 200 times of tuning training, so that the practical application capability is greatly improved;

(5) the method can obviously reduce the operation amount of pixel level expression, well extract the global and detail characteristics of the image, realize good food image classification and identification effects, has good robustness, very strong generalization capability and practical application value.

Drawings

FIG. 1 is a flowchart illustrating a food image classification method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a two-channel based generative self-supervised learning algorithm of the present invention in one embodiment;

FIG. 3 is a schematic view of a food product image according to an embodiment of the invention;

FIG. 4(a) is a schematic diagram of a training error and a verification error of the present invention in one embodiment;

FIG. 4(b) is a schematic diagram of training accuracy and verification accuracy of the present invention in one embodiment

FIG. 5 is a schematic diagram of a food image classification system according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of a food image classification terminal according to an embodiment of the invention.

Description of the element reference numerals

51 image feature training module

52 extraction module

53 image classification training module

54 detection module

61 processor

62 memory

Detailed Description

The following embodiments of the present invention are provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

According to the food image classification method and system, the storage medium and the terminal, the method of natural language processing is referred, firstly, discrete coding is carried out on an input image based on a discrete variational encoder, and a coding representation of the characteristics of the input image is formed; and then randomly covering the image blocks with a preset proportion by using a self-attention channel and inputting a covering model based on self-attention, mapping the self-attention representation into an image discrete coding representation by using a full-connection classification network, finally training the model to acquire the capability of reconstructing the covered image blocks, and flexibly adding new food categories by using a knowledge distillation mode training model, so that the food images can be accurately identified and classified according to the image characteristics, and the method has the characteristics of high accuracy and small operation amount and has wide application scenes.

As shown in fig. 1, in an embodiment, the food image classification method of the present invention includes the following steps:

and step S1, training an image feature recognition model based on an automatic supervision learning method.

Specifically, the method adopts a generating type self-supervision learning algorithm based on double-pass to train an image feature recognition model based on a self-supervision learning method. As shown in fig. 1, training an image feature recognition model based on an unsupervised learning method includes the following steps:

11) an input image is equally divided into a plurality of image blocks.

In particular, the input image X is equally divided into a plurality of image blocks, i.e.

Is divided into N HW/P ² Image block, image block

Where (H, W) is the resolution of the input image, C is the number of image channels, and (P, P) is the resolution of each image block. When the input image resolution is 224 × 224, the resolution of each image block is 16 × 16, so the input image can be divided into 14 × 14 — 196 image blocks. All image blocks are stacked to form an image block vector of length 196.

12) And coding each image block to generate a coded value corresponding to the image block.

Specifically, in the first channel, an image is encoded and decoded using a Discrete variable Auto Encoder (dVAE Encoder) and a decoder, both of which are composed of a deep learning residual network and need to be obtained through training. The structure of the dVAE encoder and the decoder is composed of a residual error network partitioned by BottleNeck. Compared with the usual VAE, dVAE differs from the usual VAE in two ways: one is that the dVAE encoder maps each image block of the image into a dictionary of 8192 size, the distribution of the dictionary is set as a uniform classification distribution on the vocabulary vectors, which is a discrete distribution, and the heavy parameter technique cannot be adopted due to the problem of irreducibility. DALL-E uses Gumbel-SoftMax skills to solve this problem. dVAE is a discrete VAE, but it is identical to VAE in nature in that the picture is passed through an encoder to obtain hidden variables, which are then passed through a decoder to reconstruct the original picture. Tokenizer in dVAE corresponds to the mean variance fit neural network of VAEs, Decoder in dVAE corresponds to the generator inside VAEs. The dVAE may be trained in substantially the same way as the VAE. A comparison of dVAE and VAEs is shown in table 1.

TABLE 1 dVAE and VAE comparisons

	Encoder for encoding a video signal	Decoder
			VAE	Mean variance fitting neural network	Generator
DVAE	Tokenizer	Decoder

Specifically, the invention uses a pre-trained dVAE encoder to create a vocabulary for the image, the vocabulary dictionary comprising 8192 words in a uniform discrete distribution. That is, each image block of an image is encoded using dVAE, and can be mapped to a codeword in a dictionary of 8192. Therefore, the dVAE encoder can greatly compress the image and extract the image characteristics; the decoder reconstructs the original image using 196 codewords. The quality of the reconstructed image marks the strength of the dVAE encoder feature extraction capability and the decoder generation capability. The method comprises the steps of inputting 196 16X 16 image blocks obtained by dividing an input image X into an encoder to obtain 196 discrete codes, wherein each code takes a value of 0-8191. The input image is converted into an array with the length of 196 and the value of 0-8191.

13) Randomly covering the plurality of Image blocks according to a preset proportion, inputting the covered Image block sequence into a BEiT (BERT Pre-Training of Image Transformer) model, and obtaining the feature vector of the covered Image block.

Specifically, in the second channel, the image blocks of the input image X are randomly covered, for example, 40% of the image blocks are covered, i.e., 5 image blocks are covered. The covered image blocks and the corresponding position information construct an image block sequence, and the image block sequence is input into the BEiT model.

Wherein, BEiT opens the source at github and provides a pre-training model, which is divided into two models of image classification and image segmentation, all models are trained on ImageNet22K, the image classification model is optimized on ImageNet-1K, and the image segmentation model is optimized on ADE 20K. Specifically, the image classification pre-training model is shown in table 2, and the image segmentation pre-training module is shown in table 3.

TABLE 2 image classification pre-training model

Model (model)	Image resolution	acc@1	acc@5	Amount of ginseng
					BEiT-base	224x224	83.7	96.6	87M
BEiT-base	224x224	85.2	97.6	87M
					BEiT-base	384x384	86.8	98.1	87M
BEiT-large	224x224	86.0	97.6	304M
					BEiT-large	224x224	87.4	98.3	304M
BEiT-large	384x384	88.4	98.6	305M
					BEiT-large	512x512	88.6	98.66	306M

TABLE 3 image segmentation pre-training model

As shown in fig. 2, the black squares represent the covered image blocks, and after being input into the BEiT model, the feature vectors of all the image blocks. Discarding the eigenvectors of the uncovered image blocks, and reserving the eigenvector h of the covered image blocks ₂ ，h ₄ ，h ₇ ，h ₁₀ ，h ₁₄ 。

14) Inputting the feature vector into an MIM (masked Image model) model, and acquiring the coding value of the feature vector.

Specifically, the feature vector of the covering Image block obtained by the BEiT model is input into an mim (masked Image Modeling head) model, so as to obtain a code value corresponding to the feature vector.

15) Acquiring errors of the coded values of the covered image blocks and the coded values of the eigenvectors; and if the error is not in a preset range, adjusting the parameter values of the BEiT model and the MIM model until the error is in the preset range, and taking the BEiT model as the image feature identification model.

In particular, the object of the image feature recognition model of the present invention is to reduce as much as possible the error of the encoded values of the masked image blocks with the encoded values of the feature vectors. Therefore, the threshold range determination is performed for the currently obtained error. When the error is within the preset range, the training target of the image feature recognition model is finished, and the training can be finished. And when the error is not in the preset range, adjusting parameter values of the BEiT model and the MIM model, and continuing model training until the obtained error is in the preset range. After training, only the BEiT model is required to be reserved, and the MIM model is discarded later, so that the image feature recognition model is obtained. In an embodiment of the present invention, the parameter values of the BEiT model include a Batch Size (Batch Size) of 32, an optimizer selects Adm, a learning Rate (Lean Rate) is a cosine optimization scheme, a warm-up period (warm epoch) of 5, a Layer Decay (Layer Decay) of 0.75, a fall Path (Drop Path) of 0.2, a Weight Decay (Weight Decay) of 0.05, and a training period of 200 times.

And step S2, extracting the image characteristics of the food image based on the image characteristic recognition model.

Specifically, food images are input into a trained image feature recognition model, so that image features of the food images can be extracted.

And step S3, training a food image classification model according to the image features based on a supervised learning method.

In particular, the food image classification model is essentially a classification module that gives a food image classification result according to the input image features.

31) and acquiring the image characteristics of the food image extracted by the image characteristic identification model.

32) Inputting the image features into the food image classification model.

Preferably, the food product image classification model employs a fully connected MLP network.

33) And acquiring a food image classification result output by the food image classification model.

34) And if the accuracy of the food image classification result does not reach a preset threshold value, adjusting the parameters of the food image classification model until the accuracy of the food image classification result reaches the preset threshold value.

And step S4, sequentially inputting the food images into the image feature recognition model and the food image classification model to obtain food image classification results.

Specifically, for the food images to be classified, the food images are sequentially input into the image feature recognition model and the food image classification model, and then the food image classification result output by the food image classification model can be obtained.

In order to test the actual effect of the food image classification method of the present invention, experimental verification was performed. The experimental environment is completed on a Linux system, the version of an operating system is Ubuntu 20.04, a server is provided with a 128G memory and is provided with 4 RTX 2080Ti GPU graphics cards; the selection of the pytorch1.7.1 deep learning platform runs code and uses minconda3 to create a separate experimental environment. The experimental work included: food image data set selection, image data preprocessing, pre-training model selection, model tuning, model testing, comparison experiments and the like.

By comparison, a public Food data set Food-101 as shown in fig. 3 was selected as the experimental data set. Food-101 was released by stanford university in 2014 and contained an image dataset of the Food categories in 101, with 1000 images per Food category, a total of 101,000 images, and a dataset size of 5.41 GB. The training images in the data set were not data washed, contained noisy data, and some labels were erroneous. The data set has obvious advantages that firstly, the sample is balanced, and model training is facilitated; secondly, the intra-class samples have large difference and high requirements on the extraction capability and generalization capability of the model features.

2,624 images were divided into a training set, a validation set and a data set at a 7:2:1 ratio, where the training set had 700 images per type of food, the validation set had 200 images per type of food, and the test set had 100 images per type of food for training, validation and testing, and to evaluate and compare the effect of the model. Preferably, training image data is enhanced by random cropping, horizontal flipping, geometric transformation, normalization, etc. as provided by Torchvision, resulting in a 224X224 image training, validation, and test data set consisting of 256-level gray scale maps. The method aims to enable the trained network to have more complex characterization capability, reduce the difference of network performance among a verification set, a training set and a test set, and enable the network to better learn the data distribution on a data set.

First, a comparative experiment was performed with the CNN benchmarking model ResNet and the best model SwinTransformer in 2021 by supervised learning, and the comparative results are shown in Table 4.

TABLE 4 alignment of different algorithms

As can be seen from the above table, the Dino adopts self-supervised learning based on contrast learning, and the trunk network is ViT method, so the effect is good. The self-supervision learning attention model can improve the self-attention of the defect area in a non-explicit way by depending on the attention mechanism, so that a good feature extraction effect is achieved. The identification accuracy of the Dino _ KNN method reaches 76.67%, and the identification accuracy of the Dino _ Linear method reaches 78.90%. Since the comparative learning of Dino is discriminant self-supervision learning, the feature extraction capability is inferior to that of the generative self-supervision learning, and the classification accuracy is lower than 81.75% of BEiT.

When tuning using the training set, the training period is set to 200, and the training error and validation error curves are shown in fig. 4 (a). As the training period increases, the training error decreases and the validation error decreases. When the training period reaches about 80, the verification error reaches the lowest, and then the training error continues to decrease. When the training period reaches 150, the training error converges to about 0.75, and the training error is already substantially converged and does not decrease. Empirically, the intersection point of the training error and the validation error is the optimal training point for model training. And training again, although the training error continues to decrease, the verification error increases, and at the moment, the model is in danger of overfitting. The accuracy change curves of training and validation are shown in fig. 4 (b). The accuracy of the training set and the verification set is improved along with the increase of the number of training cycles, when the training cycle reaches 114, the accuracy of the verification set reaches 81.74%, and then the accuracy of the verification is basically stable and hardly increased.

As shown in fig. 5, in an embodiment, the food image classification system of the present invention includes an image feature training module 51, an extraction module 52, an image classification training module 53 and a detection module 54.

The image feature training module 51 is configured to train an image feature recognition model based on an auto-supervised learning method.

The extraction module 52 is connected to the image feature training module 51, and is configured to extract image features of the food image based on the image feature recognition model.

The image classification training module 53 is connected to the extraction module 52, and is configured to train a food image classification model according to the image features based on a supervised learning method.

The detection module 54 is connected to the image feature training module 51 and the image classification training module 53, and is configured to sequentially input food images into the image feature recognition model and the food image classification model, so as to obtain food image classification results.

The structures and principles of the image feature training module 51, the extraction module 52, the image classification training module 53, and the detection module 54 correspond to the steps in the food image classification method one to one, and thus are not described herein again.

It should be noted that the division of each module of the above apparatus is only a logical division, and all or part of the actual implementation may be integrated into one physical entity or may be physically separated. And the modules can be realized in a form that all software is called by the processing element, or in a form that all the modules are realized in a form that all the modules are called by the processing element, or in a form that part of the modules are called by the hardware. For example: the x module can be a separately established processing element, and can also be integrated in a certain chip of the device. In addition, the x-module may be stored in the memory of the apparatus in the form of program codes, and may be called by a certain processing element of the apparatus to execute the functions of the x-module. Other modules are implemented similarly. All or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software. These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), and the like. When a module is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

The storage medium of the present invention stores thereon a computer program that, when executed by a processor, implements the food image classification method described above. Preferably, the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

As shown in fig. 6, in an embodiment, the food image classification terminal of the present invention includes: a processor 61 and a memory 62.

The memory 62 is used for storing computer programs.

The memory 62 includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

The processor 61 is connected to the memory 62 and configured to execute the computer program stored in the memory, so that the food image classification terminal executes the food image classification method.

Preferably, the Processor 61 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

In summary, the food image classification method and system, the storage medium and the terminal of the invention can extract the food image features based on the generative self-supervision learning method to realize the identification and classification of the food images, and can be widely applied to various application scenes such as supermarkets, cafeterias and the like; the model can be pre-trained in the self-supervision learning training mode without a large number of manually labeled sample sets, so that the workload of manual labeling is greatly reduced, and the bottleneck of manually labeled data sets is broken through. On the basis of the pre-training model, only the labeled data set needs to be trained for the second time, and the converged classifier can achieve good classification effect; the self-attention backbone network model is integrated through self-supervision learning training, the feature extraction robustness is good, and the generalization capability is strong; compared with a discriminant self-monitoring method, the generative self-monitoring method can extract the detail features of the image and can also extract the semantic features of the global image; for newly added food samples, the knowledge distillation model is adopted, only a small amount of newly added samples are needed, and training of the new model can be realized through 200 times of tuning training, so that the practical application capability is greatly improved; the method can obviously reduce the operation amount of pixel level expression, well extract the global and detail characteristics of the image, realize good food image classification and identification effects, has good robustness, very strong generalization capability and practical application value. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A food image classification method is characterized by comprising the following steps:

training an image feature recognition model based on an automatic supervision learning method;

extracting image features of the food image based on the image feature recognition model;

training a food image classification model according to the image features based on a supervised learning method;

and sequentially inputting the food images into the image feature recognition model and the food image classification model to obtain food image classification results.

2. The food image classification method according to claim 1, wherein training the image feature recognition model based on an auto-supervised learning method comprises the steps of:

equally dividing an input image into a plurality of image blocks;

3. The food product image classification method according to claim 2, characterized in that encoding each of the image blocks employs a discrete variational encoder for converting the image block into discrete encoded values.

4. The food image classification method according to claim 2, wherein the parameter values of the BEiT model include a batch size of 32, the optimizer selects Adm, the learning rate is a cosine optimization scheme, the warm-up period is 5, the layer attenuation is 0.75, the fall path is 0.2, the weight attenuation is 0.05, and the training period is 200 times.

5. The food product image classification method according to claim 2, characterized in that the preset proportion is 40%.

6. The food image classification method according to claim 1, wherein training a food image classification model according to the image features based on a supervised learning method comprises the steps of:

inputting the image features into the food image classification model;

7. The food product image classification method according to claim 6, characterized in that the food product image classification model employs a fully connected MLP network.

8. A food image classification system is characterized by comprising an image feature training module, an extraction module, an image classification training module and a detection module;

9. A storage medium on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of classifying a food product image according to any one of claims 1 to 7.

10. A food image classification terminal, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is used for executing the computer program stored in the memory to enable the food image classification terminal to execute the food image classification method of any one of claims 1 to 7.