CN114758167B

CN114758167B - Dish identification method based on self-adaptive contrast learning

Info

Publication number: CN114758167B
Application number: CN202210163470.4A
Authority: CN
Inventors: 胡海苗; 徐振博; 黄龚; 姜宏旭; 李明竹
Original assignee: Hangzhou Shifang Technology Co ltd; Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Shifang Technology Co ltd; Hangzhou Innovation Research Institute of Beihang University
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2024-04-26
Anticipated expiration: 2042-02-22
Also published as: CN114758167A

Abstract

The invention relates to a dish identification method based on self-adaptive contrast learning, which is different from the traditional dish identification method, is based on a neural network of self-adaptive contrast learning, does not need on-line training, has lower requirements on reasoning environment, and provides a multi-scale triplet loss function so as to lead the neural network to self-adaptively learn the loss of different scale differences, thereby better distinguishing the fine differences among dishes; the multi-scale triplet loss function consists of triplet loss functions comprising three boundaries and a maximum value selection function, and boundary values of triplet loss can be selected in a self-adaptive mode; according to the invention, the offline reasoning of the dish identification is realized in a self-adaptive comparison learning mode, the restriction of the dish type is avoided, the real-time change of the type can be dealt with, and the computational power requirement of the dish identification application environment is greatly reduced by the offline reasoning; according to the invention, the low-similarity sample is introduced in the feedback process to automatically delete, so that the dish identification method can be stably operated for a long time.

Description

Dish identification method based on self-adaptive contrast learning

Technical Field

The invention relates to a dish identification method based on self-adaptive contrast learning.

Background

The existing classical dish identification method is often realized by classifying different dishes based on a neural network, and the method is often realized by retraining parameters of the neural network, and has the advantages of large calculation power and long training time depending on cloud or side ends. Because of the long time required for training network parameters, traditional dishes cannot be newly added in real time. Traditional schemes based on contrast learning often do not consider the similarity between dishes, and the loss function is calculated by using the boundary value of the same distance, so that the feature extraction network predicted features are not very distinguishable. In addition, dish identification schemes based on comparison learning often accumulate errors in the identification process, so that the accuracy of dish identification can be deteriorated with the use time.

Disclosure of Invention

It is an object of the present invention to address at least one of the above problems and/or disadvantages and to provide at least the advantages described below.

It is still another object of the present invention to provide a dish identification method based on adaptive contrast learning, which can optimize the distinguishability of the feature extraction network prediction features by using the triple loss function of the adaptive boundary, and ensure high accuracy of dish identification. By introducing a strategy of automatically deleting the low-similarity samples, the problem of error accumulation in dish identification pushing is effectively solved.

To achieve these objects and other advantages and in accordance with the purpose of the invention, a dish identification method based on adaptive contrast learning is provided, comprising: in the training process, a training method of a feature extraction model based on a self-adaptive contrast learning loss function is provided, the three-tuple loss based on three different boundaries is calculated for each three-tuple at the same time, and then a larger loss value in three loss values is selected for back propagation for each three-tuple; the neural network parameters are fixed, only reasoning is carried out, and the updated parameters do not need to be trained; in the reasoning stage, in order to prevent error accumulation, a low-similarity sample is introduced in the feedback process to automatically delete, so that the dish identification method can stably run for a long time.

The input of the training process comprises a plurality of menu categories, and the number of images in each category is not less than two. Every two images of the same category and one image of a different category form a triplet. During training, the triple loss based on multiple boundaries is calculated simultaneously for each triplet, and then a larger loss value between the two is selected for back propagation for each triplet.

Preferably, a triplet (a, p, n) is assumed, where a and p are the same dish category and n belongs to a different dish category. The larger boundary triplet loss function is L _B＝max{d(a,p)-d(a,n)+M_B,0},M_B as the larger boundary constant. The medium boundary triplet loss function is L _I＝g*max{d(a,p)-d(a,n)+M_I,0},M_I a medium boundary constant. The smaller boundary triplet loss function is L _S＝f*max{d(a,p)-d(a,n)+M_S,0},M_S, which is a smaller boundary constant, where f, g are constants. The self-adaptive contrast learning loss function is L=max { L _B,L_I,L_S };

the reasoning phase consists of three processes: a feature extraction process, a comparison process and a feedback process. Firstly, in the feature extraction process, feature extraction is carried out on an input image based on a feature extraction model optimized in a training stage, so as to obtain a feature M. And then, taking out all the features cached in the feature cache region, calculating the distance based on the similarity degree and the current features, and taking the category of the features corresponding to the minimum distance D between all the features in the feature cache region and the current features as the recognition result. And then, if the minimum distance is smaller than the threshold value T, storing the currently identified features into a feature cache area, otherwise, discarding the features to complete the reasoning process.

Preferably, the training process further comprises a data enhancement step of preprocessing the dish identification image: performing random horizontal/vertical overturn on an input image; adding random contrast, saturation or brightness noise to the input image.

The invention at least comprises the following beneficial effects: because the self-adaptive contrast learning loss function is introduced in the training stage, the loss functions of different boundaries are selected for different triples, so that the neural network achieves a better contrast learning effect, and the accuracy of dish identification is improved; the neural network parameters are fixed, only reasoning is carried out, and the updated parameters do not need to be trained, so that the calculation force requirement on the computing equipment can be greatly reduced; in the reasoning stage, in order to prevent error accumulation, a low-similarity sample is introduced in the feedback process to automatically delete, so that the dish identification method can stably run for a long time

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a training flow chart of a dish identification method based on adaptive contrast learning in one embodiment of the invention;

FIG. 2 is a flowchart of an application of a dish identification method based on adaptive contrast learning according to an embodiment of the present invention;

FIG. 3 is a graph of a loss function calculation for adaptive contrast learning according to one embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, so as to enable those skilled in the art to refer to the description.

It will be understood that terms, such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.

Fig. 1 and 2 illustrate a dish identification method based on adaptive contrast learning according to an embodiment of the present invention, which includes: in the training process, for each triplet, the triplet loss of the three boundaries is calculated, and then a larger loss value is selected as the final loss value for the loss value of each triplet. The final loss value is used for optimizing the neural network parameters; in the reasoning process, firstly, feature extraction is carried out on an input image based on a feature extraction model optimized in a training stage, then the extracted features and all the features cached in a feature cache region are calculated according to the similarity degree, and the category of the features corresponding to the minimum distance D between all the features in the feature cache region and the current features is taken as the recognition result. And then, if the minimum distance is smaller than the threshold value T, storing the currently identified features into a feature cache area, otherwise, discarding the features to complete the reasoning process.

The dish identification method based on self-adaptive contrast learning adopts a feature extraction network ResNet < 18 >, and the specific method comprises the following implementation processes:

1. Training process

Randomly selecting 32 different dish categories from the training set, then randomly taking 8 pictures from each category, carrying out data enhancement on 256 pictures in total, and comprising the following steps:

step one, horizontally overturning the 256 pictures with the probability of Q1 to obtain 256 pictures after random horizontal overturning;

step two, vertically overturning the 256 pictures obtained in the step one with the probability of Q2 to obtain 256 pictures after random vertical overturning;

Step three, adding random contrast noise, saturation noise and brightness noise to the 256 pictures obtained in the step two according to the probability of Q3 in sequence to obtain 256 pictures with random added random noise;

Step four, resampling the image and normalizing the pixel values, resampling the 256 pictures obtained in the step three to obtain 256 pictures with 224 pixels in width and height, and normalizing the pixel values of each picture to be between 0 and 1;

Inputting the 256 pictures subjected to resampling and normalization processing into ResNet networks to obtain the characteristics with the size of (256,1000);

Step six, finding out all triples (a, p, n) existing in the 256 pictures according to the menu IDs of the 256 pictures, wherein a is a feature extracted based on a template picture, p is a feature extracted based on any one input picture with the same category as a, and n is a feature extracted based on any one input picture with a different category from a. We calculate the larger boundary triplet loss L _B＝max{d(a,p)-d(a,n)+M_B, 0, the medium boundary triplet loss L _I＝g*max{d(a,p)-d(a,n)+M_I, 0, and the smaller boundary triplet loss L _S＝f*max{d(a,p)-d(a,n)+M_S, 0 for each triplet, where g, f are constants, preferably 2 and 4, respectively. d (x, y) is the Euclidean distance of x and y. Subscripts B, I, S represent the larger, medium, and smaller boundaries, respectively. Then for each triplet (a, p, n), leave l=max { L _B,L_I,L_S } as the final penalty;

And seventhly, calculating the gradient of the neural network parameters based on the AdamW optimizer and the final loss, and optimizing the parameters of the model.

2. Dish identification process

Step one, resampling an unknown menu image and normalizing pixel values to obtain an image tensor P with the size of (1,3,224,224), inputting the image tensor P into a neural network optimized based on self-adaptive contrast learning, and obtaining a feature vector M with the size of (1,1000) after calculation of the neural network;

And step two, if the category of the dishes appears for the first time or the characteristic buffer area is empty, the dishes are considered to be a new category. Otherwise, calculating Euclidean distances between M and all the features in the feature cache region, and taking the dish category corresponding to the minimum value D as a final recognition result;

And step three, executing a low-similarity sample automatic deleting strategy, if the minimum distance D is smaller than a preset threshold T, preferably 0.1, storing the currently identified features and the identification result into a feature cache area, otherwise, discarding the features and the identification result, and completing the identification process.

To further illustrate the effect of the invention, two examples are listed as follows:

Three restaurant data which come from different areas and are randomly selected A, B, C are adopted for comparison experiments of dish identification, in order to facilitate the accuracy of the verification method, the data selected by the experiments are respectively dish identification records of three restaurants in a time dimension for one month, and more dish types with similar appearance exist in the dish types of each restaurant. Wherein, restaurant A is used as training set, 600 dishes are all provided, and the total number of pictures is up to 80,000. Restaurant B and restaurant C serve as a validation set and a test set, respectively. The number of data set samples is shown in table 1.

Table 1 dataset composition

	Training set-A restaurant	Verification set-B restaurant	Test set-C restaurant
				Category number	600	200	300
Total number of pictures	80,000	21,000	24,000

In order to show the advantages of the dish identification method based on adaptive contrast learning, we select the commonly used reference method ResNet as the feature extraction network. In addition, in order to show that the method of the present invention has an improved effect on other models, the present inventors simultaneously selected ResNet and ResNet networks to compare the scheme according to the present invention with the prior art scheme. The operation of the ResNet embodiment is consistent with ResNet. The reference method only adopts the triple loss of a smaller boundary as the final loss, adopts the common recognition accuracy, namely the average probability of the correct result of each recognition, as a measurement standard, and adds all recognition results into the characteristic cache region in the recognition process. Based on the same experimental configuration of training/verification/testing, the influence of adaptive contrast learning (abbreviated as +ada in table 2) on the verification set and the test set and the automatic deletion (abbreviated as +T in table 2) of the low similarity sample introduced in the feedback process on the accuracy of the fresh identification method is compared. The comparative experiments are shown in table 2.

Table 2 results of comparative experiments with or without adaptive convolution kernels for different data sets

As shown in table 2, the baseline method consisting of the fixed triplet loss optimized feature extraction network and the recognition error accumulation was limited in dish recognition effect of less than 75% in both the validation set B restaurant and the test set C restaurant, whether ResNet or ResNet. After the training method based on self-adaptive contrast learning provided by the invention is adopted, the test results on the restaurant of the test set C show that the dish identification accuracy of ResNet and ResNet is respectively improved by 7.4% and 7.0%. After the strategy of automatically deleting the low-similarity samples in the feedback process provided by the invention is adopted, the problem of accumulation of identification errors is effectively relieved, the identification accuracy is greatly improved, the identification accuracy of ResNet50 on a restaurant in a verification set B and a restaurant in a test set C is respectively improved by 11.5% and 10.8%, 90.9% and 92.4% respectively, and the fact that the strategy of automatically deleting the low-similarity samples provided by the invention has a remarkable improvement effect on improving the identification accuracy of dishes when dishes are identified across restaurants is proved, and the embodiments of 'ResNet 18 +ada+T' and 'ResNet 50 +ada+T' are preferred embodiments of the invention.

As described above, according to the invention, a training scheme based on self-adaptive contrast learning and a low-similarity sample automatic deletion strategy are adopted, so that a high-precision dish identification method without training can be ensured. The dish identification method not only can support real-time dish addition, but also can stably identify dishes for a long time, and can be widely applied to application scenes such as social meals, intelligent campuses and the like which need to use dish identification to improve the digitalization of dishes.

Although embodiments of the invention have been disclosed above, they are not limited to the use listed in the specification and embodiments. It can be applied to various fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. Therefore, the invention is not to be limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. A dish identification method based on self-adaptive contrast learning is characterized by comprising the following steps:

A) The training step includes randomly selecting 32 different dish categories from a training set, then randomly taking 8 pictures from each dish category, carrying out data enhancement on 256 pictures in total, and comprises the following steps:

A1 Horizontally turning over the 256 pictures with the probability of Q1 to obtain 256 pictures after random horizontal turning over;

a2 C), vertically overturning the 256 pictures obtained in the step A1) with the probability of Q2 to obtain 256 random vertically overturned pictures;

A3 Adding random contrast noise, saturation noise and brightness noise to the 256 pictures obtained in the step A2) according to the probability of Q3 in sequence to obtain 256 pictures with random added random noise;

A4 Resampling the images and normalizing the pixel values, wherein the resampling comprises the steps of evenly resampling 256 pictures obtained in the step A3), obtaining 256 pictures with 224 pixels in width and height, and normalizing the pixel values of each picture to be between 0 and 1;

A5 256 pictures subjected to resampling and pixel value normalization are input into a feature extraction network, wherein the feature extraction network can be any neural network which can be used for image classification, the invention takes ResNet and ResNet as examples to obtain feature vectors with the size of (256, V), and the V can be any length, and the invention takes common 1000 as examples;

A6 Finding all triples (a, p, n) existing in 256 pictures according to the dish IDs of the 256 pictures, wherein a is a feature extracted based on a template picture, p is a feature extracted based on any one input picture similar to a, n is a feature extracted based on any one input picture different from a, multi-scale triplet loss L _B＝max{d(a,p)-d(a,n)+M_B, 0, medium boundary triplet loss L _I＝g*max{d(a,p)-d(a,n)+M_I, 0 and small boundary triplet loss L _S＝f*max{d(a,p)-d(a,n)+M_S, 0 of each triplet are calculated, g and f are constants, d (x, y) is the Euclidean distance between x and y, subscript B, I, S represents a large boundary, a medium boundary and a small boundary respectively, and then L=max { L _B,L_I,L_S } is reserved for each triplet (a, p, n) as a final loss;

a7 Calculating a gradient of the neural network parameters based on AdamW optimizers and final losses, optimizing the parameters of the model,

B) A dish identification step comprising:

B1 Resampling and normalizing pixel values of an unknown menu image to obtain an image tensor P with the size of (1,3,224,224), inputting the image tensor P into a neural network optimized based on self-adaptive contrast learning, and obtaining a feature vector M with the size of (1,1000) after calculation of the neural network;

B2 If the dish of the dish category appears for the first time or the characteristic cache area is empty, the dish category is considered to be a new dish category, the characteristic vector and the new category are added into the characteristic library without identification, otherwise, the Euclidean distance between M and all the characteristics in the characteristic cache area is calculated, and the dish category corresponding to the minimum value D is taken as a final identification result;

and step three, executing a low-similarity sample automatic deleting strategy, if the minimum distance D is smaller than a preset threshold value T, storing the currently identified features and the identification result into a feature cache region, otherwise, discarding the features and the identification result, and completing the identification process.

2. The adaptive contrast learning-based dish identification method as claimed in claim 1, wherein:

The preset threshold is preferably 0.1.

3. A dish identification method based on adaptive contrast learning as claimed in claim 1 or 2, wherein:

The constants g and f are preferably 2 and 4, respectively.