CN111062424A

CN111062424A - Small sample food image recognition model training method and food image recognition method

Info

Publication number: CN111062424A
Application number: CN201911232161.2A
Authority: CN
Inventors: 闵巍庆; 吕永强; 蒋树强
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-04-24

Abstract

The invention provides a small sample food image recognition model training method and a food image recognition method. The model training method comprises the following steps: constructing a triple containing a positive sample, a negative sample and an anchor image by using a training data set, and inputting a ternary convolution neural network to extract the feature representation of the triple; carrying out feature map fusion to obtain a feature map of a positive and negative sample image pair; and screening the positive and negative sample images based on the relation value scores, and training the feature embedded network and the relation learning network by using the screened positive and negative sample images.

Description

Small sample food image recognition model training method and food image recognition method

Technical Field

The invention relates to the technical field of food identification, in particular to a small sample food image identification model training method and a food image identification method fusing a ternary convolutional neural network and a relational network.

Background

Food identification is an important research topic in the fields of computer vision, data mining, multimedia social interaction and the like, and is widely applied to food automation detection, food management, food trend and popularity analysis, smart home and food safety. The food data set collected from the real world conforms to the typical long tail distribution, that is, only a small number of samples can be collected from many unusual food categories, so that the identification of small sample foods is a problem to be solved urgently.

However, no relevant work is currently concerned with the problem of small sample food image recognition. This is because small sample food image recognition faces many challenges, including: (1) the fine granularity is distinguished, and the distinguishing of the fine granularity of the food image in the class and between the classes is very important for food identification; (2) lack of rigid structures, many food images do not have a fixed spatial layout, many food items do not have fixed structures, structural information is not readily available; (3) training is difficult, and because the training class space does not intersect with the testing class space, it is more difficult to train from the training class to obtain the relevant guide information of the testing class.

The existing small sample identification method only focuses on the similarity information between image pairs, and omits the fine-grained distinction of the images within and among classes.

In the existing small sample identification method, fine-grained division cannot be performed on an image.

Disclosure of Invention

Aiming at the problems and the defects of the prior art, the invention provides a sample food image recognition model training method and a corresponding recognition method, which are integrated with a ternary convolution neural network and a relationship network.

In order to achieve the purpose, the invention provides the following technical scheme:

according to one aspect of the invention, a small sample food image recognition model training method is provided, and is characterized in that the method comprises the following steps:

constructing a triple including a positive sample image, a negative sample image and an anchor image by using a training data set, inputting the triple into a ternary convolution neural network to extract the characteristic representation of the triple, and acquiring a characteristic diagram of a convolution layer;

respectively fusing the positive and negative sample images with the feature maps of the anchor image to obtain positive and negative sample image feature maps;

and (3) substituting the positive and negative sample image pairs into a relation learning network to obtain corresponding relation value scores, screening the positive and negative sample images based on the relation value scores, and training the ternary convolutional neural network and the relation learning network by using the screened positive and negative sample images.

In a preferred implementation manner, when the relationship learning network is trained, the hinge loss function used is:

wherein P is the number of randomly selected sample classes, K is the number of randomly selected samples from each class, m is a relationship control threshold,

scoring the relationship values obtained for the relationship learning network,

the parameters of the network are learned for the relationship,

is a fused feature map.

In another preferred implementation, the feature screening in step (3) is performed by selecting a triplet with a positive sample pair relationship score of greater than β and a negative sample pair relationship score of less than or equal to α, wherein β and α are preset parameters.

In another preferred implementation, β and α take on values of 0.6 and 0.4, respectively.

In another preferred implementation, the relationship score of the positive and negative sample image pairs is calculated by the following formula:

wherein r is a relationship value score,

for the parameters of the relationship learning network, τ (f)_θ(x)，f_θ(x^-) τ (f) as a fusion feature of the negative sample image pair_θ(x)，f_θ(x⁺) ) are fused features of positive sample image pairs.

In another preferred implementation, in the step (1), the feature of the last convolutional layer in the ternary convolutional neural network is extracted as a feature representation of the image.

In another preferred implementation manner, the relationship learning network includes two convolutional layers and two fully-connected layers, where the convolutional layers are used to perform convolutional learning on the input fusion features, and the fully-connected layers are used to perform learning and dimension reduction processing on the convolutional results.

According to another aspect of the invention, a method for small sample food image recognition by using a model trained by the method is provided, and the method comprises the following steps:

selecting one image from the C category images of the training set as a support set, taking a target image as a query set, respectively pairing the images in the query set and the images in the support set to form an image pair, and inputting the image pair into a trained ternary convolution neural network for feature map extraction;

and fusing the characteristic graphs of the image pairs, inputting the characteristic graphs into a trained relation learning network to obtain corresponding relation scores, and determining the category of the target image based on the maximum value of the relation scores in the image pairs.

According to another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the method.

According to another aspect of the present invention, there is provided a computer device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor implements the method when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

the small sample Food image recognition model training method and the corresponding recognition method provided by the invention integrate the ternary convolution neural network and the relational network, the ternary neural network can be used for learning more fine-grained information, and meanwhile, the linear measurement method is replaced by organically integrating the nonlinear measurement method by means of the relational network, so that the classification performance is improved to the greatest extent, and the best classification performance is achieved on a plurality of public data sets (ETH Food-101, Vireofood-172 and ChinesFoodNet).

By using the screening rule of the batch limited difficulty sample (limited batch hard) provided by the invention, the triples which do not meet the training are screened, and the model training which meets the requirement is carried out, so that the reliability of the model training can be further ensured, and the classification accuracy is improved.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

fig. 1 is a schematic frame diagram of a small sample food identification method fusing a ternary convolutional neural network and a relational network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a relationship network structure adopted in the embodiment of the present invention.

Fig. 3 is a schematic flow chart of a "limited difficulty sample in batch" method adopted in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Brief introduction to study of small samples

In the small sample learning problem, training and testing samples are usually composed of a series of training sets and testing sets. Assuming that C training classes are provided, N labeled training samples are provided in total, and a training set is defined

Wherein

Refers to the image that is sampled and,

means that

The label of (1). For the test set, assuming there are L new classes, there are M test samples in total, defining the test sample set

The label set is

It is worth noting that the sample space of the training set and the test set is completely uncorrelated. The small sample problem is to identify unknown classes using known classes.

For small sample learning, a support set and a query set are first defined. Using training set as an example, random Slave training set D_baseC classes are sampled, and K samples are randomly sampled from each class to form a support set

Defining a set of queries

A category is randomly selected from the C categories of the support set and n samples are randomly sampled from the selected category. If the support set contains C different classesAnd each category contains K samples, this task is called "C-way K-shot". In general, K tends to be small in small sample learning settings, e.g., K-1 or K-5. The aim of the task based on the 'C-way K-shot' is to provide a query image

Learning a classification map using a support set

Deriving a probability distribution for query classes

Wherein

Is a predicted label.

According to one embodiment of the invention, the small sample food image identification method comprises the following steps:

1. defining a triple input, providing three food image inputs x^-X and x⁺Negative ("negative"), anchor ("anchor") and positive ("positive") specimen images are shown, respectively. Wherein x and x⁺Are samples belonging to the same class, x^-Samples belonging to a different category than x. And respectively selecting 2 categories A and B from the training categories, wherein the category A randomly selects two images as an anchor and a positive sample, and selects one image from the category B as a negative sample, and the triple is (anchor, positive and negative). The data set is partitioned into a test set and a training set in which class spaces are disjoint from each other.

Aiming at the learning characteristics of small sample food images, in a training stage, constructing triples from a training set of a data set as the input of a ternary neural network model for training the model; in the testing phase, a "C-wayK-shot" image pair (i.e., containing C classes each containing a subset of K samples) is constructed from the test set as an input to the deep neural network model for testing the performance of the model. The invention mainly tests the experimental result of the 5-way 1-shot, for example, randomly selects 5 categories, selects an image as a support set for each category, selects any one category, selects another image as a query image from the categories, and finally judges the category of the query image from the 5 categories.

2. And respectively inputting the constructed triples into a ternary convolutional neural network, for example, embedding features in a deep neural network into a sub-network to obtain feature representation of the triples. Wherein, the feature embedding sub-network can adopt different deep neural networks,

briefly describing the structure of the feature embedding sub-network, the invention takes a commonly used VGG16 network as an example to carry out relevant experimental analysis, and the feature embedding sub-network is a VGG16 network without a full connection layer and consists of 13 convolutional layers; the extracted features are the features of the last convolution layer with feature dimensions 14 x 512.

The difference from the ternary neural network in the prior art is that the invention extracts the characteristic diagram of the convolutional layer, not the characteristic diagram of the full connection layer.

In the prior art, assuming that the feature embedding expression of a sample is f θ (x), where θ is a parameter of the feature embedding network, the prior art adopts a fully connected layer before a classification layer as an embedded expression of an image, and then obtains a distance expression of a triplet by using a fixed distance algorithm (e.g., L2 distance):

in the invention, the characteristic graphs obtained by inputting the triples into the image characteristic embedding network are respectively expressed as f_θ(x^-)，f_θ(x)，f_θ(x⁺) This is expressed as a feature of the image. The inventor of the present application has noted that the features of the convolutional layer may retain the spatial information of the image compared to the features of the fully-connected layer, and the features of the last convolutional layer may also have stronger semantic information, which is more beneficial to the identification of the small sample food image. Therefore, in a preferred embodiment of the present invention, the features of the last convolutional layer are represented as features of the image.

3. Respectively carrying out depth fusion on the negative sample image, the positive sample image and the feature map of the anchor by utilizing a fusion operator tau to respectively obtain the fusion features of the negative sample image pair and the positive sample image pair, wherein the feature maps of the fused positive sample image pair and the fused negative sample image pair are tau (f) respectively_θ(x)，f_θ(x⁺) And τ (f)_θ(x)，f_θ(x^-) )., there are many feature fusion methods, and in this embodiment, a feature splicing method is used, for example, for a feature map of 14 ＊ 14 ＊ 512, the features of the negative sample image pair and the positive sample image pair are depth-fused, respectively, and the dimension of the obtained feature map is 14 ＊ 14 ＊ 1024.

In this embodiment, the feature embedding is expressed as a feature map extracted from the convolutional layer, which not only fits the input of the relationship learning network, but also contains richer image information than the fully-connected layer. According to the above assumptions, the feature embedding of the sample is denoted as f_θ(x) Wherein θ is a parameter of the feature embedding network, and a fusion operator τ (feature depth fusion) is used to obtain a fusion feature pair:

4. and respectively inputting the characteristics of the fused positive and negative sample image pairs into a relationship learning network to obtain the relationship score of the positive and negative sample image pairs.

Wherein r is a score of the relationship value,

for the parameters of the relationship learning network, its parameter value, τ (f), can be obtained by training_θ(x)，f_θ(x^-) Feature depth fusion of negative sample image pairs, τ (f)_θ(x)，f_θ(x⁺) Feature depth fusion for positive sample image pairs. Emphasis on the combination of ternary networks and relational networksThe characteristics of the fused positive and negative sample image pairs enter a relationship learning network to obtain a nonlinear relationship score.

As shown in fig. 2, the relationship learning network proposed by the present invention is composed of two convolutional layers and two fully-connected layers, wherein the convolutional layers are used for performing convolution processing on input fusion features, learning fusion representation of the fusion features, and the nonlinear activation function of the convolutional layers is ReLU and is composed of 64 convolution kernels of 3 × 3; the two full-connection layers are used for performing dimension reduction and relationship score learning on the convolution result of the fusion features, respectively use a ReLU nonlinear activation function and a Sigmoid nonlinear activation function, and output dimensions are respectively 8 dimensions and 1 dimension. And respectively inputting the characteristics of the positive sample image pair and the negative sample image pair after characteristic fusion into a relation learning network, so that each image pair can obtain a 1-dimensional relation score. It should be noted that, although fig. 2 shows a relationship learning network, the construction of each layer in the relationship learning network is the same as the conventional construction method, and is not described in detail here.

The relation network is regarded as a nonlinear measurement function, the linear measurement function in the traditional ternary neural network is replaced, self-adaptive learning can be carried out according to different network models and different data, and a fixed linear distance cannot be changed; the discrimination of the nonlinear measurement method is stronger than that of the linear measurement method (the experimental performance shows superiority).

Feature map τ (f) fused in training phase_θ(x)，f_θ(x^-) And τ (f)_θ(x)，f_θ(x⁺) Respectively input to the relationship learning network

The hinge loss function is defined as:

where m is the relationship control threshold value,

obtaining relationship values for a relationship networkThe method comprises the following steps of dividing,

are parameters of the relationship network.

And training the relation learning network by using the hinge loss function.

Since the last fully-connected layer in the relational network uses the Sigmoid function as the activation function, the final relationship score r is between 0 and 1, and 1 can be considered as completely similar, and 0 is completely dissimilar. However, with the traditional "batch hardest sample" (batch hard) screening rule, when the relationship score is used as the standard for screening the triples, there are many triples that are not suitable for model training, for example, many pairs of positive sample images have a relationship score much smaller than that of pairs of negative sample images, which generate excessive loss values, and are very disadvantageous for model training.

Therefore, in the conventional method, the relationship scores of positive and negative samples are screened by using a self-researched screening rule of 'batch limited difficulty samples', namely, after all triples are input into a relationship learning network, the positive and negative samples are screened according to the relationship scores of the positive and negative samples, for example, only the triples with the positive sample pair being greater than or equal to β and the triples with the negative sample image pair relationship being less than or equal to α (wherein β and α are hyper-parameters which are parameters set before the learning process is started, and are respectively used for selecting and screening trainable positive sample pairs and trainable negative sample pairs, and through experimental verification, β and α respectively take values of 0.6 and 0.4), so that the triples which are extremely difficult to train in batch data are effectively avoided, the difficulty of network training is reduced.

Based on a triple selection scheme of 'batch limited difficulty samples', an original loss function is adjusted, and finally, based on a new triple sampling scheme, the loss function is as follows:

wherein]₊The subscript of the min function indicates the combination of samples in different classes, and the s.t. indicates subject to as a symbol in mathematics, indicating a conditional statement. P classes are randomly selected per batch of data, and then K samples are randomly selected from each class, so that each batch of samples contains P x K images. The hinge loss function causes the relationship value score between the positive sample image pair to be greater than the relationship value score between the negative sample image pair. The hinge loss function not only directs the feature embedding model to generate the embedding of the image, but also directs the relational network learning at the same time.

By adopting the hinge loss function, the model can learn more discriminative information by simultaneously limiting the positive and negative sample relation values.

In a testing stage or an application stage, images are selected in the mode of the C-way K-shot to generate corresponding support sets and query sets, corresponding image pairs are generated according to each corresponding support set and query set, for example, for a group of 5-way 1-shot, 5 images from 5 categories are respectively used as the support sets, then one category is selected from the selected 5 categories, an image is additionally selected as the query set (if the application stage is adopted, the image to be classified is selected as the query set), and finally the images in the query set are respectively matched with the images in the support sets, so that 4 negative sample image pairs and 1 positive sample image pair exist, and then trained features are input to be embedded into a network and a relation learning subnetwork.

And inputting the constructed 5-way 1-shot image pair into a feature embedding sub-network (a deep neural network) respectively, and extracting a feature map of the convolutional layer. And (4) performing depth fusion on the feature maps of the negative sample image pair and the positive sample image pair respectively by using a fusion operator tau to obtain fusion features of the 4 negative sample image pairs and the 1 positive sample image pair respectively.

And respectively inputting the characteristics of the fused positive and negative sample image pairs into a relationship learning network to obtain the relationship score of the positive and negative sample image pairs.

Finally, 5 relation scores are obtained respectively, and if the relation score of the positive sample image pair is higher than the relation scores of the other four negative samples, the classification is correct; if the positive sample image pair relationship score is not the highest, a classification error is indicated. The category of the query set is the category of the positive samples in the support set, so that the food image classification is realized.

In order to verify the performance advantages of the food image recognition method of the present invention and the existing food image recognition method, the inventors conducted experiments on a plurality of common data sets. The inventor respectively adopts the identification method of the invention and models and corresponding methods of Sieme Network, MatchingNetwork, relationship Network, triple Network and the like on common data sets ETH Food-101, Vireofood-172, ChinesFoodNet and the like, and 1000 groups of '5-way 1-shot' image pairs are randomly generated from each data set by test data.

Table 1 shows the results of comparing the performance of the process of the invention with other reference processes.

It can be seen from table 1 that the ternary neural network based on nonlinear distance (relationship learning network) proposed by the present invention achieves the best classification performance on three popular common data sets. Compared with a linear distance ternary neural network, the accuracy rate is respectively improved by 0.8%, 1.7% and 3.7%, the superiority of a relation learning network is proved, and the invention provides a brand-new technical concept of small sample food image recognition.

TABLE 1

Table 2 shows the superiority of the method of the present invention, and a group of parameter settings with the best test performance are obtained by exploring the parameter settings, wherein the two values in parentheses in the table are α and β respectively.

TABLE 2

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A small sample food image recognition model training method is characterized by comprising the following steps:

2. The small sample food image recognition model training method of claim 1, wherein at the close pointWhen training is performed by a learning network, the adopted hinge loss function is as follows:

scoring the relationship values obtained for the relationship learning network,

the parameters of the network are learned for the relationship,

is a fused feature map.

3. The small sample food image recognition model training method of claim 1, wherein the feature screening in step (3) is performed by selecting a triplet with a positive sample pair relationship score of greater than β and a negative sample pair relationship score of less than or equal to α, wherein β and α are preset parameters.

4. The small sample food image recognition model training method of claim 3, wherein the values of β and α are 0.6 and 0.4, respectively.

5. The small sample food image recognition model training method of claim 3,

calculating a relationship score for the positive and negative sample image pairs by:

wherein r is a relationship value score,

6. The small sample food image recognition model training method of claim 1, wherein the features of the last convolutional layer in the ternary convolutional neural network are extracted as the feature representation of the image in step (1).

7. The small sample food identification method according to claim 1, wherein the relationship learning network comprises two convolutional layers for convolutional learning of the input fusion features and two fully-connected layers for learning and dimension reduction of the convolutional result.

8. A method for small sample food product image recognition using a model trained by the method of any one of claims 1-7, comprising:

selecting an image from the C category images of the training set as a support set, taking a target image as a query set, respectively pairing the images in the query set and the images in the support set to form an image pair, and inputting the image pair into the trained ternary convolution neural network for feature map extraction;

and fusing the characteristic graphs of the image pairs, inputting the characteristic graphs into the trained relation learning network to obtain corresponding relation scores, and determining the category of the target image based on the maximum value of the relation scores in the image pairs.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 7 when executing the program.