CN112699822A

CN112699822A - Restaurant dish identification method based on deep convolutional neural network

Info

Publication number: CN112699822A
Application number: CN202110006146.7A
Authority: CN
Inventors: 翟盛龙; 尹旭; 王东伟; 张金波; 张睿智
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-04-23
Anticipated expiration: 2041-01-05
Also published as: CN112699822B

Abstract

The invention discloses a restaurant dish identification method based on a deep convolutional neural network, relates to the technical field of deep learning, and aims to solve the problem that the dish identification and classification method in the prior art is difficult to finely divide when the dish similarity is high, and the technical scheme is as follows: taking and cutting a dish image to obtain a sample block of the image; the method comprises the steps of firstly carrying out 3D convolution operation and downsampling operation on a sample block, then extracting a feature map through an attention module, then carrying out 3D convolution operation to obtain a one-dimensional intermediate feature map, inputting the one-dimensional intermediate feature map into a deep convolution neural network with a softmax function, and obtaining a classification result of an original dish image through a probability value obtained by mapping of the softmax function. The method can improve the identification precision, reduce the workload of artificial auxiliary operation, and solve the defect that the dish is difficult to finely divide when the dish similarity is high in the existing dish identification and classification method.

Description

Restaurant dish identification method based on deep convolutional neural network

Technical Field

The invention relates to the technical field of deep learning, in particular to a restaurant dish identification method based on a deep convolutional neural network.

Background

Food is an important component of human life, and is an important prerequisite for human survival and health development. Along with the development of society, the requirements of people on food quality are continuously improved, the development of the catering industry is objectively and greatly promoted, various dishes are made in restaurants of enterprises, new dishes are continuously produced, and the requirement for accurately identifying a plurality of dishes is higher and higher. Meanwhile, the restaurant dishes of many enterprises are various in types, and the manual settlement efficiency of the dining hall is very low. Today, mobile interconnection is rapidly developed, a large number of dish images can be rapidly acquired from massive network image information, and can be used as a data source to perform analysis modeling to obtain a general model for dish image classification, segmentation and identification, which has important significance for saving labor cost and improving restaurant settlement efficiency.

Image classification, in short, distinguishes among already acquired images based on finding common features of a large number of images. Only if such features can be found by the model can the images be correctly distinguished.

The classified dish images are a direction of research which is concerned before the deep learning technology is developed. Deep learning is a machine learning method based on characterization learning of data, aims to establish and simulate a multilayer neural network for analyzing and learning of human brain, is used for explaining data such as images, sounds and texts, and is widely applied to the field of image recognition. Because the deep learning can extract more abstract and deeper features in the image, compared with the traditional classification method, the deep learning has stronger classification capability. The convolutional neural network is well applied to image classification, however, the input information quantity and the classification effect of the convolutional neural network are not completely positively correlated, and under a certain model, too complex input can not only prolong the training time and the classification time, but also even cause the accuracy to be reduced instead of increasing. Therefore, it is necessary to deeply research the feature extraction process before the classification of the convolutional neural network, and the purpose of adaptive feature refinement can be achieved on the premise of low overhead.

Disclosure of Invention

Aiming at the current existing dish identification and classification method, the invention combines the characteristics of high similarity degree, uneven material collocation and the like of some dishes, deeply researches the characteristic extraction process before the convolutional neural network classification, can achieve the purpose of self-adaptive refinement of the characteristics on the premise of low cost, and simultaneously provides a restaurant dish identification method based on the deep convolutional neural network for further improving the identification precision, reducing the workload of artificial auxiliary operation, optimizing and improving the loss function suitable for restaurant dish classification, enhancing the robustness of the algorithm and reducing the overfitting risk.

The invention discloses a restaurant dish identification method based on a deep convolutional neural network, which solves the technical problems by adopting the following technical scheme:

a restaurant dish identification method based on a deep convolutional neural network is characterized by comprising the following implementation contents:

step S1, collecting dish image R₁To the dish image R₁Performing cutting preprocessing operation to obtain a dish image R₂To the dish image R₂Sampling the sample to obtain a vegetable sample block T₁Vegetable sample block T₁The characteristic information of the dish sample is obtained;

step S2, dish sample block T₁Performing 3D convolution operation to obtain vegetable sample block T₁Intermediate characteristic map T of₂；

Step S3, dish sample block T₁Intermediate characteristic map T of₂Performing pooling operation to obtain an intermediate characteristic spectrum T3;

step S4, centering the intermediate feature map T on the space dimension₃Performing pooling operation to obtain a channel attention module A₃Centering feature map T in channel dimension₃Performing pooling operation to obtain planar attention module A'₃The intermediate feature map T₃Each channel vector and channel attention module, and intermediate feature map T₃Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T₄；

Step S5, the intermediate feature map T₄Sequentially carrying out 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T₆Centering the intermediate feature map T in the spatial dimension₆Performing pooling operation to obtain a channel attention module A₆Centering feature map T in channel dimension₆Performing pooling operation to obtain planar attention module A'₆The intermediate feature map T₆Each channel vector and channel attention module, and intermediate feature map T₆Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T₇；

Step S6, the intermediate feature map T₇Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T₈；

Step S7, the intermediate feature map T₈Inputting the data into a deep convolutional neural network to obtain a dish image R₁The classification result of (1).

Optionally, when step S1 is executed, the dish image R is subjected to₂Carrying out sample block taking, and the concrete operation comprises the following steps:

s1.1, in the plane dimension, taking a dish image R₂A × a surrounding pixels are used as a neighborhood block of the sample, wherein a is the number of pixels of the image block in the plane length and width directions;

step S1.2, retaining all channel information of a pixel points of a multiplied by a, namely forming a three-dimensional sample block of P multiplied by a, which is used for representing the sample characteristics of the middle pixel points, and performing characteristic transformation of the sample block taking process by using a formula (1):

wherein Q is the number of pixel points in a single channel, and is also the number of block samples, D_sampRepresenting the sample block taking process, and L and H represent the length and the width of a preset plane of the cutting operation;

and in the sample block taking operation, when the edge pixel point has no spatial neighborhood information, performing 0 complementing operation.

Further optionally, when step S2 is executed, the dish sample block T is subjected to₁Performing a 3D convolution operation, the specific operation comprising:

s2.1, based on the deep convolutional neural network, selecting h different convolutional kernels in each layer of convolutional neural network, and performing T on dish sample blocks₁Carrying out convolution operation on the contained P pieces of channel information by using a 3D convolution kernel with the size of e multiplied by f, wherein e is the number of operation layers of channel dimensionality, namely selecting e channels for carrying out a group of convolution every time, and f represents the number of pixel points of the image block in the length and width directions in the space dimensionality;

s2.2, after h different convolution kernels are selected from each layer of convolution neural network, a dish sample block T is obtained by using formulas (2), (3) and (4)₁Intermediate characteristic map T of₂：

P ═ [ (P-e) +1] × h formula (2),

where m is [ (a-e) +1] formula (3),

wherein p represents a sample block T of dishes₁The number of channels contained, e is the number of operating layers of channel dimensions, a is the number of pixel points of the image block in the plane length and width directions, and m is a middle characteristic map T₂Number of pixels in the direction of space length and width, Con_3DIndicating that a 3D convolution operation is performed.

Further optionally, for the dish sample block T₁During the 3D convolution operation, the mapping of each feature in the convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and the mapping of one convolution isA certain position value is obtained by convolution of local receptive fields at the same position of three continuous channels of the previous layer; one convolution layer is provided with a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is a positive integer>1。

Further optionally, step S3 is executed to obtain an intermediate feature map T₃The specific operation of (1) comprises:

step S3.1, dish sample block T₁Intermediate characteristic map T of₂Performing pooling operation, namely down-sampling treatment or discarding characteristic treatment, to obtain an intermediate characteristic map T₃At this time, the intermediate feature map T₃The number of channels and the intermediate feature map T₂The number of channels is the same, and the size of a single channel in the spatial dimension is changed;

step S3.2, after the pooling treatment, the intermediate characteristic spectrum T₃Is denoted by T₃ ^p×r×rI.e. intermediate feature maps T₃The number of pixel points of each channel in the space length direction and the width direction is r, and the number r of the pixel points is calculated by using a formula (5):

r ═ (m ÷ 2) equation (5),

wherein m is an intermediate characteristic spectrum T₂The number of pixels in the spatial length and width directions.

Further optionally, step S4 is executed to obtain an intermediate feature map T₄The specific operation is as follows:

using formulas (6) and (7) to obtain an intermediate feature map T₃Transforming to obtain intermediate characteristic map T₃Attention module A in channel direction and channel direction in sequence₃Carry out point multiplication one by one, in space direction and plane attention module A'₃Performing point multiplication one by one to obtain an intermediate characteristic map T₄：

Wherein, Aten_speRepresentation of the intermediate feature map T₃Attention enhancement in the channel direction, Aten_spaRepresentation of the intermediate feature map T₃Attention is enhanced in the space direction, and u is an intermediate feature map T₃The u-th pixel point contained in a single channel, r is an intermediate characteristic map T₃The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T₃V is the intermediate feature map T₃The v channel, symbol

Elements representing the same position corresponding to the same type of matrix are multiplied.

Further optionally, step S5 is executed to obtain an intermediate feature map T₇The specific operation is as follows:

step S5.1, utilizing a formula (8) to carry out comparison on the intermediate characteristic map T₄Performing 3D convolution operation to obtain an intermediate characteristic spectrum T₅Middle feature map T₅Is that

Wherein, Con_3DRepresenting a 3D convolution operation, x representing an intermediate feature map T₅The number of pixel points in the height direction of the space, y represents the middle characteristic spectrum T₅The number of pixel points in the space length and width directions, r is the intermediate characteristic spectrum T₄The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T₄The number of channels of (a);

step S5.2, the intermediate characteristic map T₅To carry outDown-sampling to obtain intermediate characteristic spectrum T₆At this time, the intermediate feature map T₆The number of channels and the intermediate feature map T₅The number of channels is the same, the size of a single channel in the spatial dimension is changed, and the intermediate feature map T₆The dimensions of the individual channels in the spatial length and width directions are:

z×z＝[(y÷2)×(y÷2)]，

wherein z is an intermediate characteristic spectrum T₆The number of pixel points in the space length and width directions, and y is the intermediate characteristic spectrum T₅The number of pixel points in the space length and width directions;

s5.3, utilizing the formulas (9) and (10) to perform intermediate feature map T₆Carrying out feature transformation to obtain an intermediate feature map T₇，T₇Is that

Wherein, Aten_speRepresentation of the intermediate feature map T₆Attention enhancement in the channel direction, Aten_spaRepresentation of the intermediate feature map T₆Attention is enhanced in the space direction, and u is an intermediate feature map T₆The u-th pixel point contained in a single channel, and z is an intermediate characteristic spectrum T₆The number of pixel points of a single channel in the space length and width directions, and x is an intermediate characteristic spectrum T₆V is the intermediate feature map T₆The v channel, symbol

Further optionally, when step S4 or S5 is executed, the specific operations of the channel attention module and the plane attention module are:

obtaining a channel attention module:

(1.1) first, the intermediate feature map T is centered in the spatial dimension_iPerforming maximum pooling and average pooling operations, respectively, to generate two pooling vectors, wherein i has a value of 3 or 6,

(1.2) inputting the two pooled vectors into a shared multilayer mapping neural network for training to respectively generate two new vectors,

(1.3) finally, carrying out bitwise addition on the two new vectors, and carrying out nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11) and (12)_i(T_i)，

A_i(T_i)＝σ{MLP[AvePool(T_i)]+MLP[MaxPool(T_i)]The formula (12) is described,

wherein, σ represents a Sigmoid activation function, e is the number of operation layers of a channel dimension, MLP represents nonlinear mapping performed through a multilayer neural network, AvePool represents average pooling, and MaxPool represents maximum pooling;

(II) obtaining a plane attention module:

(2.1) first, the intermediate feature map T is centered in the channel dimension_iPerforming maximum pooling and average pooling operations, respectively, to generate two pooling vectors,

(2.2) subsequently, the two aforementioned pooled vectors are mapped to a single-channel, same-size model via a convolution operation,

(2.3) finally, carrying out nonlinear mapping through a Sigmoid activation function, and obtaining the plane attention module A 'by using the formulas (13) and (14)'_i(T_i)，

Wherein, σ represents Sigmoid activation function, e is the operation layer number of channel dimension,

the feature transformation is performed by using a 1 × 1 convolutional neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.

Further optionally, step S6 is executed to obtain an intermediate feature map T₈The specific operation is as follows:

s6.1, checking the middle characteristic map T of the dish by adopting the convolution of rho multiplied by z₇Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T₈I.e. intermediate feature maps T₈Each channel only comprises one pixel point, wherein rho is the side length of the convolution size on the channel, and zxz is the size of the convolution window;

step S6.2, aiming at the characteristic map T in the middle of the dish₇When 3D convolution is performed, the number of convolution kernels used is η, the vector length of the input convolution is α, and the vector length α after convolution is obtained using the formula (15):

α ═ [ (α - ρ) +1] × η formula (15).

Further optionally, step S7 is executed to obtain dish image R₁The classification result of (2) is specifically operated as follows:

step S7.1, selecting a deep convolution neural network with an activation function as a softmax function shown in formula (16), wherein the softmax function is preceded by a layer of neural network,

wherein, Y_iRepresenting the Yth in the vector T_iAn element;

step S7.2, intermediate characteristic map T₈After the deep convolutional neural network is input, a vector T is obtained through a layer of neural network, the vector T enters a softmax function, the softmax function maps elements in the vector T into a (0, 1) interval to obtain a probability vector of the vector T, and then the dish image R₁The name of (1) is the name corresponding to the maximum probability value in the probability vector obtained by mapping the softmax function.

Compared with the prior art, the restaurant dish identification method based on the deep convolutional neural network has the beneficial effects that:

the method extracts the characteristic map of the dish sample block through the deep convolutional neural network and the attention module, obtains the name of the original dish image through softmax function mapping, has the advantage of high identification precision, can reduce the workload of manual auxiliary operation, reduces the overfitting risk, and solves the defect that the dish is difficult to finely divide when the dish similarity is high in the existing dish identification and classification method.

Drawings

FIG. 1 is a simplified flow chart of a method according to a first embodiment of the present invention;

FIG. 2 is a drawing showing an intermediate feature map T obtained in the first embodiment of the present invention₄The method of (1) is a simple flow chart.

Detailed Description

In order to make the technical scheme, the technical problems to be solved and the technical effects of the present invention more clearly apparent, the following technical scheme of the present invention is clearly and completely described with reference to the specific embodiments.

The first embodiment is as follows:

with reference to fig. 1 and 2, the present embodiment provides a restaurant dish identification method based on a deep convolutional neural network, which is characterized in that the implementation content includes:

step S1, collecting dish image R₁To the dish image R₁Performing cutting preprocessing operation to obtain a dish image R₂To the dish image R₂Sampling the sample to obtain a vegetable sample block T₁Vegetable sample block T₁I.e. the characteristics of the dish sampleAnd (4) information.

In this step, the dish image R is compared₂The specific operation of sampling the block comprises the following steps:

Step S2, dish sample block T₁Performing 3D convolution operation to obtain vegetable sample block T₁Intermediate characteristic map T of₂The specific operation comprises the following steps:

P ═ [ (P-e) +1] × h formula (2),

where m is [ (a-e) +1] formula (3),

The step is to the dish sample block T₁In the process of performing 3D convolution operation, the mapping of each feature in a convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and a certain position value of one convolution mapping is obtained by convolving local receptive fields of the same position of three continuous channels in the previous layer; one convolution layer is provided with a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is a positive integer>1。

Step S3, dish sample block T₁Intermediate characteristic map T of₂Performing pooling operation to obtain intermediate characteristic map T₃The specific operation comprises the following steps:

step S3.2, after the pooling treatment, the intermediate characteristic spectrum T₃Is shown as

I.e. the intermediate characteristic map T₃The number of pixels of each channel in the space length and width directionsThe number is r, and the number r of the pixel points is calculated by using a formula (5):

r ═ (m ÷ 2) equation (5),

Step S4, centering the intermediate feature map T on the space dimension₃Performing pooling operation to obtain a channel attention module A₃Centering feature map T in channel dimension₃Performing pooling operation to obtain planar attention module A'₃The intermediate feature map T₃Each channel vector and channel attention module, and intermediate feature map T₃Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T₄。

Step S4 includes two aspects, one is obtaining the channel attention module A₃And planar attention Module A'₃On the other hand, obtaining an intermediate feature map T₄。

First, the module A for obtaining the channel attention is described₃And planar attention Module A'₃The specific process of (1).

Obtaining a channel attention module:

(1.1) first, the intermediate feature map T is centered in the spatial dimension₃Performing maximum pooling and average pooling operations respectively to generate two pooling vectors;

(1.2) inputting the two pooling vectors into a shared multilayer mapping neural network for training to respectively generate two new vectors;

(1.3) finally, carrying out bitwise addition on the two new vectors, and carrying out nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11) and (12)₃(T₃)；

A₃(T₃)＝σ{MLP[AvePool(T₃)]+MLP[MaxPool(T₃)]The formula (12) is described,

wherein, σ represents a Sigmoid activation function, e is the number of operation layers of a channel dimension, MLP represents nonlinear mapping performed through a multilayer neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.

(II) obtaining a plane attention module:

(2.1) first, the intermediate feature map T is centered in the channel dimension₃Performing maximum pooling and average pooling operations, respectively, to generate two pooling vectors,

(2.3) finally, carrying out nonlinear mapping through a Sigmoid activation function, and obtaining the plane attention module A 'by using the formulas (13) and (14)'₃(T₃)，

Subsequently, the intermediate feature map T obtained is described₄The specific operation of (1):

using formulas (6) and (7) to obtain an intermediate feature map T₃Transforming to obtain intermediate characteristic map T₃Attention module A in channel direction and channel direction in sequence₃Carry out point multiplication one by one, in space direction and plane attention module A'₃A channel-by-channel dot-product is performed,further obtaining an intermediate characteristic map T₄：

Step S5, the intermediate feature map T₄Sequentially carrying out 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T₆Centering the intermediate feature map T in the spatial dimension₆Performing pooling operation to obtain a channel attention module A₆Centering feature map T in channel dimension₆Performing pooling operation to obtain planar attention module A'₆The intermediate feature map T₆Each channel vector and channel attention module, and intermediate feature map T₆Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T₇。

The specific operation of implementing step S5 is:

step S5.2, the intermediate characteristic map T₅Performing down-sampling operation to obtain an intermediate characteristic map T₆At this time, the intermediate feature map T₆The number of channels and the intermediate feature map T₅The number of channels is the same, the size of a single channel in the spatial dimension is changed, and the intermediate feature map T₆The dimensions of the individual channels in the spatial length and width directions are:

z×z＝[(y÷2)×(y÷2)]，

The channel attention module A is obtained in equation (10)₆And planar attention Module A'₆The specific operation is as follows:

(I) get channel attention Module A₆：

(1.1) first, the intermediate feature map T is centered in the spatial dimension₆Performing maximum pooling and average pooling operations, respectively, to generate two pooling vectors,

(1.3) finally, performing bitwise addition on the two new vectors, and performing nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11 ') and (12')₆(T₆)，

A₆(T₆)＝σ{MLP[AvePool(T₆)]+MLP[MaxPool(T₆)]The formula (12'),

(II) obtaining a plane attention module:

(2.1) first, the intermediate feature map T is centered in the channel dimension₆Performing maximum pooling and average pooling operations, respectively, to generate two pooling vectors,

(2.3) finally, the plane attention module A 'can be obtained by using the formulas (13') and (14 ') through nonlinear mapping of Sigmoid activation function'₆(T₆)，

Step S6, the intermediate feature map T₇Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T₈The method comprises the following specific operations:

step S6.2, aiming at the characteristic map T in the middle of the dish₇Performing 3D volumeDuring the product operation, the number of convolution kernels adopted is η, the vector length of the input convolution is α, and the vector length α after the convolution is obtained by using a formula (15):

α ═ [ (α - ρ) +1] × η formula (15).

Step S7, the intermediate feature map T₈Inputting the data into a deep convolutional neural network to obtain a dish image R₁The classification result of (2) is specifically operated as follows:

wherein, Y_iRepresenting the Yth in the vector T_iAn element;

In conclusion, the restaurant dish identification method based on the deep convolutional neural network can improve identification precision, reduce workload of artificial auxiliary operation and overcome the defect that dish items are difficult to finely divide when dish similarity is high in the existing dish identification and classification method.

The principles and embodiments of the present invention have been described in detail using specific examples, which are provided only to aid in understanding the core technical content of the present invention. Based on the above embodiments of the present invention, those skilled in the art should make any improvements and modifications to the present invention without departing from the principle of the present invention, and therefore, the present invention should fall into the protection scope of the present invention.

Claims

1. A restaurant dish identification method based on a deep convolutional neural network is characterized by comprising the following implementation contents:

Step S3, dish sample block T₁Intermediate characteristic map T of₂Performing pooling operation to obtain intermediate characteristic map T₃；

2. The restaurant dish identification method based on deep convolutional neural network of claim 1, wherein in step S1, the dish image R is processed₂Carrying out sample block taking, and the concrete operation comprises the following steps:

3. The restaurant dish identification method based on deep convolutional neural network of claim 2, wherein in step S2, a dish sample block T is subjected to₁Performing a 3D convolution operation, the specific operation comprising:

s2.1, based on the deep convolutional neural network, selecting h different convolutional kernels in each layer of convolutional neural network, and performing T on dish sample blocks₁The P pieces of channel information are respectively extracted by using the sizes of e multiplied by ff, performing convolution operation on the 3D convolution kernel, wherein e is the number of operation layers of channel dimensions, namely selecting e channels for performing one group of convolution each time, and f represents the number of pixel points of the image block in the length and width directions in the space dimensions;

P ═ [ (P-e) +1] × h formula (2),

where m is [ (a-e) +1] formula (3),

4. The restaurant dish identification method based on deep convolutional neural network of claim 3, wherein a dish sample block T is subjected to₁In the process of performing 3D convolution operation, the mapping of each feature in a convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and a certain position value of one convolution mapping is obtained by convolving local receptive fields of the same position of three continuous channels in the previous layer; one convolution layer is provided with a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is a positive integer>1。

5. The restaurant dish identification method based on deep convolutional neural network of claim 3, wherein step S3 is executed to obtain an intermediate feature map T₃The specific operation of (1) comprises:

I.e. the intermediate characteristic map T₃The number of pixel points of each channel in the space length direction and the width direction is r, and the number r of the pixel points is calculated by using a formula (5):

r ═ (m ÷ 2) equation (5),

6. The restaurant dish identification method based on deep convolutional neural network of claim 5, wherein step S4 is executed to obtain an intermediate feature map T₄The specific operation is as follows:

7. The restaurant dish identification method based on deep convolutional neural network of claim 6, wherein step S5 is executed to obtain an intermediate feature map T₇The specific operation is as follows:

step S5.2, to the intermediate partSign map T₅Performing down-sampling operation to obtain an intermediate characteristic map T₆At this time, the intermediate feature map T₆The number of channels and the intermediate feature map T₅The number of channels is the same, the size of a single channel in the spatial dimension is changed, and the intermediate feature map T₆The dimensions of the individual channels in the spatial length and width directions are:

z×z＝[(y÷2)×(y÷2)]，

Represent the same classThe matrix of types is multiplied by the elements at the same position.

8. The restaurant dish identification method based on the deep convolutional neural network of claim 7, wherein when step S4 or S5 is executed, the specific operations of the channel attention module and the plane attention module are:

obtaining a channel attention module:

A_i(T_i)＝σ{MLP[AvePool(T_i)]+MLP[MaxPool(T_i)]The formula (12) is described,

(II) obtaining a plane attention module:

(2.3) finally, proceeding through Sigmoid activation functionNon-linear mapping, i.e. obtaining the planar attention module A 'using the formulas (13), (13)'_i(T_i)，

9. The restaurant dish identification method based on deep convolutional neural network of claim 1, wherein step S6 is executed to obtain an intermediate feature map T₈The specific operation is as follows:

α ═ [ (α - ρ) +1] × η formula (15).

10. The restaurant dish identification method based on deep convolutional neural network of claim 1, wherein step S7 is executed to obtain dish image R₁As a result of the classification of (a),the specific operation is as follows:

wherein, Y_iRepresenting the Yth in the vector T_iAn element;