CN112699822B

CN112699822B - Restaurant dish identification method based on deep convolutional neural network

Info

Publication number: CN112699822B
Application number: CN202110006146.7A
Authority: CN
Inventors: 翟盛龙; 尹旭; 王东伟; 张金波; 张睿智
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2023-05-30
Anticipated expiration: 2041-01-05
Also published as: CN112699822A

Abstract

The invention discloses a restaurant dish identification method based on a deep convolutional neural network, which relates to the technical field of deep learning, and aims at solving the problem that dishes are difficult to divide finely when the similarity of the dishes is high in the current dish identification and classification method, and adopts the following technical scheme: taking and cutting out a dish image, and obtaining a sample block of the image; the method comprises the steps of firstly carrying out 3D convolution operation and downsampling operation on a sample block, then extracting a characteristic spectrum through an attention module, then carrying out 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum, inputting the intermediate characteristic spectrum into a deep convolution neural network with a softmax function, and obtaining a classification result of an original menu image through a probability value obtained through softmax function mapping. The invention can improve the identification precision, reduce the workload of manual auxiliary operation and solve the defect that the current dish identification and classification method is difficult to finely divide dishes when the similarity of the dishes is high.

Description

Restaurant dish identification method based on deep convolutional neural network

Technical Field

The invention relates to the technical field of deep learning, in particular to a restaurant dish identification method based on a deep convolutional neural network.

Background

Food is an important component in human life, and is an important precondition for human survival and healthy development. Along with the development of society, the requirements of people on food quality are continuously improved, the development of the catering industry is objectively and greatly promoted, various dishes are also manufactured in the restaurant of enterprises, and new dishes are continuously appeared, so that the accurate identification of a plurality of dishes is an increasingly high requirement. Meanwhile, the dining room of many enterprises has various kinds of dishes and the manual settlement efficiency of dining halls is quite low. At present, mobile interconnection rapidly develops, a large number of dish images can be rapidly obtained from massive network image information, and can be used as a data source for analysis modeling, so that a general model for classifying, dividing and identifying the dish images is obtained, and the method has very important significance for saving labor cost and improving restaurant settlement efficiency.

Image classification, in short, distinguishes images that have been obtained based on finding common features that they have from a large number of images. The model can only correctly distinguish the images if such features can be found.

Classifying dish images is a focused research direction before deep learning technology is developed. Deep learning is a machine learning method based on characterization learning of data, and aims to build and simulate a multi-layer neural network for analysis learning of human brain, which is used for explaining some data such as images, sounds, texts and the like, and has been widely applied in the field of image recognition. As the deep learning can extract more abstract and deeper features in the image, the deep learning has stronger classification capability compared with the traditional classification method. The convolutional neural network is well applied to image classification, however, the input information quantity and the classification effect of the convolutional neural network are not completely positively correlated, and under a certain model, too complicated input can not only prolong training time and classification time, but even cause no increase in accuracy and decrease in accuracy. Therefore, deep research on the feature extraction process before convolutional neural network classification is necessary, and the purpose of self-adaptive refinement of the features can be achieved on the premise of low cost.

Disclosure of Invention

Aiming at the characteristics of high similarity degree, uneven material collocation and the like of some dishes in the current dish identification and classification method, the invention carries out deep research on the characteristic extraction process before convolutional neural network classification, can achieve the aim of self-adaptive refinement of the characteristics on the premise of low expenditure, and simultaneously, optimizes and improves the loss function suitable for restaurant dish classification, enhances the robustness of an algorithm, reduces the fitting risk and provides the restaurant dish identification method based on the deep convolutional neural network for further improving the identification precision, reducing the workload of manual auxiliary operation.

The invention discloses a restaurant dish identification method based on a deep convolutional neural network, which solves the technical problems and adopts the following technical scheme:

the restaurant dish identification method based on the deep convolutional neural network is characterized by comprising the following steps of:

step S1, collecting a dish image R ₁ For dish image R ₁ Cutting pretreatment operation is carried out to obtain a dish image R ₂ For dish image R ₂ Sample block taking is carried out to obtain a dish sample block T ₁ Dish sample block T ₁ Namely, the characteristic information of the dish sample;

step S2, for a dish sample block T ₁ Performing 3D convolution operation to obtain a dish sample block T ₁ Intermediate profile T of (2) ₂ ；

Step S3, for a dish sample block T ₁ Intermediate profile T of (2) ₂ Carrying out pooling operation to obtain an intermediate characteristic spectrum T3;

step S4, middle characteristic spectrum T in space dimension ₃ And carrying out pooling operation to obtain a channel attention module A ₃ Intermediate feature pattern T in the channel dimension ₃ Carrying out pooling operation to obtain a plane attention module A' ₃ Intermediate characteristic spectrum T ₃ Each channel vector and channel attention module, intermediate feature pattern T ₃ Each spatial feature and the plane attention module are respectively subjected to phase-based multiplication to obtain an intermediate feature map T ₄ ；

Step S5, intermediate characteristic spectrum T ₄ Sequentially performing 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T ₆ Intermediate feature pattern T in the spatial dimension ₆ And carrying out pooling operation to obtain a channel attention module A ₆ Intermediate feature pattern T in the channel dimension ₆ Carrying out pooling operation to obtain a plane attention module A' ₆ Intermediate characteristic spectrum T ₆ Is defined as each channel vector and channel attention module, intermediate featureSyndrome pattern T ₆ Each spatial feature and the plane attention module are respectively subjected to phase-based multiplication to obtain an intermediate feature map T ₇ ；

Step S6, intermediate characteristic spectrum T ₇ Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T ₈ ；

Step S7, the intermediate characteristic spectrum T ₈ Inputting into a deep convolutional neural network to obtain a dish image R ₁ Is a result of classification of (a).

Optionally, when step S1 is performed, the menu image R is displayed ₂ Sample block taking is carried out, and the concrete operation comprises the following steps:

s1.1, taking a dish image R in the plane dimension ₂ Taking surrounding a multiplied by a pixel points as a neighborhood block of the sample, wherein a is the number of the pixel points of the image block in the length and width directions of the plane;

step S1.2, reserving all channel information of a×a pixel points, namely forming a three-dimensional sample block of p×a×a, wherein the three-dimensional sample block is used for representing sample characteristics of middle pixel points, and performing characteristic transformation in a sample block taking process by using a formula (1):

wherein Q is the number of pixel points in a single channel, namely the number of block samples, D _samp The method comprises the steps of representing a sample block taking process, wherein L and H represent preset plane length and width of a cutting operation;

in the sample block taking operation, when the edge pixel point has no space neighborhood information, 0 supplementing operation is carried out.

Further optionally, executing step S2, for a dish sample block T ₁ Performing 3D convolution operations, the specific operations including:

step S2.1, based on the deep convolutional neural network, selecting h different convolutional kernels from each layer of convolutional neural network, and performing on a dish sample block T ₁ The P channel information contained is convolved with a 3D convolution kernel of size e x f, where e is the number of operation layers in the channel dimension,e channels are selected each time to carry out a group of convolution, f represents the number of pixel points of the image block in the length and width directions in the space dimension;

s2.2, after h different convolution kernels are selected from each layer of convolution neural network, obtaining a dish sample block T by using formulas (2), (3) and (4) ₁ Intermediate profile T of (2) ₂ ：

p= [ (P-e) +1] ×h formula (2),

m= [ (a-e) +1] formula (3),

wherein p represents a dish sample block T ₁ The number of the included channels, e, is the number of operation layers of the channel dimension, a is the number of pixel points of the image block in the length and width directions of the plane, and m is the middle characteristic map T ₂ The number of pixels in the space length and width directions, con _3D Representing performing a 3D convolution operation.

Further optionally, for a dish sample block T ₁ In the 3D convolution operation process, mapping of each feature in a convolution layer is connected with a plurality of adjacent continuous channels in the upper layer, and a certain position value of one convolution mapping is obtained by convolving local receptive fields of the same position of three continuous channels of the upper layer; one convolution layer has a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, and h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is>1。

Further optionally, step S3 is performed to obtain an intermediate feature map T ₃ The specific operations of (a) include:

step S3.1, for dish sample block T ₁ Intermediate profile T of (2) ₂ Performing pooling operation, which is downsampling or discarding feature processing to obtain intermediate feature map T ₃ At this time, the intermediate feature map T ₃ Channel number and intermediate profile T ₂ The number of channels is the same, and the single channel is in the ruler of space dimensionThe size is changed;

step S3.2, after pooling treatment, the intermediate characteristic spectrum T ₃ Denoted as T ₃ ^p×r×r I.e. intermediate profile T ₃ The number of the pixel points of each channel in the space length and width directions is r, and the number r of the pixel points is calculated by using a formula (5):

r= (m/2) equation (5),

wherein m is an intermediate characteristic spectrum T ₂ The number of pixels in the spatial length and width directions.

Further optionally, step S4 is performed to obtain an intermediate feature map T ₄ The specific operation of (a) is as follows:

intermediate characteristic map T using formulas (6), (7) ₃ Transforming to make the intermediate characteristic spectrum T ₃ In turn in the channel direction and channel attention module A ₃ Dot multiplication channel by channel, and module A 'of attention in space direction and plane' ₃ Performing channel-by-channel dot multiplication to obtain an intermediate characteristic map T ₄ ：

Wherein Aten _spe Representing the intermediate characteristic spectrum T ₃ Attention enhancement in channel direction, aten _spa Representing the intermediate characteristic spectrum T ₃ The attention is enhanced in the space direction, u is the middle characteristic spectrum T ₃ The (u) th pixel point contained in a single channel, r is an intermediate characteristic spectrum T ₃ The number of pixels of a single channel in the space length and width directions, p is the middle characteristic map T ₃ V is the intermediate characteristic spectrum T ₃ V-th channel of (a), symbol

Representing the multiplication of elements of the same type of matrix corresponding to the same position.

Further optionally, step S5 is performed to obtain an intermediate feature map T ₇ The specific operation of (a) is as follows:

step S5.1, utilizing the formula (8) to obtain the intermediate characteristic spectrum T ₄ Performing 3D convolution operation to obtain an intermediate characteristic spectrum T ₅ Intermediate feature map T ₅ Namely, is

Wherein Con _3D Representing the 3D convolution operation, x representing the intermediate feature map T ₅ The number of pixel points in the space height direction, y represents the intermediate characteristic map T ₅ The number of pixels in the space length and width directions, r is the middle characteristic spectrum T ₄ The number of pixels of a single channel in the space length and width directions, p is the middle characteristic map T ₄ The number of channels;

step S5.2, intermediate characteristic spectrum T ₅ Downsampling to obtain intermediate characteristic spectrum T ₆ At this time, the intermediate feature map T ₆ Channel number and intermediate profile T ₅ The number of channels is the same, the size of a single channel in the space dimension is changed, and the middle characteristic spectrum T ₆ The dimensions of the individual channels in the spatial length and width directions are:

z×z＝[(y÷2)×(y÷2)]，

wherein z is an intermediate characteristic spectrum T ₆ The number of pixels in the space length and width directions, y is the middle characteristic map T ₅ The number of pixels in the space length and width directions;

step S5.3, intermediate characteristic spectrum T using formulas (9), (10) ₆ Performing feature transformation to obtain an intermediate feature map T ₇ ，T ₇ Namely, is

Wherein Aten _spe Representing the intermediate characteristic spectrum T ₆ Attention enhancement in channel direction, aten _spa Representing the intermediate characteristic spectrum T ₆ The attention is enhanced in the space direction, u is the middle characteristic spectrum T ₆ The (u) th pixel point contained in a single channel, and z is an intermediate characteristic spectrum T ₆ The number of pixels of a single channel in the space length and width directions, x is the middle characteristic spectrum T ₆ V is the intermediate characteristic spectrum T ₆ V-th channel of (a), symbol

Further optionally, when step S4 or S5 is performed, the specific operations of the channel attention module and the plane attention module are:

obtaining a channel attention module:

(1.1) first, the intermediate feature map T is in the spatial dimension _i Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors, wherein i takes a value of 3 or 6,

(1.2) subsequently, inputting the two pooled vectors into a shared multi-layer mapped neural network for training, respectively generating two new vectors,

(1.3) finally, adding the two new vectors bit by bit, and performing nonlinear mapping by using a Sigmoid activation function to obtain a channel attention module A by using formulas (11) and (12) _i (T _i )，

A _i (T _i )＝σ{MLP[AvePool(T _i )]+MLP[MaxPool(T _i )]Equation (12),

wherein sigma represents a Sigmoid activation function, e is the operation layer number of channel dimension, MLP represents nonlinear mapping through a multi-layer neural network, avePool represents average pooling, and MaxPool represents maximum pooling;

(II) obtaining a planar attention module:

(2.1) first, the intermediate feature pattern T is in the channel dimension _i Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

(2.2) subsequently, mapping the two pooled vectors to a single-channel, same-size model by convolution operation,

(2.3) finally, performing nonlinear mapping through the Sigmoid activation function to obtain a plane attention module A 'by using formulas (13) and (14)' _i (T _i )，

Where σ represents the Sigmoid activation function, e is the number of operation layers in the channel dimension,

representing feature transformation with a 1 x 1 convolutional neural network, aveboost represents mean pooling and MaxPool represents maximum pooling.

Further optionally, step S6 is performed to obtain an intermediate feature map T ₈ The specific operation of (a) is as follows:

step S6.1, convolution check Using ρXz XzCharacteristic spectrum T in vegetable middle ₇ Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T ₈ I.e. intermediate profile T ₈ Each channel only contains one pixel point, wherein ρ is the side length of the convolution size on the channel, and z×z is the size of the convolution window;

step S6.2, the vegetable intermediate characteristic map T ₇ When 3D convolution operation is carried out, the number of convolution kernels is eta, the vector length of input convolution is alpha, and the vector length alpha after convolution is obtained by utilizing a formula (15):

α= [ (α - ρ) +1] ×η formula (15).

Further optionally, step S7 is performed to obtain a dish image R ₁ The specific operation is as follows:

step S7.1, selecting a deep convolutional neural network whose activation function is a softmax function shown in formula (16), the softmax function being preceded by a layer of neural network, the softmax function being,

wherein Y is _i Representing the Y-th in vector T _i An element;

step S7.2, intermediate characteristic map T ₈ After the deep convolutional neural network is input, a vector T is obtained through a layer of neural network, then enters a softmax function, elements in the vector T are mapped into a (0, 1) interval by the softmax function, probability vectors of the vector T are obtained, and a dish image R is obtained ₁ The name of (2) is the name corresponding to the maximum probability value in the probability vector mapped by the softmax function.

Compared with the prior art, the restaurant dish identification method based on the deep convolutional neural network has the beneficial effects that:

according to the invention, the characteristic patterns of the dish sample blocks are extracted through the deep convolutional neural network and the attention module, and the names of the original dish images are also obtained through softmax function mapping, so that the method has the advantage of high recognition accuracy, the workload of human-aided operation can be reduced, the fitting risk is reduced, and the defect that dishes are difficult to finely divide when the similarity of dishes is high in the current dish recognition and classification method is overcome.

Drawings

FIG. 1 is a simplified flow chart of a method according to a first embodiment of the invention;

FIG. 2 is a diagram showing an intermediate feature pattern T obtained in the first embodiment of the present invention ₄ Is a simple flow chart of the method.

Detailed Description

In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.

Embodiment one:

referring to fig. 1 and 2, this embodiment provides a restaurant dish identification method based on a deep convolutional neural network, which is characterized in that the implementation content includes:

step S1, collecting a dish image R ₁ For dish image R ₁ Cutting pretreatment operation is carried out to obtain a dish image R ₂ For dish image R ₂ Sample block taking is carried out to obtain a dish sample block T ₁ Dish sample block T ₁ And the characteristic information of the dish sample is obtained.

In this step, the dish image R ₂ The concrete operations of sample block taking comprise:

wherein Q is a single channelThe number of pixels in the array is also the number of block samples, D _samp The method comprises the steps of representing a sample block taking process, wherein L and H represent preset plane length and width of a cutting operation;

Step S2, for a dish sample block T ₁ Performing 3D convolution operation to obtain a dish sample block T ₁ Intermediate profile T of (2) ₂ The specific operation comprises the following steps:

step S2.1, based on the deep convolutional neural network, selecting h different convolutional kernels from each layer of convolutional neural network, and performing on a dish sample block T ₁ The method comprises the steps that the included P channel information is subjected to convolution operation by using 3D convolution kernels with the size of e multiplied by f, wherein e is the operation layer number of a channel dimension, namely, e channels are selected for carrying out a group of convolution each time, and f represents the number of pixel points of an image block in the length direction and the width direction in the space dimension;

p= [ (P-e) +1] ×h formula (2),

m= [ (a-e) +1] formula (3),

The step is to a dish sample block T ₁ In the 3D convolution operation, the mapping of each feature in the convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and a certain position value of one convolution mapping is obtained by convolving three continuous channels of the previous layerObtained from a local receptive field at the same location; one convolution layer has a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, and h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is>1。

Step S3, for a dish sample block T ₁ Intermediate profile T of (2) ₂ Performing pooling operation to obtain intermediate characteristic spectrum T ₃ The specific operation comprises the following steps:

step S3.1, for dish sample block T ₁ Intermediate profile T of (2) ₂ Performing pooling operation, which is downsampling or discarding feature processing to obtain intermediate feature map T ₃ At this time, the intermediate feature map T ₃ Channel number and intermediate profile T ₂ The number of channels is the same, and the size of a single channel in the space dimension is changed;

step S3.2, after pooling treatment, the intermediate characteristic spectrum T ₃ Represented as

I.e. intermediate feature pattern T ₃ The number of the pixel points of each channel in the space length and width directions is r, and the number r of the pixel points is calculated by using a formula (5):

r= (m/2) equation (5),

Step S4, middle characteristic spectrum T in space dimension ₃ And carrying out pooling operation to obtain a channel attention module A ₃ Intermediate feature pattern T in the channel dimension ₃ Carrying out pooling operation to obtain a plane attention module A' ₃ Intermediate characteristic spectrum T ₃ Each channel vector and channel attention module, intermediate feature pattern T ₃ Each spatial feature and the plane attention module are respectively subjected to phase-based multiplication to obtain an intermediate feature map T ₄ 。

Step S4 includes two aspects, namely obtaining a channel attention modelBlock A ₃ And a planar attention module A' ₃ On the other hand, an intermediate characteristic spectrum T is obtained ₄ 。

First, the get channel attention module A will be described ₃ And a planar attention module A' ₃ Is a specific procedure of (a).

Obtaining a channel attention module:

(1.1) first, the intermediate feature map T is in the spatial dimension ₃ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors;

(1.2) subsequently inputting the two pooled vectors into a shared multi-layer mapped neural network for training, respectively generating two new vectors;

(1.3) finally, adding the two new vectors bit by bit, and performing nonlinear mapping by using a Sigmoid activation function to obtain a channel attention module A by using formulas (11) and (12) ₃ (T ₃ )；

A ₃ (T ₃ )＝σ{MLP[AvePool(T ₃ )]+MLP[MaxPool(T ₃ )]Equation (12),

wherein σ represents a Sigmoid activation function, e is the operation layer number of channel dimension, MLP represents nonlinear mapping through a multi-layer neural network, avePool represents average pooling, and MaxPool represents maximum pooling.

(II) obtaining a planar attention module:

(2.1) first, the intermediate feature pattern T is in the channel dimension ₃ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

(2.3) finally, performing nonlinear mapping through the Sigmoid activation function to obtain a plane attention module A 'by using formulas (13) and (14)' ₃ (T ₃ )，

Subsequently, the intermediate feature map T is obtained as follows ₄ Is characterized by comprising the following specific operations:

Step S5, intermediate characteristic spectrum T ₄ Sequentially performing 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T ₆ Intermediate feature pattern T in the spatial dimension ₆ And carrying out pooling operation to obtain a channel attention module A ₆ Intermediate feature pattern T in the channel dimension ₆ Carrying out pooling operation to obtain a plane attention module A' ₆ Intermediate characteristic spectrum T ₆ Each channel vector and channel attention module, intermediate feature pattern T ₆ Each spatial feature and the plane attention module are respectively subjected to phase-based multiplication to obtain an intermediate feature map T ₇ 。

The specific operation of implementing step S5 is:

z×z＝[(y÷2)×(y÷2)]，

Obtaining channel attention from equation (10)Module A ₆ And a planar attention module A' ₆ The specific operation is as follows:

obtaining channel attention module A ₆ ：

(1.1) first, the intermediate feature map T is in the spatial dimension ₆ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

and (1.3) finally, adding the two new vectors bit by bit, and performing nonlinear mapping through a Sigmoid activation function to obtain a channel attention module A by using formulas (11 ')and (12') ₆ (T ₆ )，

A ₆ (T ₆ )＝σ{MLP[AvePool(T ₆ )]+MLP[MaxPool(T ₆ )]Equation (12'),

(II) obtaining a planar attention module:

(2.1) first, the intermediate feature pattern T is in the channel dimension ₆ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

and (2.3) finally, performing nonlinear mapping through a Sigmoid activation function, and obtaining a plane attention module A ' by using formulas (13 ')and (14 '). ₆ (T ₆ )，

Step S6, intermediate characteristic spectrum T ₇ Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T ₈ The specific operation is as follows:

s6.1, checking the vegetable intermediate characteristic map T by adopting the convolution of rho x z ₇ Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T ₈ I.e. intermediate profile T ₈ Each channel only contains one pixel point, wherein ρ is the side length of the convolution size on the channel, and z×z is the size of the convolution window;

α= [ (α - ρ) +1] ×η formula (15).

Step S7, the intermediate characteristic spectrum T ₈ Inputting into a deep convolutional neural network to obtain a dish image R ₁ The specific operation is as follows:

wherein Y is _i Representing the Y-th in vector T _i An element;

In summary, the restaurant dish identification method based on the deep convolutional neural network can improve the identification precision, reduce the workload of manual auxiliary operation, and solve the defect that the current dish identification and classification method is difficult to finely divide dishes when the similarity of the dishes is high.

The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention may be better understood. Based on the above-mentioned embodiments of the present invention, any improvements and modifications made by those skilled in the art without departing from the principles of the present invention should fall within the scope of the present invention.

Claims

1. The restaurant dish identification method based on the deep convolutional neural network is characterized by comprising the following steps of:

step S4, (1) in the spatial dimensionIntermediate characteristic pattern T ₃ And carrying out pooling operation to obtain a channel attention module A ₃ The specific operation is as follows:

step S4.1.1, first, the intermediate feature map T is spatially scaled ₃ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

step S4.1.2, then, inputting the two pooled vectors into the shared multi-layer mapping neural network for training, respectively generating two new vectors,

s4.1.3, finally, adding the two new vectors bit by bit, and performing nonlinear mapping by using Sigmoid activation function to obtain the channel attention module A by using formulas (11) and (12) ₃ (T ₃ )，

A ₃ (T ₃ )＝σ{MLP[AvePool(T ₃ )]+MLP[MaxPool(T ₃ )]Equation (12),

(2) intermediate feature map T in the channel dimension ₃ Carrying out pooling operation to obtain a plane attention module A' ₃ The specific operation is as follows:

step S4.2.1 first, the intermediate feature pattern T is in the channel dimension ₃ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

step S4.2.2, then, mapping the two pooled vectors to a single-channel, same-size model by convolution operation,

step S4.2.3, finally, performing nonlinear mapping by using Sigmoid activation function to obtain a plane attention module A 'by using formulas (13) and (13)' ₃ (T ₃ )，

representing feature transformation by using a 1×1 convolutional neural network, wherein AvePool represents average pooling and MaxPool represents maximum pooling;

(3) intermediate characteristic spectrum T ₃ Each channel vector and channel attention module, intermediate feature pattern T ₃ Each spatial feature and the plane attention module are respectively subjected to phase-based multiplication to obtain an intermediate feature map T ₄ The specific operation is as follows: intermediate characteristic map T using formulas (6), (7) ₃ Transforming to make the intermediate characteristic spectrum T ₃ In turn in the channel direction with channel attention module A ₃ Dot multiplication channel by channel, and module A 'of attention in space direction and plane' ₃ Performing channel-by-channel dot multiplication to obtain an intermediate characteristic map T ₄ ：

/>

Wherein Aten _spe Representing the intermediate characteristic spectrum T ₃ Attention enhancement in channel direction, aten _spa Representing the intermediate characteristic spectrum T ₃ The attention is enhanced in the space direction, u is the middle characteristic spectrum T ₃ The (u) th pixel point contained in a single channel, r is an intermediate characteristic spectrum T ₃ In the space length and width directionThe number of upward pixels, p is the middle characteristic map T ₃ V is the intermediate characteristic spectrum T ₃ V-th channel of (a), symbol

Multiplying elements representing the same positions corresponding to the matrices of the same type;

step S5, (1) intermediate feature map T ₄ Sequentially performing 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T ₆ The specific operation is as follows:

step S5.1.1, utilizing formula (8) to intermediate characteristic spectrum T ₄ Performing 3D convolution operation to obtain an intermediate characteristic spectrum T ₅ Intermediate feature map T ₅ Namely, is

step S5.1.2, intermediate characteristic atlas T ₅ Downsampling to obtain intermediate characteristic spectrum T ₆ At this time, the intermediate feature map T ₆ Channel number and intermediate profile T ₅ The number of channels is the same, the size of a single channel in the space dimension is changed, and the middle characteristic spectrum T ₆ The dimensions of the individual channels in the spatial length and width directions are:

z×z＝[(y÷2)×(y÷2)]，

wherein z is an intermediate characteristic spectrum T ₆ The number of pixels in the space length and width directions, y is the middle characteristicSyndrome pattern T ₅ The number of pixels in the space length and width directions;

(2) intermediate feature map T in spatial dimension ₆ And carrying out pooling operation to obtain a channel attention module A ₆ The specific operation is as follows:

step S5.2.1, first, the intermediate feature map T is spatially scaled ₆ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

step S5.2.2, then, inputting the two pooled vectors into the shared multi-layer mapping neural network for training, respectively generating two new vectors,

s5.2.3, finally, adding the two new vectors bit by bit, and performing nonlinear mapping by using Sigmoid activation function to obtain the channel attention module A by using formulas (11 '), (12') ₆ (T ₆ )，

A ₆ (T ₆ )＝σ{MLP[AvePool(T ₆ )]+MLP[MaxPool(T ₆ )]Equation (12'),

(3) intermediate feature map T in the channel dimension ₆ And carrying out pooling operation to obtain a plane attention module A'6, wherein the specific operation is as follows:

step S5.3.1 first, the intermediate feature pattern T is in the channel dimension ₆ Respectively carrying out maximum pooling and average pooling operation to generate two pooling vectors,

step S5.3.2, then, mapping the two pooled vectors to a single-channel, same-size model by convolution operation,

step S5.3.3, finally, performing nonlinear mapping through a Sigmoid activation function, namely using formulas (13'), (1)4 ') obtaining a planar attention module A' ₆ (T ₆ )，

(4) intermediate characteristic spectrum T ₆ Each channel vector and channel attention module, intermediate feature pattern T ₆ Each spatial feature and the plane attention module are respectively subjected to phase-based multiplication to obtain an intermediate feature map T ₇ ，T ₇ Namely, is

As shown in the formulas (9) and (10),

wherein Aten _spe Representing the intermediate characteristic spectrum T ₆ Attention enhancement in channel direction, aten _spa Representing the intermediate characteristic spectrum T ₆ The attention is enhanced in the space direction, u is the middle characteristic spectrum T ₆ The (u) th pixel point contained in a single channel, and z is an intermediate characteristic spectrum T ₆ Pixel point of single channel in space length and width directionsThe number x is the middle characteristic spectrum T ₆ V is the intermediate characteristic spectrum T ₆ V-th channel of (a), symbol

2. The restaurant dish recognition method based on the deep convolutional neural network as recited in claim 1, wherein the dish image R is ₂ Sample block taking is carried out, and the concrete operation comprises the following steps:

3. A depth-based convolution god according to claim 2The restaurant dish identification method through the network is characterized in that when the step S2 is executed, a dish sample block T is used for ₁ Performing 3D convolution operations, the specific operations including:

p= [ (P-e) +1] ×h formula (2),

m= [ (a-e) +1] formula (3),

4. A restaurant dish identification method based on deep convolutional neural network as claimed in claim 3, wherein for dish sample block T ₁ In the 3D convolution operation process, mapping of each feature in a convolution layer is connected with a plurality of adjacent continuous channels in the upper layer, and a certain position value of one convolution mapping is obtained by convolving local receptive fields of the same position of three continuous channels of the upper layer; one convolution layer has multiple convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, using hThe convolution kernel may extract h types of feature information, where h is a positive integer, and h>1。

5. The restaurant dish recognition method based on the deep convolutional neural network as recited in claim 3, wherein the step S3 is performed to obtain an intermediate characteristic spectrum T ₃ The specific operations of (a) include:

r= (m/2) equation (5),

6. The restaurant dish recognition method based on the deep convolutional neural network as recited in claim 1, wherein step S6 is performed to obtain an intermediate feature map T ₈ The specific operation of (a) is as follows:

step S6.2, the vegetable intermediate characteristic map T ₇ When 3D convolution operation is performedThe number of convolution kernels adopted is eta, the vector length of input convolution is alpha, and the vector length alpha after convolution is obtained by utilizing a formula (15):

α= [ (α - ρ) +1] ×η formula (15).

7. The restaurant dish recognition method based on the deep convolutional neural network as recited in claim 1, wherein the step S7 is performed to obtain a dish image R ₁ The specific operation is as follows:

wherein Y is _i Representing the Y-th in vector T _i An element;