CN112699822A - Restaurant dish identification method based on deep convolutional neural network - Google Patents

Restaurant dish identification method based on deep convolutional neural network Download PDF

Info

Publication number
CN112699822A
CN112699822A CN202110006146.7A CN202110006146A CN112699822A CN 112699822 A CN112699822 A CN 112699822A CN 202110006146 A CN202110006146 A CN 202110006146A CN 112699822 A CN112699822 A CN 112699822A
Authority
CN
China
Prior art keywords
feature map
channel
dish
convolution
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110006146.7A
Other languages
Chinese (zh)
Other versions
CN112699822B (en
Inventor
翟盛龙
尹旭
王东伟
张金波
张睿智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202110006146.7A priority Critical patent/CN112699822B/en
Publication of CN112699822A publication Critical patent/CN112699822A/en
Application granted granted Critical
Publication of CN112699822B publication Critical patent/CN112699822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a restaurant dish identification method based on a deep convolutional neural network, relates to the technical field of deep learning, and aims to solve the problem that the dish identification and classification method in the prior art is difficult to finely divide when the dish similarity is high, and the technical scheme is as follows: taking and cutting a dish image to obtain a sample block of the image; the method comprises the steps of firstly carrying out 3D convolution operation and downsampling operation on a sample block, then extracting a feature map through an attention module, then carrying out 3D convolution operation to obtain a one-dimensional intermediate feature map, inputting the one-dimensional intermediate feature map into a deep convolution neural network with a softmax function, and obtaining a classification result of an original dish image through a probability value obtained by mapping of the softmax function. The method can improve the identification precision, reduce the workload of artificial auxiliary operation, and solve the defect that the dish is difficult to finely divide when the dish similarity is high in the existing dish identification and classification method.

Description

Restaurant dish identification method based on deep convolutional neural network
Technical Field
The invention relates to the technical field of deep learning, in particular to a restaurant dish identification method based on a deep convolutional neural network.
Background
Food is an important component of human life, and is an important prerequisite for human survival and health development. Along with the development of society, the requirements of people on food quality are continuously improved, the development of the catering industry is objectively and greatly promoted, various dishes are made in restaurants of enterprises, new dishes are continuously produced, and the requirement for accurately identifying a plurality of dishes is higher and higher. Meanwhile, the restaurant dishes of many enterprises are various in types, and the manual settlement efficiency of the dining hall is very low. Today, mobile interconnection is rapidly developed, a large number of dish images can be rapidly acquired from massive network image information, and can be used as a data source to perform analysis modeling to obtain a general model for dish image classification, segmentation and identification, which has important significance for saving labor cost and improving restaurant settlement efficiency.
Image classification, in short, distinguishes among already acquired images based on finding common features of a large number of images. Only if such features can be found by the model can the images be correctly distinguished.
The classified dish images are a direction of research which is concerned before the deep learning technology is developed. Deep learning is a machine learning method based on characterization learning of data, aims to establish and simulate a multilayer neural network for analyzing and learning of human brain, is used for explaining data such as images, sounds and texts, and is widely applied to the field of image recognition. Because the deep learning can extract more abstract and deeper features in the image, compared with the traditional classification method, the deep learning has stronger classification capability. The convolutional neural network is well applied to image classification, however, the input information quantity and the classification effect of the convolutional neural network are not completely positively correlated, and under a certain model, too complex input can not only prolong the training time and the classification time, but also even cause the accuracy to be reduced instead of increasing. Therefore, it is necessary to deeply research the feature extraction process before the classification of the convolutional neural network, and the purpose of adaptive feature refinement can be achieved on the premise of low overhead.
Disclosure of Invention
Aiming at the current existing dish identification and classification method, the invention combines the characteristics of high similarity degree, uneven material collocation and the like of some dishes, deeply researches the characteristic extraction process before the convolutional neural network classification, can achieve the purpose of self-adaptive refinement of the characteristics on the premise of low cost, and simultaneously provides a restaurant dish identification method based on the deep convolutional neural network for further improving the identification precision, reducing the workload of artificial auxiliary operation, optimizing and improving the loss function suitable for restaurant dish classification, enhancing the robustness of the algorithm and reducing the overfitting risk.
The invention discloses a restaurant dish identification method based on a deep convolutional neural network, which solves the technical problems by adopting the following technical scheme:
a restaurant dish identification method based on a deep convolutional neural network is characterized by comprising the following implementation contents:
step S1, collecting dish image R1To the dish image R1Performing cutting preprocessing operation to obtain a dish image R2To the dish image R2Sampling the sample to obtain a vegetable sample block T1Vegetable sample block T1The characteristic information of the dish sample is obtained;
step S2, dish sample block T1Performing 3D convolution operation to obtain vegetable sample block T1Intermediate characteristic map T of2
Step S3, dish sample block T1Intermediate characteristic map T of2Performing pooling operation to obtain an intermediate characteristic spectrum T3;
step S4, centering the intermediate feature map T on the space dimension3Performing pooling operation to obtain a channel attention module A3Centering feature map T in channel dimension3Performing pooling operation to obtain planar attention module A'3The intermediate feature map T3Each channel vector and channel attention module, and intermediate feature map T3Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T4
Step S5, the intermediate feature map T4Sequentially carrying out 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T6Centering the intermediate feature map T in the spatial dimension6Performing pooling operation to obtain a channel attention module A6Centering feature map T in channel dimension6Performing pooling operation to obtain planar attention module A'6The intermediate feature map T6Each channel vector and channel attention module, and intermediate feature map T6Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T7
Step S6, the intermediate feature map T7Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T8
Step S7, the intermediate feature map T8Inputting the data into a deep convolutional neural network to obtain a dish image R1The classification result of (1).
Optionally, when step S1 is executed, the dish image R is subjected to2Carrying out sample block taking, and the concrete operation comprises the following steps:
s1.1, in the plane dimension, taking a dish image R2A × a surrounding pixels are used as a neighborhood block of the sample, wherein a is the number of pixels of the image block in the plane length and width directions;
step S1.2, retaining all channel information of a pixel points of a multiplied by a, namely forming a three-dimensional sample block of P multiplied by a, which is used for representing the sample characteristics of the middle pixel points, and performing characteristic transformation of the sample block taking process by using a formula (1):
Figure BDA0002883292150000031
wherein Q is the number of pixel points in a single channel, and is also the number of block samples, DsampRepresenting the sample block taking process, and L and H represent the length and the width of a preset plane of the cutting operation;
and in the sample block taking operation, when the edge pixel point has no spatial neighborhood information, performing 0 complementing operation.
Further optionally, when step S2 is executed, the dish sample block T is subjected to1Performing a 3D convolution operation, the specific operation comprising:
s2.1, based on the deep convolutional neural network, selecting h different convolutional kernels in each layer of convolutional neural network, and performing T on dish sample blocks1Carrying out convolution operation on the contained P pieces of channel information by using a 3D convolution kernel with the size of e multiplied by f, wherein e is the number of operation layers of channel dimensionality, namely selecting e channels for carrying out a group of convolution every time, and f represents the number of pixel points of the image block in the length and width directions in the space dimensionality;
s2.2, after h different convolution kernels are selected from each layer of convolution neural network, a dish sample block T is obtained by using formulas (2), (3) and (4)1Intermediate characteristic map T of2
P ═ [ (P-e) +1] × h formula (2),
where m is [ (a-e) +1] formula (3),
Figure BDA0002883292150000041
wherein p represents a sample block T of dishes1The number of channels contained, e is the number of operating layers of channel dimensions, a is the number of pixel points of the image block in the plane length and width directions, and m is a middle characteristic map T2Number of pixels in the direction of space length and width, Con3DIndicating that a 3D convolution operation is performed.
Further optionally, for the dish sample block T1During the 3D convolution operation, the mapping of each feature in the convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and the mapping of one convolution isA certain position value is obtained by convolution of local receptive fields at the same position of three continuous channels of the previous layer; one convolution layer is provided with a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is a positive integer>1。
Further optionally, step S3 is executed to obtain an intermediate feature map T3The specific operation of (1) comprises:
step S3.1, dish sample block T1Intermediate characteristic map T of2Performing pooling operation, namely down-sampling treatment or discarding characteristic treatment, to obtain an intermediate characteristic map T3At this time, the intermediate feature map T3The number of channels and the intermediate feature map T2The number of channels is the same, and the size of a single channel in the spatial dimension is changed;
step S3.2, after the pooling treatment, the intermediate characteristic spectrum T3Is denoted by T3 p×r×rI.e. intermediate feature maps T3The number of pixel points of each channel in the space length direction and the width direction is r, and the number r of the pixel points is calculated by using a formula (5):
r ═ (m ÷ 2) equation (5),
wherein m is an intermediate characteristic spectrum T2The number of pixels in the spatial length and width directions.
Further optionally, step S4 is executed to obtain an intermediate feature map T4The specific operation is as follows:
using formulas (6) and (7) to obtain an intermediate feature map T3Transforming to obtain intermediate characteristic map T3Attention module A in channel direction and channel direction in sequence3Carry out point multiplication one by one, in space direction and plane attention module A'3Performing point multiplication one by one to obtain an intermediate characteristic map T4
Figure BDA0002883292150000042
Figure BDA0002883292150000051
Wherein, AtenspeRepresentation of the intermediate feature map T3Attention enhancement in the channel direction, AtenspaRepresentation of the intermediate feature map T3Attention is enhanced in the space direction, and u is an intermediate feature map T3The u-th pixel point contained in a single channel, r is an intermediate characteristic map T3The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T3V is the intermediate feature map T3The v channel, symbol
Figure BDA0002883292150000055
Elements representing the same position corresponding to the same type of matrix are multiplied.
Further optionally, step S5 is executed to obtain an intermediate feature map T7The specific operation is as follows:
step S5.1, utilizing a formula (8) to carry out comparison on the intermediate characteristic map T4Performing 3D convolution operation to obtain an intermediate characteristic spectrum T5Middle feature map T5Is that
Figure BDA0002883292150000052
Figure BDA0002883292150000053
Wherein, Con3DRepresenting a 3D convolution operation, x representing an intermediate feature map T5The number of pixel points in the height direction of the space, y represents the middle characteristic spectrum T5The number of pixel points in the space length and width directions, r is the intermediate characteristic spectrum T4The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T4The number of channels of (a);
step S5.2, the intermediate characteristic map T5To carry outDown-sampling to obtain intermediate characteristic spectrum T6At this time, the intermediate feature map T6The number of channels and the intermediate feature map T5The number of channels is the same, the size of a single channel in the spatial dimension is changed, and the intermediate feature map T6The dimensions of the individual channels in the spatial length and width directions are:
z×z=[(y÷2)×(y÷2)],
wherein z is an intermediate characteristic spectrum T6The number of pixel points in the space length and width directions, and y is the intermediate characteristic spectrum T5The number of pixel points in the space length and width directions;
s5.3, utilizing the formulas (9) and (10) to perform intermediate feature map T6Carrying out feature transformation to obtain an intermediate feature map T7,T7Is that
Figure BDA0002883292150000054
Figure BDA0002883292150000061
Figure BDA0002883292150000062
Wherein, AtenspeRepresentation of the intermediate feature map T6Attention enhancement in the channel direction, AtenspaRepresentation of the intermediate feature map T6Attention is enhanced in the space direction, and u is an intermediate feature map T6The u-th pixel point contained in a single channel, and z is an intermediate characteristic spectrum T6The number of pixel points of a single channel in the space length and width directions, and x is an intermediate characteristic spectrum T6V is the intermediate feature map T6The v channel, symbol
Figure BDA0002883292150000064
Elements representing the same position corresponding to the same type of matrix are multiplied.
Further optionally, when step S4 or S5 is executed, the specific operations of the channel attention module and the plane attention module are:
obtaining a channel attention module:
(1.1) first, the intermediate feature map T is centered in the spatial dimensioniPerforming maximum pooling and average pooling operations, respectively, to generate two pooling vectors, wherein i has a value of 3 or 6,
(1.2) inputting the two pooled vectors into a shared multilayer mapping neural network for training to respectively generate two new vectors,
(1.3) finally, carrying out bitwise addition on the two new vectors, and carrying out nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11) and (12)i(Ti),
Figure BDA0002883292150000063
Ai(Ti)=σ{MLP[AvePool(Ti)]+MLP[MaxPool(Ti)]The formula (12) is described,
wherein, σ represents a Sigmoid activation function, e is the number of operation layers of a channel dimension, MLP represents nonlinear mapping performed through a multilayer neural network, AvePool represents average pooling, and MaxPool represents maximum pooling;
(II) obtaining a plane attention module:
(2.1) first, the intermediate feature map T is centered in the channel dimensioniPerforming maximum pooling and average pooling operations, respectively, to generate two pooling vectors,
(2.2) subsequently, the two aforementioned pooled vectors are mapped to a single-channel, same-size model via a convolution operation,
(2.3) finally, carrying out nonlinear mapping through a Sigmoid activation function, and obtaining the plane attention module A 'by using the formulas (13) and (14)'i(Ti),
Figure BDA0002883292150000071
Figure BDA0002883292150000072
Wherein, σ represents Sigmoid activation function, e is the operation layer number of channel dimension,
Figure BDA0002883292150000073
the feature transformation is performed by using a 1 × 1 convolutional neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.
Further optionally, step S6 is executed to obtain an intermediate feature map T8The specific operation is as follows:
s6.1, checking the middle characteristic map T of the dish by adopting the convolution of rho multiplied by z7Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T8I.e. intermediate feature maps T8Each channel only comprises one pixel point, wherein rho is the side length of the convolution size on the channel, and zxz is the size of the convolution window;
step S6.2, aiming at the characteristic map T in the middle of the dish7When 3D convolution is performed, the number of convolution kernels used is η, the vector length of the input convolution is α, and the vector length α after convolution is obtained using the formula (15):
α ═ [ (α - ρ) +1] × η formula (15).
Further optionally, step S7 is executed to obtain dish image R1The classification result of (2) is specifically operated as follows:
step S7.1, selecting a deep convolution neural network with an activation function as a softmax function shown in formula (16), wherein the softmax function is preceded by a layer of neural network,
Figure BDA0002883292150000074
wherein, YiRepresenting the Yth in the vector TiAn element;
step S7.2, intermediate characteristic map T8After the deep convolutional neural network is input, a vector T is obtained through a layer of neural network, the vector T enters a softmax function, the softmax function maps elements in the vector T into a (0, 1) interval to obtain a probability vector of the vector T, and then the dish image R1The name of (1) is the name corresponding to the maximum probability value in the probability vector obtained by mapping the softmax function.
Compared with the prior art, the restaurant dish identification method based on the deep convolutional neural network has the beneficial effects that:
the method extracts the characteristic map of the dish sample block through the deep convolutional neural network and the attention module, obtains the name of the original dish image through softmax function mapping, has the advantage of high identification precision, can reduce the workload of manual auxiliary operation, reduces the overfitting risk, and solves the defect that the dish is difficult to finely divide when the dish similarity is high in the existing dish identification and classification method.
Drawings
FIG. 1 is a simplified flow chart of a method according to a first embodiment of the present invention;
FIG. 2 is a drawing showing an intermediate feature map T obtained in the first embodiment of the present invention4The method of (1) is a simple flow chart.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the present invention more clearly apparent, the following technical scheme of the present invention is clearly and completely described with reference to the specific embodiments.
The first embodiment is as follows:
with reference to fig. 1 and 2, the present embodiment provides a restaurant dish identification method based on a deep convolutional neural network, which is characterized in that the implementation content includes:
step S1, collecting dish image R1To the dish image R1Performing cutting preprocessing operation to obtain a dish image R2To the dish image R2Sampling the sample to obtain a vegetable sample block T1Vegetable sample block T1I.e. the characteristics of the dish sampleAnd (4) information.
In this step, the dish image R is compared2The specific operation of sampling the block comprises the following steps:
s1.1, in the plane dimension, taking a dish image R2A × a surrounding pixels are used as a neighborhood block of the sample, wherein a is the number of pixels of the image block in the plane length and width directions;
step S1.2, retaining all channel information of a pixel points of a multiplied by a, namely forming a three-dimensional sample block of P multiplied by a, which is used for representing the sample characteristics of the middle pixel points, and performing characteristic transformation of the sample block taking process by using a formula (1):
Figure BDA0002883292150000091
wherein Q is the number of pixel points in a single channel, and is also the number of block samples, DsampRepresenting the sample block taking process, and L and H represent the length and the width of a preset plane of the cutting operation;
and in the sample block taking operation, when the edge pixel point has no spatial neighborhood information, performing 0 complementing operation.
Step S2, dish sample block T1Performing 3D convolution operation to obtain vegetable sample block T1Intermediate characteristic map T of2The specific operation comprises the following steps:
s2.1, based on the deep convolutional neural network, selecting h different convolutional kernels in each layer of convolutional neural network, and performing T on dish sample blocks1Carrying out convolution operation on the contained P pieces of channel information by using a 3D convolution kernel with the size of e multiplied by f, wherein e is the number of operation layers of channel dimensionality, namely selecting e channels for carrying out a group of convolution every time, and f represents the number of pixel points of the image block in the length and width directions in the space dimensionality;
s2.2, after h different convolution kernels are selected from each layer of convolution neural network, a dish sample block T is obtained by using formulas (2), (3) and (4)1Intermediate characteristic map T of2
P ═ [ (P-e) +1] × h formula (2),
where m is [ (a-e) +1] formula (3),
Figure BDA0002883292150000092
wherein p represents a sample block T of dishes1The number of channels contained, e is the number of operating layers of channel dimensions, a is the number of pixel points of the image block in the plane length and width directions, and m is a middle characteristic map T2Number of pixels in the direction of space length and width, Con3DIndicating that a 3D convolution operation is performed.
The step is to the dish sample block T1In the process of performing 3D convolution operation, the mapping of each feature in a convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and a certain position value of one convolution mapping is obtained by convolving local receptive fields of the same position of three continuous channels in the previous layer; one convolution layer is provided with a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is a positive integer>1。
Step S3, dish sample block T1Intermediate characteristic map T of2Performing pooling operation to obtain intermediate characteristic map T3The specific operation comprises the following steps:
step S3.1, dish sample block T1Intermediate characteristic map T of2Performing pooling operation, namely down-sampling treatment or discarding characteristic treatment, to obtain an intermediate characteristic map T3At this time, the intermediate feature map T3The number of channels and the intermediate feature map T2The number of channels is the same, and the size of a single channel in the spatial dimension is changed;
step S3.2, after the pooling treatment, the intermediate characteristic spectrum T3Is shown as
Figure BDA0002883292150000101
I.e. the intermediate characteristic map T3The number of pixels of each channel in the space length and width directionsThe number is r, and the number r of the pixel points is calculated by using a formula (5):
r ═ (m ÷ 2) equation (5),
wherein m is an intermediate characteristic spectrum T2The number of pixels in the spatial length and width directions.
Step S4, centering the intermediate feature map T on the space dimension3Performing pooling operation to obtain a channel attention module A3Centering feature map T in channel dimension3Performing pooling operation to obtain planar attention module A'3The intermediate feature map T3Each channel vector and channel attention module, and intermediate feature map T3Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T4
Step S4 includes two aspects, one is obtaining the channel attention module A3And planar attention Module A'3On the other hand, obtaining an intermediate feature map T4
First, the module A for obtaining the channel attention is described3And planar attention Module A'3The specific process of (1).
Obtaining a channel attention module:
(1.1) first, the intermediate feature map T is centered in the spatial dimension3Performing maximum pooling and average pooling operations respectively to generate two pooling vectors;
(1.2) inputting the two pooling vectors into a shared multilayer mapping neural network for training to respectively generate two new vectors;
(1.3) finally, carrying out bitwise addition on the two new vectors, and carrying out nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11) and (12)3(T3);
Figure BDA0002883292150000111
A3(T3)=σ{MLP[AvePool(T3)]+MLP[MaxPool(T3)]The formula (12) is described,
wherein, σ represents a Sigmoid activation function, e is the number of operation layers of a channel dimension, MLP represents nonlinear mapping performed through a multilayer neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.
(II) obtaining a plane attention module:
(2.1) first, the intermediate feature map T is centered in the channel dimension3Performing maximum pooling and average pooling operations, respectively, to generate two pooling vectors,
(2.2) subsequently, the two aforementioned pooled vectors are mapped to a single-channel, same-size model via a convolution operation,
(2.3) finally, carrying out nonlinear mapping through a Sigmoid activation function, and obtaining the plane attention module A 'by using the formulas (13) and (14)'3(T3),
Figure BDA0002883292150000112
Figure BDA0002883292150000113
Wherein, σ represents Sigmoid activation function, e is the operation layer number of channel dimension,
Figure BDA0002883292150000114
the feature transformation is performed by using a 1 × 1 convolutional neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.
Subsequently, the intermediate feature map T obtained is described4The specific operation of (1):
using formulas (6) and (7) to obtain an intermediate feature map T3Transforming to obtain intermediate characteristic map T3Attention module A in channel direction and channel direction in sequence3Carry out point multiplication one by one, in space direction and plane attention module A'3A channel-by-channel dot-product is performed,further obtaining an intermediate characteristic map T4
Figure BDA0002883292150000115
Figure BDA0002883292150000116
Wherein, AtenspeRepresentation of the intermediate feature map T3Attention enhancement in the channel direction, AtenspaRepresentation of the intermediate feature map T3Attention is enhanced in the space direction, and u is an intermediate feature map T3The u-th pixel point contained in a single channel, r is an intermediate characteristic map T3The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T3V is the intermediate feature map T3The v channel, symbol
Figure BDA0002883292150000121
Elements representing the same position corresponding to the same type of matrix are multiplied.
Step S5, the intermediate feature map T4Sequentially carrying out 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T6Centering the intermediate feature map T in the spatial dimension6Performing pooling operation to obtain a channel attention module A6Centering feature map T in channel dimension6Performing pooling operation to obtain planar attention module A'6The intermediate feature map T6Each channel vector and channel attention module, and intermediate feature map T6Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T7
The specific operation of implementing step S5 is:
step S5.1, utilizing a formula (8) to carry out comparison on the intermediate characteristic map T4Performing 3D convolution operation to obtain an intermediate characteristic spectrum T5Middle feature map T5Is that
Figure BDA0002883292150000122
Figure BDA0002883292150000123
Wherein, Con3DRepresenting a 3D convolution operation, x representing an intermediate feature map T5The number of pixel points in the height direction of the space, y represents the middle characteristic spectrum T5The number of pixel points in the space length and width directions, r is the intermediate characteristic spectrum T4The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T4The number of channels of (a);
step S5.2, the intermediate characteristic map T5Performing down-sampling operation to obtain an intermediate characteristic map T6At this time, the intermediate feature map T6The number of channels and the intermediate feature map T5The number of channels is the same, the size of a single channel in the spatial dimension is changed, and the intermediate feature map T6The dimensions of the individual channels in the spatial length and width directions are:
z×z=[(y÷2)×(y÷2)],
wherein z is an intermediate characteristic spectrum T6The number of pixel points in the space length and width directions, and y is the intermediate characteristic spectrum T5The number of pixel points in the space length and width directions;
s5.3, utilizing the formulas (9) and (10) to perform intermediate feature map T6Carrying out feature transformation to obtain an intermediate feature map T7,T7Is that
Figure BDA0002883292150000131
Figure BDA0002883292150000132
Figure BDA0002883292150000133
Wherein, AtenspeRepresentation of the intermediate feature map T6Attention enhancement in the channel direction, AtenspaRepresentation of the intermediate feature map T6Attention is enhanced in the space direction, and u is an intermediate feature map T6The u-th pixel point contained in a single channel, and z is an intermediate characteristic spectrum T6The number of pixel points of a single channel in the space length and width directions, and x is an intermediate characteristic spectrum T6V is the intermediate feature map T6The v channel, symbol
Figure BDA0002883292150000135
Elements representing the same position corresponding to the same type of matrix are multiplied.
The channel attention module A is obtained in equation (10)6And planar attention Module A'6The specific operation is as follows:
(I) get channel attention Module A6
(1.1) first, the intermediate feature map T is centered in the spatial dimension6Performing maximum pooling and average pooling operations, respectively, to generate two pooling vectors,
(1.2) inputting the two pooled vectors into a shared multilayer mapping neural network for training to respectively generate two new vectors,
(1.3) finally, performing bitwise addition on the two new vectors, and performing nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11 ') and (12')6(T6),
Figure BDA0002883292150000134
A6(T6)=σ{MLP[AvePool(T6)]+MLP[MaxPool(T6)]The formula (12'),
wherein, σ represents a Sigmoid activation function, e is the number of operation layers of a channel dimension, MLP represents nonlinear mapping performed through a multilayer neural network, AvePool represents average pooling, and MaxPool represents maximum pooling;
(II) obtaining a plane attention module:
(2.1) first, the intermediate feature map T is centered in the channel dimension6Performing maximum pooling and average pooling operations, respectively, to generate two pooling vectors,
(2.2) subsequently, the two aforementioned pooled vectors are mapped to a single-channel, same-size model via a convolution operation,
(2.3) finally, the plane attention module A 'can be obtained by using the formulas (13') and (14 ') through nonlinear mapping of Sigmoid activation function'6(T6),
Figure BDA0002883292150000141
Figure BDA0002883292150000142
Wherein, σ represents Sigmoid activation function, e is the operation layer number of channel dimension,
Figure BDA0002883292150000143
the feature transformation is performed by using a 1 × 1 convolutional neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.
Step S6, the intermediate feature map T7Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T8The method comprises the following specific operations:
s6.1, checking the middle characteristic map T of the dish by adopting the convolution of rho multiplied by z7Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T8I.e. intermediate feature maps T8Each channel only comprises one pixel point, wherein rho is the side length of the convolution size on the channel, and zxz is the size of the convolution window;
step S6.2, aiming at the characteristic map T in the middle of the dish7Performing 3D volumeDuring the product operation, the number of convolution kernels adopted is η, the vector length of the input convolution is α, and the vector length α after the convolution is obtained by using a formula (15):
α ═ [ (α - ρ) +1] × η formula (15).
Step S7, the intermediate feature map T8Inputting the data into a deep convolutional neural network to obtain a dish image R1The classification result of (2) is specifically operated as follows:
step S7.1, selecting a deep convolution neural network with an activation function as a softmax function shown in formula (16), wherein the softmax function is preceded by a layer of neural network,
Figure BDA0002883292150000151
wherein, YiRepresenting the Yth in the vector TiAn element;
step S7.2, intermediate characteristic map T8After the deep convolutional neural network is input, a vector T is obtained through a layer of neural network, the vector T enters a softmax function, the softmax function maps elements in the vector T into a (0, 1) interval to obtain a probability vector of the vector T, and then the dish image R1The name of (1) is the name corresponding to the maximum probability value in the probability vector obtained by mapping the softmax function.
In conclusion, the restaurant dish identification method based on the deep convolutional neural network can improve identification precision, reduce workload of artificial auxiliary operation and overcome the defect that dish items are difficult to finely divide when dish similarity is high in the existing dish identification and classification method.
The principles and embodiments of the present invention have been described in detail using specific examples, which are provided only to aid in understanding the core technical content of the present invention. Based on the above embodiments of the present invention, those skilled in the art should make any improvements and modifications to the present invention without departing from the principle of the present invention, and therefore, the present invention should fall into the protection scope of the present invention.

Claims (10)

1. A restaurant dish identification method based on a deep convolutional neural network is characterized by comprising the following implementation contents:
step S1, collecting dish image R1To the dish image R1Performing cutting preprocessing operation to obtain a dish image R2To the dish image R2Sampling the sample to obtain a vegetable sample block T1Vegetable sample block T1The characteristic information of the dish sample is obtained;
step S2, dish sample block T1Performing 3D convolution operation to obtain vegetable sample block T1Intermediate characteristic map T of2
Step S3, dish sample block T1Intermediate characteristic map T of2Performing pooling operation to obtain intermediate characteristic map T3
Step S4, centering the intermediate feature map T on the space dimension3Performing pooling operation to obtain a channel attention module A3Centering feature map T in channel dimension3Performing pooling operation to obtain planar attention module A'3The intermediate feature map T3Each channel vector and channel attention module, and intermediate feature map T3Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T4
Step S5, the intermediate feature map T4Sequentially carrying out 3D convolution operation and pooling operation to obtain an intermediate characteristic spectrum T6Centering the intermediate feature map T in the spatial dimension6Performing pooling operation to obtain a channel attention module A6Centering feature map T in channel dimension6Performing pooling operation to obtain planar attention module A'6The intermediate feature map T6Each channel vector and channel attention module, and intermediate feature map T6Each space feature of the three-dimensional space feature map is multiplied by a plane attention module according to positions to obtain a middle feature map T7
Step S6, the intermediate feature map T7Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T8
Step S7, the intermediate feature map T8Inputting the data into a deep convolutional neural network to obtain a dish image R1The classification result of (1).
2. The restaurant dish identification method based on deep convolutional neural network of claim 1, wherein in step S1, the dish image R is processed2Carrying out sample block taking, and the concrete operation comprises the following steps:
s1.1, in the plane dimension, taking a dish image R2A × a surrounding pixels are used as a neighborhood block of the sample, wherein a is the number of pixels of the image block in the plane length and width directions;
step S1.2, retaining all channel information of a pixel points of a multiplied by a, namely forming a three-dimensional sample block of P multiplied by a, which is used for representing the sample characteristics of the middle pixel points, and performing characteristic transformation of the sample block taking process by using a formula (1):
Figure FDA0002883292140000021
wherein Q is the number of pixel points in a single channel, and is also the number of block samples, DsampRepresenting the sample block taking process, and L and H represent the length and the width of a preset plane of the cutting operation;
and in the sample block taking operation, when the edge pixel point has no spatial neighborhood information, performing 0 complementing operation.
3. The restaurant dish identification method based on deep convolutional neural network of claim 2, wherein in step S2, a dish sample block T is subjected to1Performing a 3D convolution operation, the specific operation comprising:
s2.1, based on the deep convolutional neural network, selecting h different convolutional kernels in each layer of convolutional neural network, and performing T on dish sample blocks1The P pieces of channel information are respectively extracted by using the sizes of e multiplied by ff, performing convolution operation on the 3D convolution kernel, wherein e is the number of operation layers of channel dimensions, namely selecting e channels for performing one group of convolution each time, and f represents the number of pixel points of the image block in the length and width directions in the space dimensions;
s2.2, after h different convolution kernels are selected from each layer of convolution neural network, a dish sample block T is obtained by using formulas (2), (3) and (4)1Intermediate characteristic map T of2
P ═ [ (P-e) +1] × h formula (2),
where m is [ (a-e) +1] formula (3),
Figure FDA0002883292140000022
wherein p represents a sample block T of dishes1The number of channels contained, e is the number of operating layers of channel dimensions, a is the number of pixel points of the image block in the plane length and width directions, and m is a middle characteristic map T2Number of pixels in the direction of space length and width, Con3DIndicating that a 3D convolution operation is performed.
4. The restaurant dish identification method based on deep convolutional neural network of claim 3, wherein a dish sample block T is subjected to1In the process of performing 3D convolution operation, the mapping of each feature in a convolution layer is connected with a plurality of adjacent continuous channels in the previous layer, and a certain position value of one convolution mapping is obtained by convolving local receptive fields of the same position of three continuous channels in the previous layer; one convolution layer is provided with a plurality of convolution kernels, one convolution kernel can only extract one type of characteristic information from three-dimensional data, h types of characteristic information can be extracted by using h convolution kernels, wherein h is a positive integer, and h is a positive integer>1。
5. The restaurant dish identification method based on deep convolutional neural network of claim 3, wherein step S3 is executed to obtain an intermediate feature map T3The specific operation of (1) comprises:
step S3.1, dish sample block T1Intermediate characteristic map T of2Performing pooling operation, namely down-sampling treatment or discarding characteristic treatment, to obtain an intermediate characteristic map T3At this time, the intermediate feature map T3The number of channels and the intermediate feature map T2The number of channels is the same, and the size of a single channel in the spatial dimension is changed;
step S3.2, after the pooling treatment, the intermediate characteristic spectrum T3Is shown as
Figure FDA0002883292140000033
I.e. the intermediate characteristic map T3The number of pixel points of each channel in the space length direction and the width direction is r, and the number r of the pixel points is calculated by using a formula (5):
r ═ (m ÷ 2) equation (5),
wherein m is an intermediate characteristic spectrum T2The number of pixels in the spatial length and width directions.
6. The restaurant dish identification method based on deep convolutional neural network of claim 5, wherein step S4 is executed to obtain an intermediate feature map T4The specific operation is as follows:
using formulas (6) and (7) to obtain an intermediate feature map T3Transforming to obtain intermediate characteristic map T3Attention module A in channel direction and channel direction in sequence3Carry out point multiplication one by one, in space direction and plane attention module A'3Performing point multiplication one by one to obtain an intermediate characteristic map T4
Figure FDA0002883292140000031
Figure FDA0002883292140000032
Wherein, AtenspeRepresentation of the intermediate feature map T3Attention enhancement in the channel direction, AtenspaRepresentation of the intermediate feature map T3Attention is enhanced in the space direction, and u is an intermediate feature map T3The u-th pixel point contained in a single channel, r is an intermediate characteristic map T3The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T3V is the intermediate feature map T3The v channel, symbol
Figure FDA0002883292140000042
Elements representing the same position corresponding to the same type of matrix are multiplied.
7. The restaurant dish identification method based on deep convolutional neural network of claim 6, wherein step S5 is executed to obtain an intermediate feature map T7The specific operation is as follows:
step S5.1, utilizing a formula (8) to carry out comparison on the intermediate characteristic map T4Performing 3D convolution operation to obtain an intermediate characteristic spectrum T5Middle feature map T5Is that
Figure FDA0002883292140000043
Figure FDA0002883292140000044
Wherein, Con3DRepresenting a 3D convolution operation, x representing an intermediate feature map T5The number of pixel points in the height direction of the space, y represents the middle characteristic spectrum T5The number of pixel points in the space length and width directions, r is the intermediate characteristic spectrum T4The number of pixel points of a single channel in the space length and width directions is p, which is an intermediate characteristic map T4The number of channels of (a);
step S5.2, to the intermediate partSign map T5Performing down-sampling operation to obtain an intermediate characteristic map T6At this time, the intermediate feature map T6The number of channels and the intermediate feature map T5The number of channels is the same, the size of a single channel in the spatial dimension is changed, and the intermediate feature map T6The dimensions of the individual channels in the spatial length and width directions are:
z×z=[(y÷2)×(y÷2)],
wherein z is an intermediate characteristic spectrum T6The number of pixel points in the space length and width directions, and y is the intermediate characteristic spectrum T5The number of pixel points in the space length and width directions;
s5.3, utilizing the formulas (9) and (10) to perform intermediate feature map T6Carrying out feature transformation to obtain an intermediate feature map T7,T7Is that
Figure FDA0002883292140000045
Figure FDA0002883292140000046
Figure FDA0002883292140000041
Wherein, AtenspeRepresentation of the intermediate feature map T6Attention enhancement in the channel direction, AtenspaRepresentation of the intermediate feature map T6Attention is enhanced in the space direction, and u is an intermediate feature map T6The u-th pixel point contained in a single channel, and z is an intermediate characteristic spectrum T6The number of pixel points of a single channel in the space length and width directions, and x is an intermediate characteristic spectrum T6V is the intermediate feature map T6The v channel, symbol
Figure FDA0002883292140000052
Represent the same classThe matrix of types is multiplied by the elements at the same position.
8. The restaurant dish identification method based on the deep convolutional neural network of claim 7, wherein when step S4 or S5 is executed, the specific operations of the channel attention module and the plane attention module are:
obtaining a channel attention module:
(1.1) first, the intermediate feature map T is centered in the spatial dimensioniPerforming maximum pooling and average pooling operations, respectively, to generate two pooling vectors, wherein i has a value of 3 or 6,
(1.2) inputting the two pooled vectors into a shared multilayer mapping neural network for training to respectively generate two new vectors,
(1.3) finally, carrying out bitwise addition on the two new vectors, and carrying out nonlinear mapping through a Sigmoid activation function, namely obtaining a channel attention module A by using the formulas (11) and (12)i(Ti),
Figure FDA0002883292140000051
Ai(Ti)=σ{MLP[AvePool(Ti)]+MLP[MaxPool(Ti)]The formula (12) is described,
wherein, σ represents a Sigmoid activation function, e is the number of operation layers of a channel dimension, MLP represents nonlinear mapping performed through a multilayer neural network, AvePool represents average pooling, and MaxPool represents maximum pooling;
(II) obtaining a plane attention module:
(2.1) first, the intermediate feature map T is centered in the channel dimensioniPerforming maximum pooling and average pooling operations, respectively, to generate two pooling vectors,
(2.2) subsequently, the two aforementioned pooled vectors are mapped to a single-channel, same-size model via a convolution operation,
(2.3) finally, proceeding through Sigmoid activation functionNon-linear mapping, i.e. obtaining the planar attention module A 'using the formulas (13), (13)'i(Ti),
Figure FDA0002883292140000061
Figure FDA0002883292140000062
Wherein, σ represents Sigmoid activation function, e is the operation layer number of channel dimension,
Figure FDA0002883292140000064
the feature transformation is performed by using a 1 × 1 convolutional neural network, AvePool represents average pooling, and MaxPool represents maximum pooling.
9. The restaurant dish identification method based on deep convolutional neural network of claim 1, wherein step S6 is executed to obtain an intermediate feature map T8The specific operation is as follows:
s6.1, checking the middle characteristic map T of the dish by adopting the convolution of rho multiplied by z7Performing 3D convolution operation to obtain a one-dimensional intermediate characteristic spectrum T8I.e. intermediate feature maps T8Each channel only comprises one pixel point, wherein rho is the side length of the convolution size on the channel, and zxz is the size of the convolution window;
step S6.2, aiming at the characteristic map T in the middle of the dish7When 3D convolution is performed, the number of convolution kernels used is η, the vector length of the input convolution is α, and the vector length α after convolution is obtained using the formula (15):
α ═ [ (α - ρ) +1] × η formula (15).
10. The restaurant dish identification method based on deep convolutional neural network of claim 1, wherein step S7 is executed to obtain dish image R1As a result of the classification of (a),the specific operation is as follows:
step S7.1, selecting a deep convolution neural network with an activation function as a softmax function shown in formula (16), wherein the softmax function is preceded by a layer of neural network,
Figure FDA0002883292140000063
wherein, YiRepresenting the Yth in the vector TiAn element;
step S7.2, intermediate characteristic map T8After the deep convolutional neural network is input, a vector T is obtained through a layer of neural network, the vector T enters a softmax function, the softmax function maps elements in the vector T into a (0, 1) interval to obtain a probability vector of the vector T, and then the dish image R1The name of (1) is the name corresponding to the maximum probability value in the probability vector obtained by mapping the softmax function.
CN202110006146.7A 2021-01-05 2021-01-05 Restaurant dish identification method based on deep convolutional neural network Active CN112699822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110006146.7A CN112699822B (en) 2021-01-05 2021-01-05 Restaurant dish identification method based on deep convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110006146.7A CN112699822B (en) 2021-01-05 2021-01-05 Restaurant dish identification method based on deep convolutional neural network

Publications (2)

Publication Number Publication Date
CN112699822A true CN112699822A (en) 2021-04-23
CN112699822B CN112699822B (en) 2023-05-30

Family

ID=75514577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110006146.7A Active CN112699822B (en) 2021-01-05 2021-01-05 Restaurant dish identification method based on deep convolutional neural network

Country Status (1)

Country Link
CN (1) CN112699822B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845527A (en) * 2016-12-29 2017-06-13 南京江南博睿高新技术研究院有限公司 A kind of vegetable recognition methods
CN107578060A (en) * 2017-08-14 2018-01-12 电子科技大学 A kind of deep neural network based on discriminant region is used for the method for vegetable image classification
CN109377205A (en) * 2018-12-06 2019-02-22 深圳市淘米科技有限公司 A kind of cafeteria's intelligence settlement system based on depth convolutional network
CN110689056A (en) * 2019-09-10 2020-01-14 Oppo广东移动通信有限公司 Classification method and device, equipment and storage medium
CN111667489A (en) * 2020-04-30 2020-09-15 华东师范大学 Cancer hyperspectral image segmentation method and system based on double-branch attention deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845527A (en) * 2016-12-29 2017-06-13 南京江南博睿高新技术研究院有限公司 A kind of vegetable recognition methods
CN107578060A (en) * 2017-08-14 2018-01-12 电子科技大学 A kind of deep neural network based on discriminant region is used for the method for vegetable image classification
CN109377205A (en) * 2018-12-06 2019-02-22 深圳市淘米科技有限公司 A kind of cafeteria's intelligence settlement system based on depth convolutional network
CN110689056A (en) * 2019-09-10 2020-01-14 Oppo广东移动通信有限公司 Classification method and device, equipment and storage medium
CN111667489A (en) * 2020-04-30 2020-09-15 华东师范大学 Cancer hyperspectral image segmentation method and system based on double-branch attention deep learning

Also Published As

Publication number Publication date
CN112699822B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN112329800B (en) Salient object detection method based on global information guiding residual attention
CN109190752B (en) Image semantic segmentation method based on global features and local features of deep learning
CN109543502B (en) Semantic segmentation method based on deep multi-scale neural network
CN104834922B (en) Gesture identification method based on hybrid neural networks
CN109117703B (en) Hybrid cell type identification method based on fine-grained identification
CN109376611A (en) A kind of saliency detection method based on 3D convolutional neural networks
CN110163286B (en) Hybrid pooling-based domain adaptive image classification method
CN105701508A (en) Global-local optimization model based on multistage convolution neural network and significant detection algorithm
CN111709909A (en) General printing defect detection method based on deep learning and model thereof
CN113239954B (en) Attention mechanism-based image semantic segmentation feature fusion method
CN110992374B (en) Hair refinement segmentation method and system based on deep learning
CN110751072B (en) Double-person interactive identification method based on knowledge embedded graph convolution network
CN113657528B (en) Image feature point extraction method and device, computer terminal and storage medium
CN111160356A (en) Image segmentation and classification method and device
CN111401426A (en) Small sample hyperspectral image classification method based on pseudo label learning
CN112257741A (en) Method for detecting generative anti-false picture based on complex neural network
CN110991563A (en) Capsule network random routing algorithm based on feature fusion
CN107480471A (en) The method for the sequence similarity analysis being characterized based on wavelet transformation
CN110490210B (en) Color texture classification method based on t sampling difference between compact channels
CN112699822A (en) Restaurant dish identification method based on deep convolutional neural network
CN106446909A (en) Chinese food image feature extraction method
CN116229455A (en) Pinellia ternate origin identification method and system based on multi-scale feature deep neural network
CN114419341B (en) Convolutional neural network image recognition method based on transfer learning improvement
CN115775226A (en) Transformer-based medical image classification method
CN114998756A (en) Yolov 5-based remote sensing image detection method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant