CN112560918A - Dish identification method based on improved YOLO v3 - Google Patents

Dish identification method based on improved YOLO v3 Download PDF

Info

Publication number
CN112560918A
CN112560918A CN202011430608.XA CN202011430608A CN112560918A CN 112560918 A CN112560918 A CN 112560918A CN 202011430608 A CN202011430608 A CN 202011430608A CN 112560918 A CN112560918 A CN 112560918A
Authority
CN
China
Prior art keywords
dish
iou
yolo
network
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011430608.XA
Other languages
Chinese (zh)
Other versions
CN112560918B (en
Inventor
高明裕
石杰
董哲康
林辉品
陈超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202011430608.XA priority Critical patent/CN112560918B/en
Publication of CN112560918A publication Critical patent/CN112560918A/en
Application granted granted Critical
Publication of CN112560918B publication Critical patent/CN112560918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a dish identification method based on improved YOLO v3, which uses a target detection algorithm based on deep learning to identify the type of a dish and regress a boundary frame of a target, and provides category information and relative position information for the dish ordering operation of a robot. The method is based on One-Stage algorithm YOLO V3, and based on ResNet and SEnet, the feature extraction network is optimally designed, a feature extraction network of Seblock53 is used for replacing Darknet53, and deformable convolution DCN is introduced to form a new backhaul. The dish window scene recognition result based on colleges and universities shows that the dish window scene recognition method can be used for quickly and accurately positioning and classifying dishes and can realize the dish recognition function of the restaurant service robot.

Description

Dish identification method based on improved YOLO v3
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a dish identification method based on improved YOLO v 3.
Background
At present, dishes are sold in restaurants and canteens of many fast food restaurants, schools, factories and other units in China, the work is tedious, the labor demand is large, and the labor cost is high. The method for using the robot to replace the manual work to finish the dish-making operation becomes one of the solutions. When the intelligent robot system facing restaurant service is used for completing dish ordering operation, the robot is required to completely replace human eyes to complete dish identification, namely, the category and the region of each dish in an input image need to be identified.
Because Chinese dishes are various in variety, different dishes are likely to look similar, and the color and the shape of the same dish cooked each time are different, the difficulty of dish identification is increased under the conditions, so that the dish identification rate is low, and the speed is low. The prior art is used for identifying dishes, the dishes which are contained in dining tools such as bowls and the like are not directly identified, the identification method still needs workers to contain the dishes in the bowls in advance, and the dish serving process still needs manual participation. If the method is directly used for identifying the dishes in the dish containing basin, the image is also changed due to the change of the number of the dishes after each dish containing, and the identification accuracy rate is reduced.
In recent years, with the rapid development of neural networks and deep learning, a plurality of excellent target detection networks are proposed, and two key tasks of target detection are target classification and target positioning. Wherein the YOLO V3 algorithm achieves a better balance of speed and performance in the target detection task. Since the attention degree of the feature extraction part of the YOLO V3 is the same for features between different channels, but features in some channels have greater value for the network, the network needs to pay more attention to feature information of the channels. Meanwhile, the standard convolution of the feature extraction part can only perform sampling of a fixed geometric shape on an input feature map, and the convolution neural network is limited in modeling of geometric deformation and cannot adapt to unknown deformation well.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a dish identification method based on improved YOLO V3, the YOLO V3 algorithm is optimized and improved, an SE module is embedded into a residual error module of Darknet53, a deformable convolution DC is introduced, and the dish identification in a canteen dish basin is completed without manual participation.
A dish identification method based on improved YOLO v3 specifically comprises the following steps:
step one, data acquisition
Shooting images of the dish containing basins containing different dishes in the dish placing window of the restaurant by using a camera;
step two, image preprocessing
Analyzing and processing the image acquired in the step one and enhancing data;
preferably, the data enhancement method is to add a small amount of gaussian noise to the image randomly to improve the reasoning and generalization capability of the network.
Step three, data set division
Making a data set by using the image subjected to the enhancement processing in the step two, labeling the dish type and the boundary box in the image, calling the labeled dish type and the boundary box as a real type and a real box, and dividing the labeled data set into a training set, a verification set and a test set;
step four, analyzing and processing the data set
And carrying out visual analysis on the labeled data set, respectively counting the number of real frames of each category in the training set, the verification set and the test set, and carrying out data expansion on the dish image of which the number of the real frames is less than half of the average value of the real frames of all the categories. The data expansion method comprises the steps of randomly deducting real frame images or selecting the whole image of a dish image with a small number of real frames, then converting the deducted or selected image by using pix2pix image style migration, and adding the converted result into a data set to complete data expansion.
Step five, constructing a feature extraction and classification model
The YOLO v3 network comprises a feature extraction part and a prediction part, wherein the feature extraction part is optimized and improved to construct a neural network structure for feature extraction and classification. The characteristic extraction part of the YOLO v3 network is Darknet53, which contains 5 residual blocks, each of which is composed of a 2-time downsampled convolutional layer CBL and a group of repeated residual units, wherein the CBL is composed of 2D-contribution, Batch Normalization and leakage ReLU. According to the method, the SE module is embedded into the residual error unit to form a new residual error unit. The structure of the new residual unit is: inputting a characteristic diagram x, outputting a characteristic diagram f after passing through two CBLs (1 × 1,1) and (3 × 3, 1), weighting and multiplying residual outputs obtained after the output characteristic diagram f passes through an SE module consisting of Global Pooling, FC + Relu and FC + Sigmoid with the output characteristic diagram f, superposing weighting results with the input characteristic diagram x, and sequentially setting the repetition times of residual units in 5 residual blocks to be 1, 2, 8 and 4; the output of the last residual block is connected to a layer of deformable convolution DC to form the feature extraction part of the network.
The deformable convolution DC adds a learnable offset parameter Deltapon the basis of the operation of the standard convolutionnOffset parameter Δ pnObtained by convolution of the input feature layer, and the current point p is calculated according to the formula (1)0Corresponding output y (p)0):
Figure BDA0002820457900000021
Wherein R { (-1, -1), (0, -1), (1, -1), (-1.0), (0,0), (1,0), (-1,1), (0,1), (1,1) }, w (p)n) As convolution parameter, x (p)0+pn) For feature map input, p0+pnCorresponding to up-sampled p from input features09 positions on a 9-grid network with the center spreading all around, p0+pn+△pnCorresponding to the shifted 9 positions. Taking each point on the input feature map as a current point p in turn0Calculate its output y (p)0) And completing the feature extraction.
And the prediction part takes the idea of FPN as a reference, performs feature fusion on the 3 rd and 4 th residual blocks in the feature extraction network and the output of the DC of the last layer to generate feature maps of 52 x 52, 26 x 26 and 13 x 13 with different scales, and predicts and outputs an adjustment parameter of the center point coordinates (x, y) of the target bounding box, an adjustment parameter of the width and height (w, h) of the prior frame, a category confidence coefficient and a prediction probability on the three feature maps respectively. The 13 × 13 feature map is used for predicting a large target, the 26 × 26 feature map is used for predicting a medium target, and the 52 × 52 feature map is used for predicting a small target.
Step six, training and optimizing classification models
And performing pre-training migration by using the fully trained model on the large data set, inputting a training set to train the model, performing cross validation on the model by using a validation set after each iteration is finished, and calculating the training loss. Adopting a learning rate adjustment strategy to enable the label to be smoother, and then finely adjusting the network layer by layer;
the learning rate adjustment strategy is to use a WarmUp strategy at the initial stage of model optimization, and use a cosine attenuation strategy after the training loss is reduced to be gentle, so that the learning rate is smoother, and a periodically-changed learning rate is provided to enable the network to jump out of local optimum; wherein the loss comprises the loss of the center point coordinates (x, y), the (w, h) loss of the anchor, the confidence loss and the class prediction loss.
The step of fine tuning the network layer by layer refers to dividing the whole network into a feature extraction network and three YOLO prediction branch layers, freezing the feature extraction network, then respectively freezing two of the three YOLO prediction layers, and fine tuning the rest YOLO prediction layer.
And when the set iteration times are reached, finishing the training optimization of the model and storing the model parameters.
Preferably, the number of iterations is set to less than 500.
Step seven, model test and result analysis
Inputting the test set into the classification model optimized in the step six, and analyzing by using an improved NMS method aiming at the problems of low classification confidence, redundant repeat of candidate frames and the like in the output result.
The improved NMS method comprises 4 parameters: confidence threshold score threshold, max _ output _ size, maximum number of boxes selected by non-maximum suppression, IOU threshold IOU _ threshold1, and nIOU threshold IOU _ threshold 2. And the confidence threshold score _ threshold, the IOU threshold IOU _ threshold1 and the nIOU threshold IOU _ threshold2 range from 0.3 to 0.5, and the maximum number of frames selected by non-maximum inhibition max _ output _ size is set according to the number of real frames of each image counted by the analysis in the step four.
Firstly, rejecting detection frames with confidence coefficient lower than score _ threshold; then, screening the detection frames for the first time according to an NMS method, removing the detection frames with the IOU value larger than IOU _ threshold1, when the number of the remaining detection frames reaches the set value of max _ output _ size, removing all the remaining detection frames, and entering the next stage of screening, wherein the calculation method of the IOU value is as follows:
Figure BDA0002820457900000041
where a denotes the remaining detection boxes and B denotes the detection boxes that have been retained.
And screening the detection frames screened by the steps again, taking the detection frame with the highest confidence coefficient as a first reserved detection frame, and sequentially calculating the n IOU values of the remaining detection frames and the reserved detection frames, wherein the calculation formula of the n IOU value is as follows:
Figure BDA0002820457900000042
and rejecting the remaining detection frame when the nIOU value of the detection frame is greater than iou _ threshold2, and otherwise, reserving the detection frame.
And finally, outputting the coordinates and the confidence degrees of the reserved detection frames.
Preferably, the confidence threshold score threshold has a value of 0.35.
Preferably, the IOU threshold IOU _ threshold1 and the IOU threshold IOU _ threshold2 have values of 0.45 and 0.35, respectively.
Step eight, dish classification
And (3) recording images of a dish ordering window of the restaurant in real time by using a camera, inputting the videos into the stored model for classification and detection, storing the type and position information of dishes in the output result of the model, and assisting the dish ordering robot to finish dish ordering.
The invention has the following beneficial effects:
the improved feature extraction network can adaptively learn the interrelationship among different channels of the feature maps, pay attention to different degrees, and transmit the global feature information of each feature map downwards to reduce the loss of information; the method can better adapt to the continuous geometric deformation of dishes in the dish serving process, and improves the identification accuracy of the network; improved NMS processing can better suppress redundant prediction blocks.
Drawings
FIG. 1 is an image of a dish collected in an embodiment;
FIG. 2 is a true box distribution in an example dataset;
fig. 3 shows the actual box distribution of the categories of dishes according to the example.
FIG. 4 is a comparison between before and after transformation using the pix2pix method in examples;
FIG. 5 is a feature extraction and classification model constructed in an embodiment;
FIG. 6(a) is a diagram of standard convolution samples, and FIG. 6(b) is an example of a diagram of deformable convolution samples;
FIG. 7 is a loss curve of training optimization in the example;
fig. 8 shows the classification results obtained in the examples.
Detailed Description
The invention is further explained below with reference to the drawings;
step one, data acquisition
6219 images of college restaurant dishes are collected, as shown in FIG. 1.
Step two, image preprocessing
And C, after the image data collected in the step I are analyzed, removing the dish types with too few occurrence times, and reserving 37 dish types meeting the conditions as detection target types. And adding a small amount of Gaussian noise to the reserved dish images to improve the reasoning and generalization capability of the network to serve as a classified data set.
Step three, data set division
Making a data set in a Pascal VOC format by using the image subjected to enhancement processing in the second step, labeling the dish type and the boundary box in the image, calling the labeled dish type and the boundary box as a real type and a real box, and randomly dividing the labeled data set into a training set, a verification set and a test set according to a ratio of 45:5: 6; the training set contained 4975 images, the validation set contained 552 images, and the test set contained 692 images.
Step four, analyzing and processing the data set
Performing visual analysis on the labeled data set, and respectively counting the number of real frames of each category in the training set, the verification set and the test set, wherein the training set comprises 23055 real frames, the verification set comprises 3017 real frames, the number of the real frames in each picture is at most 8, the number of the real frames in each picture is at least 1, and the average number of the real frames is 4-5 as shown in fig. 2; the number of the real frames of part of dish categories is shown in fig. 3, and the data expansion is carried out on the images of the bean bags, the green soy bean wax gourd strips, the steamed meat with eggs, the fried pumpkin and the brine shrimp: and (3) randomly deducting real block images or selecting the whole image of the dish image, then converting the randomly selected image by using pix2pix image style migration, and adding the converted result into a data set to complete data expansion as shown in fig. 4.
And analyzing the size proportion of the real frame, wherein the area of the real frame, which is equivalent to the original image, is about 12% -30%, and according to the definition on the relative size, when the length and the width of the target real frame are less than 10% of the length and the width of the original image, namely the area of the target real frame is less than 1% of the original image, the target real frame is called a small target. Analysis results show that the quantity of the medium proportion accounts for the majority, and the quantity of the large proportion accounts for the second, namely, the detection task is mainly based on medium and large targets.
Step five, constructing a feature extraction and classification model
Optimizing and improving the feature extraction part in the YOLO v3 network, and constructing a neural network structure for feature extraction and classification, as shown in fig. 5. The SE module is embedded into 5 residual modules of Darknet53 in a YOLO v3 network, mainly comprises two operations of Squeeze and Excitation, can model the correlation among characteristic channels, and strengthens important characteristics. The Squeeze operation adopts Global Average Pooling to realize that the whole spatial feature on one channel is coded into a Global feature; the method comprises the following steps that a door mechanism in a sigmoid form is adopted for capturing the relation between channels in the specification operation, two full-connection layers are used, the first full-connection layer is subjected to dimensionality reduction, then the first full-connection layer is activated through ReLU, the dimensionality is restored through the second full-connection layer, then sigmoid activation is adopted, and finally the learned activation value of each channel is multiplied by the original characteristics. The entire SE operation can be viewed as learning the weight coefficients of the individual channels of the feature. The characteristic extraction part of the YOLO v3 network is Darknet53, which contains 5 residual blocks, each of which is composed of a 2-time downsampled convolutional layer CBL and a group of repeated residual units, wherein the CBL is composed of 2D-contribution, Batch Normalization and leakage ReLU. According to the method, the SE module is embedded into the residual error unit to form a new residual error unit. The structure of the new residual unit is: inputting a characteristic diagram x, outputting a characteristic diagram f after passing through two CBLs (1 × 1,1) and (3 × 3, 1), weighting and multiplying residual outputs obtained after the output characteristic diagram f passes through an SE module consisting of Global Pooling, FC + Relu and FC + Sigmoid with the output characteristic diagram f, superposing weighting results with the input characteristic diagram x, and sequentially setting the repetition times of residual units in 5 residual blocks to be 1, 2, 8 and 4; the output of the last residual block is connected to a layer of deformable convolution DC to form the feature extraction part of the network.
For a standard convolution of 3 × 3, 9 positions are sampled from the input features to calculate the output of each point, and these 9 positions are distributed on a grid of 9 grids with a shape of a grid, which is obtained by diffusing around the current point p0, as shown in fig. 6(a), then the output y (p) corresponding to the current point p0 is obtained0) Comprises the following steps:
Figure BDA0002820457900000061
wherein R { (-1, -1), (0, -1), (1, -1), (-1.0), (0,0), (1,0), (-1,1), (0,1), (1,1) }, p0+pnCorresponds to p09 positions sampled in the input feature as centers, w (p)n) As convolution parameter, x (p)0+pn) Inputting a feature map.
The deformable convolution DCN adds a learnable offset parameter delta p on the basis of the operation of the standard convolutionnOffset parameter Δ pnObtaining the offset parameter delta p by performing convolution on the input features, wherein the size of an input feature layer is w multiplied by h multiplied by c, w and h represent the width and the height, c is the number of channels, the convolution on the input feature layer is 3 multiplied by 3, and the convolution kernel is 2cnThe size of the output offset parameter is w × h × 2c, the first c channels are offset in the x direction, and the last c channels are offset in the y direction.
Calculating the current point p according to equation (5)0Corresponding output y (p)0):
Figure BDA0002820457900000062
Wherein p is0+pn+△pnFor the 9 positions after the offset, as shown in fig. 6(b), the 9 points to which the offset parameter is added have a non-rectangular shape. Taking each point on the input feature map as a current point p in turn0Calculate its output y (p)0) And completing the feature extraction.
And the prediction part takes the idea of FPN as a reference, performs feature fusion on the 3 rd and 4 th residual blocks in the feature extraction network and the output of the DC of the last layer, respectively predicts the adjustment parameters of the center point coordinates (x, y) of the target bounding box, the (w, h) adjustment parameters of the prior frame, the category confidence coefficient and the prediction probability, and generates feature maps of three different scales of 52 x 52, 26 x 26 and 13 x 13. The 13 × 13 feature map is used for predicting a large target, the 26 × 26 feature map is used for predicting a medium target, and the 52 × 52 feature map is used for predicting a small target.
Step six, training and optimizing classification models
Firstly, a pre-training model fully trained on a COCO data set is used for carrying out migration training, a training set is input to train the model, a verification set is used for carrying out cross verification on the model after each iteration is finished, and the training loss is calculated. A WarmUp strategy is used in the initial stage to stabilize training; when the training loss is reduced to be gentle, a cosine attenuation strategy is adopted, so that the learning rate is smoother, and the learning rate with periodic change is provided to enable the network to jump out of the local optimum, wherein the loss comprises the loss of a central point coordinate (x, y), the loss of an anchor (w, h), the confidence coefficient loss and the class prediction loss; as the condition of label missing and label error also can exist in the dish data set making process, the label smoothness is used as a regularization strategy, the trust of a network on the confidence coefficient of the label can be reduced, the adaptability to label missing and label error data is good, and the influence of label error on the classification accuracy is reduced;
the whole network is divided into a feature extraction network and three YOLO prediction branch layers, after the feature extraction network is frozen, two of the three YOLO prediction layers are respectively frozen, and the remaining YOLO prediction layer is subjected to fine adjustment. The loss curve of the training process is shown in fig. 7.
And when the iterative training is carried out for 490 times, the training optimization of the model is completed, and the model parameters are stored.
Step seven, model test and result analysis
Inputting the test set into the classification model optimized in the step six, and analyzing by using an improved NMS method aiming at the problems of low classification confidence, redundant repeat of candidate frames and the like in the output result. The confidence threshold score _ threshold is set to 0.35, the maximum number of frames selected by non-maximum suppression max _ output _ size is 12, the IOU threshold IOU _ threshold1 is 0.45, and the nIOU threshold IOU _ threshold2 is 0.35.
Firstly, rejecting detection frames with confidence coefficient lower than score _ threshold; then, screening the detection frames for the first time according to an NMS method, removing the detection frames with the IOU value larger than IOU _ threshold1, and removing all the remaining detection frames when the number of the reserved detection frames reaches 12, and entering the subsequent screening, wherein the IOU value is calculated by the following method:
Figure BDA0002820457900000071
where a denotes the remaining detection boxes and B denotes the detection boxes that have been retained.
And screening the detection frames screened by the steps again, taking the detection frame with the highest confidence coefficient as a first reserved detection frame, and sequentially calculating the n IOU values of the remaining detection frames and the reserved detection frames, wherein the calculation formula of the n IOU value is as follows:
Figure BDA0002820457900000081
and rejecting the remaining detection frame when the nIOU value of the detection frame is greater than iou _ threshold2, and otherwise, reserving the detection frame.
The final recognition result is output as shown in fig. 8.
The recognition accuracy of the images in the test set is calculated according to the mean Average Precision calculation method of the VOC standard, and the recognition accuracy of the embodiment is 91.16%.
The foregoing detailed description is intended to illustrate and not limit the invention, which is intended to be within the spirit and scope of the appended claims, and any changes and modifications that fall within the true spirit and scope of the invention are intended to be covered by the following claims.

Claims (8)

1. A dish identification method based on improved YOLO v3 is characterized in that: the method specifically comprises the following steps:
step one, data acquisition;
shooting images of the dish containing basins containing different dishes in the dish placing window of the restaurant by using a camera;
step two, preprocessing an image;
analyzing and processing the image acquired in the step one and enhancing data;
step three, dividing a data set;
making a data set by using the image subjected to the enhancement processing in the step two, labeling the dish type and the boundary box in the image, calling the labeled dish type and the boundary box as a real type and a real box, and dividing the labeled data set into a training set, a verification set and a test set;
analyzing and processing the data set;
carrying out visual analysis on the labeled data set, respectively counting the number of real frames of each category in the training set, the verification set and the test set, and carrying out data expansion on the dish image of which the number of the real frames is less than half of the average value of the real frames of all the categories;
constructing a feature extraction and classification model;
embedding an SE module into residual units of 5 residual blocks of a Darknet53 network of a YOLO v3 network feature extraction part to form a new residual unit; the structure of the new residual unit is: inputting a feature map x, outputting a feature map f after two 2-time downsampling convolutional layers CBL of (1 × 1,1) and (3 × 3, 1), weighting and multiplying residual outputs obtained after the output feature map f passes through an SE module consisting of Global Pooling, FC + Relu and FC + Sigmoid with the output feature map f, and overlapping the weighted results with the input feature map x, wherein the repetition times of residual units in 5 residual blocks are 1, 2, 8 and 4 in sequence; connecting the output of the last residual block with a layer of deformable convolution DC to form a characteristic extraction part of the network;
the prediction part performs feature fusion on the 3 rd and 4 th residual blocks in the feature extraction network and the output of the deformable convolution DC to generate feature maps of 52 x 52, 26 x 26 and 13 x 13 with different scales, and predicts the adjustment parameters of the center point coordinates (x, y) of the target bounding box, the adjustment parameters of the width and the height (w, h) of the prior frame, the category confidence coefficient and the prediction probability on the three feature maps respectively;
step six, training and optimizing a classification model;
using a model fully trained on a large data set to perform pre-training migration, inputting a training set to train the model, using a verification set to perform cross verification on the model after each iteration is finished, and calculating training loss; adopting a learning rate adjustment strategy to smoothen the label, and then finely adjusting the network layer by layer; when the set iteration times are reached, finishing the training optimization of the model and storing the model parameters;
step seven, model testing and result analysis;
inputting the test set into the classification model optimized in the step six, and analyzing by using an improved NMS method aiming at the problems of low classification confidence, redundant repeat of candidate frames and the like in the output result, wherein the method comprises the following steps:
s7.1, setting the confidence threshold score _ threshold, IOU threshold IOU _ threshold1 and the value of the IOU threshold IOU _ threshold2 to be 0.3-0.5, and setting the value of the maximum number of frames selected through non-maximum inhibition max _ output _ size according to the number of real frames of each image analyzed and counted in the step four;
s7.2, eliminating the detection frame with the confidence coefficient lower than score _ threshold;
s7.3, adopting an NMS method for the detection frames screened in the step 7.2, eliminating the detection frames with the IOU value larger than IOU _ threshold1, and when the number of the remaining detection frames reaches the set value of max _ output _ size, eliminating all the remaining detection frames, and entering the step 7.4, wherein the calculation method of the IOU value is as follows:
Figure FDA0002820457890000021
wherein, A represents the remaining detection boxes, and B represents the detection boxes which have been reserved;
s7.4, adopting an NMS method for the detection frames screened in the step 7.2, and eliminating the detection frames with the n IOU value being more than iou _ threshold2, wherein the calculation formula of the n IOU value is as follows:
Figure FDA0002820457890000022
s7.5, finally outputting the coordinates and confidence degrees of the detection frames reserved in the step 7.4;
step eight, dish classification;
and (3) recording images of a dish ordering window of the restaurant in real time by using a camera, inputting the videos into the stored model for classification and detection, storing the type and position information of dishes in the output result of the model, and assisting the dish ordering robot to finish dish ordering.
2. The dish identification method based on improved YOLO v3 of claim 1, wherein: and the data enhancement method in the second step is to add a small amount of Gaussian noise to the image randomly.
3. The dish identification method based on improved YOLO v3 of claim 1, wherein: and the data expansion method in the fourth step is that the real frame image deduction or the whole image selection is carried out on the dish images needing to be expanded at random, then the deducted or selected images are converted by using pix2pix image style migration, and the converted results are added into the data set to complete the data expansion.
4. The dish identification method based on improved YOLO v3 of claim 1, wherein: the deformable convolution DC adds a learnable offset parameter Δ p on the basis of the operation of the standard convolutionnOffset parameter Δ pnObtained by convolution of the input feature layer, and the current point p is calculated according to the formula (1)0Corresponding output y (p)0):
Figure FDA0002820457890000031
Wherein R { (-1, -1), (0, -1), (1, -1), (-1.0), (0,0), (1,0), (-1,1), (0,1), (1,1) }, w (p)n) As convolution parameter, x (p)0+pn) For feature map input, p0+pnCorresponding to up-sampled p from input features09 positions on a 9-grid network with the center spreading all around, p0+pn+△pnCorresponding to the shifted 9 positions; taking each point on the input feature map as a current point p in turn0Calculate its output y (p)0) And completing the feature extraction.
5. The dish identification method based on improved YOLO v3 of claim 1, wherein: in the sixth step, the learning rate adjustment strategy is to use a WarmUp strategy at the initial stage of model optimization, and use a cosine attenuation strategy after the training loss is reduced to be gentle, so that the learning rate is smoother, and the periodically changed learning rate is provided to ensure that the network jumps out of local optimum; wherein the loss comprises the loss of the center point coordinates (x, y), the (w, h) loss of the anchor, the confidence loss and the class prediction loss.
6. The dish identification method based on improved YOLO v3 of claim 1, wherein: and step six, fine tuning the network layer by layer means that the whole network is divided into a feature extraction network and three YOLO prediction branch layers, two of the three YOLO prediction layers are respectively frozen after the feature extraction network is frozen, and fine tuning is performed on the rest YOLO prediction layer.
7. The dish identification method based on improved YOLO v3 of claim 1, wherein: and the iteration number set in the step six is less than 500.
8. The dish identification method based on improved YOLO v3 of claim 1, wherein: confidence thresholds score _ threshold, IOU _ threshold1 and nIOU _ threshold2 in step seven are set to values of 0.35, 0.45 and 0.35, respectively.
CN202011430608.XA 2020-12-07 2020-12-07 Dish identification method based on improved YOLO v3 Active CN112560918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011430608.XA CN112560918B (en) 2020-12-07 2020-12-07 Dish identification method based on improved YOLO v3

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011430608.XA CN112560918B (en) 2020-12-07 2020-12-07 Dish identification method based on improved YOLO v3

Publications (2)

Publication Number Publication Date
CN112560918A true CN112560918A (en) 2021-03-26
CN112560918B CN112560918B (en) 2024-02-06

Family

ID=75059897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011430608.XA Active CN112560918B (en) 2020-12-07 2020-12-07 Dish identification method based on improved YOLO v3

Country Status (1)

Country Link
CN (1) CN112560918B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033706A (en) * 2021-04-23 2021-06-25 广西师范大学 Multi-source two-stage dish identification method based on visual target detection and re-identification
CN113269161A (en) * 2021-07-16 2021-08-17 四川九通智路科技有限公司 Traffic signboard detection method based on deep learning
CN113435337A (en) * 2021-06-28 2021-09-24 中国电信集团系统集成有限责任公司 Video target detection method and device based on deformable convolution and attention mechanism
CN113591575A (en) * 2021-06-29 2021-11-02 北京航天自动控制研究所 Target detection method based on improved YOLO v3 network
CN113977609A (en) * 2021-11-29 2022-01-28 杭州电子科技大学 Automatic dish-serving system based on double-arm mobile robot and control method thereof
CN114310872A (en) * 2021-11-29 2022-04-12 杭州电子科技大学 Mechanical arm automatic dish-serving method based on DGG point cloud segmentation network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325084A (en) * 2019-08-29 2020-06-23 西安铱食云餐饮管理有限公司 Dish information identification method and terminal based on YOLO neural network
CN111401148A (en) * 2020-02-27 2020-07-10 江苏大学 Road multi-target detection method based on improved multilevel YO L Ov3

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325084A (en) * 2019-08-29 2020-06-23 西安铱食云餐饮管理有限公司 Dish information identification method and terminal based on YOLO neural network
CN111401148A (en) * 2020-02-27 2020-07-10 江苏大学 Road multi-target detection method based on improved multilevel YO L Ov3

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033706A (en) * 2021-04-23 2021-06-25 广西师范大学 Multi-source two-stage dish identification method based on visual target detection and re-identification
CN113435337A (en) * 2021-06-28 2021-09-24 中国电信集团系统集成有限责任公司 Video target detection method and device based on deformable convolution and attention mechanism
CN113591575A (en) * 2021-06-29 2021-11-02 北京航天自动控制研究所 Target detection method based on improved YOLO v3 network
CN113269161A (en) * 2021-07-16 2021-08-17 四川九通智路科技有限公司 Traffic signboard detection method based on deep learning
CN113977609A (en) * 2021-11-29 2022-01-28 杭州电子科技大学 Automatic dish-serving system based on double-arm mobile robot and control method thereof
CN114310872A (en) * 2021-11-29 2022-04-12 杭州电子科技大学 Mechanical arm automatic dish-serving method based on DGG point cloud segmentation network
CN113977609B (en) * 2021-11-29 2022-12-23 杭州电子科技大学 Automatic dish serving system based on double-arm mobile robot and control method thereof
CN114310872B (en) * 2021-11-29 2023-08-22 杭州电子科技大学 Automatic vegetable-beating method for mechanical arm based on DGG point cloud segmentation network

Also Published As

Publication number Publication date
CN112560918B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN112560918A (en) Dish identification method based on improved YOLO v3
CN111126202B (en) Optical remote sensing image target detection method based on void feature pyramid network
CN109147254B (en) Video field fire smoke real-time detection method based on convolutional neural network
CN110363215B (en) Method for converting SAR image into optical image based on generating type countermeasure network
CN109522857B (en) People number estimation method based on generation type confrontation network model
CN110059586B (en) Iris positioning and segmenting system based on cavity residual error attention structure
CN110570363A (en) Image defogging method based on Cycle-GAN with pyramid pooling and multi-scale discriminator
CN108629370B (en) Classification recognition algorithm and device based on deep belief network
CN105657402A (en) Depth map recovery method
CN111340046A (en) Visual saliency detection method based on feature pyramid network and channel attention
CN111126278B (en) Method for optimizing and accelerating target detection model for few-class scene
CN111626993A (en) Image automatic detection counting method and system based on embedded FEFnet network
CN112347970B (en) Remote sensing image ground object identification method based on graph convolution neural network
CN114943963A (en) Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN111798469A (en) Digital image small data set semantic segmentation method based on deep convolutional neural network
CN111539314A (en) Cloud and fog shielding-oriented sea surface target significance detection method
CN113033687A (en) Target detection and identification method under rain and snow weather condition
CN115484410A (en) Event camera video reconstruction method based on deep learning
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN115810149A (en) High-resolution remote sensing image building extraction method based on superpixel and image convolution
CN116151319A (en) Method and device for searching neural network integration model and electronic equipment
CN117058079A (en) Thyroid imaging image automatic diagnosis method based on improved ResNet model
CN110826691A (en) Intelligent seismic velocity spectrum pickup method based on YOLO and LSTM
CN115439738A (en) Underwater target detection method based on self-supervision cooperative reconstruction
CN115497164A (en) Multi-view framework sequence fusion method based on graph convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant