CN112560918A - Dish identification method based on improved YOLO v3 - Google Patents
Dish identification method based on improved YOLO v3 Download PDFInfo
- Publication number
- CN112560918A CN112560918A CN202011430608.XA CN202011430608A CN112560918A CN 112560918 A CN112560918 A CN 112560918A CN 202011430608 A CN202011430608 A CN 202011430608A CN 112560918 A CN112560918 A CN 112560918A
- Authority
- CN
- China
- Prior art keywords
- dish
- iou
- yolo
- network
- frames
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000001514 detection method Methods 0.000 claims abstract description 49
- 238000000605 extraction Methods 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims description 31
- 238000012360 testing method Methods 0.000 claims description 14
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 12
- 238000012795 verification Methods 0.000 claims description 11
- 238000013145 classification model Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000005012 migration Effects 0.000 claims description 6
- 238000013508 migration Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 230000005764 inhibitory process Effects 0.000 claims description 2
- 230000007480 spreading Effects 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 12
- 238000012216 screening Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000008014 freezing Effects 0.000 description 2
- 238000007710 freezing Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 241000238426 Anostraca Species 0.000 description 1
- 235000000832 Ayote Nutrition 0.000 description 1
- 235000011274 Benincasa cerifera Nutrition 0.000 description 1
- 244000036905 Benincasa cerifera Species 0.000 description 1
- 235000009854 Cucurbita moschata Nutrition 0.000 description 1
- 240000001980 Cucurbita pepo Species 0.000 description 1
- 235000009804 Cucurbita pepo subsp pepo Nutrition 0.000 description 1
- 235000010469 Glycine max Nutrition 0.000 description 1
- 244000068988 Glycine max Species 0.000 description 1
- 244000046052 Phaseolus vulgaris Species 0.000 description 1
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013601 eggs Nutrition 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 235000013410 fast food Nutrition 0.000 description 1
- 235000013372 meat Nutrition 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 235000015136 pumpkin Nutrition 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a dish identification method based on improved YOLO v3, which uses a target detection algorithm based on deep learning to identify the type of a dish and regress a boundary frame of a target, and provides category information and relative position information for the dish ordering operation of a robot. The method is based on One-Stage algorithm YOLO V3, and based on ResNet and SEnet, the feature extraction network is optimally designed, a feature extraction network of Seblock53 is used for replacing Darknet53, and deformable convolution DCN is introduced to form a new backhaul. The dish window scene recognition result based on colleges and universities shows that the dish window scene recognition method can be used for quickly and accurately positioning and classifying dishes and can realize the dish recognition function of the restaurant service robot.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a dish identification method based on improved YOLO v 3.
Background
At present, dishes are sold in restaurants and canteens of many fast food restaurants, schools, factories and other units in China, the work is tedious, the labor demand is large, and the labor cost is high. The method for using the robot to replace the manual work to finish the dish-making operation becomes one of the solutions. When the intelligent robot system facing restaurant service is used for completing dish ordering operation, the robot is required to completely replace human eyes to complete dish identification, namely, the category and the region of each dish in an input image need to be identified.
Because Chinese dishes are various in variety, different dishes are likely to look similar, and the color and the shape of the same dish cooked each time are different, the difficulty of dish identification is increased under the conditions, so that the dish identification rate is low, and the speed is low. The prior art is used for identifying dishes, the dishes which are contained in dining tools such as bowls and the like are not directly identified, the identification method still needs workers to contain the dishes in the bowls in advance, and the dish serving process still needs manual participation. If the method is directly used for identifying the dishes in the dish containing basin, the image is also changed due to the change of the number of the dishes after each dish containing, and the identification accuracy rate is reduced.
In recent years, with the rapid development of neural networks and deep learning, a plurality of excellent target detection networks are proposed, and two key tasks of target detection are target classification and target positioning. Wherein the YOLO V3 algorithm achieves a better balance of speed and performance in the target detection task. Since the attention degree of the feature extraction part of the YOLO V3 is the same for features between different channels, but features in some channels have greater value for the network, the network needs to pay more attention to feature information of the channels. Meanwhile, the standard convolution of the feature extraction part can only perform sampling of a fixed geometric shape on an input feature map, and the convolution neural network is limited in modeling of geometric deformation and cannot adapt to unknown deformation well.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a dish identification method based on improved YOLO V3, the YOLO V3 algorithm is optimized and improved, an SE module is embedded into a residual error module of Darknet53, a deformable convolution DC is introduced, and the dish identification in a canteen dish basin is completed without manual participation.
A dish identification method based on improved YOLO v3 specifically comprises the following steps:
step one, data acquisition
Shooting images of the dish containing basins containing different dishes in the dish placing window of the restaurant by using a camera;
step two, image preprocessing
Analyzing and processing the image acquired in the step one and enhancing data;
preferably, the data enhancement method is to add a small amount of gaussian noise to the image randomly to improve the reasoning and generalization capability of the network.
Step three, data set division
Making a data set by using the image subjected to the enhancement processing in the step two, labeling the dish type and the boundary box in the image, calling the labeled dish type and the boundary box as a real type and a real box, and dividing the labeled data set into a training set, a verification set and a test set;
step four, analyzing and processing the data set
And carrying out visual analysis on the labeled data set, respectively counting the number of real frames of each category in the training set, the verification set and the test set, and carrying out data expansion on the dish image of which the number of the real frames is less than half of the average value of the real frames of all the categories. The data expansion method comprises the steps of randomly deducting real frame images or selecting the whole image of a dish image with a small number of real frames, then converting the deducted or selected image by using pix2pix image style migration, and adding the converted result into a data set to complete data expansion.
Step five, constructing a feature extraction and classification model
The YOLO v3 network comprises a feature extraction part and a prediction part, wherein the feature extraction part is optimized and improved to construct a neural network structure for feature extraction and classification. The characteristic extraction part of the YOLO v3 network is Darknet53, which contains 5 residual blocks, each of which is composed of a 2-time downsampled convolutional layer CBL and a group of repeated residual units, wherein the CBL is composed of 2D-contribution, Batch Normalization and leakage ReLU. According to the method, the SE module is embedded into the residual error unit to form a new residual error unit. The structure of the new residual unit is: inputting a characteristic diagram x, outputting a characteristic diagram f after passing through two CBLs (1 × 1,1) and (3 × 3, 1), weighting and multiplying residual outputs obtained after the output characteristic diagram f passes through an SE module consisting of Global Pooling, FC + Relu and FC + Sigmoid with the output characteristic diagram f, superposing weighting results with the input characteristic diagram x, and sequentially setting the repetition times of residual units in 5 residual blocks to be 1, 2, 8 and 4; the output of the last residual block is connected to a layer of deformable convolution DC to form the feature extraction part of the network.
The deformable convolution DC adds a learnable offset parameter Deltapon the basis of the operation of the standard convolutionnOffset parameter Δ pnObtained by convolution of the input feature layer, and the current point p is calculated according to the formula (1)0Corresponding output y (p)0):
Wherein R { (-1, -1), (0, -1), (1, -1), (-1.0), (0,0), (1,0), (-1,1), (0,1), (1,1) }, w (p)n) As convolution parameter, x (p)0+pn) For feature map input, p0+pnCorresponding to up-sampled p from input features09 positions on a 9-grid network with the center spreading all around, p0+pn+△pnCorresponding to the shifted 9 positions. Taking each point on the input feature map as a current point p in turn0Calculate its output y (p)0) And completing the feature extraction.
And the prediction part takes the idea of FPN as a reference, performs feature fusion on the 3 rd and 4 th residual blocks in the feature extraction network and the output of the DC of the last layer to generate feature maps of 52 x 52, 26 x 26 and 13 x 13 with different scales, and predicts and outputs an adjustment parameter of the center point coordinates (x, y) of the target bounding box, an adjustment parameter of the width and height (w, h) of the prior frame, a category confidence coefficient and a prediction probability on the three feature maps respectively. The 13 × 13 feature map is used for predicting a large target, the 26 × 26 feature map is used for predicting a medium target, and the 52 × 52 feature map is used for predicting a small target.
Step six, training and optimizing classification models
And performing pre-training migration by using the fully trained model on the large data set, inputting a training set to train the model, performing cross validation on the model by using a validation set after each iteration is finished, and calculating the training loss. Adopting a learning rate adjustment strategy to enable the label to be smoother, and then finely adjusting the network layer by layer;
the learning rate adjustment strategy is to use a WarmUp strategy at the initial stage of model optimization, and use a cosine attenuation strategy after the training loss is reduced to be gentle, so that the learning rate is smoother, and a periodically-changed learning rate is provided to enable the network to jump out of local optimum; wherein the loss comprises the loss of the center point coordinates (x, y), the (w, h) loss of the anchor, the confidence loss and the class prediction loss.
The step of fine tuning the network layer by layer refers to dividing the whole network into a feature extraction network and three YOLO prediction branch layers, freezing the feature extraction network, then respectively freezing two of the three YOLO prediction layers, and fine tuning the rest YOLO prediction layer.
And when the set iteration times are reached, finishing the training optimization of the model and storing the model parameters.
Preferably, the number of iterations is set to less than 500.
Step seven, model test and result analysis
Inputting the test set into the classification model optimized in the step six, and analyzing by using an improved NMS method aiming at the problems of low classification confidence, redundant repeat of candidate frames and the like in the output result.
The improved NMS method comprises 4 parameters: confidence threshold score threshold, max _ output _ size, maximum number of boxes selected by non-maximum suppression, IOU threshold IOU _ threshold1, and nIOU threshold IOU _ threshold 2. And the confidence threshold score _ threshold, the IOU threshold IOU _ threshold1 and the nIOU threshold IOU _ threshold2 range from 0.3 to 0.5, and the maximum number of frames selected by non-maximum inhibition max _ output _ size is set according to the number of real frames of each image counted by the analysis in the step four.
Firstly, rejecting detection frames with confidence coefficient lower than score _ threshold; then, screening the detection frames for the first time according to an NMS method, removing the detection frames with the IOU value larger than IOU _ threshold1, when the number of the remaining detection frames reaches the set value of max _ output _ size, removing all the remaining detection frames, and entering the next stage of screening, wherein the calculation method of the IOU value is as follows:
where a denotes the remaining detection boxes and B denotes the detection boxes that have been retained.
And screening the detection frames screened by the steps again, taking the detection frame with the highest confidence coefficient as a first reserved detection frame, and sequentially calculating the n IOU values of the remaining detection frames and the reserved detection frames, wherein the calculation formula of the n IOU value is as follows:
and rejecting the remaining detection frame when the nIOU value of the detection frame is greater than iou _ threshold2, and otherwise, reserving the detection frame.
And finally, outputting the coordinates and the confidence degrees of the reserved detection frames.
Preferably, the confidence threshold score threshold has a value of 0.35.
Preferably, the IOU threshold IOU _ threshold1 and the IOU threshold IOU _ threshold2 have values of 0.45 and 0.35, respectively.
Step eight, dish classification
And (3) recording images of a dish ordering window of the restaurant in real time by using a camera, inputting the videos into the stored model for classification and detection, storing the type and position information of dishes in the output result of the model, and assisting the dish ordering robot to finish dish ordering.
The invention has the following beneficial effects:
the improved feature extraction network can adaptively learn the interrelationship among different channels of the feature maps, pay attention to different degrees, and transmit the global feature information of each feature map downwards to reduce the loss of information; the method can better adapt to the continuous geometric deformation of dishes in the dish serving process, and improves the identification accuracy of the network; improved NMS processing can better suppress redundant prediction blocks.
Drawings
FIG. 1 is an image of a dish collected in an embodiment;
FIG. 2 is a true box distribution in an example dataset;
fig. 3 shows the actual box distribution of the categories of dishes according to the example.
FIG. 4 is a comparison between before and after transformation using the pix2pix method in examples;
FIG. 5 is a feature extraction and classification model constructed in an embodiment;
FIG. 6(a) is a diagram of standard convolution samples, and FIG. 6(b) is an example of a diagram of deformable convolution samples;
FIG. 7 is a loss curve of training optimization in the example;
fig. 8 shows the classification results obtained in the examples.
Detailed Description
The invention is further explained below with reference to the drawings;
step one, data acquisition
6219 images of college restaurant dishes are collected, as shown in FIG. 1.
Step two, image preprocessing
And C, after the image data collected in the step I are analyzed, removing the dish types with too few occurrence times, and reserving 37 dish types meeting the conditions as detection target types. And adding a small amount of Gaussian noise to the reserved dish images to improve the reasoning and generalization capability of the network to serve as a classified data set.
Step three, data set division
Making a data set in a Pascal VOC format by using the image subjected to enhancement processing in the second step, labeling the dish type and the boundary box in the image, calling the labeled dish type and the boundary box as a real type and a real box, and randomly dividing the labeled data set into a training set, a verification set and a test set according to a ratio of 45:5: 6; the training set contained 4975 images, the validation set contained 552 images, and the test set contained 692 images.
Step four, analyzing and processing the data set
Performing visual analysis on the labeled data set, and respectively counting the number of real frames of each category in the training set, the verification set and the test set, wherein the training set comprises 23055 real frames, the verification set comprises 3017 real frames, the number of the real frames in each picture is at most 8, the number of the real frames in each picture is at least 1, and the average number of the real frames is 4-5 as shown in fig. 2; the number of the real frames of part of dish categories is shown in fig. 3, and the data expansion is carried out on the images of the bean bags, the green soy bean wax gourd strips, the steamed meat with eggs, the fried pumpkin and the brine shrimp: and (3) randomly deducting real block images or selecting the whole image of the dish image, then converting the randomly selected image by using pix2pix image style migration, and adding the converted result into a data set to complete data expansion as shown in fig. 4.
And analyzing the size proportion of the real frame, wherein the area of the real frame, which is equivalent to the original image, is about 12% -30%, and according to the definition on the relative size, when the length and the width of the target real frame are less than 10% of the length and the width of the original image, namely the area of the target real frame is less than 1% of the original image, the target real frame is called a small target. Analysis results show that the quantity of the medium proportion accounts for the majority, and the quantity of the large proportion accounts for the second, namely, the detection task is mainly based on medium and large targets.
Step five, constructing a feature extraction and classification model
Optimizing and improving the feature extraction part in the YOLO v3 network, and constructing a neural network structure for feature extraction and classification, as shown in fig. 5. The SE module is embedded into 5 residual modules of Darknet53 in a YOLO v3 network, mainly comprises two operations of Squeeze and Excitation, can model the correlation among characteristic channels, and strengthens important characteristics. The Squeeze operation adopts Global Average Pooling to realize that the whole spatial feature on one channel is coded into a Global feature; the method comprises the following steps that a door mechanism in a sigmoid form is adopted for capturing the relation between channels in the specification operation, two full-connection layers are used, the first full-connection layer is subjected to dimensionality reduction, then the first full-connection layer is activated through ReLU, the dimensionality is restored through the second full-connection layer, then sigmoid activation is adopted, and finally the learned activation value of each channel is multiplied by the original characteristics. The entire SE operation can be viewed as learning the weight coefficients of the individual channels of the feature. The characteristic extraction part of the YOLO v3 network is Darknet53, which contains 5 residual blocks, each of which is composed of a 2-time downsampled convolutional layer CBL and a group of repeated residual units, wherein the CBL is composed of 2D-contribution, Batch Normalization and leakage ReLU. According to the method, the SE module is embedded into the residual error unit to form a new residual error unit. The structure of the new residual unit is: inputting a characteristic diagram x, outputting a characteristic diagram f after passing through two CBLs (1 × 1,1) and (3 × 3, 1), weighting and multiplying residual outputs obtained after the output characteristic diagram f passes through an SE module consisting of Global Pooling, FC + Relu and FC + Sigmoid with the output characteristic diagram f, superposing weighting results with the input characteristic diagram x, and sequentially setting the repetition times of residual units in 5 residual blocks to be 1, 2, 8 and 4; the output of the last residual block is connected to a layer of deformable convolution DC to form the feature extraction part of the network.
For a standard convolution of 3 × 3, 9 positions are sampled from the input features to calculate the output of each point, and these 9 positions are distributed on a grid of 9 grids with a shape of a grid, which is obtained by diffusing around the current point p0, as shown in fig. 6(a), then the output y (p) corresponding to the current point p0 is obtained0) Comprises the following steps:
wherein R { (-1, -1), (0, -1), (1, -1), (-1.0), (0,0), (1,0), (-1,1), (0,1), (1,1) }, p0+pnCorresponds to p09 positions sampled in the input feature as centers, w (p)n) As convolution parameter, x (p)0+pn) Inputting a feature map.
The deformable convolution DCN adds a learnable offset parameter delta p on the basis of the operation of the standard convolutionnOffset parameter Δ pnObtaining the offset parameter delta p by performing convolution on the input features, wherein the size of an input feature layer is w multiplied by h multiplied by c, w and h represent the width and the height, c is the number of channels, the convolution on the input feature layer is 3 multiplied by 3, and the convolution kernel is 2cnThe size of the output offset parameter is w × h × 2c, the first c channels are offset in the x direction, and the last c channels are offset in the y direction.
Calculating the current point p according to equation (5)0Corresponding output y (p)0):
Wherein p is0+pn+△pnFor the 9 positions after the offset, as shown in fig. 6(b), the 9 points to which the offset parameter is added have a non-rectangular shape. Taking each point on the input feature map as a current point p in turn0Calculate its output y (p)0) And completing the feature extraction.
And the prediction part takes the idea of FPN as a reference, performs feature fusion on the 3 rd and 4 th residual blocks in the feature extraction network and the output of the DC of the last layer, respectively predicts the adjustment parameters of the center point coordinates (x, y) of the target bounding box, the (w, h) adjustment parameters of the prior frame, the category confidence coefficient and the prediction probability, and generates feature maps of three different scales of 52 x 52, 26 x 26 and 13 x 13. The 13 × 13 feature map is used for predicting a large target, the 26 × 26 feature map is used for predicting a medium target, and the 52 × 52 feature map is used for predicting a small target.
Step six, training and optimizing classification models
Firstly, a pre-training model fully trained on a COCO data set is used for carrying out migration training, a training set is input to train the model, a verification set is used for carrying out cross verification on the model after each iteration is finished, and the training loss is calculated. A WarmUp strategy is used in the initial stage to stabilize training; when the training loss is reduced to be gentle, a cosine attenuation strategy is adopted, so that the learning rate is smoother, and the learning rate with periodic change is provided to enable the network to jump out of the local optimum, wherein the loss comprises the loss of a central point coordinate (x, y), the loss of an anchor (w, h), the confidence coefficient loss and the class prediction loss; as the condition of label missing and label error also can exist in the dish data set making process, the label smoothness is used as a regularization strategy, the trust of a network on the confidence coefficient of the label can be reduced, the adaptability to label missing and label error data is good, and the influence of label error on the classification accuracy is reduced;
the whole network is divided into a feature extraction network and three YOLO prediction branch layers, after the feature extraction network is frozen, two of the three YOLO prediction layers are respectively frozen, and the remaining YOLO prediction layer is subjected to fine adjustment. The loss curve of the training process is shown in fig. 7.
And when the iterative training is carried out for 490 times, the training optimization of the model is completed, and the model parameters are stored.
Step seven, model test and result analysis
Inputting the test set into the classification model optimized in the step six, and analyzing by using an improved NMS method aiming at the problems of low classification confidence, redundant repeat of candidate frames and the like in the output result. The confidence threshold score _ threshold is set to 0.35, the maximum number of frames selected by non-maximum suppression max _ output _ size is 12, the IOU threshold IOU _ threshold1 is 0.45, and the nIOU threshold IOU _ threshold2 is 0.35.
Firstly, rejecting detection frames with confidence coefficient lower than score _ threshold; then, screening the detection frames for the first time according to an NMS method, removing the detection frames with the IOU value larger than IOU _ threshold1, and removing all the remaining detection frames when the number of the reserved detection frames reaches 12, and entering the subsequent screening, wherein the IOU value is calculated by the following method:
where a denotes the remaining detection boxes and B denotes the detection boxes that have been retained.
And screening the detection frames screened by the steps again, taking the detection frame with the highest confidence coefficient as a first reserved detection frame, and sequentially calculating the n IOU values of the remaining detection frames and the reserved detection frames, wherein the calculation formula of the n IOU value is as follows:
and rejecting the remaining detection frame when the nIOU value of the detection frame is greater than iou _ threshold2, and otherwise, reserving the detection frame.
The final recognition result is output as shown in fig. 8.
The recognition accuracy of the images in the test set is calculated according to the mean Average Precision calculation method of the VOC standard, and the recognition accuracy of the embodiment is 91.16%.
The foregoing detailed description is intended to illustrate and not limit the invention, which is intended to be within the spirit and scope of the appended claims, and any changes and modifications that fall within the true spirit and scope of the invention are intended to be covered by the following claims.
Claims (8)
1. A dish identification method based on improved YOLO v3 is characterized in that: the method specifically comprises the following steps:
step one, data acquisition;
shooting images of the dish containing basins containing different dishes in the dish placing window of the restaurant by using a camera;
step two, preprocessing an image;
analyzing and processing the image acquired in the step one and enhancing data;
step three, dividing a data set;
making a data set by using the image subjected to the enhancement processing in the step two, labeling the dish type and the boundary box in the image, calling the labeled dish type and the boundary box as a real type and a real box, and dividing the labeled data set into a training set, a verification set and a test set;
analyzing and processing the data set;
carrying out visual analysis on the labeled data set, respectively counting the number of real frames of each category in the training set, the verification set and the test set, and carrying out data expansion on the dish image of which the number of the real frames is less than half of the average value of the real frames of all the categories;
constructing a feature extraction and classification model;
embedding an SE module into residual units of 5 residual blocks of a Darknet53 network of a YOLO v3 network feature extraction part to form a new residual unit; the structure of the new residual unit is: inputting a feature map x, outputting a feature map f after two 2-time downsampling convolutional layers CBL of (1 × 1,1) and (3 × 3, 1), weighting and multiplying residual outputs obtained after the output feature map f passes through an SE module consisting of Global Pooling, FC + Relu and FC + Sigmoid with the output feature map f, and overlapping the weighted results with the input feature map x, wherein the repetition times of residual units in 5 residual blocks are 1, 2, 8 and 4 in sequence; connecting the output of the last residual block with a layer of deformable convolution DC to form a characteristic extraction part of the network;
the prediction part performs feature fusion on the 3 rd and 4 th residual blocks in the feature extraction network and the output of the deformable convolution DC to generate feature maps of 52 x 52, 26 x 26 and 13 x 13 with different scales, and predicts the adjustment parameters of the center point coordinates (x, y) of the target bounding box, the adjustment parameters of the width and the height (w, h) of the prior frame, the category confidence coefficient and the prediction probability on the three feature maps respectively;
step six, training and optimizing a classification model;
using a model fully trained on a large data set to perform pre-training migration, inputting a training set to train the model, using a verification set to perform cross verification on the model after each iteration is finished, and calculating training loss; adopting a learning rate adjustment strategy to smoothen the label, and then finely adjusting the network layer by layer; when the set iteration times are reached, finishing the training optimization of the model and storing the model parameters;
step seven, model testing and result analysis;
inputting the test set into the classification model optimized in the step six, and analyzing by using an improved NMS method aiming at the problems of low classification confidence, redundant repeat of candidate frames and the like in the output result, wherein the method comprises the following steps:
s7.1, setting the confidence threshold score _ threshold, IOU threshold IOU _ threshold1 and the value of the IOU threshold IOU _ threshold2 to be 0.3-0.5, and setting the value of the maximum number of frames selected through non-maximum inhibition max _ output _ size according to the number of real frames of each image analyzed and counted in the step four;
s7.2, eliminating the detection frame with the confidence coefficient lower than score _ threshold;
s7.3, adopting an NMS method for the detection frames screened in the step 7.2, eliminating the detection frames with the IOU value larger than IOU _ threshold1, and when the number of the remaining detection frames reaches the set value of max _ output _ size, eliminating all the remaining detection frames, and entering the step 7.4, wherein the calculation method of the IOU value is as follows:
wherein, A represents the remaining detection boxes, and B represents the detection boxes which have been reserved;
s7.4, adopting an NMS method for the detection frames screened in the step 7.2, and eliminating the detection frames with the n IOU value being more than iou _ threshold2, wherein the calculation formula of the n IOU value is as follows:
s7.5, finally outputting the coordinates and confidence degrees of the detection frames reserved in the step 7.4;
step eight, dish classification;
and (3) recording images of a dish ordering window of the restaurant in real time by using a camera, inputting the videos into the stored model for classification and detection, storing the type and position information of dishes in the output result of the model, and assisting the dish ordering robot to finish dish ordering.
2. The dish identification method based on improved YOLO v3 of claim 1, wherein: and the data enhancement method in the second step is to add a small amount of Gaussian noise to the image randomly.
3. The dish identification method based on improved YOLO v3 of claim 1, wherein: and the data expansion method in the fourth step is that the real frame image deduction or the whole image selection is carried out on the dish images needing to be expanded at random, then the deducted or selected images are converted by using pix2pix image style migration, and the converted results are added into the data set to complete the data expansion.
4. The dish identification method based on improved YOLO v3 of claim 1, wherein: the deformable convolution DC adds a learnable offset parameter Δ p on the basis of the operation of the standard convolutionnOffset parameter Δ pnObtained by convolution of the input feature layer, and the current point p is calculated according to the formula (1)0Corresponding output y (p)0):
Wherein R { (-1, -1), (0, -1), (1, -1), (-1.0), (0,0), (1,0), (-1,1), (0,1), (1,1) }, w (p)n) As convolution parameter, x (p)0+pn) For feature map input, p0+pnCorresponding to up-sampled p from input features09 positions on a 9-grid network with the center spreading all around, p0+pn+△pnCorresponding to the shifted 9 positions; taking each point on the input feature map as a current point p in turn0Calculate its output y (p)0) And completing the feature extraction.
5. The dish identification method based on improved YOLO v3 of claim 1, wherein: in the sixth step, the learning rate adjustment strategy is to use a WarmUp strategy at the initial stage of model optimization, and use a cosine attenuation strategy after the training loss is reduced to be gentle, so that the learning rate is smoother, and the periodically changed learning rate is provided to ensure that the network jumps out of local optimum; wherein the loss comprises the loss of the center point coordinates (x, y), the (w, h) loss of the anchor, the confidence loss and the class prediction loss.
6. The dish identification method based on improved YOLO v3 of claim 1, wherein: and step six, fine tuning the network layer by layer means that the whole network is divided into a feature extraction network and three YOLO prediction branch layers, two of the three YOLO prediction layers are respectively frozen after the feature extraction network is frozen, and fine tuning is performed on the rest YOLO prediction layer.
7. The dish identification method based on improved YOLO v3 of claim 1, wherein: and the iteration number set in the step six is less than 500.
8. The dish identification method based on improved YOLO v3 of claim 1, wherein: confidence thresholds score _ threshold, IOU _ threshold1 and nIOU _ threshold2 in step seven are set to values of 0.35, 0.45 and 0.35, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011430608.XA CN112560918B (en) | 2020-12-07 | 2020-12-07 | Dish identification method based on improved YOLO v3 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011430608.XA CN112560918B (en) | 2020-12-07 | 2020-12-07 | Dish identification method based on improved YOLO v3 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112560918A true CN112560918A (en) | 2021-03-26 |
CN112560918B CN112560918B (en) | 2024-02-06 |
Family
ID=75059897
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011430608.XA Active CN112560918B (en) | 2020-12-07 | 2020-12-07 | Dish identification method based on improved YOLO v3 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112560918B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113033706A (en) * | 2021-04-23 | 2021-06-25 | 广西师范大学 | Multi-source two-stage dish identification method based on visual target detection and re-identification |
CN113269161A (en) * | 2021-07-16 | 2021-08-17 | 四川九通智路科技有限公司 | Traffic signboard detection method based on deep learning |
CN113435337A (en) * | 2021-06-28 | 2021-09-24 | 中国电信集团系统集成有限责任公司 | Video target detection method and device based on deformable convolution and attention mechanism |
CN113591575A (en) * | 2021-06-29 | 2021-11-02 | 北京航天自动控制研究所 | Target detection method based on improved YOLO v3 network |
CN113977609A (en) * | 2021-11-29 | 2022-01-28 | 杭州电子科技大学 | Automatic dish-serving system based on double-arm mobile robot and control method thereof |
CN114310872A (en) * | 2021-11-29 | 2022-04-12 | 杭州电子科技大学 | Mechanical arm automatic dish-serving method based on DGG point cloud segmentation network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325084A (en) * | 2019-08-29 | 2020-06-23 | 西安铱食云餐饮管理有限公司 | Dish information identification method and terminal based on YOLO neural network |
CN111401148A (en) * | 2020-02-27 | 2020-07-10 | 江苏大学 | Road multi-target detection method based on improved multilevel YO L Ov3 |
-
2020
- 2020-12-07 CN CN202011430608.XA patent/CN112560918B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325084A (en) * | 2019-08-29 | 2020-06-23 | 西安铱食云餐饮管理有限公司 | Dish information identification method and terminal based on YOLO neural network |
CN111401148A (en) * | 2020-02-27 | 2020-07-10 | 江苏大学 | Road multi-target detection method based on improved multilevel YO L Ov3 |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113033706A (en) * | 2021-04-23 | 2021-06-25 | 广西师范大学 | Multi-source two-stage dish identification method based on visual target detection and re-identification |
CN113435337A (en) * | 2021-06-28 | 2021-09-24 | 中国电信集团系统集成有限责任公司 | Video target detection method and device based on deformable convolution and attention mechanism |
CN113591575A (en) * | 2021-06-29 | 2021-11-02 | 北京航天自动控制研究所 | Target detection method based on improved YOLO v3 network |
CN113269161A (en) * | 2021-07-16 | 2021-08-17 | 四川九通智路科技有限公司 | Traffic signboard detection method based on deep learning |
CN113977609A (en) * | 2021-11-29 | 2022-01-28 | 杭州电子科技大学 | Automatic dish-serving system based on double-arm mobile robot and control method thereof |
CN114310872A (en) * | 2021-11-29 | 2022-04-12 | 杭州电子科技大学 | Mechanical arm automatic dish-serving method based on DGG point cloud segmentation network |
CN113977609B (en) * | 2021-11-29 | 2022-12-23 | 杭州电子科技大学 | Automatic dish serving system based on double-arm mobile robot and control method thereof |
CN114310872B (en) * | 2021-11-29 | 2023-08-22 | 杭州电子科技大学 | Automatic vegetable-beating method for mechanical arm based on DGG point cloud segmentation network |
Also Published As
Publication number | Publication date |
---|---|
CN112560918B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112560918A (en) | Dish identification method based on improved YOLO v3 | |
CN111126202B (en) | Optical remote sensing image target detection method based on void feature pyramid network | |
CN109147254B (en) | Video field fire smoke real-time detection method based on convolutional neural network | |
CN110363215B (en) | Method for converting SAR image into optical image based on generating type countermeasure network | |
CN109522857B (en) | People number estimation method based on generation type confrontation network model | |
CN110059586B (en) | Iris positioning and segmenting system based on cavity residual error attention structure | |
CN110570363A (en) | Image defogging method based on Cycle-GAN with pyramid pooling and multi-scale discriminator | |
CN108629370B (en) | Classification recognition algorithm and device based on deep belief network | |
CN105657402A (en) | Depth map recovery method | |
CN111340046A (en) | Visual saliency detection method based on feature pyramid network and channel attention | |
CN111126278B (en) | Method for optimizing and accelerating target detection model for few-class scene | |
CN111626993A (en) | Image automatic detection counting method and system based on embedded FEFnet network | |
CN112347970B (en) | Remote sensing image ground object identification method based on graph convolution neural network | |
CN114943963A (en) | Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network | |
CN111798469A (en) | Digital image small data set semantic segmentation method based on deep convolutional neural network | |
CN111539314A (en) | Cloud and fog shielding-oriented sea surface target significance detection method | |
CN113033687A (en) | Target detection and identification method under rain and snow weather condition | |
CN115484410A (en) | Event camera video reconstruction method based on deep learning | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
CN115810149A (en) | High-resolution remote sensing image building extraction method based on superpixel and image convolution | |
CN116151319A (en) | Method and device for searching neural network integration model and electronic equipment | |
CN117058079A (en) | Thyroid imaging image automatic diagnosis method based on improved ResNet model | |
CN110826691A (en) | Intelligent seismic velocity spectrum pickup method based on YOLO and LSTM | |
CN115439738A (en) | Underwater target detection method based on self-supervision cooperative reconstruction | |
CN115497164A (en) | Multi-view framework sequence fusion method based on graph convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |