CN111639677A

CN111639677A - Garbage image classification method based on multi-branch channel capacity expansion network

Info

Publication number: CN111639677A
Application number: CN202010379289.8A
Authority: CN
Inventors: 石翠萍; 夏瑞阳; 刘超; 刘文礼
Original assignee: Qiqihar University
Current assignee: Qiqihar University
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2020-09-08
Anticipated expiration: 2040-05-07
Also published as: CN111639677B

Abstract

The invention discloses a garbage image classification method based on a multi-branch channel capacity expansion network, and relates to a garbage image classification method based on a multi-branch channel capacity expansion network. The invention aims to solve the problem that the existing method is low in accuracy rate of classifying the garbage images. The process is as follows: firstly, establishing a multi-branch channel capacity expansion network model; training the multi-branch channel capacity expansion network model by adopting a data set to obtain a pre-trained multi-branch channel capacity expansion network model; thirdly, verifying the accuracy of the pre-trained multi-branch channel capacity expansion network model by adopting a data set, obtaining the trained multi-branch channel capacity expansion network model if the accuracy meets the requirement, and continuing to train the multi-branch channel capacity expansion network model by adopting the data set if the accuracy does not meet the requirement until the accuracy meets the requirement; and fourthly, classifying the garbage images to be recognized by adopting the trained multi-branch channel capacity expansion network model. The invention is used in the field of image classification.

Description

Garbage image classification method based on multi-branch channel capacity expansion network

Technical Field

The invention relates to a garbage image classification method based on a multi-branch channel capacity expansion network.

Background

With the development of the human society,the environmental pollution problem is gradually intensified^[1](Rajamanickam R,NaganS.Assessment ofComprehensive Environmental Pollution Index of KurichiIndustrial Cluster,Coimbatore District,Tamil Nadu,India–a Case Study[J]Journal of economic Engineering,2018,19(1):191-199.), the environmental pollution has great harm to the earth and all the organisms thereof^[2](Lee S,Meier H H,Lee P J.Tackling environmentalpollution in Seoul,South Korea through tax incentives and related strategies[J]Innovation Addressing Change changes, 2018,20: 127.). Wherein a large portion of the pollution is caused by household waste, and the decomposition of some household waste may increase the concentration of harmful chemicals^[3](AnvariFar H,Amirkolaie A K,Jalali A M,et al.Environmental pollution andtoxic substances:Cellular apoptosis as a key parameter in a sensible modellike fish[J]Aquatic toxicology,2018,204: 144-. Still other household waste is less biodegradable. Such as plastics, which are ubiquitous pollutants in all marine environments throughout the world^[4](Alimba C G,Faggio C.Microplastics in the marine environment:Current trendsin environmental pollution and mechanisms oftoxicological profile[J]Environmental toxicology and pharmacology,2019,68: 61-74.). Thus, the first step in solving the problem of trash contamination is to sort the trash by nature. Many countries in the world mandate that classified dumping of garbage is required^[5](HuberJ,Viscusi W K,Bell J.Dynamic relationships between social norms and pro-environmental behavior:evidence from household recycling[J]Behaviouralpublic Policy,2018: 1-25.). However, for people who lack specialized knowledge, it is difficult to accurately recognize various types of domestic waste. Aiming at the problem, an intelligent garbage classification system is created, and the intelligent garbage classification system is applied to an intelligent garbage can or a smart phone to guide people to correctly throw household garbage. The difficulty at present is that the intelligent garbage classification system can not accurately classify garbage images.

The high-speed development period is met by deep learning due to the improvement of calculation power and the improvement of a theoretical system in the last decade^[6](SeoJ,Park H.Object Recognition in Very Low Resolution Images Using DeepCollaborative Learning[J]IEEE Access,2019: 134071-. The deep learning has penetrated into the aspects of the computer vision field nowadays, and obtains exciting results in the tasks of image classification, target detection and image semantic segmentation^[7](Liu J,Pan Y,Li M,et al.Applications of deep learning to MRIimages:A survey[J]BigData Mining and Analytics,2018,1(1): 1-18.). Compared with other machine learning algorithms, the deep learning method has the main advantage of strong modeling capability^[8](J.Enguehard, P.O' Halloran and A.Gholipour, "Semi-Supervised Learning With Deep Embedded Learning for Image Classification and Segmentation," in IEEE Access, vol.7, pp.11093-11104,2019.), and it uses an end-to-end Learning approach, which can prevent people from performing heavy manual work^[9](Cheng G,Yang C,Yao X,et al.When Deep Learning Meets MetricLearning:Remote Sensing Image Scene Classification via LearningDiscriminative CNNs[J].IEEE Transactions on Geoscience&Remote Sensing,2018: 1-11.). Of course, deep learning also has certain limitations, and the deep learning depends on data^[10](Yan M.Adaptive LearningKnowledge Networks for Few-Shot Learning[J]IEEE Access,2019,7: 119041-. Garbage image classification is a newer field, and a standardized data set for network learning does not exist before, so that in 2016, the garbage classification data set TrashNet is manually established by Mindy Yang and Gary Thung^[11](YangM,Thung G.Classification oftrash for recyclability status[J]CS229 ProjectReport,2016,2016.). After that, the deep learning work in the field of garbage image classification is gradually increased, and the trashent data set is widely applied. However, the data set has a small number of images and insufficient feature information, and the work performed on the data set does not achieve good results.

In the garbage image classification work based on deep learning, people increasingly like to use deeper neural networks. In fact, the whole deep learning image processing field almost shows the trend^[12](Han J,Zhang D,Cheng G,et al.Advanced Deep-Learning Techniques for Salient and Category-SpecificObject Detection:A Survey[J]IEEE Signal Processing Magazine,2018,35(1): 84-100.). Generally speaking, deepening the network makes the work of each layer simpler, thereby obtaining better nonlinear expression capability^[13](Peng J,Kang S,Ning Z,et al.Residual convolutional neural network forpredicting response of transarterial chemoembolization in hepatocellularcarcinoma from CT imaging[J]European Radiology,2019(5):1-12.) had better fitting effect on complex features. However, the results from the deepened network are not absolute, and better results are usually obtained by selecting a deep neural network in target detection, image semantic segmentation and complex scene classification^[14](Wang H,Dai L,Cai Y,etal.Salient object detection based on multi-scale contrast[J]Neural Networks,2018,101). For the small data set TrashNet with a single background, the working difficulty is that the total amount of characteristic information is small, the number of data samples is small, and the similarity among classes is large, and under the condition, the expected effect is difficult to achieve by simply increasing the network depth. Therefore, we should explore a more effective method considering the specificity of garbage classification.

Disclosure of Invention

The invention aims to solve the problem that the existing method is low in accuracy rate of classifying junk images, and provides a junk image classification method based on a multi-branch channel capacity expansion network.

A garbage image classification method based on a multi-branch channel capacity expansion network comprises the following specific processes:

step one, establishing a multi-branch channel capacity-expanding network model;

training a multi-branch channel capacity expansion network model by adopting a TrashNet data set to obtain a pre-trained multi-branch channel capacity expansion network model;

the TrashNet data set comprises 2527 RGB images, wherein 501 images of glass, 594 images of paper, 403 images of paperboard, 482 images of plastic, 410 images of metal and 137 images of garbage;

the background of all images is white, and the size of all images is 512 × 384 pixels;

thirdly, verifying the accuracy of the pre-trained multi-branch channel capacity expansion network model by using a TrashNet data set, obtaining the trained multi-branch channel capacity expansion network model if the accuracy meets the requirement, and continuing to train the multi-branch channel capacity expansion network model by using the TrashNet data set until the accuracy meets the requirement if the accuracy does not meet the requirement;

and step four, classifying the garbage images to be recognized by adopting the trained multi-branch channel capacity expansion network model.

The invention has the beneficial effects that:

deep learning is not well applied in the task of garbage image classification due to insufficient data available for training. The widely used garbage classification data set TrashNet is selected, and tests show that the widely accepted method for deepening network and short-circuit connection in the deep learning field does not work on the TrashNet data set. Therefore, the invention provides a network improvement method based on multi-path characteristic information fusion aiming at the characteristics of few image samples, lack of characteristic information and large similarity among classes of the data set, and the characteristic information can be more fully utilized under the condition of less extra calculation amount. The method replaces the core structure of the fine-tuning Xception network, and obtains better classification performance. Finally, the network provided by the invention achieves 94.34% accuracy on the TrashNet data set, and has certain advantages on multiple indexes compared with some newer methods.

Aiming at the characteristics of few samples and single characteristic information of the TrashNet data set, the invention provides a novel model optimization method based on a multi-branch channel capacity-expanding network, and the efficient utilization of the characteristic information can be exchanged with less calculation amount. When the conventional network improvement method is difficult to work, the method can greatly improve the network performance. Compared with a fine-tuning Xception network, the learning network provided by the invention has higher precision and better anti-interference capability, and compared with a plurality of newer works, the learning network provided by the invention has higher classification precision, more balanced prediction capability and stronger practical application value.

In the near future, end-to-end learning systems will continue to be studied to quantitatively analyze the impact of minor adjustments in network structure on classification performance. In addition, the network size is minimized while maintaining high accuracy. Finally, the invention is transplanted to the mobile phone.

The invention analyzes the characteristics of the TrashNet data set, provides the reason that a deeper neural network is not suitable for the TrashNet data set, and proves that similar garbage is more difficult to distinguish by simply deepening the neural network through experiments.

The invention provides a network improvement method for expanding shunt for a specific network layer. The method can widen the core structure of the network, so that the network can extract the characteristic information more sufficiently, and the classification work on the TrashNet data set is facilitated.

The invention uses the network optimization method to partially replace the fine-tuning Xmeeting network, and obtains better classification results. Compared with other related newer methods, the learning network provided by the invention has great advantages in overall accuracy, single-class identification accuracy and F1-score. The method has better classification performance and more balanced single-class prediction capability, so that the method can be used for actual garbage classification.

Drawings

Fig. 1 is a diagram of a network architecture according to the present invention;

FIG. 2a is a glass garbage image of the TrashNet dataset;

FIG. 2b is a paper garbage image of the TrashNet dataset;

figure 2c is a cardboard garbage image of the trashent dataset;

FIG. 2d is a plastic waste image of the TrashNet dataset;

FIG. 2e is a metallic garbage image of the TrashNet dataset;

FIG. 2f is an additional spam image of the TrashNet dataset;

FIG. 3 is a diagram of a fine-tuned Xmeeting network architecture;

FIG. 4 is a diagram showing a core structure of a fine-tuned Xmeeting network;

FIG. 5 is a diagram of output characteristics of each residual connecting layer of a fine-tuned Xprediction network;

FIG. 6 is a diagram of a conventional residual concatenation method;

FIG. 7 is a diagram of a method for expanding a channel according to the present invention;

FIG. 8a is a graph of a comparison of fine-tuned Xception network training and validation loss;

FIG. 8b is a graph comparing loss of training and validation for the network proposed by the present invention;

FIG. 8c is a comparison graph of the fine-tuned Xception network training and verification accuracy;

FIG. 8d is a comparison graph of network training and verification accuracy proposed by the present invention;

FIG. 9 is a graph of occlusion test results;

FIG. 10 is a graph of accuracy versus number of different core channels;

FIG. 11a is a comparison graph of confusion matrices before and after optimization of 728 core channel numbers;

FIG. 11b is a comparison graph of confusion matrices before and after 896 core channel number optimization;

FIG. 12 is a graph comparing a plurality of other newer jobs;

FIG. 13 is a graph showing the comparison result of F1-score.

Detailed Description

The first embodiment is as follows: the specific process of the garbage image classification method based on the multi-branch channel capacity expansion network in the embodiment is as follows:

during the past few years, researchers have done a lot of work on garbage image classification, which can be classified into two categories, one based on traditional machine learning methods and the other based on end-to-end learning systems.

Conventional machine learning methods:

support Vector Machines (SVM). The SVM is a strong classification algorithm developed from the statistical learning theory of Vapnik^[15](Yu S,Li X,Zhang X,et al.The OCS-SVM:An Objective-Cost-Sensitive SVMWith Sample-Based Misclassification Cost Invariance[J]IEEE Access,2019,7: 118931-and 118942.) that processes machine learning tasks based on optimization theory^[16](Wu X,Zuo W,Lin L,etal.F-SVM:Combination offeature transformation and SVM learning via convexrelaxation[J]IEEE transactions on neural networks and learning systems,2018,29(11): 5185-. For the multi-classification problem, multiple SVM classifiers can be combined, or a one-time solution method can be used, namely, parameters of all classes are optimized through a formula. The SVM has corresponding processing schemes for linear divisible problems and linear inseparable problems^[17](L.Lan, Z.Wang, S.ZHE, W.Cheng, J.Wang and K.Zhang, "Scaling Up Kernel SVM on Limited Resources: A Low-Rank Linear Approach," in IEEE Transactions on Neural Networks and learning Systems, vol.30, No.2, pp.369-378, Feb.2019.). In 2016, Mindy Yang et al worked on the TrashNet dataset using a support vector machine, and finally the accuracy of 6 classes was 63%^[11]。

Nearest neighbor classification (KNN). KNN is one of the simplest algorithms in data classification^[18](S.Zhang, X.Li, M.Zong, X.Zhu and R.Wang, "efficiency kNN Classification With Difference number of New concerns Nearest Neighbors," in IEEE Transactions on Neural Networks and learning systems, vol.29, No.5, pp.1774-1785, May 2018.). The KNN algorithm has no training phase, and the classification of the new sample class is determined according to the class of the latest sample or samples^[19](Xing W,Bei Y.Medical HealthBig Data Classification Based on KNN Classification Algorithm[J]IEEE Access,2019,8: 28808-. KNN has been widely used in image classification due to its high efficiency and simple implementation^[20](Tu B,Wang J,Kang X,et al.KNN-based representation of superpixels forhyperspectral image classification[J]IEEE Journal of Selected topocs in applied Earth emissions and Remote Sensing,2018,11(11): 4032-4047). In 2018, Bernardo S.Costa et al used KNN algorithm to classify garbage images on TrashNet data set with a final accuracy of 88%^[21](Costa B S,Bernardes A C S,Pereira J V A,et al.ArtificialIntelligence in Automated Sorting in Trash Recycling[C]//Anais do XV EncontroNacional de

Artificial e Computacional.SBC,2018:198-205.)。

Random Forest (RF): the RF algorithm firstly generates a plurality of different data sets through sampling, then trains a classification tree on each data set, and each tree participates in the decision of the final prediction result^[22](Javeed A,Zhou S,Yongjian L,et al.An Intelligent Learning System Based on Random SearchAlgorithm and Optimized Random Forest Model for Improved Heart DiseaseDetection[J]IEEE Access,2019,7: 180235-180243). The RF algorithm has the advantages of better robustness in processing missing data and unbalanced data and better reliability in processing tasks with more variables^[23](a. dapogny, K.Baily and S.Dubuisson, "Dynamic Pose-route Facial Expression Recognition by Multi-View Pairwise Conditioning Random expressions," in IEEE Transactions on Audio Computing, vol.10, No.2, pp.167-181,1April-June 2019.), and the training speed is faster at the same time^[24](Kim S,Kwak S,Ko B C.Fastpedestrian detection in surveillance videobased on soft target training of shallow random forest[J]IEEE Access,2019,7: 12415-. In 2018, Mandar Satvilkar uses an RF algorithm to classify the garbage images on the TrashNet data set, and 62.61% of accuracy is achieved^[25](Satvilkar M.Image Based Trash Classificationusing Machine Learning Algorithms for Recyclability Status[J].Image,2018,13:08.)。

Extreme gradient boost (XGBoost). The XGboost algorithm is a supervision algorithm based on GBDT (gradient boosting decision Tree) improvement^[26-27]([26]Jiang Y,Tong G,Yin H,et al.Apedestrian detection method based on genetic algorithm for optimize XGBoosttraining parameters[J].IEEE Access,2019,7:118310-118321.[27]Gu X,Han Y,Yu J.ANovel Lane-Changing Decision Model for Autonomous Vehicles Based on DeepAutoencoder Network and XGBoost[J]IEEE Access,2020,8: 9846-. The idea is to establish a certain number of classification regression trees, so that the predicted value of the tree group is as close to the true value as possible and has the generalization capability as large as possible^[28](Zhang W,Zhao X,Li Z.A Comprehensive Study of Smartphone-Based IndoorActivity Recognition via Xgboost[J]IEEE Access,2019,7: 80027-. The XGBoost algorithm has the advantage that it is not prone to overfitting and can specify the default direction of the shunt for missing values. In 2018, MandarSatvilkar uses the XGboost algorithm to classify the garbage images on the TrashNet data set, and the accuracy is 70.1%.

End-to-end learning system:

the traditional machine learning method has achieved better performance in the field of computer image processing due to long development time and rich theoretical system^[29](R.Tao et al., "magnetic-Based Ischemic Heart disease Detection and Localization Using Machine Learning Methods," in IEEETransactions on biological Engineering, vol.66, No.6, pp.1658-1667, June 2019.). However, these methods are composed of several independent steps, and therefore require a large memory space for storing the intermediate results^[30](He N,Fang L,Li S,et al.Skip-Connected Covariance Network for Remote SensingScene Classification[J]IEEE transactions on neural networks and learning systems, 2019), the actual operation is cumbersome. The advent of end-to-end learning systems solved this problem. Only need give training data and test data in advance, the end-to-end learning system will automatically calculate the error result between prediction and real data, use the back propagation method to update the weight, solve the gradient, use the gradient descent method to search the minimum value of the loss function^[31](Hussein S,Kandel P,Bolan C W,et al.Lung and pancreatic tumorcharacterization in the deep learning era:novel supervised and unsupervisedlearning approaches[J]IEEE transactions on medial imaging,2019,38(8): 1777-1787), the whole convergence process is independently and continuously completed. In recent years, there have been many end-to-end learning based spam image classification efforts that all use the trashent or trimmed trashent datasets.

In the early 2018, Kennedy Tom achieved 88.42% accuracy using the oscanet network (fine tuned by VGG 19)^[32](Kennedy T.OscarNet:using transfer learning to classify disposable waste[J]CS230Report Stanford university, CA, Winter, 2018). In 10 s.2018, Bernardo s.costa et al achieved 91% accuracy using the trimmed AlexNet network and 93% accuracy using the trimmed VGG16 network^[33](Costa B S,Bernardes A C S,Pereira J V A,et al.ArtificialIntelligence in Automated Sorting in Trash Recycling[C]v/Anais do XV EncontrolNacional de Intelig E International SBC,2018: 198-; stephenn l. rabano et al achieved 87.2% accuracy with the fine-tuned MobileNet network^[34](Rabano S L,Cabatuan M K,Sybingco E,et al.Common Garbage Classification Using MobileNet[C]I/2018 IEEE 10th International Conference on human, Nanotechnology, Information Technology, Communication and Control, environmental and management (HNICEM). In 6.2019, victoriariuiz et al achieved 87.71% accuracy using the fine-tuned inclusion network, 88.34% accuracy using the fine-tuned inclusion-ResNet network, and 88.66% accuracy using the ResNet network^[35](Ruiz V,Sánchezá,Vélez J F,et al.Automatic Image-Based Waste Classification[C]Spring, Cham,2019: 422-. They all use a classical network that is superior in performance in large computer vision games, without considering the specificity of the trashent dataset. The invention fills the blank of the work.

TrashNet dataset

The dataset used in the present invention was TrashNet, produced by Mindy Yang and Gary Thung in 2016. There are 2527 RGB images, among which 501 glass images, 594 paper images, 403 cardboard images, 482 plastic images, 410 metal images, and 137 garbage images. The background of all images is white, and they are all taken under sufficient illumination^[36](Rabano, Stephenn L., et al, "Common gas Classification Using Mobile Net."2018IEEE 10th International Conference on human, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM). Size of all images512 × 384 pixels fig. 2a, 2b, 2c, 2d, 2e, 2f show some spam images of the trashent dataset.

Unlike other classified datasets, each image in the trashent dataset contains only one object, which may make the task of human eye resolution easier, but is different for computers. The convolutional neural network has the characteristic extraction capability far exceeding that of human eyes, and the details of all positions in the image are not easy to find by using a trained model. However, for an image of a trashent dataset that only contains a single object, the number of features that can be extracted is small, and the fault tolerance is poor. No other objects in the image can provide additional feature information, and thus when the sample object is slightly special, it will eventually show a large difference, which is one of the difficulties of the garbage classification task. Another difficulty is the high degree of similarity between the different classes of trashent datasets. Such as glass and plastic, most of which are bottle-shaped and transparent in color. The human eye can only distinguish by observing the trademark on the bottle and incorporating life experiences that the convolutional neural network does not have. Therefore, even if the feature extraction capability of the deep learning network is strong, an erroneous determination may be generated.

Deep network inapplicable local data set

With the increasing effort and resolution of the gradient vanishing problem, people are increasingly inclined to use deeper neural networks to deal with the image classification problem. The benefit of this is evident in that a deeper network means better non-linear representation capability, enabling learning of more complex transformations, fitting more complex feature inputs. However, experiments have shown that deep networks are not conducive to image classification on the trashent dataset.

First, the present invention trims the Xception network used to test the TrashNet data set, the trimmed Xception network structure is shown in FIG. 3. the present invention refers to the network part of the trimmed Xception network that receives 14 × 14 × 728 images as the core structure part, which is not expanded in FIG. 2 due to limited spaceAnd 3 depth separable convolutional layers. The depth separable convolution here uses the separatableconv 2D under the Keras framework rather than another depth separable convolution DepthwiseConv 2D. The SeparableConv2D must be used with the active layer^[37](CholletF.Xception:Deep learning with depthwise separable convolutions[C]//Proceedings ofthe IEEE conference on computer vision andpatternrecognition.2017:1251-1258.)。

Then, in order to check the influence of each network layer on images of the TrashNet data set, two images are selected from the data set to perform characteristic diagram test on partial network layers. For greater intuition, the core structure is shown in fig. 4 and the output feature map of the corresponding layer is shown in fig. 5. The two input images selected by the invention are respectively glass and plastic, the characteristic diagram is the output of the whole network layer instead of the output of a single channel, and the total characteristic diagram can better reflect the influence of a specific network layer on the images. It can be seen that for the plastic image, the deeper residual connecting layer feature map is more sensitive to body contour information than the trademark, which makes it more confusable with glass. Obviously, from the output feature maps of the network layers, the increase of the network depth does not play a positive role in the classification work.

Finally, in order to quantitatively check the influence of the network depth on the performance on the data set, the method provided by the invention deletes the network structure which is originally repeated by 8 in the core structure of the Xconcept network into 1 network structure by deleting the network structure on the fine-tuned Xconcept network, and the deleted network is used as a comparison network. Training the fine-tuned Xception network and the comparison network under the same condition, and testing by using the same test set, wherein the result shows that the precision of the comparison network is 0.44% higher than that of the fine-tuned Xception network.

The method proposed by the invention

Providing a channel capacity expansion method;

the specific method is that branches are expanded for partial layers of the linear network, and then the branches are connected together by add layers at the end of each branch. The schematic diagram is shown in fig. 6, where the number of neurons and the number of layers shown is very limited due to the limited space. Each circle represents a neuron, each column is a network layer, and the number of neurons in the network layer is the number of channels in the layer, which is also called the width of the layer network. It can be seen that after the third network layer, the network is expanded to a two-way structure, the number of channels in each way is 4, and the number of layers is 3, so that the number of network channels is expanded to 8, and the network is widened. The two branches are finally connected together through an add layer, which performs information superposition of the corresponding channels, which increases the amount of information under each feature, but does not increase the number of features describing the image. Therefore, it can be seen that although the add layer is connected with two network branches with the width of 4, the width of the connected network layer is still 4, which effectively controls the network parameters and makes the network lighter.

Channel capacity expansion method used by the invention

Aiming at the characteristics of a data set, the invention provides a channel capacity expansion method different from the traditional residual connection. Conventional residual connection typically adds a short-circuit mechanism on a linear network, as shown in fig. 6. The idea is to map information identities in lower layers of the network into higher layers of the network, which can be computationally understood as inputting x into the output as an initial result, thus making the output h (x) ═ f (x) + x^[38](He K,Zhang X,Ren S,et al.Deep residual learning forimage recognition[C]If f (x) is 0 at the deep layer of the network, h (x) may be x. The advantage of doing so is that the gradient can be transmitted without loss, even the gradient of the network layer is smaller, will not be zero after adding one, thus better solution the gradient vanishes the problem.

The add layer used by the present invention can realize the corresponding channel fusion in the horizontal direction, i.e. the add layer superposes the outputs of the multiple branches, as shown in fig. 7. There are three methods for increasing the network width, and it is common to directly expand the number of channels or use a concatenate layer for channel merging, and in fact, the same effect will be obtained by using an add layer. The difference is that the number of channels is enlarged, namely the number of filters is directly increased, so that the number of features extracted by each convolution layer is increased; the coordinate layer is used for transversely overlapping channels, so that the number of features for describing the image is increased, but the information under each feature is not increased; and the add layer carries out information superposition of corresponding channels, which increases the information quantity under each feature but does not increase the number of features describing the image. The add layer is therefore better suited to trashent datasets with a smaller total amount of feature information.

The invention compares three network widening methods from the perspective of the number of network parameters. K denotes the size of the convolution kernel, N denotes the number of channels of a layer, and M denotes the number of signatures of the previous layer. The parameter quantity of Conv2D may be calculated as:

Param＝K²·M·N

taking the structure of fig. 6 as an example, the input picture size is 256 × 256, and here, the convolution type is referred to as Conv2D convolution, and add layer and concatenate layer do not add extra parameter amount, but concatenate expands the number of channels and adds parameter amount of the next network layer. When the method of directly increasing the number of channels by one is adopted, the total parameter of the convolutional layer is 1962; when the bicarbonate layer is connected with the two-way structure, the total parameter of the convolution layer is 1386; when the add layer is connected with the two-path structure, the total parameter of the convolution layer is 1206.

Secondly, the invention compares the three network broadening methods from the view point of time complexity. The time complexity of the convolutional neural network is:

the size S of the input characteristic diagram is jointly determined by the size P of an input matrix, the size K of a convolution kernel, filling and stride; the correspondence may be expressed as:

obviously, the method of increasing the network width by using the add layer is more advantageous in time complexity, and the lower time complexity means less training time and faster prediction speed, and the demand for calculation power is lower.

Step one, establishing a multi-branch channel capacity-expanding network model;

the background of all images is white and they are all taken under sufficient illumination. All images are 512 × 384 pixels in size;

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: establishing a multi-branch channel capacity-expanding network model in the first step; the specific process is as follows:

the multi-branch channel capacity-expansion network model is shown in fig. 1 and consists of four parts connected end to end; firstly, the functions of some network layers are briefly explained, and the functions of the convolution layer and the depth separable convolution layer are to perform feature extraction on the image; the batch standardization layer has the function of optimizing the variance and the mean position so that the new distribution is more suitable for the real distribution of the data; the activation layer functions to increase the non-linear expressive power of the neural network model. Since all the batch normalization layers in the network are identical, no distinction is made in fig. 7, and the activation layers are the same.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the present embodiment differs from the first or second embodiment in that: in the first part of the multi-branch channel capacity-expanding network model, an input image with the size of 229 × 229 × 3 enters an input layer first, and then enters the convolutional layer 1 for first feature extraction, the number of channels of the convolutional layer 1 is 32, the stride is 2, and the input image is subjected to first downsampling, so that the feature map size output by the convolutional layer 1 is 114 × 114 × 32;

then, the characteristic diagram output from the convolutional layer 1 sequentially passes through the batch normalization layer and the activation layer; then, the obtained object enters a convolutional layer 2 for second-time feature extraction, the number of channels of the convolutional layer 2 is 64, the size of a feature graph output by the convolutional layer 2 is 112 multiplied by 64, and then the obtained object sequentially passes through a batch of normalization layers and an activation layer;

then, the network is divided into two paths, wherein the characteristic diagram entering the first path of the network passes through a depth separable convolutional layer 1, a batch normalization layer, an activation layer, a depth separable convolutional layer 2 and a batch normalization layer; the number of channels of both depth-separable convolutional layer 1 and depth-separable convolutional layer 2 is 128, so the output feature map size of five layers of depth-separable convolutional layer 1, batch normalization layer, active layer, depth-separable convolutional layer 2, batch normalization layer is 112 × 112 × 128, and then the five layers enter maximum pooling layer 1 for down-sampling, and the size after down-sampling is 56 × 56 × 128;

meanwhile, after the feature map of the second path entering the network is subjected to down sampling and activation for once, the feature map is directly connected with the output of the first path of network through a residual connecting layer 1, and the size of the feature map is unchanged after the residual connection; the feature map output by residual connection layer 1 enters the second part.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between this embodiment and the first to third embodiments is that, in the second part of the multi-branch channel capacity expansion network model, the feature map entering the first path of the network passes through the active layer, the depth separable convolutional layer 3, the batch normalization layer, the active layer, the depth separable convolutional layer 4, and the batch normalization layer;

the number of channels of depth-separable convolutional layer 3 and depth-separable convolutional layer 4 is 256, so the output feature map sizes of the six layers, active layer, depth-separable convolutional layer 3, batch normalization layer, active layer, depth-separable convolutional layer 4, batch normalization layer, are 56 × 56 × 256, and then down-sampling into maximum pooling layer 2, the down-sampled size being 28 × 28 × 256;

meanwhile, after the feature map of the second path entering the network is subjected to down sampling and activation for once, the feature map is directly connected with the output of the first path of network through the residual connecting layer 2, and the feature map output by the residual connecting layer 2 is continuously transmitted downwards;

then the network is divided into two paths again, and the feature graph entering the first path of the network passes through an activation layer, a depth separable convolution layer 5, a batch standardization layer, an activation layer, a depth separable convolution layer 6 and a batch standardization layer; the number of channels of both depth-separable convolutional layers 5 and 6 is 728, so the output feature map sizes of the six layers, active layer, depth-separable convolutional layer 5, batch normalization layer, active layer, depth-separable convolutional layer 6, batch normalization layer, are 28 × 28 × 728, and then down-sampled into maximum pooling layer 3 to a size of 14 × 14 × 728;

meanwhile, after the feature diagram of the second path entering the network is subjected to down sampling and activation for once, the feature diagram is directly connected with the output of the first path network through the residual connecting layer 3, and the feature diagram output by the residual connecting layer 3 enters the third part.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the difference between this embodiment and one of the first to the fourth embodiments is that, in the third part of the multi-branch channel capacity expansion network model, the third part is more specific, because a large number of tests indicate that the network layer after downsampling four times has the best effect on extracting the features of the spam image data, we set the number of network paths in the third part to be 8, and call the network structure of this part as a core structure;

the feature diagram input from the second part is divided into 8 paths, each path passes through six activation layers, six depth separable convolution layers and six batches of normalization layers, and the specific sequence is shown in the figure; taking the first branch as an example, the feature map entering the 1 st branch sequentially passes through an active layer, a depth separable convolutional layer 7, a batch normalization layer, an active layer, a depth separable convolutional layer 8, a batch normalization layer, an active layer, a depth separable convolutional layer 9 and a batch normalization layer;

the structures of the 8 shunts are identical, the depth-separable convolutional layers 7-30 are also identical, the number of channels is 728, and no downsampling is involved, so that the feature map sizes of the outputs of the depth-separable convolutional layers 7-30, all active layers and all batch normalization layers in the third part are 14 × 14 × 728;

then, the output of the 1 st shunt and the output of the 2 nd shunt are overlapped through a residual connecting layer 4, the output of the residual connecting layer 4 and the output of the 3 rd shunt are overlapped through a residual connecting layer 5, the output of the residual connecting layer 5 and the output of the 4 th shunt are overlapped through a residual connecting layer 6, the output of the residual connecting layer 6 and the output of the 5 th shunt are overlapped through a residual connecting layer 7, the output of the residual connecting layer 7 and the output of the 6 th shunt are overlapped through a residual connecting layer 8, the output of the residual connecting layer 8 and the output of the 7 th shunt are overlapped through a residual connecting layer 9, the output of the residual connecting layer 9 and the output of the 8 th shunt are overlapped through a residual connecting layer 10, and the output of the residual connecting layer 10 is output through the residual connecting layer 10 (sequentially overlapped until the output of the residual connecting layer 10 is obtained, namely the output of the; and the feature map output by the core structure of the third part is continuously input into the fourth part again.

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the difference between this embodiment and the first to fifth embodiments is that, in the fourth part of the multi-branch channel capacity-expanding network model, the feature map entering the first path of the network passes through the active layer, the depth separable convolutional layer 31, the batch normalization layer, the active layer, the depth separable convolutional layer 32, and the batch normalization layer;

the number of channels for depth-separable convolutional layers 31 and depth-separable convolutional layers 32 is 728 and 1024, respectively, so that the output feature map size of the first three active layers, depth-separable convolutional layers 31, and batch normalization layers is 14 × 14 × 728, and the output feature map size of the last three active layers, depth-separable convolutional layers 32, and batch normalization layers is 14 × 14 × 1024;

then, the data enters a maximum pooling layer 4 for down-sampling, and the size of the down-sampled data is 7 multiplied by 1024;

meanwhile, after the feature map of the second path entering the network is subjected to one-time downsampling and activation, the feature map is directly connected with the output of the first path of network through a residual connecting layer 11, and the feature map output by the residual connecting layer 11 passes through a depth separable convolutional layer 33, a batch normalization layer, an activation layer, a depth separable convolutional layer 34, a batch normalization layer and an activation layer;

the number of channels of depth-separable convolutional layer 33 and depth-separable convolutional layer 34 is 1536 and 2048, respectively, so that the output feature map size of the first three layers of depth-separable convolutional layer 33, batch normalization layer, and active layer is 7 × 7 × 1536, and the output feature map size of the last three layers of depth-separable convolutional layer 34, batch normalization layer, and active layer is 7 × 7 × 2048;

and finally, the feature map enters a global average pooling layer, is flattened into a vector of 1 multiplied by 2048, and finally enters a dense connection layer to obtain output.

Learning network proposed by the invention

The invention improves the fine-tuned Xmeeting network by using the proposed channel capacity expansion method, and the network structure is shown in figure 7. After a large number of tests are carried out on the fine-tuned Xception network, the fact that the network layer which is down-sampled for four times has the most obvious effect on the characteristic extraction of the TrashNet data set is found, so that the network structure of the part is widened, and the specific method is to change the original long linear structure into a structure with 8 branches in parallel. The number of branches is set to 8, so that the advantage of the extraction of the structural features is brought into greater play, and the total parameters of the Xmeeting network are ensured to be the same as the fine-tuned total parameters of the Xmeeting network, so that quantitative comparison and analysis are facilitated.

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is: in the forward propagation process of the multi-branch channel capacity expansion network, for the core structure of the network, the input of the first depth separable convolution layer of the ith branch (8 branches in the core part, and the first depth separable convolution of each branch is 7, 10, 13, 16, 19, 22, 25, 28, respectively) is:

wherein the content of the first and second substances,

the weight assigned to the nth neuron in the target convolutional layer for the mth neuron in the input,

is the output of the nth neuron of the previous layer,

for the respective offsets,

l

1,2, 3., 8 (a total of 8 branches, l 1 is the first branch, and l 2 is the second branch), the information sources of the 8 branches are the same, but the information allocated to the first convolution layer is different due to the difference in weight and offset, and the process is controlled by back propagation;

after the first depth separable convolutional layer of 8 branches receives respective information, respective cross-correlation operation is performed, and the output is:

a^(l)＝g(Z^(l))

wherein, a^(l)For the output of the first depth-separable convolutional layer of the first branch, Z^(l)For the input of the first depth separable convolutional layer of the l-th branch, g () represents the cross-correlation operation.

The outputs of the 8 convolutional layers are respectively transmitted to the next network layer of the current shunt circuit until the operation of the last layer in the shunt circuit is completed, and the processes are independently carried out and do not interfere with each other.

Finally, the output of each branch is fused (simple addition) through 7 residual connecting layers (add layers), and the fusion method is the addition of the corresponding channel characteristic information. The effect of output fusion of each branch can be observed more conveniently by using a plurality of add layers.

In the non-core structure (fig. 1, the third part is called core structure, and the remaining one, two, and four are non-core structures) of the multi-branch channel capacity-expanding network of the present invention, there are four short-circuit connections, which are the first part once, the second part twice, and the fourth part once in fig. 7. Regarding the network structure between each short-circuit connection as a short-circuit block (in fig. 1, when a line is led out each time to perform cross-layer connection, the process is called short-circuit connection, and each layer which is crossed (short-circuited) is the network structure between each short-circuit connection; according to the chain rule, the back propagation gradient from the deep short block D (4 short connections in fig. 1, the depth being opposite, e.g. when part 2 is placed together with part 1, the short connection of part 2 is called the deep short block, whereas when part 2 is placed together with part 4, the short connection of part 2 is called the shallow short block) to the shallow short block S is:

wherein loss is a loss value;

because 1 is added into the formula, the condition that the gradient is 0 cannot occur in a non-core structure;

for the core structure part of the network, the problem of gradient disappearance of the network can not occur due to less layer number; therefore, the invention does not add shortcuts in the core structure in order to fully extract the characteristics of the core structure. Although the back-propagation training rules will certainly bring about the learning rate difference, i.e. the network layer near the input will learn slower than the network layer near the output, the learning process of each branch is synchronous because the structures of the 8 branches are the same.

Other steps and parameters are the same as those in one of the first to sixth embodiments.

The first embodiment is as follows:

experimental procedures and results

Experimental configuration

Data enhancement

Since the TrashNet dataset is small, with 2527 images, the present invention uses data enhancement to suppress overfitting during trainingPhenomenon(s)^[39](Al-falluji R A,YoussifA AA,Guirguis S K.Single ImageSuper ResolutionModel Using LearnableWeightFactorinResidual Skip Connection[J]IEEEAccess,2019,7:58676-^[40](PhamTD.Geostatistical Simulation ofMedical Images for DataAugmentationin Deep Learning[J]IEEE Access,2019.), including cut, zoom, rotate, etc. The specific setup of the data enhancement of the present invention is shown in table 1.

Table 1 data enhancement settings

Optimizing and regularizing settings

Deep learning methods require extensive training procedures to achieve high performance of the model, which can take a long time if the parameters are not initialized properly and can fall into local minima^[41](Zhong G,Zhang K,WeiH,et al.Marginal deep architecture:stacking feature learning modules to builddeep learning models[J]IEEE Access,2019,7: 30220-. After a large number of tests, the invention sets more appropriate optimization parameters as shown in table 2.

TABLE 2 optimization and regularization settings

Results of the experiment

Comparison of the network of the present invention with the finely tuned Xception network

The present invention trains the fine-tuned Xception network and the network proposed by the present invention under the same conditions, and table 3 shows the training results and the total network parameters.

Table 3 network and fine-tuned Xception network precision and total parameters of the present invention

Because the number of layers and the number of channels of the network proposed by the invention are consistent with the number of layers and the number of channels of the fine-tuned Xception network, the total parameters of the model are the same, and the contrast precision can accurately reflect the advantages and disadvantages of the method on the basis. It can be seen that the method of the present invention is effective for the trashent dataset, improving accuracy by 1.75% without adding additional computational load.

Fig. 8a, 8b, 8c, 8d show the training information records of the network of the present invention and the fine-tuned Xception network, from the viewpoint of the network convergence speed, the fine-tuned Xception model converges after 105 iterations, whereas the network proposed by the present invention converges after 82 iterations. The faster convergence rate may save computational resources during the training process. From the view of the fitness of the training loss and the verification loss curve, the network is superior to a fine-tuned Xceptim network, which shows that the network has better fitting capability to the TrashNet data set, so that the final verification loss is reduced compared with the fine-tuned Xceptance network, and the learning effect is improved.

Next, the present invention performs occlusion testing on both networks. According to the difference of the shielding positions, 9 new test sets are established, the trained model and the fine-tuned Xception model are tested, and the test results are shown in FIG. 9. In FIG. 9, the present invention selects a test picture sample to describe the occlusion position of the test set.

In the occlusion test, the network of the invention obtains better results, which reflects that the finely adjusted Xcenter network of the invention has stronger anti-jamming capability. Therefore, the network of the present invention can provide more reliable prediction results even in the case where the test sample is incomplete or occluded.

The core structure of the network can effectively extract the characteristics of the TrashNet data set, so that the invention further optimizes the number of channels of the core structure. Fig. 10 shows typical experimental results, in which the network performance is best when the number of channels of the network core structure of the present invention is changed to 896, and the accuracy is improved by 1.09% compared with the previous method. It can be seen that in the case of a garbage network data set, the network width is an important factor for determining the network performance, and a better result can be obtained by properly widening the network.

Fig. 11a and 11b show confusion matrices before and after channel number optimization, the true category is shown on the vertical axis, the prediction category is shown on the horizontal axis, and the elements of the main diagonal are the image proportion for identifying correct images. It can be seen that after the number of channels of the core structure is changed to 896, the recognition accuracy rates of three types of glass, metal and other garbage which are difficult to predict correctly are respectively improved by 4%, 2% and 8%. The invention greatly improves the performance balance of the network and reduces the occurrence of unreliable detection of a certain class in practical application.

Fig. 12 compares the proposed network of the present invention with a number of other newer works, including four conventional machine learning methods SVM, XGB, RF, KNN, and seven image processing methods based on convolutional neural networks. The precision is closest to the network of the invention, namely a fine-tuned VGG16 network, and the precision is 0.93. However, the VGG16 network is very bulky and the total parameter of the inventive network is only 20% thereof.

The present invention converts the confusion matrix of a plurality of other methods into single-class identification accuracy, calculates the average identification accuracy of 6 classes, and compares the result with the method of the present invention to obtain the result shown in table 4.

TABLE 4 Single class and average recognition accuracy rates for the method of the present invention and other newer jobs

In table 4, the other four methods are described as bar graphs, and the method proposed by the present invention is a line graph. The result shows that the single-class identification accuracy of each class of the network is higher than that of other methods, and the identification accuracy of each class is more balanced. Compared with other methods, the network has obvious advantages in average identification accuracy.

F1-score is a comprehensive evaluation index for measuring accuracy and recall rate, and can better reflect the comprehensive performance of the model. We calculated F1-score for each of the above methods, and the results are shown in FIG. 13. The results show that the network of the invention has significant advantages in the cardboard, glass, metal, paper and plastic categories, further demonstrating the effectiveness of the process.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A garbage image classification method based on a multi-branch channel capacity expansion network is characterized by comprising the following steps: the method comprises the following specific processes:

step one, establishing a multi-branch channel capacity-expanding network model;

2. The method according to claim 1, wherein the garbage image classification method based on the multi-branch channel capacity expansion network is as follows: establishing a multi-branch channel capacity-expanding network model in the first step; the specific process is as follows:

the multi-branch channel capacity-expanding network model is formed by connecting four parts end to end.

3. The method according to claim 2, wherein the garbage image classification method based on the multi-branch channel capacity expansion network is as follows: in the first part of the multi-branch channel capacity-expanding network model, an input image with the size of 229 × 229 × 3 enters an input layer first, and then enters the convolutional layer 1 for first feature extraction, the number of channels of the convolutional layer 1 is 32, the stride is 2, and the input image is subjected to first downsampling, so that the feature map size output by the convolutional layer 1 is 114 × 114 × 32;

then, the network is divided into two paths, wherein the characteristic diagram of the first path passes through a depth separable convolutional layer 1, a batch normalization layer, an activation layer, a depth separable convolutional layer 2 and a batch normalization layer; the number of channels of both depth-separable convolutional layer 1 and depth-separable convolutional layer 2 is 128, so the output feature map size of five layers of depth-separable convolutional layer 1, batch normalization layer, active layer, depth-separable convolutional layer 2, batch normalization layer is 112 × 112 × 128, and then the five layers enter maximum pooling layer 1 for down-sampling, and the size after down-sampling is 56 × 56 × 128;

meanwhile, after the feature map of the second path is subjected to down sampling and activation for one time, the feature map is directly connected with the output of the first path of network through a residual connecting layer 1, and the size of the feature map is unchanged after the residual connection; the feature map output by residual connection layer 1 enters the second part.

4. The method according to claim 3, wherein the garbage image classification method based on the multi-branch channel capacity expansion network is as follows: in the second part of the multi-branch channel capacity expansion network model, the feature diagram entering the first path of the network passes through an activation layer, a depth separable convolutional layer 3, a batch normalization layer, an activation layer, a depth separable convolutional layer 4 and a batch normalization layer;

meanwhile, after the first down-sampling and activation, the feature map of the second path is directly connected with the output of the first path network through the residual connecting layer 2, and the feature map output by the residual connecting layer 2 is continuously transmitted downwards;

meanwhile, after the feature map of the second path is subjected to one-time downsampling and activation, the feature map of the second path is directly connected with the output of the first path network through the residual connecting layer 3, and the feature map output by the residual connecting layer 3 enters the third part.

5. The method according to claim 4, wherein the garbage image classification method based on the multi-branch channel capacity expansion network is as follows: in the third part of the multi-branch channel capacity expansion network model, the number of network paths in the third part is set to be 8, and the network structure of the third part is called as a core structure;

the feature map input from the second part is divided into 8 paths;

the characteristic diagram entering the 1 st shunt sequentially passes through an activation layer, a depth separable convolutional layer 7, a batch normalization layer, an activation layer, a depth separable convolutional layer 8, a batch normalization layer, an activation layer, a depth separable convolutional layer 9 and a batch normalization layer;

the structures of the 8 branches are completely the same, the depth separable convolutional layers 7-30 are also completely the same, the number of channels is 728, downsampling is not involved, and the feature map sizes output by the depth separable convolutional layers 7-30, all the active layers and all the batch normalization layers in the third part are 14 multiplied by 728;

the output of the 1 st shunt and the output of the 2 nd shunt are superposed through a residual connecting layer 4, the output of the residual connecting layer 4 and the output of the 3 rd shunt are superposed through a residual connecting layer 5, the output of the residual connecting layer 5 and the output of the 4 th shunt are superposed through a residual connecting layer 6, the output of the residual connecting layer 6 and the output of the 5 th shunt are superposed through a residual connecting layer 7, the output of the residual connecting layer 7 and the output of the 6 th shunt are superposed through a residual connecting layer 8, the output of the residual connecting layer 8 and the output of the 7 th shunt are superposed through a residual connecting layer 9, the output of the residual connecting layer 9 and the output of the 8 th shunt are superposed through a residual connecting layer 10 and are output through a residual connecting layer 10; and the feature map output by the core structure of the third part is continuously input into the fourth part again.

6. The method according to claim 5, wherein the garbage image classification method based on the multi-branch channel capacity expansion network is as follows: in the fourth part of the multi-branch channel capacity expansion network model, the feature map of the first path passes through an activation layer, a depth separable convolution layer 31, a batch normalization layer, an activation layer, a depth separable convolution layer 32 and a batch normalization layer;

meanwhile, after the feature map of the second path is subjected to one-time downsampling and activation, the feature map of the second path is directly connected with the output of the first path network through the residual connecting layer 11, and the feature map output by the residual connecting layer 11 passes through the depth separable convolutional layer 33, the batch normalization layer, the activation layer, the depth separable convolutional layer 34, the batch normalization layer and the activation layer;

7. The method according to claim 6, wherein the garbage image classification method based on the multi-branch channel capacity expansion network is as follows: in the forward propagation process of the multi-branch channel capacity expansion network, for a network core structure, the input of the first depth separable convolution layer of the ith branch is as follows:

wherein the content of the first and second substances,

is the output of the nth neuron of the previous layer,

for the respective offsets, l ═ 1,2, 3.., 8;

a^(l)＝g(Z^(l))