CN111027605A

CN111027605A - Fine-grained image recognition method and device based on deep learning

Info

Publication number: CN111027605A
Application number: CN201911193231.8A
Authority: CN
Inventors: 宋波
Original assignee: Beijing Moviebook Technology Corp Ltd
Current assignee: Beijing Moviebook Technology Corp Ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-04-17

Abstract

The method loads and trains a pre-training model on an ImageNet database through a deep convolutional neural network, firstly trains a full connection layer of the deep convolutional neural network in a training process, optimizes parameters of the full connection layer and then trains the rest layers of the deep convolutional network so as to obtain an optimized fine-grained image recognition model, finally inputs a target image to be detected into the deep convolutional neural network through YOLO target detection to generate a detection frame corresponding to the target image, performs threshold processing on the obtained detection frame according to the confidence coefficient of the fine-grained image recognition model, and outputs an image recognition result. The method can well avoid background errors, avoid a long-time image candidate region extraction process, improve the detection speed, learn the generalization characteristics of the object, migrate to other fields and improve the accuracy.

Description

Fine-grained image recognition method and device based on deep learning

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a fine-grained image recognition method and apparatus based on deep learning.

Background

In real life, the conventional target detection process can be described as follows: target feature extraction, target identification and target positioning. Common image features include SIFT, HOG, SURF, LBP, etc., and the features are used to identify the target and then combined with corresponding strategies to locate the target. The traditional target detection method achieves some achievements, but cannot well cope with complex natural scenes due to poor generalization capability of artificial design features.

Nowadays, target detection and identification based on deep learning gradually become mainstream, and the main processes of the methods are as follows: and extracting a target identification and positioning based on a deep neural network from the depth features of the image. Can be divided into two main categories according to the flow:

(1) two-stage (Two-stage) methods such as R-CNN, SPP-Net, FastR-CNN, Faster R-CNN, etc.;

(2) one-stage (One-stage) methods such as the YOLO series, etc.

Girshick proposes R-CNN for target detection and identification, and the algorithm idea can be summarized as follows: extracting the characteristics of the image by using deep learning to improve the detection and identification precision, and adopting region suggestion to reduce the number of candidate regions; and the detection precision is further improved by adding a boundary regression strategy. The SPP-Net is greatly improved on the basis of R-CNN, the image normalization process is cancelled by pyramid pooling, and the problems of information loss and storage caused by image deformation are solved; the idea of extracting features of the original image only once is adopted, the problem of repeated calculation of the convolutional layer is effectively solved, and the speed is improved by 24-102 times compared with that of R-CNN. Fast R-CNN makes some substantial improvements over R-CNN: the shared convolutional layer is adopted, namely the whole picture is sent to a convolutional neural network for feature extraction, and a candidate region is extracted from a feature image output by the convolutional layer, so that features can be shared, calculation parameters are greatly reduced, and calculation power is saved; the SVM of R-CNN is replaced by the SOFTMAX function for classification, training data directly enter a loss layer in a GPU memory, and therefore the first layers of features of the candidate area do not need to be repeatedly calculated and a large amount of data do not need to be stored on a hard disk any more, training speed is improved, and storage space is saved. Furthermore, Fast R-CNN designs a network RPN (region pro social network) for extracting candidate regions, extracts and merges the candidate frames into a deep network, avoids a time-consuming image candidate region extraction process, and greatly improves the detection speed; in addition, the convolutional neural network generating the suggestion window and the convolutional neural network feature of the target detection are shared, so that the feature can be extracted and the candidate region can be generated through one-time convolution of the picture. The main performance indexes of the target detection model are detection accuracy and speed, and the two-stage method has achieved good results in the aspect of detection accuracy, but is slightly poor in detection speed.

Compared with the prior art, the method has the advantages that the class probability and the position coordinate value of the object are obtained directly through the regression idea without additionally generating the candidate region, and the final detection result can be obtained through a single pass of data, so that the method has higher detection speed advantage in nature.

The conventional image recognition algorithm is generally used for recognizing general targets, the inter-class difference between the targets is large, the inter-class difference is small for fine-grained classes, the influence of external factors is easily caused, objects of the same subclass have morphological differences due to illumination, posture, background and the like, the fine-grained features among the subclasses are difficult to capture by directly using a traditional network, and the recognition effect is limited. In addition, some particle size image recognition methods require labeling information at object and part levels to obtain good effects, and in reality, the labels are often obtained at high cost, and the current image recognition methods all want to use only image-level labels. In the existing fine-grained image recognition algorithm, whether extra marking information is used or not, the attention point of the algorithm is extracting important local information, the function of the convolutional neural network is only a feature extractor, the feature extraction and the class classification are a dispersed process, and the advantages of end-to-end training and optimization of the convolutional neural network model in coarse-grained image recognition are lost.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to one aspect of the application, a fine-grained image recognition method based on deep learning is provided, a pre-training model on an ImageNet database is loaded and trained through a deep convolutional neural network, a full connection layer of the deep convolutional neural network is trained in the training process, parameters of the full connection layer are optimized, then the rest layers of the deep convolutional network are trained, so that an optimized fine-grained image recognition model is obtained, finally a target image to be detected is input into the deep convolutional neural network through YOLO target detection, a detection frame corresponding to the target image is generated, threshold processing is carried out on the obtained detection frame through confidence of the fine-grained image recognition model, and an image recognition result is output.

Preferably, when the fully-connected layer of the deep convolutional neural network is trained in the training process, the training data in the pre-training model is calculated through the deep convolutional neural network, and then the recognition results with the same number as the types of the target images to be generated are output by the fully-connected layer.

Preferably, after the parameters of the full-connection layer are optimized, the rest layers of the deep convolutional network are trained, cross entropy loss calculation is performed on the output of the full-connection layer and the class label of the target image, the loss function is minimized as a target, the class of the output target image is close to the class label of the real image through continuous iterative optimization, and therefore the optimized fine-grained image recognition model is obtained.

Preferably, the pre-training model is a pre-training model classified into 1000 classes and trained on an ImageNet database, the data loading batch size is 64, a Momentum optimization method is used in the training process to optimize parameters of a full-link layer in the deep convolutional neural network, the Momentum parameter is 0.5 when the full-link layer is optimized, and the iteration number is 100.

Preferably, a gradient descent optimization method is used in the training process of training the rest layers of the deep convolutional network to realize full network training, the learning rate is 0.0001, and the iteration number is 1000.

Preferably, inputting a target image to be detected into a deep convolutional neural network through YOLO target detection, generating a detection frame corresponding to the target image, performing threshold processing on the obtained detection frame according to the confidence of the fine-grained image recognition model, and outputting an image recognition result, including:

adjusting the size of a target image to be detected to be a standard size, and then inputting the standard size into a deep convolutional neural network;

operating a depth convolution neural network to obtain the boundary frame coordinates of a detection frame corresponding to the target image, the confidence coefficient and the class probability of an object contained in the boundary frame;

and carrying out threshold processing on the obtained detection frame according to the confidence coefficient of the fine-grained image recognition model to obtain a final image recognition result.

Particularly, the invention also provides a fine-grained image recognition device based on deep learning, which comprises:

the primary training module is configured to load and train a pre-training model on the ImageNet database through the deep convolutional neural network, and a full connection layer of the deep convolutional neural network is trained in the training process;

the secondary training module is configured to optimize parameters of the full-connection layer and train the rest layers of the deep convolutional network, so that an optimized fine-grained image recognition model is obtained;

and the target recognition module is configured to input a target image to be detected into the deep convolutional neural network through the YOLO target detection, generate a detection frame corresponding to the target image, perform threshold processing on the obtained detection frame according to the confidence of the fine-grained image recognition model, and output an image recognition result.

Preferably, in the primary training module, when the full-connection layer of the deep convolutional neural network is trained in the training process, the training data in the pre-training model is calculated by the deep convolutional neural network, and then the recognition results with the same number as the types of the target images to be generated are output by the full-connection layer.

Preferably, in the secondary training module, the parameters of the full connection layer are optimized, and then the remaining layers of the deep convolutional network are trained, cross entropy loss calculation is performed on the output of the full connection layer and the class label of the target image, so that the loss function is minimized, and the class of the output target image is close to the class label of the real image through continuous iterative optimization, thereby obtaining the optimized fine-grained image recognition model.

Preferably, the target identification module, the target identification step implemented by the target identification module, is specifically configured to include:

According to yet another aspect of the application, there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

According to yet another aspect of the application, a computer-readable storage medium, preferably a non-volatile readable storage medium, is provided, having stored therein a computer program which, when executed by a processor, implements a method as described above.

According to yet another aspect of the application, there is provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method as described above.

According to the technical scheme, a fine-grained image recognition model is established through secondary optimization, and then the idea of YOLO target detection is combined, a target detection task is treated as a regression problem, a boundary frame of a target is directly obtained through all pixels of the whole picture, the confidence coefficient and the category of the target are contained in the boundary frame, background errors can be well avoided, a long-time image candidate region extraction process is avoided, the detection speed is improved, the generalization characteristics of the target can be learned, the target is more easily migrated to other fields, and the accuracy is improved to a certain extent.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flowchart of a fine-grained image recognition method based on deep learning according to an embodiment of the present application;

fig. 2 is a block diagram of a fine-grained image recognition apparatus based on deep learning according to another embodiment of the present application;

FIG. 3 is a block diagram of a computing device according to another embodiment of the present application;

fig. 4 is a diagram of a computer-readable storage medium structure according to another embodiment of the present application.

Detailed Description

Fig. 1 is a flowchart of a fine-grained image recognition method based on deep learning according to an embodiment of the present application. Referring to fig. 1, the fine-grained image recognition method based on deep learning includes:

101: loading and training a pre-training model on an ImageNet database through a deep convolutional neural network, and training a full connection layer of the deep convolutional neural network in the training process;

102: optimizing parameters of the full-connection layer, and then training the rest layers of the deep convolutional network to obtain an optimized fine-grained image recognition model;

103: and finally, inputting the target image to be detected into a deep convolutional neural network through the YOLO target detection to generate a detection frame corresponding to the target image, carrying out threshold processing on the obtained detection frame according to the confidence coefficient of the fine-grained image recognition model, and outputting an image recognition result.

According to the method, a fine-grained image recognition model is established through secondary optimization, then the idea of YOLO target detection is combined, a target detection task is treated as a regression problem, a boundary frame of a target is directly obtained through all pixels of the whole picture, the confidence coefficient and the category of the target are contained in the boundary frame, background errors can be well avoided, a long-time image candidate region extraction process is avoided, the detection speed is improved, the generalization characteristics of the target can be learned, the target is easier to migrate to other fields, and the accuracy is improved to a certain extent.

Fig. 2 is a block diagram of a fine-grained image recognition apparatus based on deep learning according to another embodiment of the present application. Referring to fig. 2, the fine-grained image recognition device based on deep learning includes:

a primary training module 201, configured to load and train a pre-training model on the ImageNet database through a deep convolutional neural network, wherein a full connection layer of the deep convolutional neural network is trained in a training process;

a secondary training module 202, configured to optimize parameters of a fully-connected layer and train the rest layers of the deep convolutional network, so as to obtain an optimized fine-grained image recognition model;

the target recognition module 203 is configured to input a target image to be detected into the deep convolutional neural network through YOLO target detection, generate a detection frame corresponding to the target image, perform threshold processing on the obtained detection frame according to the confidence of the fine-grained image recognition model, and output an image recognition result.

Preferably, in the primary training module 201, when the full-connection layer of the deep convolutional neural network is trained first in the training process, the training data in the pre-training model is calculated by the deep convolutional neural network, and then the recognition results with the same number as the class of the target image to be generated are output by the full-connection layer.

Preferably, in the secondary training module 202, the parameters of the full connection layer are optimized, and then the remaining layers of the deep convolutional network are trained, cross entropy loss calculation is performed on the output of the full connection layer and the class label of the target image, so as to minimize the loss function as a target, and the class of the output target image is close to the class label of the real image through continuous iterative optimization, thereby obtaining an optimized fine-grained image recognition model.

Preferably, the target identification module 203, the target identification step implemented by it, is specifically configured to include:

Embodiments also provide a computing device, referring to fig. 3, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 4, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A fine-grained image recognition method based on deep learning is characterized in that a pre-training model on an ImageNet database is loaded and trained through a deep convolutional neural network, a full-link layer of the deep convolutional neural network is trained in the training process, parameters of the full-link layer are optimized, then the rest layers of the deep convolutional network are trained, so that an optimized fine-grained image recognition model is obtained, finally a target image to be detected is input into the deep convolutional neural network through YOLO target detection, a detection frame corresponding to the target image is generated, threshold processing is carried out on the obtained detection frame through confidence of the fine-grained image recognition model, and an image recognition result is output.

2. The fine-grained image recognition method based on deep learning of claim 1, wherein when a full-connection layer of a deep convolutional neural network is trained in a training process, the training data in a pre-training model is calculated through the deep convolutional neural network, and then recognition results with the same quantity as the types of target images to be generated are output through the full-connection layer.

3. The fine-grained image recognition method based on deep learning of claim 2, characterized in that parameters of the full-connection layer are optimized and then the remaining layers of the deep convolutional network are trained, cross entropy loss calculation is performed on the output of the full-connection layer and a class label of a target image, a loss function is minimized as a target, and the class of the output target image is close to a real image class label through continuous iterative optimization, so that an optimized fine-grained image recognition model is obtained.

4. The fine-grained image recognition method based on deep learning of claim 3, wherein the pre-training model is a pre-training model classified into 1000 classes trained on an ImageNet database, the data loading batch size is 64, a Momentum optimization method is used in the training process to optimize parameters of a full connection layer in a deep convolutional neural network, the Momentum parameter is 0.5 when the full connection layer is optimized, and the iteration number is 100.

5. The fine-grained image recognition method based on deep learning according to claim 3 or 4, wherein a gradient descent optimization method is used in the training process of training the rest layers of the deep convolutional network to realize full network training, the learning rate is 0.0001, and the iteration number is 1000.

6. The fine-grained image recognition method based on deep learning of claim 1, wherein a target image to be detected is input into a deep convolutional neural network through YOLO target detection to generate a detection frame corresponding to the target image, the obtained detection frame is subjected to threshold processing according to the confidence of the fine-grained image recognition model, and an image recognition result is output, and the method comprises the following steps:

7. A fine-grained image recognition apparatus based on deep learning, comprising:

8. The fine-grained image recognition device based on deep learning of claim 7, wherein the initial training module is configured to, when training a full connection layer of a deep convolutional neural network in a training process, output recognition results with the same number as the classes of target images to be generated by the full connection layer after the training data in a pre-training model is calculated by the deep convolutional neural network.

9. The fine-grained image recognition device based on deep learning of claim 8, wherein the secondary training module is configured to optimize parameters of a full connection layer and train the rest layers of the deep convolutional network, and performs cross entropy loss calculation on the output of the full connection layer and a class label of a target image to minimize the loss function as a target, and the class of the output target image is close to a real image class label through continuous iterative optimization, so as to obtain an optimized fine-grained image recognition model.

10. The fine-grained image recognition device based on deep learning according to claim 7, wherein the target recognition module is specifically configured to implement the target recognition step including: