CN116188916B

CN116188916B - Fine granularity image recognition method, device, equipment and storage medium

Info

Publication number: CN116188916B
Application number: CN202310404342.9A
Authority: CN
Inventors: 王金桥; 郭子江; 黄文俊; 陈雄辉; 朱贵波; 张海
Original assignee: Nexwise Intelligence China Ltd
Current assignee: Nexwise Intelligence China Ltd
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-07-28
Anticipated expiration: 2043-04-17
Also published as: CN116188916A

Abstract

The invention provides a fine-grained image recognition method, a fine-grained image recognition device, fine-grained image recognition equipment and a fine-grained image recognition storage medium, and belongs to the technical field of image recognition, wherein the fine-grained image recognition method comprises the following steps of: acquiring a fine-grained image to be identified; identifying the fine-grained image by using an identification model to obtain an identification result; the parameters of the preset layers of the identification model are preset values, the parameters of the other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers. The scheme of the embodiment of the invention is not easy to generate over fitting and has higher recognition accuracy.

Description

Fine granularity image recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing fine granularity images.

Background

Image classification is a classical topic of research in the field of computer vision. Image classification mainly includes coarse-grained image classification and fine-grained image classification. Fine granularity image classification, namely classification problem of subclasses, is to carry out more careful subclassification to a large class, for example distinguish the type of bird, brand style etc. of car, because there are different factors such as gesture, visual angle, illumination, shelter from, background interference in the image acquisition, often have fine class difference and great class intra-class difference when fine granularity is classified, compare with ordinary image classification, fine granularity image classification degree of difficulty is bigger.

For the supervised recognition model training scheme, because the fine-granularity image data can be marked with strong professional knowledge, the marking of the fine-granularity image is more difficult than the marking of a common image, the image scale contained in the fine-granularity image data set with the label is smaller at present, but the fitting phenomenon is easy to occur when the image scale is smaller.

Disclosure of Invention

The invention provides a fine-granularity image recognition method, a device, equipment and a storage medium, which are used for solving the defect that the prior art is easy to be subjected to over-fitting when training is performed based on a smaller image scale, and realizing the fine-granularity image recognition method which is not easy to be subjected to over-fitting.

The invention provides a fine-grained image recognition method, which comprises the following steps:

acquiring a fine-grained image to be identified;

identifying the fine-grained image by using an identification model to obtain an identification result; the parameters of the preset layers of the identification model are preset values, the parameters of the other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers.

According to the fine-grained image recognition method provided by the invention, the training process of the recognition model comprises a pre-training process and a fine-tuning process; the method further comprises the steps of:

training layers except the preset layer in the identification model in the pre-training process; in the fine tuning process, setting the parameters of the preset layer as preset values, and training the identification model; or alternatively, the first and second heat exchangers may be,

training the recognition model in the pre-training process; and in the fine tuning process, setting the parameters of the preset layer as preset values, and training the identification model.

According to the fine-grained image recognition method provided by the invention, the fine-grained image is recognized by using the recognition model to obtain a recognition result, and the method comprises the following steps:

according to different target parameters, partitioning the fine-grained images according to the target parameters, and splicing the partitioned images according to an arrangement sequence different from the original arrangement sequence to obtain a recombined image; the target parameters include at least one of: the number of blocks in the abscissa direction, the number of blocks in the ordinate direction and the size of the blocks;

and inputting the recombined images corresponding to the different target parameters into the recognition model to obtain the recognition result.

According to the fine-grained image recognition method provided by the invention, the method for inputting the recombined images corresponding to the different target parameters into the recognition model to obtain the recognition result comprises the following steps:

and inputting the recombined images corresponding to the different target parameters into the identification model according to different stages according to a preset sequence to obtain the identification result.

According to the fine-grained image recognition method provided by the invention, the preset sequence is the same as the input sequence adopted in the training stage.

According to the fine-granularity image recognition method provided by the invention, a loss function used in a training process adopts the following formula (1):

（1）

wherein,,representing the loss function value, N is the number of fine-grained images, (-) isai，pi) The representation is based on the firstiPositive sample pairs from the fine-grained images,i∈1, 2, ...,N，ni、njrepresentation based on division in N fine-grained imagesiA negative sample of the fine-grained image of the fine-grained images,z _ai 、z _pi 、z _ni 、z _nj representing the extraction of the recognition modelai、pi、ni、njIs a feature of the image of (a).

According to the fine-grained image recognition method provided by the invention, the recognition model is a model established based on a residual network structure.

The invention also provides a fine-grained image recognition device, which comprises:

the acquisition module is used for acquiring the fine-grained image to be identified;

the processing module is used for identifying the fine-grained image by utilizing an identification model to obtain an identification result; the parameters of the preset layers of the identification model are preset values, the parameters of the other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the fine-grained image recognition method as described above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a fine-grained image recognition method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a fine-grained image recognition method as described in any of the above.

According to the fine-grained image recognition method, the fine-grained image recognition device, the fine-grained image recognition equipment and the storage medium, the fine-grained image is recognized by using the recognition model, and a recognition result is obtained; the parameters of the preset layers of the recognition model are preset values, the parameters of other layers of the recognition model except the preset layers are obtained based on label-free training, the number of the preset layers is higher than that of the layers except the preset layers, namely, the preset layers are high layers of the recognition model, because the data size of a fine-grained image dataset is small, the high-layer network of the recognition model is more in general parameters, and the high layers contain complex semantic information, the overfitting is easier to generate, the preset layers in the scheme do not train, the preset values are directly used, namely, the other layers of the recognition model are trained under the condition that the parameters of the preset layers are the preset values, the overfitting is not easy to generate, and the training data with labels are not needed.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a fine-grained image recognition method provided by the invention;

FIG. 2 is a schematic diagram of a fine-grained image recognition method provided by the invention;

FIG. 3 is one of the image schematics of the fine-grained image recognition method provided by the present invention;

FIG. 4 is a second image diagram of the fine-grained image recognition method according to the present invention;

FIG. 5 is a third image diagram of the fine-grained image recognition method according to the present invention;

FIG. 6 is one of the progressive multi-granularity fusion image schematics of the fine granularity image recognition method provided by the invention;

FIG. 7 is a second schematic diagram of a progressive multi-granularity fusion image of the fine granularity image recognition method according to the present invention;

FIG. 8 is a sample pair generation schematic diagram of the fine-grained image recognition method provided by the invention;

fig. 9 is a schematic structural view of a fine-grained image recognition device provided by the invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes the technical solution of the embodiment of the present invention in detail with reference to fig. 1 to 10. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 1 is a schematic flow chart of a fine-grained image recognition method provided by the invention. As shown in fig. 1, the method provided in this embodiment includes:

step 101, acquiring a fine-grained image to be identified;

102, identifying the fine-grained image by using an identification model to obtain an identification result;

the parameters of the preset layers of the identification model are preset values, the parameters of other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers.

Specifically, because the training of the small-scale data set in the deep residual network is easy to be subjected to the condition of over-fitting, in this embodiment, the training is simplified, for example, the highest layer or the next highest layer of the network structure of the recognition model, that is, the preset layer, is removed in the training process, that is, the preset layer in the recognition model is not trained. Experiments prove that the over fitting caused by small data volume can be effectively relieved.

When the model is applied, the parameters of the preset layers of the model are identified as preset values, and the parameters of the other layers of the model except the preset layers are identified as training.

Optionally, the recognition model may perform feature extraction on the input image, and classify and recognize the input image by a classifier.

Optionally, the identification model is a model built based on a residual network structure.

In the embodiment of the present invention, the backbone network adopted by the identification model is a residual network, and in the following description, the residual networks res net50 and res net18 are taken as an example, the network structures of the res net50 and the res net18 are shown in table 2, and the structures of the two residual networks can be generally divided into four layers (layer 1, layer2, layer3 and layer 4), and each layer is composed of a plurality of residual blocks. For example, using the MoCo self-supervised learning framework, two network structures, namely, res net50 and res net18, are respectively used as encoders to perform self-supervised learning on a small-scale fine-granularity image dataset, so as to verify the effectiveness of the self-supervised learning on fine-granularity image recognition, and the results are shown in table 1. As can be seen from table 1, when the same epoch is pre-trained and fine-tuned, the identification accuracy of the res net50 with more network layers and stronger feature expression is lower than that of the res net18 with fewer network layers and weaker feature expression on the data set.

TABLE 1

；

TABLE 2

；

Further, to determine which part of the network model caused difficulty in learning ResNet50 on fine-grained image datasets, embodiments of the present invention discard parameters of a layer in the residual network model during training, thereby yielding the impact of each layer of parameters on the model tuning process. Specifically, parameters of a certain layer of the network are discarded one by adopting a control variable method, so that the influence of the layer on the result is obtained, and the result is shown in table 3. As can be seen from table 3, for the res net50, the parameters of layer3 and layer4 layers of the pre-trained network structure are discarded and then trimmed, which instead can result in higher recognition accuracy, indicating that higher layer parameters in the network may be redundant for this task. At the same time, this can explain the reason that the deeper network structure shown in table 3 is prone to over-fitting phenomenon, because the data size of the fine-granularity image dataset is small, while the parameters of the higher-level network of the residual structure are more, and the higher-level network contains complex semantic information, so over-fitting is easier to generate.

TABLE 3 Table 3

；

In the method of the embodiment, the fine-grained image is identified by utilizing an identification model, and an identification result is obtained; the parameters of the preset layers of the recognition model are preset values, the parameters of other layers of the recognition model except the preset layers are obtained based on label-free training, the number of the preset layers is higher than that of the layers except the preset layers, namely, the preset layers are high layers of the recognition model, because the data size of a fine-grained image dataset is small, the high-layer network of the recognition model is more in general parameters, and the high layers contain complex semantic information, the overfitting is easier to generate, the preset layers in the scheme do not train, the preset values are directly used, namely, the other layers of the recognition model are trained under the condition that the parameters of the preset layers are the preset values, the overfitting is not easy to generate, and the training data with labels are not needed.

Optionally, the training process of the recognition model comprises a pre-training process and a fine tuning process; the method further comprises the steps of:

Specifically, because the training of the small-scale data set in the deep residual network is easy to be subjected to the condition of over fitting, the embodiment provides two methods for simplifying the residual network structure: front Drop (BD) and rear Drop (AD) remove higher layers of the deep network structure during the pre-training process and the fine-tuning process, respectively. Experiments prove that the two methods can effectively relieve the overfitting caused by small data volume. Besides, the pre-rejection (AD) can reduce the calculation amount of the pre-training process while ensuring the accuracy, and accelerate the pre-training process of the model, namely the training efficiency is improved.

Therefore, the following two schemes are adopted for training in the embodiment of the invention:

(1) Pre-discard (BD): and removing a preset layer (such as the highest layer) of the residual network structure during pre-training, initializing a proper parameter of the preset layer for identifying the model through a preheating operation during fine tuning, and then carrying out fine tuning on the model through iteration of the model. The reduced residual structure with the highest layer removed in the embodiment of the invention can be called SimResNet.

(2) Post discard (AD): the complete residual network structure is trained during pre-training, and preset layer (such as the highest layer) parameters are discarded during fine tuning, namely, a proper highest layer parameter is initialized for the model through pre-heating operation, and then fine tuning of the model is performed through iteration of the model. Both methods can effectively prevent the model from being over fitted on the fine-grained image dataset, and improve the recognition accuracy.

In one embodiment, step 102 may be implemented as follows:

(1) According to different target parameters, partitioning the fine-grained images according to the target parameters, and splicing the partitioned images according to an arrangement sequence different from the original arrangement sequence to obtain a recombined image; the target parameters include at least one of: the number of blocks in the abscissa direction, the number of blocks in the ordinate direction and the size of the blocks;

(2) And inputting the recombined images corresponding to the different target parameters into the recognition model to obtain the recognition result.

Specifically, aiming at the characteristics of large intra-class variance and small inter-class variance of a fine-granularity image, namely for the same class, the model is difficult to cluster the class due to factors such as illumination, shielding and gesture, so that misjudgment is easy to cause. Most of the previous approaches avoid this effect by locating discriminant regions explicitly or implicitly, but this approach ignores the features of other parts of the image, i.e., none of these approaches consider how to fuse different granularity information together. Aiming at the problem, the embodiment of the invention provides a self-supervision fine-granularity recognition algorithm framework for multi-granularity information fusion, namely a method for multi-granularity information fusion is combined with self-supervision contrast learning, the whole framework is shown in figure 2, and the core of the method is that a network model is fused with a plurality of different granularity information, so that the problem of large intra-class variation is relieved. The effectiveness of the method is proved through a large number of experiments, only a small-scale label-free fine-grained image is used, the identification accuracy of image net supervised learning is surpassed on a Stanford Car and air-resource data set, and the effect close to the image net supervised learning is obtained on a CUB200-2011 data set.

The fine-grained image is first analyzed intuitively. As can be seen from fig. 3, 4 and 5, for a particular fine-grained class, it can be observed that large variations within the class (pose, angle, etc.) can be mitigated by a reduction in semantic granularity, so learning a finer granularity characterization can reduce the impact of fine-grained intra-class variance. In addition, it can be seen from fig. 3, 4 and 5 that different granularity information is synergistic for birds in the figures, and learning the characteristics of the beak can help learn the head characteristics of the birds, and learning the head characteristics of the birds can also help learn the overall characteristics of the birds.

For fine-grained images, since the difference between categories is typically some fine area, care needs to be taken to accurately distinguish between fine local areas. Inspired by the human perception system to identify the disturbed picture by distinguishing the local area, a simple and effective method is designed in the embodiment of the invention to force the network model to pay attention to the local area, namely, fine-grained images are cut, so that image blocks with different granularity information are generated, the order of the image blocks is disturbed in a manner similar to a jigsaw so as to obtain the image blocks with different granularity characteristics, and the disturbed image blocks are recombined into a new image, as shown in fig. 6 and 7.

Specifically, a given image I is input by first dividing the image into N sub-regions, denoted as I _h,v 1.ltoreq.h, v.ltoreq.N, h and v representing indices in the width-height direction, respectively, and then scrambling each sub-region and sorting themFused into a new image P. The super parameter N controls the generation of image blocks of information having different granularity. By the method, global information of the image can be broken, but local information of the image is reserved, the network model is forced to pay attention to local features of the image in the training process, and identification accuracy is improved.

In this embodiment, the process in step (1) may be referred to as global information interference, that is, by disturbing global information of an image, local information with different granularity is retained, so that images with different granularity information are generated, and fine-grained images are identified based on information with different granularity levels. The training process is consistent with the implementation of the recognition process.

Alternatively, step (2) may be implemented by:

Specifically, the pre-training of the label-free fine-grained image can be performed by two methods which cooperate with each other, and then the fine-grained image is identified by utilizing an identification model obtained by training.

Namely global information interference and progressive multi-granularity fusion learning. The progressive multi-granularity fusion learning method adopts a progressive learning mode, learns information of different granularity levels in stages and forces a network to fuse the multi-granularity information.

Because the variance in the fine-granularity image class is easily influenced by factors such as gestures and illumination, the information of a single granularity is insufficient to accurately locate the discriminant region. In contrast, in the embodiment of the invention, the recognition model is trained through a multi-granularity information fusion learning strategy under a self-supervision learning paradigm, and larger intra-class differences can be avoided through fusion of image information with different granularities. The feature learning process of progressive multi-granularity fusion is shown in fig. 2, 6 and 7, 1-5 in fig. 2 represent a preset sequence of input recombinant images, and the direction indicated by the arrow in fig. 6 and 7 represents the preset sequence of input recombinant images. Specifically, the training process of the recognition model is divided into S stages, and each stage sends images with different granularity information of the global interference stage into the recognition model, and model parameters obtained by training in the previous stage serve as initialization parameters of the model in the later stage. Because the granularity information of each stage is different, the recognition model is forced to mine fine local features from the different granularity information and fuse the fine local features, so that the recognition accuracy is high; the image recognition process is similar to the training process and will not be described in detail herein.

Optionally, in order to improve the recognition accuracy, the preset sequence in the image recognition process is the same as the input sequence adopted in the training stage.

Optionally, the loss function used by the training process employs the following equation (1):

（1）

Specifically, the training process in the embodiment of the invention aims to pull similar sample pairs in the hidden space and push dissimilar sample pairs away, so that the aim of training a good model is achieved. This objective is achieved by minimizing the loss function value. The loss function used in the training process uses equation (1).

The optimization process of the formula (1) is to enable the numerator to be towards 1, the value of the denominator is closer to 0, namely the distance between positive samples is pulled to force the positive sample pair to have larger similarity, and meanwhile the distance between the positive sample and the negative sample is pushed away, so that the similarity between the positive sample pair and the negative sample pair is reduced, and the whole model is optimized without using a label. Therefore, the most critical point in introducing the progressive multi-granularity fusion method into self-supervision contrast learning is the setting of positive and negative sample pairs, and the setting of the positive and negative sample pairs is shown in fig. 8.

Specifically, the progressive multi-granularity fusion method uses image granularity as a selection criterion of positive and negative sample pairs. The conventional method for constructing positive and negative sample pairs by contrast learning is usually that different cuts of the same image are directly used as positive sample pairs and different cuts of different images are used as negative sample pairs. As shown in fig. 8, the positive and negative sample pair construction process herein is: the images after different clipping of the same image are subjected to global interference and then serve as positive sample pairs, and the images after different clipping of the different images are subjected to global interference and then serve as negative sample pairs. For a specific image, taking the specific image as an Anchor sample (Anchor), taking two images generated by random clipping of the specific image as positive samples, and respectively marking the two images generated by global interference of the two images as a and p, wherein (a, p) is a positive sample pair; the image different from the anchor sample is taken as a negative sample of the anchor sample, and the image generated after global disturbance is denoted as n, thus forming an (a, p, n) triplet. The image features extracted by the three images after passing through a feature extraction model (such as CNN network) in the recognition model are marked as z _a , z _p , z _n 。

Optionally, in the actual training process, as a training method of progressive multi-granularity learning is introduced, in order to fuse different granularity information, a simple loss calculation method is adopted in this embodiment, that is, the training process is divided into a plurality of stages, the loss function value of each stage is calculated, then the parameter is updated, and the parameter value of the previous stage is used as the initialization value of the subsequent training stage. Therefore, multi-granularity information can be obtained through interaction among different granularities, and the training process is divided into a plurality of stages, so that a small-batch gradient descent algorithm can be easily executed to obtain a loss function value.

The effectiveness of the pre-discard and post-discard methods described above will be verified in subsequent embodiments on a plurality of commonly used fine-grained image recognition datasets and MoC self-supervised contrast learning methods. And only a small-scale fine-grained image dataset is used for pre-training, no extra large dataset is used for pre-training, and no weights pre-trained by ImageNet are loaded. First, experimental setup and experimental data set are described.

(1) CUB200-2011 dataset

The CUB200-2011 bird dataset is the benchmark dataset for current fine-grained image recognition. The images of the dataset included 11788 images of 200 bird subclasses, each bird subclass having approximately 60 images of Zuo You, divided into a training set and a test set, wherein the training set contained 5994 images and the test set contained 5794 images. Each image provides category information of the image, bounding box information of bird objects in the image, key part information of birds, and attribute information of the birds.

(2) Stanford cards dataset

The Stanford Cars dataset is also a classical dataset for fine-grained image recognition, which contains 16185 car images in total, and the object to be recognized is from 196 different vehicle models. These pictures were divided equally into two subsets, a training set containing 8144 images and a test set containing 8041 images.

(3) FGVC-Aircraft data set

The FGVC-airshift dataset is also a classical fine-grained image recognition benchmark dataset, with the object to be recognized being a class 100 different Aircraft, most of which are guests. The dataset contains a total of 10,000 aircraft images, each category containing 100. Each image has a bounding box and a layered label, the format of the image is JPEG format, the name is composed of seven digits and jpg suffix, and the resolution of the image is about 1-2MP. The training set in the dataset included 6667 images and the test set included 3333 images.

Alternatively, the following parameters may be set in the experiment in the embodiment of the present invention:

the self-supervision learning method is adopted, the small-scale data set is utilized for experiments, the parameters of the ImageNet pre-training are not loaded, the common practice of self-supervision learning is followed, and all the experiments consist of two parts: pretraining and fine tuning.

In the pre-training phase, the image resolution is set to 224×224, and the initial learning rate is set to 0.03. The data enhancement includes: random clipping, resizing, horizontal flipping, gaussian noise, color distortion, etc. The learning rate decreases in a cosine manner. It should be noted that the pre-training phase uses only the training set images in the 3 mainstream public data sets described above, and does not require any labels.

In the fine tuning stage, the learning rate is set to 0.03, and the cosine mode is adopted to decrease. In the experimental process, parameters of other layers can be fixed by adopting a preheating (warming) operation, and then the parameters of the highest layer are preheated, namely initial values are set for the parameters of the highest layer. The process can obtain a good initialization for the highest layer, and accelerate the subsequent model convergence process. And then performing global fine tuning, namely removing all frozen parameters for training, wherein the fine tuning process only adopts two data enhancement modes of horizontal overturn and cutting. Without special explanation, the image resolution employed in the trimming process is first adjusted to 256×256, followed by random cropping to 224×224.

In the pre-training process, only the training set of the downstream target dataset (i.e., the dataset of the image to be identified) is used for pre-training, and no tags are used. And then migrating the trained recognition model to a downstream task to complete fine-grained image recognition. For comparison with other self-supervising fine-granularity image recognition results, the evaluation method herein follows the common practice in the art, and the model performance, namely the Accuracy of the highest confidence class of the softmax classifier, is evaluated by using the common Top-1 Accuracy (Accuracy), and the calculation method is as shown in the formula (2):

（2）

wherein y represents a real label, y ˆ represents a class label corresponding to the highest confidence of the softmax classifier, N represents the total number of images of the test set, I is an indication function, and the expression is shown in the formula (3):

（3）

alternatively, a hyper-parameter optimization experiment may be performed during the training process to set the appropriate hyper-parameters for the recognition model.

The fine-grained image recognition device provided by the invention is described below, and the fine-grained image recognition device described below and the fine-grained image recognition method described above can be referred to correspondingly.

Fig. 9 is a schematic structural diagram of a fine-grained image recognition device provided by the invention. As shown in fig. 9, the fine-grained image recognition apparatus provided in the present embodiment includes:

an acquisition module 110, configured to acquire a fine-grained image to be identified;

the processing module 120 is configured to identify the fine-grained image by using an identification model, so as to obtain an identification result; the parameters of the preset layers of the identification model are preset values, the parameters of the other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers.

Optionally, the training process of the recognition model comprises a pre-training process and a fine tuning process; the processing module 120 is further configured to:

Optionally, the processing module 120 is specifically configured to:

Optionally, the preset sequence is the same as the input sequence used in the training phase.

Optionally, the loss function used by the training process uses the following formula:

（1）

The device of the embodiment of the present invention is configured to perform the method of any of the foregoing method embodiments, and its implementation principle and technical effects are similar, and are not described in detail herein.

Fig. 10 illustrates a physical structure diagram of an electronic device, as shown in fig. 10, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a fine-grained image recognition method comprising: acquiring a fine-grained image to be identified;

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the fine-grained image recognition method provided by the methods described above, the method comprising: acquiring a fine-grained image to be identified;

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the fine-grained image recognition method provided by the above methods, the method comprising: acquiring a fine-grained image to be identified;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A fine-grained image recognition method, characterized by comprising:

acquiring a fine-grained image to be identified;

identifying the fine-grained image by using an identification model to obtain an identification result;

the parameters of the preset layers of the identification model are preset values, the parameters of the other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers;

the training process of the identification model comprises a pre-training process and a fine-tuning process; the method further comprises the steps of:

2. The fine-grained image recognition method according to claim 1, wherein the recognizing the fine-grained image by using the recognition model to obtain the recognition result comprises:

3. The fine-grained image recognition method according to claim 2, wherein the inputting the reorganized images corresponding to the different target parameters into the recognition model to obtain the recognition result includes:

4. The fine-grained image recognition method according to claim 3, wherein,

the preset sequence is the same as the input sequence adopted in the training stage.

5. The fine-grained image recognition method according to claim 1, wherein the loss function used in the training process employs the following formula (1):

（1）

6. The fine-grained image recognition method according to claim 1, wherein the recognition model is a model built based on a residual network structure.

7. A fine-grained image recognition device, characterized by comprising:

the processing module is used for identifying the fine-grained image by utilizing an identification model to obtain an identification result; the parameters of the preset layers of the identification model are preset values, the parameters of the other layers of the identification model except the preset layers are obtained based on label-free training, and the number of the preset layers is higher than that of the layers except the preset layers;

the training process of the identification model comprises a pre-training process and a fine-tuning process; the processing module is further configured to:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the fine-grained image recognition method of any of claims 1-6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the fine-grained image recognition method according to any of the claims 1 to 6.