CN112633276A

CN112633276A - Training method, recognition method, device, equipment and medium

Info

Publication number: CN112633276A
Application number: CN202011566347.4A
Authority: CN
Inventors: 李辉; 王洪志; 董青
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-09
Anticipated expiration: 2040-12-25
Also published as: CN112633276B

Abstract

The application discloses a training method of an image recognition model, an image recognition method, an image recognition device, equipment, a medium and a program product, relates to the field of artificial intelligence, and particularly relates to computer vision and deep learning technologies. The implementation scheme is as follows: acquiring a sample image and marking information thereof, wherein the marking information comprises initial marking information for marking according to positive and negative sample dimensions and at least one fine-grained marking information for dividing a positive sample in the sample image according to different fine-grained dimensions; inputting a sample image into a pre-constructed image recognition model, wherein the image recognition model comprises at least two independent convolution layers, and different convolution layers are used for extracting feature vectors of feature maps of the sample image from different dimensions; and respectively carrying out supervision training on the image recognition model by using the loss functions corresponding to the dimensions according to the labeling information of the sample image with different dimensions, wherein the loss functions are used for being transmitted back to the convolution layers corresponding to the dimensions. The image recognition accuracy can be improved.

Description

Training method, recognition method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to computer vision and deep learning techniques, and more particularly, to a training method for an image recognition model, an image recognition method, apparatus, device, medium, and program product.

Background

Currently, in the process of poi (point of information) data production, a large number of signboard images need to be generated by collection and recognition.

However, in the data acquisition process in real life, many images with occlusion, blurring and non-signboard existence can be acquired, which affects the identification of the signboard images and further affects the quality of POI data.

Disclosure of Invention

The present application provides a training method, an image recognition method, an apparatus, a device, a medium, and a program product of an image recognition model to improve image recognition accuracy.

In a first aspect, the present application provides a training method for an image recognition model, including:

acquiring a sample image and marking information thereof, wherein the marking information comprises initial marking information marked according to positive and negative sample dimensions and at least one fine-grained marking information divided by positive samples in the sample image according to different fine-grained dimensions;

inputting the sample image into a pre-constructed image recognition model, wherein the image recognition model comprises at least two independent convolution layers, and different convolution layers are used for extracting feature vectors of feature maps of the sample image from different dimensions;

and performing supervision training on the image recognition model by using loss functions corresponding to the dimensions respectively according to the labeling information of the sample image with different dimensions, wherein the loss functions are used for being transmitted back to the convolution layers with the corresponding dimensions.

In a second aspect, the present application further provides an image recognition method, including:

extracting a feature map of an image to be recognized by using the image recognition model trained by the training method of the image recognition model according to any embodiment of the application;

extracting feature vectors of the feature map from different dimensions respectively by using at least two independent convolution layers in the image recognition model, wherein the dimensions comprise positive and negative sample dimensions and fine granularity dimensions;

and classifying the images to be identified according to the output results of the convolution layers corresponding to the positive and negative sample dimensions.

In a third aspect, the present application further provides a training apparatus for an image recognition model, including:

the system comprises a sample image information acquisition module, a data processing module and a data processing module, wherein the sample image information acquisition module is used for acquiring a sample image and marking information thereof, and the marking information comprises initial marking information marked according to positive and negative sample dimensions and at least one fine-grained marking information divided according to different fine-grained dimensions for a positive sample in the sample image;

the image recognition system comprises a sample image input module, a recognition module and a recognition module, wherein the sample image input module is used for inputting a sample image into a pre-constructed image recognition model, the image recognition model comprises at least two independent convolutional layers, and different convolutional layers are used for extracting feature vectors of feature maps of the sample image from different dimensions;

and the supervision training module is used for respectively carrying out supervision training on the image recognition model by using the loss functions corresponding to the dimensions according to the labeling information of the sample images in different dimensions, wherein the loss functions are used for being transmitted back to the convolution layers corresponding to the dimensions.

In a fourth aspect, the present application further provides an image recognition apparatus, including:

the characteristic diagram extraction module is used for extracting a characteristic diagram of an image to be recognized by utilizing the image recognition model trained by the training device of the image recognition model in any embodiment of the application;

the feature vector extraction module is used for extracting feature vectors of the feature map from different dimensions respectively by utilizing at least two independent convolution layers in the image recognition model, wherein the dimensions comprise positive and negative sample dimensions and fine granularity dimensions;

and the image classification module is used for classifying the images to be identified according to the output results of the convolution layers corresponding to the positive and negative sample dimensions.

In a fifth aspect, the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training an image recognition model according to any embodiment of the present application.

In a sixth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for training an image recognition model according to any of the embodiments of the present application.

In a seventh aspect, the present application further provides a computer program product comprising a computer program, which when executed by a processor, implements the training method for an image recognition model according to any embodiment of the present application.

In an eighth aspect, the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image recognition method of any embodiment of the present application.

In a ninth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image recognition method according to any of the embodiments of the present application.

In a tenth aspect, the present application further provides a computer program product comprising a computer program which, when executed by a processor, implements the image recognition method according to any of the embodiments of the present application.

According to the technical scheme, at least one fine-grained label is marked according to different fine-grained dimensions on a positive sample in a sample image, at least two independent convolution layers in the model are utilized, feature vectors of feature maps of the sample image are extracted from different dimensions respectively, and loss functions corresponding to the dimensions are used for supervision training on the model according to the multi-dimensional fine-grained labels respectively, so that the depth features of the sample image are fully learned, noise interference is reduced, image recognition precision is improved, marking cost is reduced, and model training efficiency is improved.

It should be understood that the statements herein do not intend to identify key or critical features of the present application, nor to limit the scope of the present application. Other features of the present application will become readily apparent from the following description, and other effects of the above alternatives will be described hereinafter in conjunction with specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart diagram of a training method of an image recognition model according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of a training method of an image recognition model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a training method of an image recognition model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of an image recognition method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for training an image recognition model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an image recognition device according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device for implementing a training method of an image recognition model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flowchart of a training method of an image recognition model according to an embodiment of the present application, which is applicable to a case of training an image recognition model, for example, a model for recognizing a signboard image, and relates to the field of artificial intelligence, in particular to technologies such as computer vision, deep learning, and the like. The method can be executed by an image recognition model training device, which is implemented in software and/or hardware, and is preferably configured in an electronic device, such as a computer device or a server. As shown in fig. 1, the method specifically includes the following steps:

s101, obtaining a sample image and marking information thereof, wherein the marking information comprises initial marking information marked according to positive and negative sample dimensions and at least one fine-grained marking information divided by positive samples in the sample image according to different fine-grained dimensions.

After the model is initially built, a large number of sample images are needed to be used for supervised training of the model, and before training, the sample images are also needed to be labeled, for example, whether each sample image is a positive sample or a negative sample is labeled. In the field of image recognition, the labeling of sample images is a very large and difficult task, and the insufficient model training can be caused by too few labeled sample images or inaccurate labeling, so that the accuracy of model recognition is insufficient.

In the field of image recognition, if only marking positive and negative samples is carried out on a sample image in some scenes, the requirements for image recognition in the scenes cannot be met. For example, in the production process of POI data, a large number of signboard images are required to be identified from the captured images, and some of the signboard images may have occlusion or blur, but if the occlusion or blur does not affect the visibility of the name of the whole signboard, the POI data can still be used. Therefore, if only positive and negative samples are marked in the model training stage according to the binary classification method in the prior art, and the sample images are not divided into fine particles, but the signboard images are simply divided into good and bad samples, the model is difficult to accurately identify the content in the signboard images.

In the embodiment of the application, not only positive and negative samples are marked, but also the positive samples are further marked according to different fine-grained dimensions, for example, whether shielding or blurring exists in the positive samples is marked, and further, the blurring grade can be marked, so that the depth characteristics of the samples can be fully learned by the model in the subsequent supervision training process of the model, noise (shielding, blurring and the like) interference is reduced, more discriminative characteristics are learned, and the identification precision of the signboard images is finally improved.

S102, inputting a sample image into a pre-constructed image recognition model, wherein the image recognition model comprises at least two independent convolution layers, and different convolution layers are used for extracting feature vectors of feature maps of the sample image from different dimensions.

The feature map can be extracted by a feature extraction network in the image recognition model, and then, the feature vectors of the feature map are extracted from different dimensions by each independent convolution layer. For example, the convolution layers corresponding to the positive and negative sample dimensions and the convolution layers corresponding to the fine-grained dimensions are included, and each fine-grained division mode corresponds to one convolution layer.

S103, according to the labeling information of different dimensions of the sample image, respectively carrying out supervision training on the image recognition model by using the loss function corresponding to each dimension, wherein the loss function is used for being transmitted back to the convolution layer corresponding to the dimension.

In the example that the sample image is a signboard image, a positive sample and a negative sample are marked on the sample image, meanwhile, whether shielding and fuzzy levels exist or not is marked on each positive sample, each independent convolution layer in the model corresponds to the dimension of the positive sample, the dimension of the shielding and the dimension of the fuzzy level respectively, 3 independent convolution layers are formed in total, and feature vectors of the feature map are extracted from the three dimensions respectively. Then, according to the labeled information of different dimensions, each dimension corresponds to a respective loss function, the loss caused by the labeled information of different dimensions is calculated according to the respective loss function and is respectively transmitted back to the convolution layer corresponding to the dimension, and the network parameters of the convolution layer are updated, so that the purpose of supervision and training is achieved. These loss functions are also used to be transmitted back to the feature extraction network to update the network parameters of the feature extraction network.

For example, according to the labeling information, if the current sample image is a negative sample and the negative sample is not labeled with fine granularity, the loss caused by the initial labeling information is calculated only according to the initial labeling information of the negative sample and the loss function corresponding to the positive and negative sample dimensions, and is transmitted back to the convolution layer and the feature extraction network corresponding to the positive and negative sample dimensions; according to the label information, if the current sample image is a positive sample, in addition to calculating the loss caused by the initial label information according to the initial label information of the positive sample and the loss function corresponding to the positive and negative sample dimensions, the loss caused by each fine-grained label information and the loss function corresponding to the fine-grained label information also need to be calculated, and each calculated loss is respectively transmitted back to the convolution layer and the feature extraction network corresponding to each dimension.

Therefore, through training of the multi-dimensional fine-grained label, the model can identify the image content, particularly the signboard image, the signboard image which is shielded or blurred but does not influence the signboard name can be identified, the model has identification power on the image content, and the identification precision of the signboard image is higher. In the marking stage, only fine-grained marking needs to be carried out on the positive sample, so that marking cost is reduced, and model training efficiency is improved.

According to the technical scheme, at least one fine-grained label is marked according to different fine-grained dimensions on a positive sample in a sample image, at least two independent convolution layers in the model are utilized, the feature vectors of the feature map of the sample image are extracted from different dimensions respectively, and the model is supervised and trained by loss functions corresponding to the dimensions respectively according to the multi-dimensional fine-grained labels.

Fig. 2 is a schematic flow chart of a training method of an image recognition model according to an embodiment of the present application, and the embodiment is further optimized based on the above embodiment. As shown in fig. 2, the method specifically includes the following steps:

s201, obtaining a sample image and marking information thereof, wherein the marking information comprises initial marking information marked according to positive and negative sample dimensions and at least one fine-grained marking information divided according to different fine-grained dimensions of a positive sample in the sample image.

S202, inputting the sample image into a pre-constructed image recognition model, wherein the image recognition model comprises at least two independent convolution layers, and different convolution layers are used for extracting feature vectors of feature maps of the sample image from different dimensions.

Wherein, the feature map can be extracted by a feature extraction network in the image recognition model.

And S203, according to the initial labeling information, if the sample image is a negative sample, calculating the loss brought by the initial labeling information of the negative sample according to the loss function corresponding to the positive and negative sample dimensions, and transmitting the loss back to the convolution layer corresponding to the positive and negative sample dimensions.

That is, since the negative sample is not labeled with fine granularity, the loss caused by the initial labeling information is calculated only according to the initial labeling information of the negative sample and the loss function corresponding to the positive and negative sample dimensions, and is transmitted back to the convolution layer and the feature extraction network corresponding to the positive and negative sample dimensions.

S204, according to the initial marking information, if the sample image is a positive sample, calculating the loss brought by each fine-grained marking information of the positive sample according to the loss function corresponding to each fine-grained dimension, and respectively transmitting the loss to the convolution layer corresponding to each fine-grained dimension; and calculating the loss brought by the initial labeling information of the positive sample according to the loss function corresponding to the positive and negative sample dimensions, fusing the loss brought by each fine-grained labeling information with the loss brought by the initial labeling information of the positive sample to obtain the total loss of the positive sample, and transmitting the total loss back to the convolution layer corresponding to the positive and negative sample dimensions.

That is, when the sample image is a positive sample, in addition to calculating the loss caused by the initial labeling information according to the initial labeling information of the positive sample and the loss function corresponding to the positive and negative sample dimensions, it is also necessary to calculate the loss caused by each fine-grained labeling according to each fine-grained labeling information and the loss function corresponding to the fine-grained labeling information, for example, if there is a shielding in the positive sample, the loss caused by the shielding of the labeling information is calculated according to the loss function shielding the dimension and the output of the corresponding convolutional layer, and is transmitted back to the convolutional layer corresponding to the dimension; if the positive sample is marked with fuzzy grade, calculating the loss caused by fuzzy marking information according to the loss function of the dimension and the output of the corresponding convolution layer, and transmitting the loss back to the convolution layer corresponding to the dimension, and meanwhile transmitting the loss back to the feature extraction network.

It should be further noted that after the loss of the fine-grained dimension is calculated, the loss caused by each fine-grained labeling information needs to be fused with the loss caused by the initial labeling information of the positive sample, so as to obtain the total loss of the positive sample, and the total loss is transmitted back to the convolution layer corresponding to the positive and negative sample dimensions. And the loss of the fine-grained dimensionality and the positive and negative sample dimensionality is fused, the mutual influence of multiple dimensionalities is considered, the loss of each dimensionality divided according to the fine-grained dimensionality is fused to the loss brought by the mark of the positive and negative samples, the loss is returned to the convolution layer corresponding to the positive and negative sample dimensionality according to the fusion result, and in the model trained in this way, in the testing stage, the influence of each fine-grained dimensionality is considered when the convolution layer corresponding to the positive and negative sample dimensionality extracts the feature vector of the feature map, the content identification capability of the fine-grained content of the image is stronger, the image is identified and classified according to the output of the convolution layer, and the result is more accurate.

Specifically, in an embodiment, the fusing the loss caused by each fine-grained labeling information with the loss caused by the initial labeling information of the positive sample may include: and respectively taking the loss brought by each fine-grained marking information as a weight, and weighting the loss brought by the initial marking information of the positive sample. That is, the loss caused by each fine-grained labeling information is used as a weight, and is multiplied by the loss caused by the initial labeling information of the positive sample, or weighted summation is performed, so that the final total loss can be obtained.

Fig. 3 is a schematic flow diagram of a training method of an image recognition model according to an embodiment of the present application, and as shown in fig. 3, an Input sample image (Input), for example, a signboard image, first passes through a Feature extraction network (backhaul) of the model to obtain a Feature map (Feature map), and then extracts Feature vectors from a blur dimension, an occlusion dimension, and an original positive and negative sample dimension through three independent convolution layers (1 × 1conv), and then passes through a full connection layer (FC) to obtain an output result. Then, from the annotation information of the input sample image, a blur loss, an occlusion loss, and a positive and negative sample dimension loss of itself are calculated, respectively, and both the blur loss and the occlusion loss are weighted as weights to the (Re-weight) positive and negative sample dimension loss. The fuzzy loss and the shielding loss are respectively transmitted back to the convolution layer corresponding to the fuzzy loss and the shielding loss, the weighted total loss is also transmitted back to the convolution layer corresponding to the fuzzy loss, the network parameters of the convolution layer corresponding to each dimensionality are respectively updated, and meanwhile, all the losses are transmitted back to the feature extraction network to update the network parameters of the feature extraction network.

In the example of signboard image recognition, the loss function of the positive and negative sample dimensions may be any one of the loss functions applied to the two-classification method in the prior art, and the loss function of the fine-grained dimension, which is the dimension where occlusion exists, may also be any one of the loss functions applied to the two-classification method. For the dimension of the fuzzy grade, the fuzzy is divided according to the grade, so the corresponding loss function can be any one of the loss functions applicable to the multi-classification method in the prior art, such as a cross entropy loss function.

According to the technical scheme, at least one fine-grained label is marked according to different fine-grained dimensions for a positive sample in a sample image, at least two independent convolution layers in the model are utilized, feature vectors of feature maps of the sample image are extracted from different dimensions respectively, and according to the multi-dimensional fine-grained labels, the model is supervised and trained by using loss functions corresponding to the dimensions respectively.

Fig. 4 is a flowchart of an image recognition method according to an embodiment of the present application, which is applicable to a case where an image is recognized by using a trained image recognition model, for example, a signboard image is recognized, and the present application relates to the field of artificial intelligence, and in particular to technologies such as computer vision and deep learning. The method may be performed by an image recognition apparatus, which is implemented in software and/or hardware, and is preferably configured in an electronic device, such as a computer device or a server. As shown in fig. 4, the method specifically includes the following steps:

s401, extracting a feature map of an image to be recognized by using the image recognition model trained by the training method of the image recognition model according to any embodiment of the application.

The image recognition model comprises a feature extraction network used for extracting a feature map of an input image. The image to be recognized is, for example, a signboard image.

S402, extracting feature vectors of feature maps from different dimensions respectively by using at least two independent convolution layers in the image recognition model, wherein the dimensions comprise positive and negative sample dimensions and fine granularity dimensions.

The image recognition model further comprises at least two independent convolution layers for respectively extracting the feature vectors of the feature map from different dimensions. In addition, the image recognition model further comprises full connection layers, and the back of each independent convolution layer is connected with the corresponding full connection layer for reducing the dimension of the feature vector extracted by the convolution layer.

And S403, classifying the images to be recognized according to the output results of the convolution layers corresponding to the positive and negative sample dimensions.

Because the image recognition model is obtained by training according to the training method of the image recognition model described in any embodiment of the application, the influence of different dimensions on each other is already considered in the process of extracting the feature vector of the convolutional layer corresponding to the positive and negative sample dimensions, each dimension has certain discrimination capability, the influence of noise such as shielding or blurring in the image can be eliminated, and the signboard image with higher precision is extracted.

In addition, in the embodiment of the application, the identifier of the image to be recognized on the fine-grained dimension can be obtained according to the output result of the convolution layer corresponding to the fine-grained dimension. That is, after the model is trained, the convolutional layer corresponding to each fine-grained dimension has the capability of extracting features from the dimension, so that for any image to be identified, the scores of the image in the dimensions, for example, the score of whether occlusion exists or not, or the score of a fuzzy grade, and the like, can be determined by using the output result of the convolutional layer corresponding to the fine-grained dimension. The scores can be used as the identifiers of the images, can be used in any scene, and can also be used as an image labeling method, so that the labeling efficiency is improved.

Of course, the image recognition model trained by the training method for the image recognition model according to any embodiment of the present application may classify the image, or classify the image according to a fusion result of output results of convolution layers of positive and negative sample dimensions and each fine-grained dimension, which is not limited in this embodiment of the present application.

According to the technical scheme of the embodiment of the application, the image recognition model trained by the training method of the image recognition model in any embodiment of the application is utilized to classify and recognize the image, and the image recognition precision is improved. Particularly, with respect to a signboard image, an available signboard image which is not affected by a signboard name although occlusion or blurring exists can be recognized, and the POI image is not disturbed by noise, has higher recognition accuracy of the signboard image content, and can produce high-quality POI data.

Fig. 5 is a schematic structural diagram of a training device for an image recognition model according to an embodiment of the present application, which is applicable to a case of training an image recognition model, for example, a model for recognizing a signboard image, and relates to the field of artificial intelligence, in particular to technologies such as computer vision, deep learning, and the like. The device can realize the training method of the image recognition model in any embodiment of the application. As shown in fig. 5, the apparatus 500 specifically includes:

a sample image information obtaining module 501, configured to obtain a sample image and annotation information thereof, where the annotation information includes initial annotation information that is annotated according to positive and negative sample dimensions, and at least one fine-grained annotation information that is obtained by dividing a positive sample in the sample image according to different fine-grained dimensions;

a sample image input module 502, configured to input the sample image into a pre-constructed image recognition model, where the image recognition model includes at least two independent convolutional layers, and different convolutional layers are used to extract feature vectors of feature maps of the sample image from different dimensions;

and a supervised training module 503, configured to perform supervised training on the image recognition model by using loss functions corresponding to the dimensions respectively according to the labeling information of different dimensions of the sample image, where the loss functions are used to return to the convolution layers corresponding to the dimensions.

Optionally, the supervised training module 503 comprises a first supervised training unit, configured to perform the following operations:

and according to the initial labeling information, if the sample image is a negative sample, calculating the loss brought by the initial labeling information of the negative sample according to the loss function corresponding to the positive and negative sample dimensions, and transmitting the loss to the convolution layer corresponding to the positive and negative sample dimensions.

Optionally, the supervised training module 503 includes a second supervised training unit, configured to perform the following operations:

according to the initial marking information, if the sample image is a positive sample, calculating the loss brought by each fine-grained marking information of the positive sample according to the loss function corresponding to each fine-grained dimension, and respectively transmitting the loss to the convolution layer corresponding to each fine-grained dimension;

and calculating the loss brought by the initial labeling information of the positive sample according to the loss functions corresponding to the positive and negative sample dimensions, fusing the loss brought by each fine-grained labeling information with the loss brought by the initial labeling information of the positive sample to obtain the total loss of the positive sample, and transmitting the total loss back to the convolution layer corresponding to the positive and negative sample dimensions.

Optionally, the second supervised training unit is specifically configured to:

and weighting the loss brought by the initial labeling information of the positive sample by taking the loss brought by each fine-grained labeling information as a weight.

Optionally, the image recognition model further includes a feature extraction network, configured to extract a feature map of the sample image.

Optionally, the loss function is further used for returning to the feature extraction network.

Optionally, the sample image is a signboard image; the fine-grained marking information comprises the grade of whether occlusion and blurring exist.

The training device 500 for the image recognition model provided by the embodiment of the present application can execute the training method for the image recognition model provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.

Fig. 6 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application, which is applicable to a case where image recognition is performed by using a trained image recognition model, for example, recognition of a signboard image relates to the field of artificial intelligence, and in particular to technologies such as computer vision and deep learning. The device can realize the image identification method in any embodiment of the application. As shown in fig. 6, the apparatus 600 specifically includes:

the feature map extraction module 601 is configured to extract a feature map of an image to be recognized by using an image recognition model trained by the training apparatus for an image recognition model according to any embodiment of the present application;

a feature vector extraction module 602, configured to extract feature vectors of the feature map from different dimensions respectively by using at least two independent convolutional layers in the image recognition model, where the dimensions include a positive sample dimension and a negative sample dimension, and a fine granularity dimension;

and an image classification module 603, configured to classify the image to be identified according to an output result of the convolution layer corresponding to the positive and negative sample dimensions.

Optionally, the apparatus further comprises:

and the fine-grained identification obtaining module is used for obtaining the identification of the image to be recognized on the fine-grained dimension according to the output result of the convolution layer corresponding to the fine-grained dimension.

The image recognition device 600 provided by the embodiment of the present application can execute the image recognition method provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as a training method of an image recognition model. For example, in some embodiments, the training method of the image recognition model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the training method of the image recognition model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform the training method of the image recognition model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

Furthermore, according to an embodiment of the present application, there is provided another electronic device, another readable storage medium, and another computer program product for executing one or more steps of the image recognition method according to any embodiment of the present application. The specific structure and program code thereof can be referred to the content description of the embodiment shown in fig. 7, and are not described herein again.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of an image recognition model comprises the following steps:

2. The method of claim 1, wherein the supervised training of the image recognition model with the loss function corresponding to each dimension respectively comprises:

3. The method of claim 1, wherein the supervised training of the image recognition model with the loss function corresponding to each dimension respectively comprises:

4. The method of claim 3, wherein the fusing the loss of each fine-grained annotation information with the loss of the initial annotation information for the positive sample comprises:

5. The method of claim 1, wherein the image recognition model further comprises a feature extraction network for extracting a feature map of the sample image.

6. The method of claim 5, wherein the loss function is also used for feedback to the feature extraction network.

7. The method of any of claims 1-6, wherein the sample image is a sign image; the fine-grained marking information comprises the grade of whether occlusion and blurring exist.

8. An image recognition method, comprising:

extracting a feature map of an image to be recognized by using an image recognition model trained by the method according to any one of claims 1-7;

9. The method of claim 8, further comprising:

and acquiring the identifier of the image to be identified on the fine-grained dimension according to the output result of the convolution layer corresponding to the fine-grained dimension.

10. An apparatus for training an image recognition model, comprising:

11. The apparatus of claim 10, wherein the supervised training module comprises a first supervised training unit to:

12. The apparatus of claim 10, wherein the supervised training module comprises a second supervised training unit to:

13. The apparatus of claim 12, wherein the second supervised training unit is specifically configured to:

14. The apparatus of claim 10, wherein the image recognition model further comprises a feature extraction network for extracting a feature map of the sample image.

15. The apparatus of claim 14, wherein the loss function is further for feedback to the feature extraction network.

16. The apparatus of any of claims 10-15, wherein the sample image is a sign image; the fine-grained marking information comprises the grade of whether occlusion and blurring exist.

17. An image recognition apparatus comprising:

a feature map extraction module, configured to extract a feature map of an image to be recognized using the image recognition model trained by the apparatus according to any one of claims 10-16;

18. The apparatus of claim 17, further comprising:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training an image recognition model of any one of claims 1-7.

20. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the training method of the image recognition model of any one of claims 1-7.

21. A computer program product comprising a computer program which, when executed by a processor, implements a method of training an image recognition model according to any one of claims 1 to 7.

22. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image recognition method of claim 8 or 9.

23. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image recognition method of claim 8 or 9.

24. A computer program product comprising a computer program which, when executed by a processor, implements an image recognition method according to claim 8 or 9.