CN114581710A

CN114581710A - Image recognition method, device, equipment, readable storage medium and program product

Info

Publication number: CN114581710A
Application number: CN202210210792.XA
Authority: CN
Inventors: 刘俊; 詹佳伟; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-03

Abstract

The embodiment of the application provides an image recognition method, an image recognition device, image recognition equipment, a storage medium and a program product, and relates to the field of artificial intelligence. The method comprises the following steps: determining a first feature map and a second feature map corresponding to an image to be recognized; determining a first probability value of the image to be recognized belonging to each object class and an activation map corresponding to each object class based on the first feature map and each object class in the object class set; performing feature fusion processing between the second feature map and the activation map of each object type to determine a feature map after feature fusion; determining at least one interested candidate area of the image to be recognized based on the activation map of each object category; determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and at least one candidate region of interest; and determining the object category to which the image to be recognized belongs based on the first probability values and the second probability values.

Description

Image recognition method, device, equipment, readable storage medium and program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image recognition method, apparatus, device, readable storage medium, and program product.

Background

In the prior art, multi-label image classification (multi-label classification) aims at identifying a plurality of labels (classifications) of an image, is a basic task of computer vision and multimedia, and has wide application in the fields of image retrieval, attribute identification, automatic image annotation and the like. The conventional image recognition method generates a large number of ROIs (regions of interest candidates), which is not only inefficient for multi-label image classification, but also causes a large number of ROIs to generate a large number of inaccurate ROIs due to problems of background, illumination, angle, and the like associated with the image, such that the large number of ROIs do not substantially accurately focus on the target object in the image, thereby reducing the final classification performance.

Disclosure of Invention

The application provides an image recognition method, an image recognition device, an image recognition apparatus, a computer-readable storage medium and a computer program product, which are used for solving the problem of how to improve the recognition accuracy of multi-classification of images.

In a first aspect, the present application provides an image recognition method, including:

acquiring an image to be identified;

determining a first feature map and a second feature map corresponding to the image to be recognized, wherein the resolution of the first feature map is smaller than that of the second feature map;

determining a first probability value of the image to be recognized belonging to each object class and an activation map corresponding to each object class based on the first feature map and each object class in a preset object class set;

performing feature fusion processing between the second feature map and the activation map of each object type to determine a feature map after feature fusion; determining at least one interested candidate area of the image to be recognized based on the activation map of each object category;

determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and at least one candidate region of interest;

and determining the object category to which the image to be recognized belongs based on the first probability values and the second probability values.

In one embodiment, determining a first feature map and a second feature map corresponding to an image to be recognized includes:

inputting an image to be recognized into a feature extraction model of a first neural network, extracting a second feature map from a fourth layer of volume blocks of the feature extraction model, and extracting a first feature map from a fifth layer of volume blocks of the feature extraction model;

the feature extraction model comprises a first layer volume block, a second layer volume block, a third layer volume block, a fourth layer volume block and a fifth layer volume block, and a cascade relation exists among the first layer volume block, the second layer volume block, the third layer volume block, the fourth layer volume block and the fifth layer volume block.

In one embodiment, determining a first probability value that the image to be recognized belongs to each object class based on the first feature map and each object class in a preset object class set comprises:

and inputting the first characteristic diagram into a first full-connection layer and a maximum pooling layer of the first neural network, and performing classification processing based on each object category in the object category set to obtain a first probability value corresponding to each object category in the image to be identified.

In one embodiment, determining an activation map corresponding to each object class based on the first feature map and each object class in a preset object class set includes:

performing dimension reduction processing on the first feature map, and determining a feature map after dimension reduction, wherein the dimension number of the feature map after dimension reduction is the same as the number of object categories in the object category set;

and inputting the feature map subjected to dimension reduction into a batch normalization layer of the first neural network, performing batch normalization processing, and determining an activation map corresponding to each object type.

In one embodiment, the feature fusion processing is performed between the second feature map and the activation map of each object class, and determining the feature map after feature fusion includes:

inputting the second feature map and the activation map of each object type into a feature fusion model of the first neural network, and performing up-sampling and linear difference processing on the activation map of each object type to obtain a third feature map;

and carrying out summation processing between the second characteristic diagram and the third characteristic diagram according to the position to obtain the characteristic diagram after the characteristic fusion.

In one embodiment, determining at least one candidate region of interest of the image to be recognized based on the activation map for each object category comprises:

inputting the activation map of each object category into an interested candidate region selection model of the first neural network, and screening out the background in the activation map of each object category to obtain the activation map of each object category after the background is screened out;

and sequencing the first probability values from large to small, and performing edge extraction processing on the activation image after the background is screened and corresponding to at least one first probability value sequenced in front to obtain at least one interested candidate region of the image to be identified.

In one embodiment, determining a second probability value that the image to be recognized belongs to each object category based on the feature fused feature map and the at least one candidate region of interest includes:

inputting the feature map after feature fusion and at least one candidate region of interest into a candidate region of interest pooling layer of a first neural network, and cutting the feature map after feature fusion to obtain a feature map of the candidate region of interest;

and determining a second probability value of the image to be recognized belonging to each object category based on the feature map of the interest candidate region.

In one embodiment, determining the object class to which the image to be recognized belongs based on the first probability values and the second probability values includes:

and aiming at one object type in the object type set, if the average value between a first probability value corresponding to the object type and a second probability value corresponding to the object type is smaller than a preset type threshold value, determining that the object type exists in the image to be identified.

In one embodiment, before acquiring the image to be identified, the method further comprises:

inputting the training samples into a second neural network, and determining the value of a first loss function of the global branch prediction model and an activation map of each object class in the training samples; the second neural network comprises a global branch prediction model, a local branch prediction model and a weak supervision model, the global branch prediction model comprises a feature extraction model, a first full-link layer and a maximum pooling layer, and the local branch prediction model comprises a feature fusion model, an interested candidate region selection model, a batch normalization layer, an interested candidate region pooling layer and a second full-link layer;

inputting the activation map of each object class in the training sample into the weak supervision model, suppressing the noise of the activation map of each object class in the training sample, and determining the value of a second loss function of the weak supervision model;

respectively inputting the activation map of each object category in the training sample into the feature fusion model and the interested candidate region selection model, and determining the value of a third loss function of the local branch prediction model;

updating parameters of the second neural network based on the values of the first loss function, the second loss function, and the third loss function;

if the sum of the value of the first loss function, the value of the second loss function and the value of the third loss function is smaller than a preset loss threshold value, finishing the training of the second neural network; and obtaining a second neural network based on the training, and determining a first neural network, wherein the first neural network does not comprise the weak supervision model.

In a second aspect, the present application provides an image recognition apparatus comprising:

the first processing module is used for acquiring an image to be identified;

the second processing module is used for determining a first feature map and a second feature map corresponding to the image to be recognized, and the resolution of the first feature map is smaller than that of the second feature map;

the third processing module is used for determining a first probability value of the image to be recognized belonging to each object class and an activation map corresponding to each object class based on the first feature map and each object class in a preset object class set;

the fourth processing module is used for carrying out feature fusion processing between the second feature map and the activation map of each object type and determining a feature map after feature fusion; determining at least one interested candidate area of the image to be recognized based on the activation map of each object category;

the fifth processing module is used for determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and at least one candidate region of interest;

and the sixth processing module is used for determining the object category to which the image to be recognized belongs based on the first probability values and the second probability values.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and a bus;

a bus for connecting the processor and the memory;

a memory for storing operating instructions;

and the processor is used for executing the image recognition method of the first aspect of the application by calling the operation instruction.

In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program for executing the image recognition method of the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the image recognition method of the first aspect of the present application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the feature map after feature fusion contains a plurality of scales, namely, the feature map has high-level semantic understanding features and also contains certain low-level image texture features; determining at least one interested candidate region of the image to be recognized based on the activation map of each object category, thereby realizing focusing on a key region in the image to be recognized and improving the recognition accuracy of the interested candidate region; and determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and at least one interested candidate region, and determining the object category of the image to be recognized belonging to each object category based on the first probability values and the second probability values, so that the recognition accuracy of multi-classification (multiple labels) of the image to be recognized is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic diagram of an architecture of an image recognition system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of image recognition provided by an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a comparison of interest candidate regions according to an embodiment of the present application;

fig. 11 is a schematic flowchart of another image recognition method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" indicates either an implementation as "a", or an implementation as "B", or an implementation as "a and B".

It is understood that in the specific implementation of the present application, data related to image recognition is referred to, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application relates to an image identification method provided by an image identification system, and relates to the fields of artificial intelligence, cloud technology and the like. The image recognition referred to in the embodiments of the present application is, for example, a computer vision technique in the field of artificial intelligence; for another example, the artificial neural network in the embodiment of the present application is a machine learning technique in the field of artificial intelligence. The Application scenes of the image recognition method include but are not limited to scenes such as image recognition, and the like, and the image recognition scenes include mobile phone photo albums, information flow news APP (Application), short video APPs, image recognition scenes and the like.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

For better understanding and description of the embodiments of the present application, some technical terms used in the embodiments of the present application will be briefly described below.

Multi-label classification (Multi-label classification): the main task of multi-label classification is to classify and identify images so as to accurately classify a sample into one or more labels. Compared with the traditional classification problem, the multi-label classification problem has the following characteristics: (1) the number of categories is uncertain, some samples may have only one label, and some samples may have tens of labels; (2) there may be some degree of dependency between tags, and a category with a table will also have chairs to a large extent. One label may represent one category and a plurality of labels may represent a plurality of categories.

Candidate region of interest (ROI): given an input image, all possible positions where an object can be located are found; the output of this stage is a list of bounding boxes of possible locations of the object, the bounding boxes being represented in the form of boxes, circles, ellipses, irregular polygons, and the like. In the image processing, the processed image outlines a region to be processed in a mode of a square frame, a circle, an ellipse, an irregular polygon and the like, and the region is called as ROI; the ROI is an image region selected from the image, which is the focus of interest for image analysis.

Activation graph: the activation map may be referred to as a saliency map (saliency map), and the activation map is learned and used for expressing a region where an activation value of each object class is larger than a corresponding activation value; it can also be simply regarded as a sign of whether an object is salient or not, and the more salient an object is, the whiter it is on the activation map.

Ground truth value: the ground truth (ground true) is an actual measurement value, and is relative to a predicted value and an estimated value; ground truth is 0 or 1, indicating whether the object with this label (example) is in the current sample; for example, if there are 2 cats in a graph and 1 person in a graph, and there is a data set with a total of 4 tags (categories), cat, dog, person and car, respectively, the label for the graph would be 1010.

The technical scheme provided by the embodiment of the application relates to an artificial intelligence technology, and the technical scheme of the application is explained in detail by specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

In order to better understand the scheme provided by the embodiment of the present application, the scheme is described below with reference to a specific application scenario.

In an embodiment, fig. 1 shows an architecture diagram of an image recognition system to which the embodiment of the present application is applied, and it can be understood that the image recognition method provided by the embodiment of the present application can be applied to, but is not limited to, the application scenario shown in fig. 1.

In the present example, as shown in fig. 1, the architecture of the image recognition system in this example may include, but is not limited to, a terminal 10, a server 20, and a database 30. The terminal 10, the server 20 and the database 30 can interact with each other through a network, the terminal 10 sends an image to be identified to the server 20, and the server 20 can also obtain the image to be identified from the database 30; one server 20 of the plurality of servers 20 may be responsible for determining a first feature map and a second feature map corresponding to the image to be recognized, where the resolution of the first feature map is smaller than that of the second feature map; the server 20 determines a first probability value of the image to be recognized belonging to each object class and an activation map corresponding to each object class based on the first feature map and each object class in a preset object class set; the server 20 performs feature fusion processing between the second feature map and the activation map of each object type to determine a feature map after feature fusion; the server 20 determines at least one candidate region of interest of the image to be identified, based on the activation map for each object category; the server 20 determines a second probability value that the image to be recognized belongs to each object category based on the feature map after feature fusion and at least one candidate region of interest; the server 20 determines the object category to which the image to be recognized belongs based on the first probability values and the second probability values; the server 20 sends the object class to which the image to be recognized belongs to the terminal 10. Another server 20 of the plurality of servers 20 may be responsible for training the neural network in the image recognition method provided by the embodiments of the present application.

It is understood that the above is only an example, and the present embodiment is not limited thereto.

The terminal may be a smart phone (e.g., an Android phone, an iOS phone, etc.), a cell phone simulator, a tablet computer, a notebook computer, a digital broadcast receiver, an MID (Mobile Internet Devices), a PDA (personal digital assistant), etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a large pool of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

The so-called artificial intelligence cloud Service is also generally called AIaaS (AIas a Service, chinese is "AI as a Service"). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.

Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, Wi-Fi, and other networks that enable wireless communication. The specific determination may also be based on the requirements of the actual application scenario, and is not limited herein.

Referring to fig. 2, fig. 2 shows a flowchart of an image recognition method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a server, and as an alternative implementation, the method may be executed by the server. As shown in fig. 2, an image recognition method provided in an embodiment of the present application includes the following steps:

s201, acquiring an image to be identified.

Specifically, the image to be recognized may be a picture, an image in a segment video, or the like. The method can acquire the image to be identified from a mobile phone photo album, an information flow news APP, a short video APP and the like; or shooting pictures on site, and taking the shot pictures as images to be identified.

S202, determining a first feature map and a second feature map corresponding to the image to be recognized, wherein the resolution of the first feature map is smaller than that of the second feature map.

Specifically, for example, the image to be recognized is 448 × 3, the first feature map is 2048 × 14, and the second feature map is 1024 × 28, where 2048 in 2048 × 14 denotes the dimension of the first feature map, i.e., 2048 channels, and 14 × 14 denotes the size of the first feature map, i.e., length, width, length, 14, and width, 14.

S203, determining a first probability value of the image to be recognized belonging to each object class and an activation map corresponding to each object class based on the first feature map and each object class in a preset object class set.

Specifically, the preset set of object classes may include a plurality of object classes, for example, the set of object classes is a Pascal VOC 2007 dataset, the Pascal VOC 2007 dataset includes 20 object classes, a first probability value that the image to be recognized belongs to each object class is determined based on the first feature map and each object class of the 20 object classes, and an activation map corresponding to each object class is determined, that is, a total of 20 first probability values and 20 activation maps may be determined. The value range of the first probability value is 0-1. The object categories may be people, cats, dogs, tables, etc.

S204, performing feature fusion processing between the second feature map and the activation map of each object type, and determining a feature map after feature fusion; and determining at least one candidate region of interest of the image to be recognized based on the activation map for each object category.

In particular, each object category corresponds to one activation map, for example, the set of object categories includes 20 object categories, the 20 object categories correspond to 20 activation maps, and based on the 20 activation maps, a plurality of interesting candidate regions ROI of the image to be recognized can be determined.

S205, determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and at least one candidate region of interest.

Specifically, the feature map after feature fusion includes multiple scales, that is, the feature map has high-level semantic understanding features and also includes certain low-level image texture features. For example, the object class set comprises 20 object classes, and a second probability value of the image to be recognized belonging to each object class is determined based on the feature map after feature fusion and a plurality of candidate regions of interest, that is, 20 second probability values can be determined in total, and the value range of the second probability value is 0-1.

S206, determining the object category to which the image to be recognized belongs based on the first probability values and the second probability values.

Specifically, for example, the object class set includes 20 object classes, each object class corresponds to a first probability value and a second probability value, that is, the object class to which the image to be recognized belongs is determined based on the first probability value and the second probability value corresponding to the 20 object classes, and the object class to which the image to be recognized belongs may be one or more of the 20 object classes.

In the embodiment of the application, the feature map after feature fusion comprises a plurality of scales, namely, the feature map has high-level semantic understanding features and simultaneously comprises certain low-level image texture features; determining at least one interested candidate region of the image to be recognized based on the activation map of each object category, thereby realizing focusing on a key region in the image to be recognized and improving the recognition accuracy of the interested candidate region; and determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and at least one interested candidate region, and determining the object category of the image to be recognized belonging to each object category based on the first probability values and the second probability values, so that the recognition accuracy of multi-classification (multiple labels) of the image to be recognized is improved.

Specifically, as shown in fig. 3, the first neural network includes a feature extraction model 301, a fully-connected layer 302 (first fully-connected layer), a max pooling layer 303, a 1 × 1 convolutional layer 304, a batch normalization layer 305, a feature fusion model 306, a ROI selection model 307 (candidate region of interest selection model), a ROI pooling layer 308 (candidate region of interest pooling layer), a convolution block 309, and a fully-connected layer 310 (second fully-connected layer), wherein the feature fusion model 306 includes an upsampled layer and a linear interpolation layer.

The feature extraction model 301 may be a ResNet101 model, an AlexNet model, or a VGGNet model. If the feature extraction model 301 is the ResNet101 model, as shown in fig. 3, the feature extraction model 301 includes conv1 (first layer volume block), conv2 (second layer volume block), conv3 (third layer volume block), conv4 (fourth layer volume block), and conv5 (fifth layer volume block). For example, the picture (image to be recognized) is 448 × 3, and a second feature map is extracted from conv4 of the feature extraction model 301, and the second feature map is 1024 × 28; a first feature map is extracted from conv5 of the feature extraction model 301, and the first feature map is 2048 × 14. The structure of the volume block 309 may be the same as conv 5. As shown in fig. 4, the network structure of the ResNet101 model adopts a picture (an image to be recognized) as an input of the ResNet101 model, and performs high-level semantic feature extraction on the image to be recognized through the ResNet101 model, that is, a second feature map is extracted from conv4 of the ResNet101 model, the second feature map is 1024 × 28, and a first feature map is extracted from conv5 of the ResNet101 model, and the first feature map is 2048 × 14.

and inputting the first feature map into a first full-connection layer and a maximum pooling layer of the first neural network, and performing classification processing based on each object class in the object class set to obtain a first probability value corresponding to each object class in the image to be identified.

Specifically, as shown in fig. 3, the first feature map is input to the fully-connected layer 302 (the first fully-connected layer) and the maximum pooling layer 303, and classification processing is performed on the basis of each object class in the object class set, so as to obtain a global branch prediction value (a first probability value) corresponding to each object class in the picture (the image to be identified); for example, the graph is 448 × 3, and the first feature graph is 2048 × 14.

performing dimension reduction processing on the first feature map, and determining a feature map after dimension reduction, wherein the dimension of the feature map after dimension reduction is the same as the number of object types in the object type set;

Specifically, as shown in fig. 3, for example, the first feature map 2048 × 14 is input to the 1 × 1 convolution layer 304, the dimensionality reduction process is performed, the feature map 20 × 14 after the dimensionality reduction is determined, the feature map 20 × 14 after the dimensionality reduction is input to the batch normalization layer 305, the batch normalization process is performed, and 20 feature maps (activation maps) of 1 × 14 are obtained, where the 20 feature maps of 1 × 14 may be represented by one feature map 20 × 14, and each feature map of 1 × 14 is an activation map corresponding to one object class.

Specifically, as shown in fig. 3, for example, 20 feature maps (activation maps) of 1 × 14 are input to the upsampling layer included in the feature fusion model 306, and 1024 feature maps of 14 × 14 are obtained by sampling the 20 feature maps of 14 × 14, where the 1024 feature maps of 14 × 14 may be represented by 1024 × 14; inputting the 1024 14 × 14 feature maps into a linear interpolation layer included in the feature fusion model 306, and interpolating all the 1024 feature maps with the size of 14 × 14 into 1024 feature maps with the size of 28 × 28, that is, obtaining 1024 feature maps with the size of 28 × 28, where the 1024 feature maps with the size of 28 × 28 can be represented by 1024 × 28 (third feature map); and performing bitwise summation processing between the second characteristic diagram 1024 × 28 and the third characteristic diagram 1024 × 28 to obtain a characteristic diagram after characteristic fusion.

Specifically, as shown in fig. 3, for example, 20 feature maps (activation maps) of 1 × 14 are input to the ROI selection model 307 (candidate region of interest selection model), and the 20 activation maps of 1 × 14 are filtered to remove the background, resulting in 20 activation maps after background removal; and sequencing the first probability values from large to small, and performing edge extraction processing on the activation image after the background is screened and corresponding to the first 4 sequenced probability values to respectively obtain 4 interesting candidate regions ROI of the image (the image to be identified).

In one embodiment, as shown in fig. 5, for example, 20 characteristic maps (activation map, i.e., saliency activation map) of 1 × 14 are input to the nonlinear activation layer ReLU in the ROI selection model 307, and the 20 activation maps after background screening can be obtained by screening the 20 activation maps of 1 × 14 by the ReLU. If the activation value of the pixel point in the activation graph is a negative value, the activation value corresponds to the background; if the activation value of the pixel point in the activation graph is a non-negative value, the activation value corresponds to an object, such as a person; therefore, activation values smaller than 0 are screened out by ReLU, i.e., the background of the activation map is screened out.

After the activation map passes through the nonlinear activation layer ReLU, the regions with negative activation values are filtered out, and only the activation values are obtainedPositive signatures, such as N14, N may be 20, i.e. 20 activation signatures after background screening. Choosing k of the top-ranked confidence probability (first probability value)_sA characteristic map, e.g. k _s4, namely, the activation map after screening out the background corresponding to the first 4 probability values in the first sequence; and respectively extracting edges of the activation graphs after the 4 backgrounds are screened out, and selecting an ROI (region of interest) for each activation graph after the backgrounds are screened out based on the total energy value of the region instead of selecting the ROI based on the area size to respectively obtain 4 ROIs of the picture.

In one embodiment, as shown in FIG. 6, to select critical ROI regions, one may choose to compare by ROI energy value within the region instead of region size. The strategy of selecting the ROI area through the area value may select the area with large area but small overall activation value; while a better area can be obtained by integrating the energy value, i.e. the area of the activation value within the ROI area. All contours in each feature map (activation map) are extracted, and contours with larger energy are taken as candidate ROIs. As shown in fig. 6, an input picture, an activation map, an edge extraction, a suboptimal region representation selected according to area, and an optimal region representation selected according to energy are sequentially displayed. The input picture has two regions under the category activation map of 'people', wherein the energy value of the upper region is higher, but the area of the lower region is larger, if an area-first strategy is adopted, the recognition of a key core feature of a human face can be weakened, and by adopting the energy-first strategy, the influence of a part of large-area low-activation regions can be balanced, and only the most key region is focused.

In one embodiment, as shown in FIG. 7, two additional branches are added between the batch normalization layer 305 and the ROI selection model 307 for some degree of floating up or floating down the activation map (saliency activation map) to get k_rMultiple ROI. Meanwhile, for one activation map, not only the ROI of the highest energy but also the ROI of the second high energy, the ROI of the third high energy, and the like may be selected, wherein the highest energy is greater than the second high energy, and the second high energy is greater than the third high energy. Thus, k can be obtained_eMultiple ROI. Finally k is obtained_e×k_s×k_rMultiple candidate regions of interest ROI. Original 4 ROIs, at k_e×k_s×k_rAfter expanding the range, more regions can be obtained, such as 16 ROIs, 24 ROIs, etc.

For example, the upper float is +0.1 σ, the lower float is-0.1 σ, +0.1 σ and-0.1 σ are all variances. Also for example, k _r1+ number of floats (e.g., -0.1 σ for one float, -0.2 σ for two floats); k is a radical of_s4; in FIG. 7, the second highest energy ROI is added, then k_eTo 2, increase the third highest energy ROI, then k_eIs 3; if-0.1 σ is once-floating and the second highest energy ROI is added, then k_r＝2，k_e2, i.e. k-k_e×k_s×k_r＝2*4*2＝16ROI。

As shown in fig. 8, the input picture, the activation map, the edge extraction after the activation map floats up and down, and different interesting candidate regions ROI corresponding to different edges are sequentially performed.

In one embodiment, the determining a second probability value that the image to be recognized belongs to each object category based on the feature map after feature fusion and the at least one candidate region of interest includes:

Specifically, as shown in fig. 3, for example, the feature map after feature fusion and 4 candidate regions of interest ROI are input to the candidate region of interest pooling layer 308, the feature map after feature fusion is subjected to a clipping process to obtain a feature map of the candidate region of interest, where the feature map of the candidate region of interest may be represented as 4 × 1024 × 7, the feature map of the candidate region of interest is input to the convolution block 309 to obtain a feature map 4 × 2048 × 1, the feature map 4 × 2048 × 1 is input to the fully-connected layer 310 (the second fully-connected layer), and a classification process is performed to obtain a second probability value that the picture (the image to be identified) belongs to each object category.

In one embodiment, determining the object class to which the image to be recognized belongs based on the first probability values and the second probability values comprises:

Specifically, for example, if the object category set includes 20 object categories, each of the 20 object categories corresponds to 1 first probability value and 1 second probability value, and each object category corresponds to an average value between the first probability value and the second probability value, where the average value is smaller than a preset category threshold, it may be determined that the object category exists in the picture (the image to be recognized); if the average value is not less than the preset category threshold, it may be determined that the object category does not exist in the picture. A number of object categories in the picture can be determined, for example there are 3 object categories in the picture, namely people, cats and tables.

In one embodiment, before acquiring the image to be recognized, the method further comprises:

inputting the training samples into a second neural network, and determining the value of a first loss function of the global branch prediction model and an activation map of each object class in the training samples; the second neural network comprises a global branch prediction model, a local branch prediction model and a weak supervision model, wherein the global branch prediction model comprises a feature extraction model, a first full connection layer and a maximum pooling layer, and the local branch prediction model comprises a feature fusion model, an interested candidate region selection model, a batch normalization layer, an interested candidate region pooling layer and a second full connection layer;

Specifically, as shown in fig. 3, the second neural network includes a feature extraction model 301, a fully-connected layer 302 (first fully-connected layer), a max pooling layer 303, a 1 × 1 convolutional layer 304, a batch normalization layer 305, a feature fusion model 306, a ROI selection model 307 (candidate region of interest selection model), a ROI pooling layer 308 (candidate region of interest pooling layer), a convolution block 309, a fully-connected layer 310 (second fully-connected layer), and a weakly supervised model 311, wherein the feature fusion model 306 includes an upsampled layer and a linear interpolation layer. The weakly supervised model 311 is used to improve the feature representation of local regions, suppress noise ROIs generated by non-existing classes, i.e. suppress noise of the activation map for each object class, where a noise ROI may refer to an inaccurate ROI that is not framed to the corresponding object but is framed to a background region or other target.

As shown in fig. 9, the activation map of each object category in the training sample is input to the weak supervision model 311, and for the activation map without an object category (the activation map with ground truth being 0), the activation map is constrained by the sigmoid function in the weak supervision model 311, so that the activation value of each pixel point in the activation map tends to 0 as much as possible, and the activation map is constrained to a very low value. Thus, the weak supervision module 311 processes the activation map with the ground truth value of 0, and removes the activation value of each pixel point in the activation map to obtain the activation map corresponding to the background.

In one implementationIn the example, the first loss function is determined by Binary Cross Entropy (BCE)

To train a global branch prediction model, a first penalty function

As shown in equation (1):

wherein i refers to the object class, y_iIs the ground truth (presence or absence, y) of the input training sample under the object class_iValues of 0 or 1, binary classification),

the prediction value is output after passing through the global branch prediction model, and the value range of the prediction value is 0 to 1. Through the first loss function, the output prediction of the global branch prediction model gradually approaches the ground truth value of the training sample under the continuous iteration and optimization of the global branch prediction model.

Second loss function for weakly supervised model 311

As shown in equation (2):

where H, W are the length and width of the activation map;

representing the activation values of the pixel points corresponding to the coordinates (h, w) under the selected object category, gt represents the ground truth value,

means that all non-existing object classes are summed, N represents the number of object classes; δ is a small value to prevent mathematical domain errors.

The global branch prediction model includes a feature extraction model 301, a fully-connected layer 302 (the first fully-connected layer), and a max-pooling layer 303.

Third penalty function for local branch prediction model

As shown in equation (3):

where i refers to the current category, y_iIs the ground truth value (presence or absence, binary classification), y 'of the input training sample under the object class'_iThe prediction value is output after passing through the local branch prediction model, and the value range of the prediction value is 0 to 1. Through the third loss function, the output prediction of the local branch prediction model gradually approaches the ground truth value of the sample under the continuous iteration and optimization of the local branch prediction model.

The local branch prediction model includes a 1 x 1 convolutional layer 304, a batch normalization layer 305, a feature fusion model 306, a ROI selection model 307 (candidate region of interest selection model), a ROI pooling layer 308 (candidate region of interest pooling layer), a convolution block 309, and a fully-connected layer 310 (second fully-connected layer).

The application of the embodiment of the application has at least the following beneficial effects:

the neural network structure is clear, each model in the neural network has better generalization capability, and the practical verification on popular large-scale multi-label image classification data sets proves that the image identification method provided by the embodiment of the application has excellent performance. The weak supervision model can effectively generate high-quality region suggestions for the multi-label image classification task, so that the ROI generation efficiency is improved; the weak supervision realized by the weak supervision model does not need heavy labor cost and does not need to additionally label the candidate frames of the image. The ROI selection model can automatically learn the semantic boundary threshold of the ROI, and the quality of generating the ROI is improved due to multi-scale feature fusion, so that the accuracy of generating the ROI is improved.

In order to better understand the method provided by the embodiment of the present application, the following further describes the scheme of the embodiment of the present application with reference to an example of a specific application scenario.

The application scenes of the method provided by the embodiment of the application include but are not limited to scenes such as image recognition, and the like, and the image recognition scenes comprise mobile phone photo albums, information flow news APPs, short video APPs, image recognition scenes and the like.

Specifically, the mobile phone album can classify thousands of pictures in the mobile phone of the user into landscape, party, group photo, building, portrait, and the like, so that the user can find the desired picture more conveniently. The information flow news APP, the short video APP and the like can identify corresponding multiple tags as features according to browsed information and collected contents, and assist in recommending contents which a user likes to watch. The image recognizing function can recognize plants, animals, commodities and the like in the picture by only shooting one picture, and further displays the science popularization information or recommends the science popularization information to an e-commerce website for purchase.

In one embodiment, the generated candidate region of interest is shown in fig. 10, where the left side in fig. 10 is the result of a Selective Search (Selective Search), and the right side in fig. 10 is the result of the method provided in the embodiment of the present application; the result of the selective search includes a large number of inaccurate ROIs, and the result of the method provided by the embodiment of the application includes basically accurate ROIs, so that the method provided by the embodiment of the application improves the recognition accuracy of the image multi-classification.

The model, the model training and the inference process in the method provided by the embodiment of the application are all realized on a server carrying an Intel Xeon 8255C CPU and an NVIDIATeslaV100 display card, 8V 100 display cards are adopted for distributed parallel training, and the inference result is generated. The code adopts Python 3.6.8, and the used deep learning frames are Pythroch 1.4.0, torchvision 0.5.0, opencv-Python version 4.5.1, numpy version 1.16.1, scimit-spare version 0.23.0 and the like.

Referring to fig. 11, fig. 11 shows a flowchart of an image recognition method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a server, as an alternative implementation, the method may be executed by the server, and for convenience of description, in the following description of some alternative embodiments, the server is taken as an example of an execution subject of the method. As shown in fig. 11, the image recognition method provided in the embodiment of the present application includes the following steps:

s801, inputting the image to be identified into a feature extraction model, extracting a second feature map from a fourth layer of volume blocks of the feature extraction model, and extracting a first feature map from a fifth layer of volume blocks of the feature extraction model.

S802, inputting the first feature map into the first full-link layer and the maximum pooling layer, and performing classification processing based on each object class in the object class set to obtain a first probability value corresponding to each object class in the image to be identified.

And S803, inputting the first feature map into the 1 × 1 convolution layer, performing dimension reduction processing, determining the feature map after dimension reduction, inputting the feature map after dimension reduction into the batch normalization layer, performing batch normalization processing, and determining the activation map corresponding to each object type.

S804, inputting the second feature map and the activation map of each object type into the feature fusion model, and performing up-sampling and linear difference processing on the activation map of each object type to obtain a third feature map; and carrying out summation processing between the second characteristic diagram and the third characteristic diagram according to the position to obtain the characteristic diagram after the characteristic fusion.

S805, inputting the activation map of each object category into the candidate region selection model of interest, and screening out the background in the activation map of each object category to obtain the activation map of each object category after the background is screened out; and performing edge extraction processing on the activation image after the background is screened out corresponding to the plurality of first probability values sequenced in the front to obtain a plurality of interested candidate regions of the image to be identified.

And S806, inputting the feature map after feature fusion and the plurality of interest candidate regions into the interest candidate region pooling layer, and performing cutting processing on the feature map after feature fusion to obtain the feature map of the interest candidate regions.

And S807, inputting the feature map of the candidate region of interest into a second full-connection layer, classifying, and determining a second probability value of the image to be recognized belonging to each object category.

In one embodiment, for example, as shown in fig. 3, the feature maps of the candidate region of interest are input to the convolution block 309, so as to obtain the feature map 4 × 2048 × 1, and the feature map 4 × 2048 × 1 is input to the fully-connected layer 310 (the second fully-connected layer), and then the classification process is performed, so as to obtain a second probability value that the picture (the image to be recognized) belongs to each object category.

And S808, determining the object class to which the image to be recognized belongs based on the first probability values and the second probability values.

the feature map after feature fusion contains a plurality of scales, namely, the feature map has high-level semantic understanding features and also contains certain low-level image texture features; determining a plurality of interested candidate regions of the image to be recognized based on the activation map of each object category, thereby realizing focusing on key regions in the image to be recognized and improving the recognition accuracy of the interested candidate regions; and determining a second probability value of the image to be recognized belonging to each object category based on the feature map after feature fusion and the plurality of interested candidate regions, and determining the object category of the image to be recognized belonging to each object category based on the first probability values and the second probability values, so that the multi-classification recognition accuracy of the image to be recognized is improved.

An image recognition apparatus is further provided in the embodiment of the present application, and a schematic structural diagram of the image recognition apparatus is shown in fig. 12, where the image recognition apparatus 90 includes a first processing module 901, a second processing module 902, a third processing module 903, a fourth processing module 904, a fifth processing module 905, and a sixth processing module 906.

A first processing module 901, configured to obtain an image to be identified;

the second processing module 902 is configured to determine a first feature map and a second feature map corresponding to an image to be identified, where a resolution of the first feature map is smaller than a resolution of the second feature map;

a third processing module 903, configured to determine, based on the first feature map and each object class in a preset set of object classes, a first probability value that the image to be recognized belongs to each object class, and an activation map corresponding to each object class;

a fourth processing module 904, configured to perform feature fusion processing between the second feature map and the activation map of each object category, and determine a feature map after feature fusion; determining at least one interested candidate area of the image to be recognized based on the activation map of each object category;

a fifth processing module 905, configured to determine, based on the feature map after feature fusion and the at least one candidate region of interest, a second probability value that the image to be recognized belongs to each object category;

a sixth processing module 906, configured to determine, based on the first probability values and the second probability values, an object class to which the image to be recognized belongs.

In an embodiment, the second processing module 902 is specifically configured to:

In an embodiment, the third processing module 903 is specifically configured to:

In an embodiment, the fourth processing module 904 is specifically configured to:

In an embodiment, the fifth processing module 905 is specifically configured to:

In an embodiment, the sixth processing module 906 is specifically configured to:

In one embodiment, the first processing module 901 is further configured to:

An embodiment of the present application further provides an electronic device, a schematic structural diagram of the electronic device is shown in fig. 13, and an electronic device 4000 shown in fig. 13 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 13, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.

The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.

Wherein, the electronic device includes but is not limited to: a server, etc.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of the foregoing method embodiments can be implemented.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in any of the alternative embodiments of the present application.

It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.

The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims

1. An image recognition method, comprising:

acquiring an image to be identified;

performing feature fusion processing between the second feature map and the activation map of each object type, and determining a feature map after feature fusion; and determining at least one interest candidate region of the image to be recognized based on the activation map of each object category;

determining a second probability value of the image to be recognized belonging to each object category based on the feature fused feature map and the at least one candidate region of interest;

and determining the object class to which the image to be recognized belongs based on the first probability values and the second probability values.

2. The method according to claim 1, wherein the determining the first feature map and the second feature map corresponding to the image to be recognized comprises:

inputting the image to be recognized into a feature extraction model of a first neural network, extracting a second feature map from a fourth layer of volume blocks of the feature extraction model, and extracting a first feature map from a fifth layer of volume blocks of the feature extraction model;

3. The method according to claim 1, wherein the determining a first probability value that the image to be recognized belongs to each object class in a preset object class set based on the first feature map comprises:

and inputting the first feature map into a first full-connection layer and a maximum pooling layer of the first neural network, and performing classification processing on the basis of each object class in the object class set to obtain a first probability value corresponding to each object class in the image to be recognized.

4. The method according to claim 1, wherein the determining an activation map corresponding to each object class based on the first feature map and each object class in a preset set of object classes comprises:

performing dimension reduction processing on the first feature map, and determining a feature map after dimension reduction, wherein the dimension of the feature map after dimension reduction is the same as the number of object classes in the object class set;

5. The method according to claim 1, wherein the performing a feature fusion process between the second feature map and the activation map of each object class to determine a feature map after feature fusion comprises:

and carrying out bitwise summation processing between the second characteristic diagram and the third characteristic diagram to obtain a characteristic diagram after characteristic fusion.

6. The method according to claim 1, wherein the determining at least one candidate region of interest of the image to be recognized based on the activation map for each object category comprises:

inputting the activation map of each object category into a candidate region of interest selection model of the first neural network, and screening out the background in the activation map of each object category to obtain the activation map of each object category after the background is screened out;

7. The method according to claim 1, wherein the determining a second probability value that the image to be recognized belongs to the each object category based on the feature-fused feature map and the at least one candidate region of interest comprises:

inputting the feature map after feature fusion and the at least one candidate region of interest into a candidate region of interest pooling layer of the first neural network, and performing clipping processing on the feature map after feature fusion to obtain a feature map of the candidate region of interest;

8. The method of claim 1, wherein determining the object class to which the image to be recognized belongs based on the first probability values and the second probability values comprises:

and for one object class in the object class set, if an average value between a first probability value corresponding to the one object class and a second probability value corresponding to the one object class is smaller than a preset class threshold, determining that the one object class exists in the image to be identified.

9. The method of claim 1, prior to the obtaining the image to be identified, further comprising:

inputting training samples into a second neural network, and determining the value of a first loss function of a global branch prediction model and an activation map of each object class in the training samples; the second neural network comprises a global branch prediction model, a local branch prediction model and a weak supervision model, wherein the global branch prediction model comprises a feature extraction model, a first full-link layer and a maximum pooling layer, and the local branch prediction model comprises a feature fusion model, an interested candidate region selection model, a batch normalization layer, an interested candidate region pooling layer and a second full-link layer;

determining a value of a second loss function of the weakly supervised model based on inputting the activation map of each object class in the training sample to the weakly supervised model, suppressing noise of the activation map of each object class in the training sample;

inputting the activation map of each object category in the training sample into the feature fusion model and the interested candidate region selection model respectively, and determining the value of a third loss function of the local branch prediction model;

updating parameters of the second neural network based on the values of the first, second, and third loss functions;

if the sum of the value of the first loss function, the value of the second loss function and the value of the third loss function is smaller than a preset loss threshold value, ending the training of the second neural network; and determining the first neural network based on the second neural network obtained through training, wherein the first neural network does not comprise the weak supervision model.

10. An image recognition apparatus, comprising:

the first processing module is used for acquiring an image to be identified;

the second processing module is used for determining a first feature map and a second feature map corresponding to the image to be identified, and the resolution of the first feature map is smaller than that of the second feature map;

a third processing module, configured to determine, based on the first feature map and each object class in a preset set of object classes, a first probability value that the image to be recognized belongs to each object class, and an activation map corresponding to each object class;

the fourth processing module is used for performing feature fusion processing between the second feature map and the activation map of each object class to determine a feature map after feature fusion; and determining at least one interest candidate region of the image to be recognized based on the activation map of each object category;

a fifth processing module, configured to determine, based on the feature map after feature fusion and the at least one candidate region of interest, a second probability value that the image to be recognized belongs to each object category;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1-9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.

13. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1-9 when executed by a processor.