CN111931859A

CN111931859A - Multi-label image identification method and device

Info

Publication number: CN111931859A
Application number: CN202010883534.9A
Authority: CN
Inventors: 乔宇; 彭小江; 叶锦
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-11-13
Anticipated expiration: 2040-08-28
Also published as: CN111931859B

Abstract

The invention discloses a multi-label image identification method and a multi-label image identification device. The device includes: a semantic attention module for separating the feature map output by the backbone network into a plurality of classes of features; and the dynamic graph convolution network module is used for modeling the relation among the plurality of class characteristics by using a dynamic graph convolution network, and the dynamic graph convolution network comprises a static graph and a dynamic graph, wherein the static graph is used for acquiring the global correlation of the image, and the dynamic graph is used for acquiring the local correlation of the image. The invention can improve the accuracy of image recognition, has stronger independence and robustness, and can be applied to the image recognition of various scenes.

Description

Multi-label image identification method and device

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-label image identification method and device.

Background

In recent years, Graph Neural Networks (GNNs) have been widely used for computer vision or NLP (natural language processing). The graph neural network obtains the correlation among different nodes by modeling the relationship between every two node (node) characteristics, improves the expression capacity of the node characteristics and further improves the precision of a target task. The most basic units of the graph neural network are composed of a relation modeling layer and a state updating layer, and the graph neural network is generally composed of n (n > ═ 1) basic units. For the relational modeling layer, it is common practice to use a graph (graph) to model the relationships between nodes. The naming of the neural network of the graph is different according to different state updating layers. For example, if the state update layer consists of Convolutional layers, it is generally referred to as Graph Convolutional Networks (GCNs); if the state update layer is composed of a Recurrent Neural Networks (RNN) or other means, it is called a graph Neural network. In the multi-label classification task, the thought of the graph neural network is applied to multi-label classification at the earliest and achieves good effect. Thereafter, the image neural network is used to model the features of the image and achieve excellent results on the public data set.

The paper "Learning from semantic-specific graph representation for multi-label image retrieval" published at the ICCV 2019 conference, designs a method for first separating image features into a plurality of nodes capable of performing relational modeling, and then performing relational modeling on the nodes through a graph neural network to improve the expression capability of the features. And finally, the identification accuracy of multi-label classification is improved on the disclosed data set. The specific steps of the paper include: (1) extracting image features by using a backbone network (backbone, such as ResNet101) of the CNN, wherein the extracted features are feature maps (feature maps) of the last convolutional layer; (2) and separating the feature map, and if the class number of the target is c, dividing the feature map into c n-dimensional features. The specific separation mode needs to use the text information of the labels, and after the text features of each label are coded (embedding), the text features are interacted with the feature graph to obtain the features of the corresponding category; (3) after the features for each class are obtained, they are modeled using a graph neural network. For the relation modeling layer, the method uses the frequency of common occurrence of every two classes in the statistical training data as the weight (weight) of the graph, so that the relation between the classes of each input image is fixed. For a state updating layer, the method uses GRU (gated Current Unit) to update the state; (4) and classifying the features after passing through the neural network of the graph by using a classifier (particularly a full-link layer).

Through analysis, the prior art mainly has the following defects:

1) generally, the content of each input image is different, the included categories have larger differences, and the graphs constructed by the existing method are static, namely, each input image shares a relationship graph. Such a method has a suppression effect on the classes with low co-occurrence frequency, and therefore, it is difficult to further improve the identification accuracy of multi-label classification by using such a static map method.

2) The existing static graph construction mode must carry out probability statistics on a data set in advance, so that the model is more complex and the robustness is poor.

3) The existing multi-label classification method is too complex for the characteristic diagram separation method of the image, so that the memory and the speed of the model are greatly influenced.

Disclosure of Invention

The present invention is directed to overcome the above-mentioned drawbacks of the prior art, and to provide a multi-label image recognition method and apparatus, which can connect any type of backbone network to achieve a more accurate image classification result.

According to a first aspect of the present invention, a multi-label image recognition apparatus is provided. The device includes:

semantic attention module: the system comprises a main network, a plurality of backbone networks and a plurality of network interfaces, wherein the main network is used for outputting a feature map;

dynamic graph convolution network module: for modeling the relationship between the plurality of class features using a dynamic graph convolution network comprising a static graph for obtaining a global correlation of the image and a dynamic graph for obtaining a local correlation of the image.

According to a second aspect of the present invention, a multi-label image recognition method is provided. The method comprises the following steps:

separating the feature map output by the backbone network into a plurality of categories of features;

the relationship between the plurality of class features is modeled using a dynamic graph convolutional neural network comprising a static graph for obtaining global correlations of the image and a dynamic graph for obtaining local correlations of the image.

Compared with the prior art, the method has the advantages that any backbone network can be connected, the independence is strong, and the accuracy of the final classification result is improved; the accuracy is improved, and meanwhile, the calculation speed and the video memory occupation are not deteriorated; in addition, the way of constructing the graph is dynamic, any prior statistics is not needed, and robustness is good.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a static diagram illustration of the prior art;

FIG. 2 is a dynamic diagram illustration in accordance with one embodiment of the present invention;

FIG. 3 is a process diagram of multi-label image recognition according to one embodiment of the invention;

FIG. 4 is a schematic diagram of the structure of a semantic attention module and a dynamic graph convolution module according to one embodiment of the present invention;

FIG. 5 is a process diagram of a dynamic graph convolutional network according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of the structure of a dynamic graph convolution network in accordance with one embodiment of the present invention;

FIG. 7 is a schematic diagram of an application scenario according to one embodiment of the present invention;

FIG. 8 is a schematic diagram of an application scenario according to another embodiment of the present invention;

fig. 9 is a schematic application diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the prior art, when a graph neural network is used for carrying out relationship modeling on features of an image, a relationship matrix (namely, a relationship matrix between every two categories, if the number of the categories is c, the size of the matrix is c × c, and the relationship matrix represents the size of correlation between every two categories) must be calculated in advance, and the manner of calculating the relationship matrix is realized by counting the frequency of the common occurrence of the categories in a training set. After the relationship matrix is calculated, the matrix is fixed and invariant during the training and testing processes. Due to this way of calculation and fixing in advance, the relationship matrix of each input image is the same. As shown in fig. 1, the relationship matrix between the truck (truck), car (car), toilet (toilet), and person (person) constructed in advance is fixed. This approach can lead to serious problems, for example, given the high frequency of "cars" and "trucks" and the low frequency of "cars" and "toilets" in the data set. The fixed relation matrix constructed based on the data set is used for subsequent image recognition, and the following problems exist: 1) if a "truck" is not present in a graph, but a "car" is present, then a "car" may not be identified; 2) misidentifying the "truck" category in a "car" only scenario; 3) only "car" can be recognized in the co-appearing picture of "car" and "toilet", and "toilet" is missed.

Aiming at the problems in the prior art, the invention provides a dynamic graph idea. As shown in fig. 2, this embodiment proposes a simple and effective attention-based dynamic graph convolution network (ADD-GCN) to handle the problem of sharing a relationship matrix for each input image, i.e. the relationship matrix is different for different input images.

Specifically, referring to fig. 3, in the multi-tag image recognition apparatus provided in this embodiment, the ADD-GCN includes a Semantic Attention Module (SAM) and a Dynamic-Graph Convolutional Network (D-GCN). The SAM is used to separate (or called as a feature separation module) a feature graph output by a backbone network (backbone), for example, the feature graph is separated into c categories (c is the number of categories). The D-GCN obtains the global correlation of the image through a static graph (static graph) and obtains the individual/local correlation of the image through a dynamic graph (dynamic graph) to enhance the expressive ability of the characteristics. In the description herein, the D-GCN is also referred to as a static-dynamic graph convolution module, which calculates the relationship between nodes by self-constructing a dynamic graph, and greatly improves the precision of classification.

In fig. 3, two branch structures connecting a backbone network are illustrated, wherein one branch structure globally pools a feature map output by the backbone network and transmits a pooling result to a classifier to obtain a first classification result; the other branch structure comprises a semantic attention module and a dynamic graph convolution network designed by the invention, and obtains a second classification result through a classifier, and the average of the first classification result and the second classification result is taken as a final classification result, so that the classification precision can be further improved.

The semantic attention module and the dynamic graph convolution network that are characteristic of the present invention will be described in detail below, and the validation results on the public data set will be analyzed.

First, module for semantic attention

As shown in fig. 3 and 4, the semantic attention module is used to perform feature separation on a feature map (feature map) output by the backbone network. For example, for class 1 (denoted f)_cls1) Is separated into V₁₁、V₁₂、…V_1cC feature vectors are equal, and c is the number of categories.

In fig. 4, the semantic attention module performs feature separation using an attention mechanism, and determines a separated feature vector by obtaining thermodynamic diagrams of each category and weights corresponding to the thermodynamic diagrams of interest regions.

In addition, it should be noted that the semantic attention module uses a modified CAM (Class Activation Mapping) technology to perform feature separation, and does not need to add an auxiliary of text embedding (embedding) features. Specifically, the conventional CAM performs global pooling on the feature map and then performs classification using a full connectivity layer (FC), but in the embodiment of the present invention, each point on the feature map is directly classified, specifically, each point on the feature map is classified using a convolution of 1 × 1, so as to obtain a classified feature map, then performs maximum pooling of top k (e.g., 1 ═ k ═ 5) on the classified feature map, and averages k results retained after the maximum pooling, thereby obtaining a classification result.

Second, about the dynamic graph convolution network

Referring to fig. 4 and 5, the dynamic graph convolution network (or static-dynamic graph convolution module) includes a static graph network and a dynamic graph network, wherein the parameters of the static graph network and the dynamic graph network are obtained by automatic network learning. The static map is the same for each input image, and is used for calculating the global relation between the features; the relationship matrix of the kinetic map is different for each input image and is used to calculate the individual/local relationships between features.

Specifically, referring to fig. 6, the operation process of the dynamic graph convolution network includes:

static graph utilizes fixed relation matrix to output category feature vector V of semantic attention module₁₁、V₁₂、…V_1cConversion into a feature vector V₂₁、V₂₂、…V_2c；

Obtaining the Global Average Pooling (GAP) value V of static graph output_g；

Will V₂₁、V₂₂、…V_2cAnd V_gAfter fusion, obtaining a dynamic relation matrix through convolution processing;

feature vector V₂₁、V₂₂、…V_2cAfter being fused with the dynamic relation matrix, the dynamic relation matrix is converted into a feature vector V₃₁、V₃₂、…V_3cI.e. the output of the dynamic graph convolution network.

In the invention, the dynamic graph convolution network comprises the calculation of both the static relation and the dynamic relation, is suitable for the accurate identification of different images, and does not need prior knowledge by introducing the dynamic graph. Compared with the existing method of constructing a static graph by using a statistical data set mode as a relation matrix in a relation modeling layer of a graph neural network, the method improves the image recognition accuracy and has wider application range.

The invention can be used for various types of electronic equipment and realizes the classification and identification of the input images. Electronic devices include, but are not limited to: smart phones, tablet electronic devices, portable computers, desktop computers, Personal Digital Assistants (PDAs), in-vehicle devices, smart wearable devices, and the like.

To further verify the effect of the present invention, comparative experiments were performed on the present invention and the prior art, see tables 1 and 2, wherein the last row is the experimental result of the present invention, and other prior art techniques are not described one by one. It can be seen that the method can greatly improve the identification accuracy on the VOC 2007 and MS-COCO 2014 data sets, and the calculation speed and the video memory occupation are basically equal to those of the prior art, and even better.

TABLE 1 Experimental results on MS-COCO 2014 data set

TABLE 2 Experimental results on VOC 2007 data set

The method can be applied to various scenes, for example, as shown in fig. 7, and can be used for identifying the clothing image to detect whether the existing classification mode is correct or not, or classifying the clothing attribute or automatically recommending the clothing category required by the user according to the identification result, and the like. As shown in fig. 8, the invention can be used to directly perform multi-tag image classification on the uploaded images or perform multi-tag image recognition on images published by the blogger, uploaded by the user mobile phone terminal, or collected by the social app stored in the cloud server, so as to recognize whether the images contain illegal contents or automatically sort out the images of the user or automatically recommend the favorite image contents of the user. In addition, the present invention can also be applied to various types of terminals such as a mobile phone, an iPAD, and the like, as shown in fig. 9.

In summary, the present invention is applicable to the following aspects;

1) and intelligent image auditing

And (4) auditing and screening massive image data. For example, the image yellow identification, the intelligent audit of the image data uploaded by the user, the audit of the image in the comment data of the shopping platform user, and the like.

2) Intelligent auxiliary labeling

Since a thermodynamic diagram for each category is available in the semantic attention module, the present invention can locate relevant regions of the target category in only image-level supervisory information. In this way, some work requiring fine calibration can be facilitated. For example, a detection frame or a segmentation region needs to be labeled, and after the region of each category is roughly located, a lot of later labeling workload can be saved.

3) Intelligent clothing goods classification

The method for manually classifying the clothing commodities is time-consuming and labor-consuming, the invention can classify different attributes of the clothing, and the clothing commodities can be classified under corresponding attributes according to the classification result. For example, a skirt contains the following attributes: short collar, long sleeves, short skirt, red, etc. The method can accurately identify the relevant attributes of the skirt and classify the skirt into corresponding attribute categories according to the classification result. This saves considerable labor costs.

In summary, the multi-label image identification method provided by the invention has the following advantages:

1) the independence is strong. The semantic attention module and the D-GCN module can be connected behind any backbone network (backbone), wherein the backbone network can be replaced at will without further modification. In case of adopting more powerful backbone network (such as SENEt, EfficientNet, etc.), the accuracy of the final classification result can be further improved.

2) The precision is improved, meanwhile, the calculation speed and the video memory occupation are not reduced, and the method has advantages in comparison with all aspects.

3) And the robustness is good. The existing graph structure method has to carry out probability statistics on a training data set in advance. This statistical approach can produce large bias errors if the data set is a news (small sample) or the training and test set distributions are biased. The method for constructing the graph is dynamic and does not need any prior statistics, and compared with the prior art, the method has obvious advantages.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A multi-label image recognition device comprising:

2. The multi-label image recognition device of claim 1, wherein the semantic attention module is configured to perform:

extracting thermodynamic diagrams of each category from the feature diagram output by the backbone network;

acquiring the weight corresponding to each thermodynamic diagram attention area;

multiplying the thermodynamic diagram of each category by the corresponding weight as the separated plurality of category features.

3. The multi-label image recognition device of claim 1, wherein the semantic attention module directly classifies each point on the input feature map by using convolution of 1x1 to obtain a classified feature map, further performs maximum pooling of top k on the classified feature map, and averages k results retained after the maximum pooling as the classification result, where k is a set number.

4. The multi-label image recognition device of claim 1, wherein modeling relationships between the plurality of class features using a dynamic graph convolution network comprises:

calculating the global relationship among the category feature vectors output by the semantic attention module based on a relationship matrix fixed by the static graph to obtain a first category feature vector;

for the firstObtaining corresponding global average pooling value V by category feature vector_g；

Summing the first class feature vectors with V_gAfter the fusion operation, obtaining a dynamic relation matrix through convolution processing;

and obtaining a second category characteristic vector based on the first category characteristic vector and the dynamic relation matrix, wherein the second category characteristic vector is used as the output of the dynamic graph convolution network module.

5. The multi-label image recognition device of claim 1, further comprising a first classifier, a second classifier and a pooling layer, wherein the pooling layer globally pools the feature maps output by the backbone network and transfers the pooled result to the first classifier to obtain a first sorted result, and the second classifier is connected to the dynamic graph convolution network module and is used for obtaining a second sorted result, and an average of the first sorted result and the second sorted result is used as a final sorted result.

6. A multi-label image identification method comprises the following steps:

7. The multi-label image recognition method of claim 6, wherein separating the feature map output by the backbone network into a plurality of class features comprises:

8. The multi-label image recognition method of claim 6 wherein modeling relationships between the plurality of class features using a dynamic graph convolution network comprises:

calculating the global relation among the plurality of category feature vectors based on a relation matrix fixed by a static graph to obtain a first category feature vector;

obtaining a corresponding global average pooling value V for the first class feature vector_g；

and obtaining a second category feature vector by using the first category feature vector and the dynamic relation matrix.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 6-8.

10. An electronic device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 6 to 8 are implemented when the processor executes the program.