CN115131604A - Multi-label image classification method and device, electronic equipment and storage medium - Google Patents

Multi-label image classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115131604A
CN115131604A CN202210593162.5A CN202210593162A CN115131604A CN 115131604 A CN115131604 A CN 115131604A CN 202210593162 A CN202210593162 A CN 202210593162A CN 115131604 A CN115131604 A CN 115131604A
Authority
CN
China
Prior art keywords
target
global
feature
sample
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210593162.5A
Other languages
Chinese (zh)
Inventor
詹佳伟
刘俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210593162.5A priority Critical patent/CN115131604A/en
Publication of CN115131604A publication Critical patent/CN115131604A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a multi-label image classification method and device, electronic equipment and a storage medium, and relates to the technical field of computers. The embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method comprises the steps of extracting features of an image to be classified to obtain corresponding global features, dividing the global features through at least one bounding box obtained by identifying candidate objects in the image to be classified to obtain corresponding at least one local feature, executing a self-attention mechanism again on the global features and the at least one local feature to obtain corresponding target global features and at least one target local feature, and obtaining target classification labels corresponding to the image to be classified according to the target global features and the at least one target local feature. Compared with the related art, the accuracy of multi-label classification of the image can be effectively improved.

Description

Multi-label image classification method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a multi-label image classification method and device, electronic equipment and a storage medium.
Background
With the rapid development of computer vision technology and multimedia technology, Multi-label Image Classification (Multi-label Image Classification) has been widely applied in the fields of Image retrieval, attribute identification, automatic Image annotation, and the like. The multi-label image classification is mainly used for classifying and identifying images so as to classify the images into one or more labels, for example, pedestrians, animals, plants and the like in one image can be identified by performing the multi-label classification on the image.
In the related technology, a multi-label classification method based on regions is usually adopted to complete the classification and identification tasks of images, the method is based on a selective search method to extract the regions of the images to be classified, at least one candidate region in the images to be classified is determined, a convolutional neural network is adopted to extract the features of each candidate region respectively, the candidate features corresponding to each candidate region are obtained, and then each candidate feature is input into a classification model respectively, and the multi-label classification result corresponding to the images to be classified is obtained.
In the scheme, a large number of candidate regions can be obtained through a selective search method, and because the candidate regions do not have a correlation relation, when the image is subjected to multi-label classification, the candidate features corresponding to each candidate region are separately classified, so that the accuracy of the obtained classification result is low. For example, after performing region extraction on an image including "table" and "chair", a candidate region including "table" and a candidate region including "chair" may be obtained, and since the candidate region including "chair" may have features similar to those of "sofa", when the candidate features corresponding to the candidate region including "chair" are individually classified, the "chair" may be mistaken for "sofa", and a result that the image includes "sofa" may be obtained.
Disclosure of Invention
In order to solve technical problems in the related art, embodiments of the present application provide a multi-label image classification method, apparatus, electronic device, and storage medium, which can improve the accuracy of multi-label classification on an image.
In order to achieve the above purpose, the technical solution of the embodiment of the present application is implemented as follows:
in one aspect, an embodiment of the present application provides a multi-label image classification method, including:
carrying out feature extraction on the image to be classified to obtain corresponding global features; the global features comprise attribute features of all pixel points in the image to be classified;
identifying a bounding box corresponding to at least one candidate object in the image to be classified, and dividing the global features based on the obtained at least one bounding box to obtain local features corresponding to the at least one bounding box;
acquiring corresponding global attention characteristics based on the attribute characteristics of each pixel point and the first association degree between the attribute characteristics;
obtaining a corresponding target global feature and at least one target local feature based on the global attention feature and at least one local feature and a second degree of association between the global attention feature and the at least one local feature;
and obtaining a target classification label corresponding to the image to be classified based on the target global feature and the at least one target local feature.
In one aspect, an embodiment of the present application further provides a multi-label image classification device, including:
the global feature extraction unit is used for extracting features of the image to be classified to obtain corresponding global features; the global features comprise attribute features of all pixel points in the image to be classified;
the local feature determining unit is configured to identify a bounding box corresponding to each of at least one candidate object in the image to be classified, and divide the global feature based on the obtained at least one bounding box to obtain a local feature corresponding to each of the at least one bounding box;
the global attention processing unit is used for obtaining corresponding global attention characteristics based on the attribute characteristics of each pixel point and the first association degree between the attribute characteristics;
a target feature determination unit, configured to obtain a corresponding target global feature and at least one target local feature based on the global attention feature and at least one local feature and a second association between the global attention feature and the at least one local feature;
and the multi-label classification unit is used for obtaining a target classification label corresponding to the image to be classified based on the target global feature and the at least one target local feature.
Optionally, the global attention processing unit is specifically configured to:
aiming at each pixel point, the following operations are respectively executed: taking one pixel point as a query pixel point, and taking other pixel points in each pixel point as key pixel points; based on a self-attention mechanism, respectively carrying out association degree matching on the attribute characteristics of the query pixel points and the attribute characteristics of each key pixel point to obtain a first association degree between the attribute characteristics;
and according to each first association degree, weighting and combining the attribute characteristics of each pixel point to obtain corresponding global attention characteristics.
Optionally, the target feature determining unit is specifically configured to:
for at least one local feature, the following operations are performed, respectively: based on a self-attention mechanism, respectively carrying out association degree matching on attribute characteristics of each pixel point in one local characteristic and attention characteristics of each pixel point in the global attention characteristic to obtain a second association degree between the local characteristic and the global attention characteristic;
and according to each second relevance, respectively carrying out weighted combination on the global attention feature and the at least one local feature to obtain a corresponding target global feature and at least one target local feature.
Optionally, the multi-label classification unit is specifically configured to:
respectively determining a global classification label result corresponding to the target global feature and a local classification label result corresponding to the at least one target local feature based on a multi-label classification model;
and obtaining a target classification label corresponding to the image to be classified according to the global classification label result and each local classification label result.
Optionally, the multi-label classification unit is further configured to:
inputting the target global features into a multi-label classification model, obtaining first probability values of the target global features belonging to each candidate classification respectively, and taking each first probability value as a global classification label result;
and respectively inputting the at least one target local feature into the multi-label classification model to obtain second probability values of the at least one target local feature belonging to each candidate classification, and taking the second probability values as local classification label results.
Optionally, the multi-label classification unit is further configured to:
for each of the candidate classifications, performing the following operations, respectively: according to a second probability value that each target local feature belongs to a candidate classification, averaging the second probability value meeting a set value condition with a first probability value corresponding to the candidate classification to determine a target probability value corresponding to the candidate classification;
and according to the target probability value corresponding to each candidate classification, taking the candidate classification with the target probability value larger than a set threshold value as the target classification label corresponding to the image to be classified.
Optionally, the apparatus further includes a model training unit, configured to:
acquiring a training data set; the training data set comprises a plurality of image samples, and the image samples are marked with set classification labels;
performing iterative training on the multi-label classification model based on the training data set until a set convergence condition is met, wherein one iterative training process comprises the following steps:
obtaining a target global sample characteristic and at least one target local sample characteristic corresponding to the image sample based on the image sample extracted from the training data set;
determining a target sample classification label corresponding to the image sample according to the target global sample feature and the at least one target local sample feature through the multi-label classification model, and performing parameter adjustment on the multi-label classification model based on a loss value determined by the target sample classification label and the set classification label.
Optionally, the model training unit is further configured to:
performing feature extraction on the image sample to obtain corresponding global sample features, and dividing the global sample features based on at least one bounding box obtained by identifying at least one candidate sample object in the image sample to obtain local sample features corresponding to the at least one sample bounding box;
and obtaining a corresponding target global sample characteristic and at least one target local sample characteristic according to the global sample characteristic and the at least one local sample characteristic.
Optionally, the model training unit is further configured to:
determining, by the multi-label classification model, first sample probability values that the target global sample features respectively belong to respective candidate sample classifications, and second sample probability values that the at least one target local sample feature respectively belong to respective candidate sample classifications;
and determining a target sample classification label corresponding to the image sample according to the first sample probability value and the second sample probability value corresponding to each candidate sample classification.
In one aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the multi-label image classification method when executing the program.
In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, where the program is executed on the computer device, so as to cause the computer device to execute the steps of the above multi-label image classification method.
In one aspect, the present application provides a computer program product, which includes a computer program stored on a computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer executes the steps of the multi-label image classification method.
The beneficial effect of this application is as follows:
the multi-label image classification method, device, electronic equipment and storage medium provided by the embodiment of the application, extracting the features of the image to be classified to obtain corresponding global features, identifying at least one candidate object in the image to be classified based on at least one bounding box, dividing the global features to obtain local features corresponding to at least one bounding box, based on the attribute features of each pixel, and a first degree of association between the attribute features, obtaining a corresponding global attention feature, and based on the global attention feature and the at least one local feature, and a second degree of association between the global attention feature and the at least one local feature, obtaining a corresponding target global feature and the at least one target local feature, and obtaining a target classification label corresponding to the image to be classified according to the target global characteristic and the at least one target local characteristic. After the global features and the local features of the images are obtained, the global features and the local features are subjected to self-attention processing, so that the global and local features of the images have an incidence relation, more accurate classification label results can be obtained when the images to be classified are classified based on the obtained global features and local features of the targets, and the accuracy of multi-label classification of the images is improved.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic diagram of a candidate region generated by a selective search method in the related art;
fig. 2 is an application scene diagram of a multi-label image classification method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a multi-label image classification method according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of obtaining a global attention feature according to an embodiment of the present application;
fig. 5 is a schematic diagram of obtaining a target global feature and a target local feature according to an embodiment of the present application;
fig. 6 is a schematic diagram of an image to be classified according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram illustrating a training process of a multi-label classification model according to an embodiment of the present application;
fig. 8 is a schematic flowchart of another multi-label image classification method according to an embodiment of the present disclosure;
fig. 9 is a network structure diagram of a ResNet101 model according to an embodiment of the present application;
fig. 10 is a schematic diagram of a process of extracting a local region by an RPN according to an embodiment of the present application;
fig. 11 is a network structure diagram of a multi-label classification model according to an embodiment of the present application;
FIG. 12a is a schematic diagram illustrating a t-SNE visualization result provided by an embodiment of the present application;
fig. 12b is a schematic view of a specific scene of a multi-label image classification method according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of a multi-label image classification apparatus according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of another multi-label image classification apparatus according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of another electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.
Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.
Multi-label image classification: and performing classified identification on the images so as to accurately classify the images into one or more labels. In multi-label classification, there is no constraint on the number of categories to which an image can be assigned, and the number of labels for an image is more than one, i.e., an image can have multiple labels.
Regional Proposal (Region pro posal): given that the input image finds all possible positions where the object can be located, the output is a bounding box list of possible positions of the object, the regions included in these bounding boxes are usually called candidate regions of Interest (ROI).
Attention mechanism (Attention): derived from research into human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing.
Self-Attention mechanism (Self-Attention) is a variant of Attention mechanism that reduces reliance on external information and is more adept at capturing internal correlations of data or features.
The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The terms "first", "second" and "first" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.
The embodiment of the present application relates to Artificial Intelligence (AI) and Machine Learning (ML) technology and Natural Language Processing (NLP), which are designed based on Machine Learning technology and natural Language processing technology in the AI.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
According to the method and the device, a multi-label classification model based on machine learning is adopted, and the target classification label corresponding to the image to be classified is obtained according to the target global feature and at least one target local feature corresponding to the image to be classified.
The following briefly introduces the design concept of the embodiments of the present application:
the multi-label image classification is mainly used for classifying and identifying images, so that the images are classified into one or more labels. In the related technology, a multi-label classification method based on regions is usually adopted to complete the classification and identification tasks of images, and the method firstly extracts the regions of the images to be classified based on a selective search method to generate at least one candidate region, then respectively extracts the features of each candidate region to obtain the candidate features corresponding to each candidate region, and finally respectively classifies each candidate feature to obtain the multi-label classification result corresponding to the images to be classified.
However, the above scheme does not consider the association between candidate regions, nor the association between local and global, and cannot well utilize the information of the whole and local regions, thereby resulting in poor classification results.
The above scheme has the following specific drawbacks: first, to achieve a high recall rate, local region extraction results in a large number of candidate regions, typically thousands, as shown in FIG. 1. This is not only inefficient for multi-tag learning, but also degrades performance due to background interference and inaccurate boundaries for recommendations. Second, semantic dependencies between many categories are ignored, which is particularly important for multi-label classification (e.g., "cats" are more likely to be misinterpreted as the "dogs" category rather than being mistakenly associated with "umbrellas," toothbrushes "are more likely to appear with" toothpaste "rather than with" planes "). In some previous work, attempts have been made to address this shortcoming by explicitly capturing class dependencies by appending an RNN or LSTM structure after the CNN-based model. However, these models only consider the local to local correlation and do not consider the local to global higher order correlation. Therefore, absent a thorough understanding of image global information, multi-tag information cannot be effectively used to learn semantic regions. Furthermore, these models need to assist the learning process in a complex iterative manner, which is inefficient for the training of the models.
Therefore, when the image is subjected to multi-label classification, the candidate features corresponding to each candidate region are separately classified, so that the accuracy of the finally obtained classification result is low.
In view of the above, embodiments of the present application provide a multi-label image classification method, apparatus, electronic device and storage medium, extracting the features of the image to be classified to obtain corresponding global features, identifying at least one candidate object in the image to be classified based on at least one bounding box, dividing the global features to obtain local features corresponding to at least one bounding box, based on the attribute features of each pixel, and a first degree of association between the attribute features, obtaining a corresponding global attention feature, and based on the global attention feature and the at least one local feature, and a second degree of association between the global attention feature and the at least one local feature, obtaining a corresponding target global feature and at least one target local feature, and obtaining a target classification label corresponding to the image to be classified according to the target global feature and at least one target local feature. Thereby, the accuracy of multi-label classification of images can be improved.
The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.
Fig. 2 is a schematic view of an application scenario in the embodiment of the present application. The application scenario at least includes the terminal device 110 and the server 130, and the application operation interface 120 can be logged in through the terminal device 110. The number of the terminal devices 110 may be one or more, the number of the servers 130 may also be one or more, and the number of the terminal devices 110 and the number of the servers 130 are not particularly limited in the present application. The terminal device 110 and the server 130 can communicate with each other through a communication network.
In the embodiment of the present application, the terminal device 110 may be a portable device (e.g., a mobile phone, a tablet Computer, a notebook Computer, etc.), or may be a Computer, a smart screen, a Personal Computer (PC), etc. The terminal device 110 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, an aircraft, and the like.
The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal device 110 and the server 130 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
The multi-label image classification method in the embodiment of the present application may be executed by the terminal device 110, may also be executed by the server 130, and may also be executed by the terminal device 110 and the server 130 interactively.
For example, the method for classifying multi-label images executed by the server 130 in the embodiment of the present application includes the following steps:
the target object sends the image to be classified to the server 130 through the terminal device 110, and the server 130 can perform feature extraction on the image to be classified to obtain corresponding global features, and based on at least one bounding box obtained by identifying at least one candidate object in the image to be classified, dividing the global features to obtain local features corresponding to at least one bounding box, based on the attribute features of each pixel, and a first degree of association between the attribute features, obtaining a corresponding global attention feature, and based on the global attention feature and the at least one local feature, and obtaining a corresponding target global feature and at least one target local feature according to a second correlation degree between the global attention feature and the at least one local feature, and finally obtaining a target classification label corresponding to the image to be classified according to the target global feature and the at least one target local feature. After obtaining the target classification tag corresponding to the image to be classified, the server may send the target classification tag to the terminal device 110, so that the terminal device 110 displays the target classification tag corresponding to the image to be classified to the target object.
It should be noted that fig. 2 is an exemplary description of an application scenario of the multi-label image classification method of the present application, and an application scenario to which the method in the embodiment of the present application may be applied is not limited to this. Moreover, the embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.
To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figure when the method is executed in an actual processing procedure or a device.
Fig. 3 shows a flowchart of a multi-label image classification method provided by an embodiment of the present application, which may be executed by an electronic device, which may be the terminal device 110 and/or the server 130 in fig. 2. As shown in fig. 3, the method comprises the following steps:
step S301, extracting the features of the image to be classified to obtain corresponding global features.
The global features comprise attribute features of all pixel points in the image to be classified.
After the image to be classified is obtained, a feature extraction model can be adopted to perform feature extraction on the image to be classified, so as to obtain corresponding global features.
Step S302, a bounding box corresponding to at least one candidate object in the image to be classified is identified, the global features are divided based on the obtained at least one bounding box, and local features corresponding to the at least one bounding box are obtained.
After obtaining global features corresponding to the image to be classified, a local region proposing operation may be performed on the global features to identify a bounding box corresponding to each of at least one candidate object in the image to be classified, and the global features are divided into at least one local feature based on the obtained at least one bounding box, where each local feature corresponds to one bounding box.
Step S303, based on the attribute features of each pixel point and the first association degree between the attribute features, obtain a corresponding global attention feature.
After obtaining the global features corresponding to the images to be classified, as shown in fig. 4, for each pixel point in the global features, the following operations may be respectively performed:
taking one pixel point in the global features as a query pixel point, taking other pixel points except the query pixel point in the global features as key pixel points, and respectively matching the attribute features of the query pixel points with the respective attribute features of the key pixel points based on a self-attention mechanism to obtain a first association degree between the attribute features; and according to the first relevance, weighting and combining the attribute features of the pixels to obtain the corresponding global attention feature.
Through the self-attention mechanism, the incidence relation among the attribute features of all the pixel points in the global features is determined, the attribute features of all the pixel points are weighted and combined according to the incidence relation, the global attention features are obtained, and therefore through the self-attention processing, the features corresponding to all the candidate objects in the image to be classified can be more prominent in the global features, namely the feature information of the region where each candidate object is located is more concerned in the global features of the image to be classified, and the distinguishing degree of the global features is further improved.
And step S304, obtaining a corresponding target global feature and at least one target local feature based on the global attention feature and the at least one local feature and a second association degree between the global attention feature and the at least one local feature.
After obtaining the global attention feature by dividing the global feature to obtain at least one local feature and performing an attention-self mechanism on the global feature, for the obtained at least one local feature, as shown in fig. 5, the following operations may be performed respectively:
based on the self-attention mechanism, the attribute characteristics of each pixel point in one local characteristic are respectively subjected to relevance degree matching with the attention characteristics of each pixel point in the global attention characteristic, second relevance degrees between one local characteristic and the global attention characteristic are obtained, and the global attention characteristic and at least one local characteristic are respectively subjected to weighted combination according to the second relevance degrees, so that a corresponding target global characteristic and at least one target local characteristic are obtained.
By means of the self-attention mechanism, the obtained global attention feature and at least one local feature are processed, more information can be transmitted from the global of the image to each local of the image, complementary information between the global and the local of the image is effectively explored, and the association relation between the global and the local is obtained, so that the target global feature obtained after self-attention processing can contain the local information of the image, and the obtained target local feature can contain the global information of the image. Meanwhile, the correlation among the candidate objects included in the image to be classified can also be contained in the obtained target global feature and the target local feature, so that the label dependence can be established in an implicit mode.
Step S305, based on the target global feature and at least one target local feature, obtaining a target classification label corresponding to the image to be classified.
And inputting the target global features into the multi-label classification model, obtaining first probability values of the target global features belonging to each candidate classification respectively, and taking each first probability value as a global classification label result corresponding to the target global features.
And respectively inputting the at least one target local feature into the multi-label classification model to obtain second probability values of the at least one target local feature belonging to each candidate classification, and taking the second probability values as local classification label results corresponding to the at least one target local feature.
And obtaining a target classification label corresponding to the image to be classified according to the global classification label result and each local classification label result. Specifically, for each candidate classification, the following operations are performed, respectively: and averaging the second probability value which accords with the set value condition with the first probability value corresponding to the candidate classification according to the second probability value of at least one target local feature belonging to the candidate classification respectively, and determining the target probability value corresponding to the candidate classification.
After the target probability values corresponding to the candidate classifications are determined, the candidate classifications with the target probability values larger than the set threshold value can be used as target classification labels corresponding to the images to be classified according to the target probability values corresponding to the candidate classifications.
For example, as shown in fig. 6, it can be seen from fig. 6 that the image to be classified includes one bee and two flowers, and the image to be classified may have 3 target local features, where the target local feature 1 includes a feature of the bee, the target local feature 2 includes a feature of one of the two flowers, and the target local feature 3 includes a feature of the other of the two flowers.
Assuming that the number of the candidate classifications is 4, namely flowers, grasses, bees and birds, inputting the target global features into the multi-label classification model, and obtaining that the first probability value of the target global features belonging to the flowers is 0.9, the first probability value of the target global features belonging to the grasses is 0.1, the first probability value of the target global features belonging to the bees is 0.88, and the first probability value of the target global features belonging to the birds is 0.12; inputting the target local feature 1 into the multi-label classification model, and obtaining that the second probability value that the target local feature 1 belongs to flowers is 0.01, the second probability value that the target local feature 1 belongs to grass is 0.01, the second probability value that the target local feature 1 belongs to bees is 0.96, and the second probability value that the target local feature 1 belongs to birds is 0.02; inputting the target local feature 2 into the multi-label classification model, and obtaining that the second probability value that the target local feature 2 belongs to flowers is 0.94, the second probability value that the target local feature 2 belongs to grass is 0.04, the second probability value that the target local feature 2 belongs to bees is 0.01, and the second probability value that the target local feature 2 belongs to birds is 0.01; inputting the target local feature 3 into the multi-label classification model, the second probability value that the target local feature 3 belongs to flowers is 0.96, the second probability value that the target local feature 3 belongs to grass is 0.02, the second probability value that the target local feature 3 belongs to bees is 0.01, and the second probability value that the target local feature 3 belongs to birds is 0.01.
Firstly, the second probability values of the target local features belonging to the candidate classifications are maximized, that is, the maximum second probability values of the target local features belonging to the candidate classifications are determined, the maximum second probability value of the target local features belonging to the flowers is 0.96, the maximum second probability value of the target local features belonging to the grass is 0.04, the maximum second probability value of the target local features belonging to the bees is 0.96, and the maximum second probability value of the target local features belonging to the birds is 0.02.
Then, the maximum second probability values of the target local features belonging to the candidate classifications are averaged with the first probability values of the target global features belonging to the candidate classifications respectively, so that the target probability value of the image to be classified belonging to flowers is 0.93, the target probability value of the image to be classified belonging to grass is 0.07, the target probability value of the image to be classified belonging to bees is 0.92, and the target probability value of the image to be classified belonging to birds is 0.07. Assuming that the threshold is set to be 0.9, the object classification labels corresponding to the images to be classified can be determined to be bees and flowers.
The training process of the multi-label classification model used in step S305 can be illustrated in fig. 7, and is described in detail below with reference to fig. 7.
Step S701, a training data set is acquired.
The acquired training data set may include a plurality of image samples, each image sample being labeled with a set classification label.
Step S702, extracting image samples from the training data set, and performing feature extraction on the extracted image samples to obtain corresponding global sample features.
When the multi-label classification model is trained, image samples can be extracted from the training data set, and feature extraction is performed on the extracted image samples to obtain global sample features.
Step S703 is to divide the global sample features based on at least one boundary box obtained by identifying at least one candidate sample object in the image sample, so as to obtain local sample features corresponding to the at least one sample boundary box.
Boundary identification is carried out on at least one candidate sample object in the global sample characteristics, a boundary box corresponding to each candidate sample object is obtained, and the global sample characteristics are divided according to the at least one boundary box to obtain at least one local sample characteristic.
Step S704, obtaining a corresponding target global sample feature and at least one target local sample feature according to the global sample feature and the at least one local sample feature.
After the global sample feature and the at least one local sample feature corresponding to the extracted image sample are obtained, an attention mechanism may be performed on the global sample feature to obtain a corresponding global sample attention feature, and then the attention mechanism may be performed on the global sample attention feature and the at least one local sample feature to obtain a corresponding target global sample feature and the at least one target local sample feature.
Step S705, respectively inputting the target global sample feature and the at least one target local sample feature into the multi-label classification model, and determining a first sample probability value that the target global sample feature respectively belongs to each candidate sample classification, and a second sample probability value that the at least one target local sample feature respectively belongs to each candidate sample classification.
And inputting the target global sample characteristics into the multi-label classification model to obtain first sample probability values of the target global sample characteristics belonging to each candidate sample classification respectively.
And respectively inputting the at least one target local sample feature into the multi-label classification model, so as to obtain second sample probability values of the at least one target local sample feature belonging to each candidate sample classification.
Step S706, determining a target sample classification label corresponding to the image sample according to the first sample probability value and the second sample probability value corresponding to each candidate sample classification.
After determining a first sample probability value that the target global sample feature belongs to each candidate sample class and a second sample probability value that the at least one target local sample feature belongs to each candidate sample class, the following operations are respectively performed for each candidate sample class: and determining a maximum second sample probability value of each target local sample feature belonging to one candidate sample class according to a second sample probability value of each target local sample feature belonging to one candidate sample class, averaging the maximum sample probability value of the candidate sample class and a first sample probability value corresponding to the candidate sample class, and determining a target sample probability value corresponding to the candidate sample class.
After the target sample probability values corresponding to the respective candidate sample classifications are determined, the candidate sample corresponding to the target sample probability value larger than the set sample threshold value in the target sample probability values can be classified as the target sample classification label corresponding to the image sample.
In step S707, a corresponding loss value is determined based on the target sample classification label and the set classification label.
In calculating the loss value, a Binary Cross Entropy (BCE) loss function may be employed to calculate the loss value. Specifically, the BCE loss function can be represented by the following formula:
Figure BDA0003666414380000171
in general, the loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output.
Step S708, determining whether the loss value converges to a preset target value; if not, go to step S709; if so, go to step S710.
Judging whether the loss value converges to a preset target value, if the loss value is less than or equal to the preset target value, or if the variation amplitude of the loss value obtained by continuous N times of training is less than or equal to the preset target value, considering that the loss value converges to the preset target value, and indicating that the loss value converges; otherwise, it indicates that the loss value has not converged.
And step S709, adjusting parameters of the multi-label classification model according to the determined loss value.
And if the loss value is not converged, adjusting the model parameters, and after adjusting the model parameters, returning to execute the step S702 to continue the next round of training process.
And step S710, finishing the training to obtain the trained multi-label classification model.
And if the loss value is converged, taking the currently obtained multi-label classification model as a trained multi-label classification model.
In some embodiments, the multi-label image classification method proposed in this application may also be implemented according to the process illustrated in fig. 8, and the process may be executed by an electronic device, which may be the terminal device 110 and/or the server 130 in fig. 2. As shown in fig. 8, the method comprises the following steps:
step S801, extracting the features of the image to be classified to obtain corresponding global features.
After the image to be classified is acquired, the image to be classified can be input into a feature extraction model such as a residual error network 101(ResNet101) model, high-level semantic feature extraction of the image to be classified is performed, and corresponding global features are acquired.
When the feature extraction model is the ResNet101 model, the network structure of the ResNet101 model may be as shown in fig. 9. In fig. 9, the features of the input ResNet101 model need to pass through the convolutional layer and the global maximum pooling layer first, and then pass through three different residual modules in sequence.
It should be noted that, in order to improve the universality and generalization performance of the features, when the ResNet101 model is used to extract the features of the image to be classified and obtain the global features, only the first three stages of the ResNet101 model are selected as feature extraction modules, and the fourth stage can be applied to the multi-label classification process of the image to be classified to help the extracted features to better perform projection and learn semantic information.
Optionally, the feature extraction model in the embodiment of the present application may also be AlexNet, Visual Geometry Group Net (VGGNet), renet, or the like.
Step S802, the global feature is divided by at least one boundary box obtained by extracting the region of the image to be classified, and at least one local feature is obtained.
Inputting the image to be classified into a local area proposed network, and extracting a bounding box possibly containing a key target from the image to be classified. The local area proposed Network used in this embodiment may be a pre-trained area proposed Network (Region pro-social Network) RPN.
Each pixel point in the convolution characteristic diagram is provided with k anchors as an initial detection frame, so that whether the anchor belongs to an object or a background (namely whether the anchor covers a target at all) is judged, and first coordinate correction is carried out on the anchor belonging to the object. The object backgrounds are classified into two categories, so that 2k scores can be obtained by classification branches; the coordinate correction is four values (x, y, w, h), so the regression branch can get an attribute of 4k coordinate values.
Specifically, as shown in fig. 10, which is a schematic diagram of a process of extracting a local region by an RPN, in fig. 10, a 3 × 3 sliding window is used to traverse the entire convolution feature map, in the traversal process, the center of each window generates 9 kinds of target frames (anchors) according to the aspect ratios 1:1, 1:2, 2:1 and the areas 128 × 128, 256 × 256, 512 × 512, then 2 classifications (distinguishing the foreground or the background) are made for each anchor by using a window classification layer (cls layer), the coordinate position of each anchor is determined by using a position refinement layer (reg layer), and finally, 2k scores and 4k coordinates can be obtained.
After at least one bounding box in the image to be classified is extracted, the global features can be divided based on the at least one bounding box to obtain at least one local feature.
Optionally, the proposed local area network used in the embodiment of the present application may be any method capable of generating a relevant target bounding box, such as a conventional binary gradient magnitude (BING) and EdgeBox, or may be a deep learning method, such as a proposed weak supervised area method.
Step S803, a self-attention mechanism is performed on the global feature to obtain a corresponding global attention feature.
Since information propagation between the two branches of global and local features is often ignored by previous work, a cross-granularity attention module is proposed in this application to effectively improve this. Because the global feature map and the local feature map are misaligned in the spatial dimension, simple fusion cannot bring about performance improvement. Thus, local to local interactions, or even global to local higher order dependencies, can be achieved by means of a self-attentive mechanism.
First, a Self-attention mechanism (Self-attention) is performed on the global feature branch to capture non-local dependencies, resulting in higher level semantic features. In the self-attention mechanism of the global feature branch, the method can make
Figure BDA0003666414380000191
Is a global feature, where H, W and C are the height, weight, and number of lanes, respectively, of the global feature. Then, the user can use the device to perform the operation,
Figure BDA0003666414380000192
where φ (-) represents a linear projection, the global features F of the input can be mapped to the output of the same dimension by learnable weights.
Thus, the global attention map can be obtained by the following formula:
Figure BDA0003666414380000193
wherein the scaling factor
Figure BDA0003666414380000201
To avoid too large a result after dot productResulting in an overflow. For the optimized characteristic A g V g By performing a maximum pooling (max-pooling) operation along the spatial dimension, a global attention feature, i.e., a feature of attention, can be obtained
Figure BDA0003666414380000202
As more information is passed from the global branch to the local branch, the feature information in the local branch can be better correlated, implicitly building the relevant information.
Step S804, a self-attention mechanism is performed on the global attention feature and the at least one local feature, and a corresponding target global feature and at least one target local feature are obtained.
In a self-attention mechanism across granular attention modules, one may have
Figure BDA0003666414380000203
Is at least one local feature, F l Is obtained by a series of learnable projections, where k o Is the number of local features. By adding an additional dimension, the global attention feature F can be obtained g And F l Are connected in series to obtain
Figure BDA0003666414380000204
Characteristic F obtained by pairing series gl Performing a self-attention mechanism, cross-granular attention features may be derived to facilitate information propagation between global and local features. Similar to the formula in the self-attention mechanism of the global feature branch described above, one can obtain
Figure BDA0003666414380000205
Figure BDA0003666414380000206
Thus, the cross-granular attention can be obtained by the following formula:
Figure BDA0003666414380000207
thereby passing through A gl V gl A cross-granularity attention feature comprising a target global feature and at least one target local feature may be obtained. On the one hand, with the help of the local area proposed network, local features can be obtained, and the local features contain more detailed information which cannot be provided by the global feature branch. On the other hand, local features do not have a macroscopic view of global features, and are difficult to handle in special cases (if the overall environment of the image can be known, a part of a priori knowledge is possessed, for example, in a bedroom, the probability of bed occurrence is much higher than that of a car).
In summary, by concatenating features of different granularities, the multi-label image classification problem is simplified to a category that can be handled by a self-attention mechanism, thereby achieving the intended purpose and effect. By global attention feature F g And local feature F l The cross-granular attention module may efficiently explore complementary information between global and local. At the same time, the correlation between candidate objects included in the image can also be captured in the extracted features, which means that tag dependencies can be established in an implicit way. Thus, these two salient advantages ensure an increase in overall network performance.
Step S805, based on the multi-label classification model, a first probability value that the target global feature belongs to each candidate classification and a second probability value that the at least one target local feature belongs to each candidate classification are obtained.
Respectively inputting the target global feature and the at least one target local feature into a multi-label classification model, and respectively obtaining a first probability value that the target global feature respectively belongs to each candidate classification and a second probability value that the at least one target local feature respectively belongs to each candidate classification based on the multi-label classification model.
Wherein, the network structure of the multi-label classification model can be as shown in fig. 11. Features input into the multi-label classification model need to pass through three residual modules, a global maximum pooling layer and a full link layer. The 1000 in the fully-connected layer of fig. 11 is illustrated as a class, and may be changed to a different number of classes in a different dataset, for example, in the Microsoft COCO dataset, the number of classes may be 80, while in the Pascal VOC 2007 dataset, the number of classes may be 20.
Optionally, since the maximum pooling layer and the average pooling layer have similar actions and effects, the maximum pooling layer in the model may be replaced by the average pooling layer.
Alternatively, since the 1 × 1 convolutional layer and the fully-connected layer have substantially the same functions and effects, the fully-connected layer in the model may be replaced with the 1 × 1 convolutional layer.
Step S806, determining a target probability value corresponding to each candidate classification according to the first probability value and the second probability value corresponding to each candidate classification, and determining a target classification label corresponding to the image to be classified based on the target probability value.
After first probability values that the target global features belong to the candidate classifications respectively and second probability values that the at least one target local feature belongs to the candidate classifications are obtained, the maximum second probability values corresponding to the candidate classifications are selected from the at least one target local feature, and then the maximum second probability values corresponding to the candidate classifications are averaged with the first probability values corresponding to the candidate classifications respectively to obtain target probability values corresponding to the candidate classifications.
And taking the candidate classification with the set threshold value in the target probability values corresponding to the candidate classifications as a target classification label corresponding to the image to be classified.
In one embodiment, each implementation process and the overall training process of the model in the multi-label image classification method provided by the application are implemented on a server carrying an Intel Xeon 8255C CPU and an NVIDIA Tesla V100 graphics card, and 8V 100 graphics cards are adopted for distributed parallel training to generate an inference result. The code adopts Python 3.6.8, and the used deep learning frames are Pythroch 1.4.0, torchvision 0.5.0, opencv-Python version 4.5.1, numpy version 1.16.1 and scikt-leern version 0.23.0.
In this embodiment, the multi-label image classification method with cross-granularity attention provided by the present application may be compared with a multi-label image classification method without cross-granularity attention in the related art, and a comparison result is visualized by using t-Distributed Stochastic Neighbor embedding (t-SNE), where the comparison result of t-SNE visualization may be as shown in fig. 12 a. In fig. 12a, each dot represents a label-level feature against a particular label background, and each color represents a category. It can be seen from FIG. 12a that in combination with the features of cross-granular attention, the t-SNE visualization appears to be more focused and more discriminative, while the features without cross-granular attention appear to be more distracting. Therefore, compared with the related art, the multi-label image classification method provided by the application has higher accuracy in multi-label classification of the image.
The multi-label image classification method obtains corresponding global features by performing feature extraction on an image to be classified, obtains at least one bounding box by identifying at least one candidate object in the image to be classified, divides the global features to obtain local features corresponding to the at least one bounding box, obtains corresponding global attention features based on attribute features of pixel points and first association degrees between the attribute features, obtains corresponding target global features and at least one target local feature based on the global attention features and the at least one local feature and second association degrees between the global attention features and the at least one local feature, and obtains target classification labels corresponding to the image to be classified according to the target global features and the at least one target local feature. The method has clear integral frame structure, each module has better generalization capability, and the method has very excellent performance proved by practical verification on popular large-scale multi-label image classification data sets. The cross-granularity attention mechanism module provided by the method can establish the association between the global and the local, so that the characteristics comprise the association of the local information and the thorough understanding of the image global information, and the distinguishing degree and the quality of the global characteristics and the local characteristics are improved. Meanwhile, an end-to-end mode is adopted in the training process of the multi-label classification model in the method, a complex iteration mode is not needed, more supervision information is not needed to assist the learning process, and the method is convenient to deploy.
Fig. 12b is a schematic diagram illustrating a specific scenario of a multi-label image classification method recited in an embodiment of the present application. The target object can send the image to be classified to the server through the terminal device, the server can perform feature extraction on the image to be classified after receiving the image to be classified to obtain global features, the two bounding boxes obtained by processing the image to be classified through a local area extraction network divide the global features to obtain local features 1 and local features 2. And performing a self-attention mechanism on the global attention feature obtained after the self-attention mechanism is performed on the global feature, the local feature 1 and the local feature 2 to obtain a target global feature, a target local feature 1 and a target local feature 2.
Assuming that the candidate classification comprises birds, animals, wood and plants, according to the target global characteristics, the probability that the image to be classified belongs to the birds is 0.86, the probability that the image to be classified belongs to the animals is 0.88, the probability that the image to be classified belongs to the wood is 0.82 and the probability that the image to be classified belongs to the plants is 0.32; according to the target local feature 1, the probability that the image to be classified belongs to birds is 0.98, the probability that the image to be classified belongs to animals is 0.92, the probability that the image to be classified belongs to wood is 0.12, and the probability that the image to be classified belongs to plants is 0.08; according to the target local feature 2, the probability that the image to be classified belongs to birds is 0.16, the probability that the image to be classified belongs to animals is 0.96, the probability that the image to be classified belongs to wood is 0.96, and the probability that the image to be classified belongs to plants is 0.04. And finally determining that the target classification labels corresponding to the images to be classified are birds, animals and wood according to the probability that the images to be classified belong to various candidate classifications.
Based on the same inventive concept as the multi-label image classification method shown in fig. 3, the embodiment of the present application further provides a multi-label image classification apparatus, and the multi-label image classification apparatus may be disposed in a server or a terminal device. Because the device is a device corresponding to the multi-label image classification method and the principle of solving the problems of the device is similar to that of the method, the implementation of the device can refer to the implementation of the method, and repeated parts are not described again.
Fig. 13 shows a schematic structural diagram of a multi-label image classification apparatus provided in an embodiment of the present application, and as shown in fig. 13, the multi-label image classification apparatus includes a global feature extraction unit 1301, a local feature determination unit 1302, a global attention processing unit 1303, a target feature determination unit 1304, and a multi-label classification unit 1305.
The global feature extraction unit 1301 is configured to perform feature extraction on an image to be classified to obtain a corresponding global feature; the global characteristics comprise attribute characteristics of each pixel point in the image to be classified;
a local feature determining unit 1302, configured to identify a bounding box corresponding to each of at least one candidate object in the image to be classified, and divide the global feature based on the obtained at least one bounding box to obtain a local feature corresponding to each of the at least one bounding box;
the global attention processing unit 1303 is configured to obtain corresponding global attention features based on the attribute features of the pixels and the first association degrees between the attribute features;
a target feature determination unit 1304, configured to obtain a corresponding target global feature and at least one target local feature based on the global attention feature and the at least one local feature and a second association between the global attention feature and the at least one local feature;
the multi-label classification unit 1305 obtains a target classification label corresponding to an image to be classified based on the target global feature and the at least one target local feature.
Optionally, the global attention processing unit 1303 is specifically configured to:
aiming at each pixel point, the following operations are respectively executed: taking one pixel point as a query pixel point, and taking other pixel points in all the pixel points as key pixel points; based on a self-attention mechanism, the attribute characteristics of the query pixel points are respectively subjected to association degree matching with the respective attribute characteristics of the key pixel points, and a first association degree between the attribute characteristics is obtained;
and according to the first relevance, weighting and combining the attribute features of the pixels to obtain the corresponding global attention feature.
Optionally, the target feature determining unit 1304 is specifically configured to:
for at least one local feature, the following operations are performed, respectively: based on an attention mechanism, respectively carrying out association degree matching on the attribute characteristics of each pixel point in a local characteristic and the attention characteristics of each pixel point in a global attention characteristic to obtain a second association degree between the local characteristic and the global attention characteristic;
and according to each second relevance, respectively carrying out weighted combination on the global attention feature and the at least one local feature to obtain a corresponding target global feature and at least one target local feature.
Optionally, the multi-label classification unit 1305 is specifically configured to:
respectively determining a global classification label result corresponding to the target global characteristic and a local classification label result corresponding to at least one target local characteristic based on the multi-label classification model;
and obtaining a target classification label corresponding to the image to be classified according to the global classification label result and each local classification label result.
Optionally, the multi-label classification unit 1305 is further configured to:
inputting the target global features into a multi-label classification model, obtaining first probability values of the target global features belonging to each candidate classification respectively, and taking each first probability value as a global classification label result;
and respectively inputting the at least one target local feature into the multi-label classification model to obtain second probability values of the at least one target local feature belonging to each candidate classification, and taking the second probability values as local classification label results.
Optionally, the multi-label classification unit 1305 is further configured to:
for each candidate classification, the following operations are performed: averaging a first probability value corresponding to a candidate classification with a second probability value meeting a set value condition according to a second probability value of at least one target local feature belonging to the candidate classification respectively, and determining a target probability value corresponding to the candidate classification;
and according to the target probability value corresponding to each candidate classification, taking the candidate classification with the target probability value larger than a set threshold value as a target classification label corresponding to the image to be classified.
Optionally, as shown in fig. 14, the apparatus may further include a model training unit 1401, configured to:
acquiring a training data set; the training data set comprises a plurality of image samples, and the image samples are marked with set classification labels;
based on a training data set, carrying out iterative training on the multi-label classification model until a set convergence condition is met, wherein the one-time iterative training process comprises the following steps:
obtaining a target global sample characteristic and at least one target local sample characteristic corresponding to an image sample based on the image sample extracted from the training data set;
determining a target sample classification label corresponding to the image sample according to the target global sample feature and at least one target local sample feature through a multi-label classification model, and performing parameter adjustment on the multi-label classification model based on the loss value determined by the target sample classification label and the set classification label.
Optionally, the model training unit 1401 is further configured to:
performing feature extraction on the image sample to obtain corresponding global sample features, and dividing the global sample features based on at least one boundary box obtained by identifying at least one candidate sample object in the image sample to obtain local sample features corresponding to the at least one sample boundary box;
and obtaining a corresponding target global sample characteristic and at least one target local sample characteristic according to the global sample characteristic and the at least one local sample characteristic.
Optionally, the model training unit 1401 is further configured to:
determining a first sample probability value that a target global sample feature belongs to each candidate sample class and a second sample probability value that at least one target local sample feature belongs to each candidate sample class through a multi-label classification model;
and determining a target sample classification label corresponding to the image sample according to the first sample probability value and the second sample probability value corresponding to each candidate sample classification.
The embodiment of the method and the embodiment of the device are based on the same inventive concept, and the embodiment of the application also provides electronic equipment.
In one embodiment, the electronic device may be a server, such as server 130 shown in FIG. 2. In this embodiment, the electronic device may be configured as shown in fig. 15, and may include a memory 1501, a communication module 1503, and one or more processors 1502.
A memory 1501 for storing computer programs executed by the processor 1502. The memory 1501 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, programs needed for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 1501 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1501 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD), or the memory 1501 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 1501 may be a combination of the above memories.
The processor 1502 may include one or more Central Processing Units (CPUs), or be a digital processing unit, etc. A processor 1502 is configured to implement the multi-label image classification method described above when a computer program stored in the memory 1501 is called.
The communication module 1503 is used for communicating with terminal devices and other electronic devices. If the electronic device is a server, the server may send a target classification tag corresponding to the image to be classified to the terminal device through the communication module 1503.
The embodiment of the present application does not limit the specific connection medium among the memory 1501, the communication module 1503 and the processor 1502. In fig. 15, the memory 1501 and the processor 1502 are connected by a bus 1504, the bus 1504 is indicated by a thick line in fig. 15, and the connection manner between other components is merely illustrative and not limited. The bus 1504 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 15, but that does not indicate only one bus or one type of bus.
In another embodiment, the electronic device may be any electronic device such as a mobile phone, a tablet computer, a POS (Point of Sales), a vehicle-mounted computer, an intelligent wearable device, and a PC, and the electronic device may also be the terminal device 110 shown in fig. 2.
Fig. 16 shows a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 16, the electronic apparatus includes: radio Frequency (RF) circuitry 1610, memory 1620, input unit 1630, display unit 1640, sensor 1650, audio circuitry 1660, wireless fidelity (WiFi) module 1670, processor 1680, and so forth. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 16 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following specifically describes each constituent component of the electronic device with reference to fig. 16:
RF circuit 1610 is configured to receive and transmit signals during a message transmission or call, and in particular, receive downlink messages from a base station and process the received downlink messages to processor 1680; in addition, the data for designing uplink is transmitted to the base station.
The memory 1620 may be used to store software programs and modules, such as program instructions/modules corresponding to the multi-tag image classification method and apparatus in the embodiment of the present application, and the processor 1680 executes the software programs and modules stored in the memory 1620, so as to execute various functional applications and data processing of an electronic device, such as the multi-tag image classification method provided in the embodiment of the present application. The memory 1620 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program of at least one application, and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 1620 may comprise high speed random access memory, and may also comprise non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 1630 may be used to receive numeric or character information input by a target object and generate key signal input related to target object setting and function control of the terminal.
Optionally, the input unit 1630 may include a touch panel 1631 and other input devices 1632.
The touch panel 1631, also referred to as a touch screen, may collect touch operations of a target object on or near the touch panel 1631 (for example, operations of the target object on or near the touch panel 1631 using any suitable object or accessory such as a finger or a stylus), and implement corresponding operations according to a preset program, such as an operation of the target object clicking a shortcut identifier of a function module. Alternatively, the touch panel 1631 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a target object, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1680, and can receive and execute commands sent by the processor 1680. In addition, the touch panel 1631 may be implemented by various types, such as resistive, capacitive, infrared, and surface acoustic wave.
Alternatively, other input devices 1632 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1640 may be used to display information input by or interface information presented to the target object and various menus of the electronic device. The display unit 1640 is a display system of the terminal device, and is used for presenting an interface, such as a display desktop, an operation interface of an application, or an operation interface of a live application.
The display unit 1640 may include a display panel 1641. Alternatively, the Display panel 1641 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
Further, the touch panel 1631 can cover the display panel 1641, and when the touch panel 1631 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1680 to determine the type of the touch event, and then the processor 1680 provides a corresponding interface output on the display panel 1641 according to the type of the touch event.
Although in fig. 16, the touch panel 1631 and the display panel 1641 are implemented as two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 1631 and the display panel 1641 may be integrated to implement the input and output functions of the terminal.
The electronic device may also include at least one sensor 1650, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1641 according to the brightness of ambient light, and the proximity sensor may turn off the backlight of the display panel 1641 when the electronic device moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the electronic device, vibration recognition related functions (such as pedometer, tapping) and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.
Audio circuitry 1660, speakers 1661, and microphones 1662 may provide an audio interface between the object and the electronic device. The audio circuit 1660 can transmit the received electrical signal converted from the audio data to the speaker 1661, and the received electrical signal is converted into an acoustic signal by the speaker 1661 for output; on the other hand, the microphone 1662 converts collected sound signals into electrical signals, which are received by the audio circuitry 1660 and converted into audio data, which are processed by the audio data output processor 1680 and then passed through the RF circuitry 1610 for transmission to, for example, another electronic device, or for output of the audio data to the memory 1620 for further processing.
WiFi belongs to a short-distance wireless transmission technology, and the electronic equipment can help a target object to receive and send emails, browse webpages, access streaming media and the like through a WiFi module 1670 and provides wireless broadband internet access for the object. Although a WiFi module 1670 is shown in fig. 16, it is understood that it does not belong to the essential constitution of the electronic device, and can be omitted entirely as needed within a scope not changing the essence of the invention.
The processor 1680 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 1620 and calling data stored in the memory 1620. Alternatively, processor 1680 may include one or more processing units; alternatively, the processor 1680 may integrate an application processor and a modem processor, wherein the application processor mainly processes software programs such as an operating system, applications, and functional modules inside the applications, for example, the multi-label image classification method provided in the embodiment of the present application. The modem processor handles primarily wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1680.
It will be appreciated that the configuration shown in fig. 16 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 16 or have a different configuration than shown in fig. 16. The components shown in fig. 16 may be implemented in hardware, software, or a combination thereof.
According to an aspect of the application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the multi-label image classification method in the above-described embodiment.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims (13)

1. A multi-label image classification method is characterized by comprising the following steps:
carrying out feature extraction on the image to be classified to obtain corresponding global features; the global features comprise attribute features of all pixel points in the image to be classified;
identifying a bounding box corresponding to at least one candidate object in the image to be classified, and dividing the global features based on the obtained at least one bounding box to obtain local features corresponding to the at least one bounding box;
acquiring corresponding global attention characteristics based on the attribute characteristics of each pixel point and the first association degree between the attribute characteristics;
obtaining a corresponding target global feature and at least one target local feature based on the global attention feature and at least one local feature and a second degree of association between the global attention feature and the at least one local feature;
and obtaining a target classification label corresponding to the image to be classified based on the target global feature and the at least one target local feature.
2. The method according to claim 1, wherein obtaining the corresponding global attention feature based on the attribute features of the pixels and the first association degree between the attribute features comprises:
aiming at each pixel point, the following operations are respectively executed: taking one pixel point as a query pixel point, and taking other pixel points in all the pixel points as key pixel points; based on a self-attention mechanism, respectively carrying out association degree matching on the attribute characteristics of the query pixel points and the attribute characteristics of each key pixel point to obtain a first association degree between the attribute characteristics;
and according to each first association degree, weighting and combining the attribute characteristics of each pixel point to obtain corresponding global attention characteristics.
3. The method of claim 1, wherein obtaining the respective target global feature and at least one target local feature based on the global attention feature and at least one local feature and a second degree of association between the global attention feature and the at least one local feature comprises:
for at least one local feature, the following operations are performed, respectively: based on a self-attention mechanism, respectively carrying out association degree matching on attribute characteristics of each pixel point in one local characteristic and attention characteristics of each pixel point in the global attention characteristic to obtain a second association degree between the local characteristic and the global attention characteristic;
and according to each second relevance, respectively carrying out weighted combination on the global attention feature and the at least one local feature to obtain a corresponding target global feature and at least one target local feature.
4. The method according to claim 1, 2 or 3, wherein the obtaining of the target classification label corresponding to the image to be classified based on the target global feature and the at least one target local feature comprises:
respectively determining a global classification label result corresponding to the target global feature and a local classification label result corresponding to the at least one target local feature based on a multi-label classification model;
and obtaining a target classification label corresponding to the image to be classified according to the global classification label result and each local classification label result.
5. The method of claim 4, wherein the determining a global classification label result corresponding to the target global feature and a local classification label result corresponding to each of the at least one target local feature based on the multi-label classification model comprises:
inputting the target global features into a multi-label classification model, obtaining first probability values of the target global features belonging to each candidate classification respectively, and taking each first probability value as a global classification label result;
and respectively inputting the at least one target local feature into the multi-label classification model to obtain second probability values of the at least one target local feature belonging to each candidate classification, and taking the second probability values as local classification label results.
6. The method according to claim 5, wherein obtaining the target classification label corresponding to the image to be classified according to the global classification label result and each local classification label result comprises:
for each of the candidate classifications, performing the following operations, respectively: averaging a second probability value meeting a set value condition with a first probability value corresponding to the candidate classification according to a second probability value that the at least one target local feature belongs to the candidate classification respectively, and determining a target probability value corresponding to the candidate classification;
and according to the target probability value corresponding to each candidate classification, taking the candidate classification with the target probability value larger than a set threshold value as the target classification label corresponding to the image to be classified.
7. The method of claim 4, wherein the training process of the multi-label classification model comprises:
acquiring a training data set; the training data set comprises a plurality of image samples, and the image samples are marked with set classification labels;
and performing iterative training on the multi-label classification model based on the training data set until a set convergence condition is met, wherein one iterative training process comprises the following steps:
obtaining a target global sample characteristic and at least one target local sample characteristic corresponding to an image sample based on the image sample extracted from the training data set;
determining a target sample classification label corresponding to the image sample according to the target global sample feature and the at least one target local sample feature through the multi-label classification model, and performing parameter adjustment on the multi-label classification model based on a loss value determined by the target sample classification label and the set classification label.
8. The method according to claim 7, wherein the obtaining of the target global sample feature and the at least one target local sample feature corresponding to the image sample comprises:
performing feature extraction on the image sample to obtain corresponding global sample features, and dividing the global sample features based on at least one bounding box obtained by identifying at least one candidate sample object in the image sample to obtain local sample features corresponding to the at least one sample bounding box;
and obtaining a corresponding target global sample characteristic and at least one target local sample characteristic according to the global sample characteristic and the at least one local sample characteristic.
9. The method of claim 7, wherein the determining, by the multi-label classification model, a target sample classification label corresponding to the image sample according to the target global sample feature and the at least one target local sample feature comprises:
determining, by the multi-label classification model, first sample probability values that the target global sample features respectively belong to respective candidate sample classifications, and second sample probability values that the at least one target local sample feature respectively belong to respective candidate sample classifications;
and determining a target sample classification label corresponding to the image sample according to the first sample probability value and the second sample probability value corresponding to each candidate sample classification.
10. A multi-label image classification apparatus, comprising:
the global feature extraction unit is used for extracting features of the image to be classified to obtain corresponding global features; the global features comprise attribute features of all pixel points in the image to be classified;
the local feature determining unit is configured to identify a bounding box corresponding to each of at least one candidate object in the image to be classified, and divide the global feature based on the obtained at least one bounding box to obtain a local feature corresponding to each of the at least one bounding box;
the global attention processing unit is used for obtaining corresponding global attention characteristics based on the attribute characteristics of the pixel points and the first association degree between the attribute characteristics;
a target feature determination unit, configured to obtain a corresponding target global feature and at least one target local feature based on the global attention feature and at least one local feature and a second association between the global attention feature and the at least one local feature;
and the multi-label classification unit is used for obtaining a target classification label corresponding to the image to be classified based on the target global feature and the at least one target local feature.
11. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 9.
12. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 9, when said program code is run on the electronic device.
13. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 9.
CN202210593162.5A 2022-05-27 2022-05-27 Multi-label image classification method and device, electronic equipment and storage medium Pending CN115131604A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210593162.5A CN115131604A (en) 2022-05-27 2022-05-27 Multi-label image classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210593162.5A CN115131604A (en) 2022-05-27 2022-05-27 Multi-label image classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115131604A true CN115131604A (en) 2022-09-30

Family

ID=83378029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210593162.5A Pending CN115131604A (en) 2022-05-27 2022-05-27 Multi-label image classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115131604A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912924A (en) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 Target image recognition method and device
CN117036788A (en) * 2023-07-21 2023-11-10 阿里巴巴达摩院(杭州)科技有限公司 Image classification method, method and device for training image classification model
CN117540306A (en) * 2024-01-09 2024-02-09 腾讯科技(深圳)有限公司 Label classification method, device, equipment and medium for multimedia data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036788A (en) * 2023-07-21 2023-11-10 阿里巴巴达摩院(杭州)科技有限公司 Image classification method, method and device for training image classification model
CN117036788B (en) * 2023-07-21 2024-04-02 阿里巴巴达摩院(杭州)科技有限公司 Image classification method, method and device for training image classification model
CN116912924A (en) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 Target image recognition method and device
CN116912924B (en) * 2023-09-12 2024-01-05 深圳须弥云图空间科技有限公司 Target image recognition method and device
CN117540306A (en) * 2024-01-09 2024-02-09 腾讯科技(深圳)有限公司 Label classification method, device, equipment and medium for multimedia data
CN117540306B (en) * 2024-01-09 2024-04-09 腾讯科技(深圳)有限公司 Label classification method, device, equipment and medium for multimedia data

Similar Documents

Publication Publication Date Title
US20210012198A1 (en) Method for training deep neural network and apparatus
CN111797893B (en) Neural network training method, image classification system and related equipment
CN112183577A (en) Training method of semi-supervised learning model, image processing method and equipment
CN111813532B (en) Image management method and device based on multitask machine learning model
CN115131604A (en) Multi-label image classification method and device, electronic equipment and storage medium
CN110929774A (en) Method for classifying target objects in image, method and device for training model
CN111414946B (en) Artificial intelligence-based medical image noise data identification method and related device
CN113807399B (en) Neural network training method, neural network detection method and neural network training device
WO2023020005A1 (en) Neural network model training method, image retrieval method, device, and medium
CN112396106B (en) Content recognition method, content recognition model training method, and storage medium
KR20190056940A (en) Method and device for learning multimodal data
CN114283316A (en) Image identification method and device, electronic equipment and storage medium
CN111709398A (en) Image recognition method, and training method and device of image recognition model
WO2024002167A1 (en) Operation prediction method and related apparatus
CN113516113A (en) Image content identification method, device, equipment and storage medium
CN112862021B (en) Content labeling method and related device
CN113705293A (en) Image scene recognition method, device, equipment and readable storage medium
CN113239915B (en) Classroom behavior identification method, device, equipment and storage medium
CN111797856A (en) Modeling method, modeling device, storage medium and electronic equipment
CN115618950A (en) Data processing method and related device
CN112995757B (en) Video clipping method and device
CN116259083A (en) Image quality recognition model determining method and related device
Wu et al. Weighted classification of machine learning to recognize human activities
WO2022111387A1 (en) Data processing method and related apparatus
CN113903083B (en) Behavior recognition method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination