CN109117781B

CN109117781B - Multi-attribute identification model establishing method and device and multi-attribute identification method

Info

Publication number: CN109117781B
Application number: CN201810890761.7A
Authority: CN
Inventors: 李磊; 董远; 白洪亮; 熊风烨
Original assignee: Beijing Eway Dacheng Technology Co ltd
Current assignee: BEIJING EWAY DACHENG TECHNOLOGY Co.,Ltd.
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2020-09-08
Anticipated expiration: 2038-08-07
Also published as: CN109117781A

Abstract

The embodiment of the invention provides a method and a device for establishing a multi-attribute identification model and a multi-attribute identification method, wherein the establishing method comprises the following steps: inputting the sample images subjected to multi-attribute labeling in advance into a first model for learning to obtain a feature matrix of each image in the sample images; inputting the feature matrix of each image into a second model for learning to obtain a semantic-space feature matrix of each image; obtaining a first predicted value of each attribute according to the feature matrix, obtaining a second predicted value of each attribute according to the semantic-space feature matrix, and performing weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image; and when the loss between the comprehensive predicted value and the label value obtained after learning is stabilized within a preset threshold range, determining to learn to obtain the multi-attribute identification model. The embodiment of the invention can effectively utilize the labeling information to obtain the correlation of the multiple attributes in space and semantics, and has high identification accuracy.

Description

Multi-attribute identification model establishing method and device and multi-attribute identification method

Technical Field

The embodiment of the invention relates to the technical field of machine learning, in particular to a method and a device for establishing a multi-attribute identification model and a multi-attribute identification method.

Background

The traditional pedestrian multi-attribute recognition mode comprises methods of multi-label SVM and Softmax classifier, but the accuracy of the methods is not as high as that of the convolutional neural network.

At present, the method for performing multi-attribute identification by using a convolutional neural network mainly comprises the following steps: 1) the method comprises the following steps of adopting a single-attribute multi-model form, identifying one attribute by one model in a targeted manner, and finally integrating output results of a plurality of models to complete multi-attribute identification; 2) the method adopts a multi-label form, such as using a deep learning framework such as MxNet, Pythroch and the like, directly inputs multiple labels for learning, shares the result of a convolutional layer by multiple attributes in the training process, and finally realizes the identification of different attributes through multiple different full connection layers; 3) the method comprises the steps of dividing the area of an image by using an area suggestion network, identifying specific areas, and converting multiple attributes into a plurality of single attributes for identification.

The efficiency of identifying multiple attributes is not high by adopting a single-attribute multi-model method; learning is carried out in a multi-label mode based on the existing deep learning framework, all attributes of the whole image are viewed at the same time, the relevance among the attributes is ignored, and the complex and various contents in the multi-label image make it difficult to learn to obtain effective feature representation and a classifier; the image is divided according to the regions, the relevance among the attributes is also ignored, and the region proposal is relatively complex to realize and has low practicability.

Disclosure of Invention

Embodiments of the present invention provide a method and an apparatus for building a multi-attribute recognition model, and a multi-attribute recognition method, which overcome the above problems or at least partially solve the above problems.

In a first aspect, an embodiment of the present invention provides a method for building a multiple attribute identification model, including:

inputting a sample image subjected to multi-attribute labeling in advance into a first model for learning to obtain a feature matrix of each image in the sample image, wherein the feature matrix is used for representing attribute feature information of the image;

inputting the feature matrix of each image into a second model for learning to obtain a semantic-space feature matrix of each image, wherein the semantic-space feature matrix is used for representing semantic relations and spatial relations among attributes in the image;

respectively inputting the feature matrix of each image to a first full-connection layer corresponding to each attribute to obtain a first predicted value of each attribute in each image, respectively inputting the semantic-space feature matrix to a second full-connection layer corresponding to each attribute to obtain a second predicted value of each attribute in each image, and performing weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image;

when the loss between the comprehensive predicted value of each attribute of each image and each attribute label value of each image obtained after learning is stabilized within a preset threshold range, determining to learn to obtain a multi-attribute identification model, wherein the multi-attribute identification model comprises: the first model, the second model, the first fully-connected layer corresponding to each attribute, and the second fully-connected layer corresponding to each attribute.

In a second aspect, an embodiment of the present invention provides a multiple attribute identification method, including:

inputting an image to be recognized into a first model in a pre-established multi-attribute recognition model to obtain a feature matrix of the image to be recognized;

inputting the feature matrix of the image to be recognized into a second model in the multi-attribute recognition model to obtain a semantic-space feature matrix of the image to be recognized;

respectively inputting the feature matrix of the image to be recognized into a first full-connection layer corresponding to each attribute of the multi-attribute recognition model to obtain a first predicted value of each attribute of the image to be recognized, respectively inputting the semantic-space feature matrix into a second full-connection layer corresponding to each attribute of the multi-attribute recognition model to obtain a second predicted value of each attribute of the image to be recognized, and performing weighted summation on the first predicted value and the second predicted value to obtain a recognition result of each attribute of the image to be recognized.

In a third aspect, an embodiment of the present invention provides an apparatus for building a multiple attribute identification model, including:

the first learning module is used for inputting a sample image subjected to multi-attribute labeling in advance into a first model for learning to obtain a feature matrix of each image in the sample image, wherein the feature matrix is used for representing attribute feature information of the image;

the second learning module is used for inputting the feature matrix of each image into a second model for learning to obtain a semantic-space feature matrix of each image, and the semantic-space feature matrix is used for representing semantic relations and spatial relations among attributes in the image;

the calculation module is used for respectively inputting the feature matrix of each image into a first full-connection layer corresponding to each attribute to obtain a first predicted value of each attribute in each image, respectively inputting the semantic-space feature matrix into a second full-connection layer corresponding to each attribute to obtain a second predicted value of each attribute in each image, and performing weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image;

a model determining module, configured to determine that a multi-attribute recognition model is obtained by learning when a loss between a comprehensive predicted value of each attribute of each image obtained after learning and each attribute tag value of each image is stable within a preset threshold range, where the multi-attribute recognition model includes: the first model, the second model, the first fully-connected layer corresponding to each attribute, and the second fully-connected layer corresponding to each attribute.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform the method provided by any of the various possible implementations of the first aspect.

According to a fifth aspect of the present invention there is provided a non-transitory computer readable storage medium storing computer instructions enabling the computer to perform a method as provided by any one of the various possible implementations of the first aspect described above.

According to the method and the device for establishing the multi-attribute recognition model and the multi-attribute recognition method, the new network structure is added, the correlation of the multi-attribute on space and semantics can be effectively obtained by utilizing the label information, the recognition accuracy is high, the multi-attribute simultaneous parallel training can be realized based on the multi-label input, the training cost is low, the efficiency is high, meanwhile, the preceding-stage network is easy to modify, the attribute addition and deletion are simple and convenient, and the flexibility is high.

Drawings

Fig. 1 is a schematic flow chart of a method for establishing a multi-attribute recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a multi-attribute recognition model according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a multi-attribute identification method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for building a multi-attribute recognition model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The embodiment of the invention provides a method and a device for establishing a multi-attribute recognition model and a multi-attribute recognition method, which are different from the prior art in that a network structure is added to extract the potential spatial region relation and semantic relation between attributes, supervised learning is realized only by using the labeling information of pictures, the multi-attribute accurate recognition can be realized, the training cost is low, and the efficiency is high. The method for establishing the multi-attribute recognition model can also be understood as a training method or a learning method.

As shown in fig. 1, a schematic flow chart of a method for establishing a multi-attribute recognition model according to an embodiment of the present invention includes:

s101, inputting a sample image subjected to multi-attribute labeling in advance into a first model for learning, and obtaining a feature matrix of each image in the sample image, wherein the feature matrix is used for representing attribute feature information of the image.

Specifically, a certain number of images can be collected according to the requirement of model training, and multi-attribute labeling is performed on the images. In implementation, the attribute type to be labeled needs to be determined, taking two-classification labeling of a pedestrian image as an example, and determining the attribute of the pedestrian in the image includes: the image identification system comprises a coat length (2 types), a trouser length (2 types), a backpack (2 types), a gender (2 types), boots (2 types), packages (2 types), handbags (2 types) and hats (2 types), wherein after marking is finished, each pedestrian image corresponds to an attribute label which is a 1 x 8-dimensional vector. And then inputting the marked sample images into a first model for learning, randomly initializing the first model when training starts, and sequentially inputting the sample images into the first model for learning to obtain a characteristic matrix corresponding to each image. The attribute feature information refers to feature information related to an attribute in an image.

The first model may adopt an existing network model structure as long as the network model can learn about the image and obtain attribute feature information of the image.

Aiming at the specific implementation of the first model, as an optional embodiment, the first model adopts a convolution network model with a residual error structure, and directly bypasses the input information to the output by introducing the residual error network structure, so that the integrity of the information is protected, the network can be deepened, and the problems of error increase and gradient dispersion on a training set caused by the fact that the number of network layers is continuously deepened are solved. For example, an 18-layer residual error network (Res18) is selected as the first model, and in practical application, the 18-layer network is selected to have higher guarantee on speed and accuracy.

Other networks, such as Res50, Res101, Alexnet, etc., may also be used in the first model, and the embodiments of the present invention are not limited thereto.

When implemented, embodiments of the present invention may support images in any format, including but not limited to JPG, PNG, TIF, BMP, etc. Of course, in order to ensure the uniformity and processing rate of image processing, when receiving the sample image, the sample image may be converted into a uniform format supported by the system, and then processed accordingly. Of course, in order to adapt to the processing performance of the system, the sample images with different sizes may be cut into fixed-size images supported by the system, and then the images are processed accordingly. In addition, the embodiment of the invention also supports the input of sample images with different widths and heights.

S102, inputting the feature matrix of each image into a second model for learning to obtain a semantic-space feature matrix of each image, wherein the semantic-space feature matrix is used for representing semantic relations and spatial relations among attributes in the image.

Specifically, a feature matrix of each image obtained by inputting a sample image into a first model for learning is input into a second model for learning, and the purpose is to extract a spatial position relationship and a semantic relationship between attributes in each image, that is, to obtain a semantic-spatial feature matrix of each image. The semantic relation between the attributes means that one attribute has a prompting effect on the judgment of the other attribute between the two attributes, namely the two attributes are semantically related, for example, the long hair has a certain effect on the gender prediction, and the result of the identification as a female is larger. The spatial position of different attributes in each image has a certain area, and the spatial position relationship among the attributes refers to the association relationship among the areas of the attributes of each image.

The second model is a model structure capable of extracting spatial positional relationships and semantic relationships between the respective attributes in each image.

S103, inputting the feature matrix of each image into a first full-connection layer corresponding to each attribute to obtain a first predicted value of each attribute in each image, inputting the semantic-space feature matrix into a second full-connection layer corresponding to each attribute to obtain a second predicted value of each attribute in each image, and performing weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image.

Specifically, after a feature matrix of each image is obtained, two aspects of processing are performed on the feature matrix, on one hand, the subsequent process of the first model is performed, namely, dimension reduction is performed through pooling, then a full-link layer is input, normalization is performed through softmax, and probability values of all categories in one attribute are obtained and recorded as first predicted values. If there are N attributes, then N fully-connected layers are required, one for each attribute.

And on the other hand, after the feature matrix is input into the second model to obtain a semantic-space feature matrix, performing subsequent processes of the second model, namely, inputting the semantic-space feature matrix into the full-connection layer corresponding to each attribute, and after softmax normalization, obtaining probability values of all categories in each attribute, and recording the probability values as second predicted values. Likewise, one attribute corresponds to one fully connected layer.

And performing weighted summation calculation on the first predicted value of each attribute and the corresponding second predicted value of each attribute to obtain a comprehensive predicted value of each attribute. The recognition results of multiple attributes are obtained through a weighted summation mode, the spatial and semantic association among the attributes can be effectively utilized, and the recognition accuracy is improved.

Specifically, the formula of the weighted sum is:

in the above formula, the first and second carbon atoms are,

in order to be said first predicted value,

for the second predicted value, α is a weight distribution coefficient, and the value of α is not limited, and is generally 0.5.

And (3) taking softmax loss between the comprehensive predicted value of each attribute and the input tag value of each attribute as a loss function. By simultaneously training the loss functions of a plurality of attributes by the gradient descent method, and by repeating the above steps S101 to S103 a plurality of times, the comprehensive predicted value of each attribute of each image can be obtained by learning for the sample image. After learning, it is decided by step S104 whether to stop learning to obtain a well-learned multi-attribute recognition model.

S104, when the loss between the comprehensive predicted value of each attribute of each image obtained after learning and each attribute label value of each image is stable within a preset threshold range, determining to learn to obtain a multi-attribute identification model, wherein the multi-attribute identification model comprises: the first model, the second model, the first fully-connected layer corresponding to each attribute, and the second fully-connected layer corresponding to each attribute.

Specifically, in the embodiment of the present invention, the condition for ending the learning process is that the loss between the comprehensive predicted value of each attribute of each image and each attribute tag value of each image can be stabilized within a preset threshold range, where the stabilization within the preset threshold range means that the loss between the comprehensive predicted value of each attribute of each image and each attribute tag value of each image can be smaller than a preset threshold, which indicates that the learning process tends to be stable, and at this time, it may be determined that the learning process is completed, and the established multi-attribute identification model is obtained.

According to the method for establishing the multi-attribute recognition model, the potential spatial position relation and semantic relation extraction among the attributes are realized by adding a new network structure (a second model), supervised learning is realized by only using the marking information of the picture, the multi-attribute accurate recognition can be realized, the training cost is low, and the efficiency is high.

Based on the content of the foregoing embodiment, as an optional embodiment, the step of inputting the feature matrix of each image into a second model for learning to obtain the semantic-spatial feature matrix of each image specifically includes:

inputting the feature matrix of each image into an attention layer for learning to obtain an attention feature matrix of each image, wherein the attention feature matrix is used for representing the weight occupied by a channel corresponding to each attribute in the image;

inputting the feature matrix of each image into a confidence layer for learning to obtain a confidence matrix of each image, wherein the confidence matrix is used for representing the confidence degree of the attribute feature information of the image;

according to the attention feature matrix and the confidence matrix, calculating and obtaining a weighted attention feature matrix of each image, wherein the weighted attention feature matrix is used for representing the confidence degree of the weight occupied by the channel corresponding to each target attribute in the image;

and inputting the weighted attention feature matrix into a spatial regularization layer for learning to obtain a semantic-spatial feature matrix of each image.

Fig. 2 is a schematic structural diagram of a multi-attribute recognition model according to an embodiment of the present invention. The second model is described in detail in this embodiment. In order to be able to extract the spatial and semantic relationships between the individual attributes of each image, the second model consists of an attention layer, a confidence layer and a spatial regularization layer.

And inputting the feature matrix of each image into an attention layer for learning to obtain the attention feature matrix of each image, wherein the attention layer learns the attention value on the channel corresponding to each attribute by adopting an attention mechanism.

As an alternative embodiment, the attention layer consists of three convolutional layers with convolutional kernel sizes of 1 × 1, 3 × 3 and 1 × 1, respectively;

if the label corresponding to the attribute is 1 (i.e. the attribute exists on the graph), the attention value tends to be high in the learning process, and an attention feature matrix of each image is obtained. It should be noted that, in the embodiment of the present invention, one attribute corresponds to one channel. The attention feature matrix represents the weight occupied by the channel corresponding to each attribute of one image.

Inputting the feature matrix of each image into a confidence layer for learning to obtain the confidence matrix of each image, wherein the confidence layer is composed of a convolution layer with the convolution kernel size of 1 x 1;

and then multiplying the confidence matrix with the attention feature matrix after passing through an activation function to obtain the weighted attention feature matrix of each image. The weighted attention feature matrix is used for representing the confidence degree of the channel corresponding to each target attribute in the image in the weight, and the weighted attention feature matrix is obtained after confidence operation is carried out on the attention feature matrix, so that the accuracy of the network for extracting the attention feature can be improved.

The spatial regularization layer is formed by two convolution layers with convolution kernel size of 1 x 1 and a convolution layer with convolution kernel size of w x h in sequence; where w is the width of each image and h is the height of each image. The spatial regularization layer captures spatial information using an attention feature matrix learned by an attention layer. The input label of each image contains abundant spatial information among attributes, and in order to effectively and accurately utilize the information, a confidence matrix is multiplied by attention characteristics after activation of a function to obtain a weighted attention characteristic matrix, so that the weight of the attention characteristic matrix is normalized to [0,1 ]. The weighted attention feature matrix is input into a space regularization layer, two convolution layers with convolution kernels of 1 are firstly arranged in the space regularization layer, the convolution kernels of 1 x 1 can ensure that the length and the width of the special normal are unchanged, information of each channel obtained by convolution is added, the relationship between the channels is obtained without changing the space information, the relationship between the channels is semantic relationship, and one attribute corresponds to one channel. And then, extracting spatial information through a convolution layer with convolution kernel size w h, and finally outputting a feature layer (feature) of 1 x 1, namely a semantic-spatial feature matrix.

In another embodiment of the present invention, on the basis of the above embodiment, the step of inputting the feature matrix of each image into a second model for learning to obtain the semantic-spatial feature matrix of each image further includes:

multiplying the attention characteristic matrix and the confidence matrix, and inputting a result obtained after multiplication into a first pooling layer for dimensionality reduction to obtain an attribute confidence vector;

specifically, the above process of obtaining the attribute confidence vector can be illustrated by the following formula:

where l represents the l-th attribute, W^lAnd b^lIs a parameter of the confidence layer, X represents a feature matrix, X_i,jRepresents the value of the feature matrix X at the (i, j) position,

is the attention value of the ith attribute at position (i, j). It will be appreciated that the above-described,

is the output of the attention layer, and W^lx_i,j+b^lThat is to say the output of the confidence layer,

is the confidence vector for the ith attribute.

The cross entropy function between the attribute confidence vector and the input label is:

wherein y is the label information of the input label,

as attribute confidence vectors

Correspondingly, when the loss between the comprehensive predicted value of each attribute of each image obtained after learning and each attribute tag value of each image is the minimum value, the step of determining to obtain the multi-attribute identification model after learning is specifically as follows:

and determining to learn to obtain a multi-attribute recognition model when the sum of the two losses is stabilized within a preset threshold range, wherein the losses between the attribute confidence vector obtained after learning and the input label of each image, and the losses between the comprehensive predicted value of each attribute of each image and each attribute label value of each image are stable.

The parameters in the first model, the second model, the first fully-connected layer and the second fully-connected layer are learned through a cross entropy loss function between the attribute confidence vector and the input label and a softmax loss between the comprehensive predicted value of each attribute of each image and each attribute label value of each image.

And when the sum of the two losses is stabilized within a preset threshold value range, determining to learn to obtain a multi-attribute recognition model.

In the embodiment of the invention, the loss between the attribute confidence vector and the input label of each image is added into the training process of the model, so that the identification accuracy of the multi-attribute identification model can be improved.

Based on the content of the foregoing embodiments, an embodiment of the present invention further provides a multi-attribute identification method, where a pre-established multi-attribute identification model is used to perform multi-attribute identification on an image, and as shown in fig. 3, the multi-attribute identification method includes:

s301, inputting an image to be recognized into a first model in a pre-established multi-attribute recognition model to obtain a feature matrix of the image to be recognized;

the first model may be a trained convolutional network model with a residual structure, or may be other networks, such as Res50, Res101, Alexnet, and the like. The characteristic matrix of the image to be identified represents attribute characteristic information of the image.

S302, inputting the feature matrix of the image to be recognized into a second model in the multi-attribute recognition model to obtain a semantic-space feature matrix of the image to be recognized;

the second model is able to extract the semantic and spatial association of the various attributes of the image to be recognized.

S303, respectively inputting the feature matrix of the image to be recognized into a first full-connection layer corresponding to each attribute of the multi-attribute recognition model to obtain a first predicted value of each attribute of the image to be recognized, respectively inputting the semantic-space feature matrix into a second full-connection layer corresponding to each attribute of the multi-attribute recognition model to obtain a second predicted value of each attribute of the image to be recognized, and performing weighted summation on the first predicted value and the second predicted value to obtain a recognition result of each attribute of the image to be recognized.

The recognition results of multiple attributes are obtained through a weighted summation mode, the spatial and semantic association among the attributes can be effectively utilized, and the recognition accuracy is improved.

Based on the above embodiment, the step of inputting the feature matrix of the image to be recognized into the second model of the multi-attribute recognition model to obtain the semantic-spatial feature matrix of the image to be recognized specifically includes:

inputting the feature matrix of the image to be recognized to an attention layer to obtain the attention feature matrix of the image to be recognized;

inputting the characteristic matrix of the image to be recognized into a confidence layer to obtain the confidence matrix of the image to be recognized;

calculating to obtain a weighted attention feature matrix of the image to be recognized according to the attention feature matrix and the confidence matrix;

and inputting the weighted attention feature matrix into a spatial regularization layer for learning to obtain a semantic-spatial feature matrix of the image to be recognized.

Specifically, the second model consists of an attention layer, a confidence layer, and a spatial regularization layer. The attention layer is composed of three convolution layers, convolution kernels of the three convolution layers are 1 x 1, 3 x 3 and 1 x 1 respectively, and the feature matrix of the image to be recognized is input into the attention layer to obtain the attention feature matrix of the image to be recognized; the confidence layer is composed of a convolution layer with convolution kernel size of 1 x 1, and the feature matrix of the image to be recognized is input into the confidence layer to obtain the confidence matrix of the image to be recognized; the spatial regularization layer is composed of convolution layers with convolution kernels of 1 × 1 and convolution layers with convolution kernels of w × h; wherein w is the width of each image, h is the height of each image, the attention feature matrix is multiplied by the confidence matrix to obtain a weighted attention feature matrix, then the weighted attention feature matrix is input to a spatial regularization layer, and spatial information and channel information among attributes in the feature matrix are extracted to obtain a semantic-spatial feature matrix.

The multi-attribute identification method provided by the embodiment of the invention considers the semantic and spatial relevance among the attributes, and has higher identification accuracy.

On the other hand, an apparatus for building a multi-attribute identification model is further provided in the embodiments of the present invention, and referring to fig. 4, a schematic structural diagram of the apparatus for building a multi-attribute identification model provided in the embodiments of the present invention is shown, where the apparatus is used to implement the method for building a multi-attribute identification model described in the foregoing embodiments. Therefore, the description and definition of the method in the foregoing embodiments may be used for understanding the execution modules in the embodiments of the present invention.

As shown in fig. 4, the apparatus includes:

the first learning module 401 is configured to input a sample image subjected to multi-attribute labeling in advance to a first model for learning, so as to obtain a feature matrix of each image in the sample image, where the feature matrix is used to represent attribute feature information of the image;

a second learning module 402, configured to input the feature matrix of each image into a second model for learning, to obtain a semantic-spatial feature matrix of each image, where the semantic-spatial feature matrix is used to represent semantic relationships and spatial relationships between attributes in the image;

a calculating module 403, configured to input the feature matrix of each image to a first fully-connected layer corresponding to each attribute to obtain a first predicted value of each attribute in each image, input the semantic-spatial feature matrix to a second fully-connected layer corresponding to each attribute to obtain a second predicted value of each attribute in each image, and perform weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image;

a model determining module 404, configured to determine that a multi-attribute recognition model is obtained by learning when a loss between the learned comprehensive predicted value of each attribute of each image and each attribute tag value of each image is stable within a preset threshold range, where the multi-attribute recognition model includes: the first model, the second model, the first fully-connected layer corresponding to each attribute, and the second fully-connected layer corresponding to each attribute.

The device for establishing the multi-attribute recognition model provided by the embodiment of the invention extracts the potential spatial region relationship and semantic relationship utilization between the attributes by adding a new network structure (a second model), realizes supervised learning by only using the labeling information of the picture, can realize accurate recognition of the multi-attribute, and has low training cost and high efficiency.

Based on the content of the foregoing embodiment, as an optional embodiment, the second learning module 402 is specifically configured to:

Fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention, and as shown in fig. 5, a processor (processor)501, a memory (memory)502, and a bus 503;

the processor 501 and the memory 502 respectively complete communication with each other through a bus 503; the processor 501 is configured to call the program instructions in the memory 502 to execute the method for building the multi-attribute identification model provided by the above embodiments, for example, the method includes: inputting a sample image subjected to multi-attribute labeling in advance into a first model for learning to obtain a feature matrix of each image in the sample image, wherein the feature matrix is used for representing attribute feature information of the image; inputting the feature matrix of each image into a second model for learning to obtain a semantic-space feature matrix of each image, wherein the semantic-space feature matrix is used for representing semantic relations and spatial relations among attributes in the image; respectively inputting the feature matrix of each image to a first full-connection layer corresponding to each attribute to obtain a first predicted value of each attribute in each image, respectively inputting the semantic-space feature matrix to a second full-connection layer corresponding to each attribute to obtain a second predicted value of each attribute in each image, and performing weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image; when the loss between the comprehensive predicted value of each attribute of each image and each attribute label value of each image obtained after learning is stabilized within a preset threshold range, determining to learn to obtain a multi-attribute identification model, wherein the multi-attribute identification model comprises: the first model, the second model, the first fully-connected layer corresponding to each attribute, and the second fully-connected layer corresponding to each attribute.

An embodiment of the present invention provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores computer instructions, where the computer instructions cause a computer to execute the method for building a multi-attribute recognition model provided in the foregoing embodiment, for example, the method includes: inputting a sample image subjected to multi-attribute labeling in advance into a first model for learning to obtain a feature matrix of each image in the sample image, wherein the feature matrix is used for representing attribute feature information of the image; inputting the feature matrix of each image into a second model for learning to obtain a semantic-space feature matrix of each image, wherein the semantic-space feature matrix is used for representing semantic relations and spatial relations among attributes in the image; respectively inputting the feature matrix of each image to a first full-connection layer corresponding to each attribute to obtain a first predicted value of each attribute in each image, respectively inputting the semantic-space feature matrix to a second full-connection layer corresponding to each attribute to obtain a second predicted value of each attribute in each image, and performing weighted summation on the first predicted value and the second predicted value to obtain a comprehensive predicted value of each attribute of each image; when the loss between the comprehensive predicted value of each attribute of each image and each attribute label value of each image obtained after learning is stabilized within a preset threshold range, determining to learn to obtain a multi-attribute identification model, wherein the multi-attribute identification model comprises: the first model, the second model, the first fully-connected layer corresponding to each attribute, and the second fully-connected layer corresponding to each attribute.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

In the present disclosure, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for establishing a multi-attribute recognition model is characterized by comprising the following steps:

when the loss between the comprehensive predicted value of each attribute of each image and each attribute label value of each image obtained after learning is stabilized within a preset threshold range, determining to learn to obtain a multi-attribute identification model, wherein the multi-attribute identification model comprises: the first model, the second model, the first fully connected layer corresponding to each attribute, and the second fully connected layer corresponding to each attribute;

the step of inputting the feature matrix of each image into a second model for learning to obtain the semantic-spatial feature matrix of each image specifically comprises:

2. The method of claim 1, wherein the first model is a convolutional network model with residual structure.

3. The method according to claim 1, wherein the step of inputting the feature matrix of each image into a second model for learning to obtain the semantic-spatial feature matrix of each image further comprises:

correspondingly, when the loss between the comprehensive predicted value of each attribute of each image obtained after learning and each attribute tag value of each image is stabilized within a preset threshold range, the step of determining to learn to obtain the multi-attribute identification model specifically comprises the following steps:

4. The method of claim 1, wherein the attention layer is composed of three convolutional layers having convolutional kernel sizes of 1 x 1, 3 x 3, and 1 x 1, respectively;

the confidence layer consists of a convolution layer with a convolution kernel size of 1 x 1;

the spatial regularization layer is formed by two convolution layers with convolution kernel size of 1 x 1 and a convolution layer with convolution kernel size of w x h in sequence; where w is the width of each image and h is the height of each image.

5. A multi-attribute identification method is characterized by comprising the following steps:

pre-building a multi-attribute recognition model by applying the method according to any one of claims 1 to 4;

inputting an image to be recognized into a first model in the pre-established multi-attribute recognition model to obtain a feature matrix of the image to be recognized;

respectively inputting the feature matrix of the image to be recognized into a first full-connection layer corresponding to each attribute of the multi-attribute recognition model to obtain a first predicted value of each attribute of the image to be recognized, respectively inputting the semantic-space feature matrix into a second full-connection layer corresponding to each attribute of the multi-attribute recognition model to obtain a second predicted value of each attribute of the image to be recognized, and performing weighted summation on the first predicted value and the second predicted value to obtain a recognition result of each attribute of the image to be recognized;

the step of inputting the feature matrix of the image to be recognized into a second model of the multi-attribute recognition model to obtain a semantic-spatial feature matrix of the image to be recognized specifically includes:

and inputting the weighted attention feature matrix to a spatial regularization layer to obtain a semantic-spatial feature matrix of the image to be recognized.

6. An apparatus for building a multi-attribute recognition model, comprising:

a model determining module, configured to determine that a multi-attribute recognition model is obtained by learning when a loss between a comprehensive predicted value of each attribute of each image obtained after learning and each attribute tag value of each image is stable within a preset threshold range, where the multi-attribute recognition model includes: the first model, the second model, the first fully connected layer corresponding to each attribute, and the second fully connected layer corresponding to each attribute;

wherein the second learning module is specifically configured to:

7. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 4.

8. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 4.