CN112487207A

CN112487207A - Image multi-label classification method and device, computer equipment and storage medium

Info

Publication number: CN112487207A
Application number: CN202011451978.1A
Authority: CN
Inventors: 罗彤; 郭彦东; 李亚乾; 杨林
Original assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-12
Also published as: WO2022121485A1

Abstract

The embodiment of the application discloses a multi-label classification method and device for images, computer equipment and a storage medium, and belongs to the technical field of image processing. The application provides a multi-label classification method for images, which can input the images to be processed into a label classification model after the images to be processed are obtained, obtain image characteristics corresponding to the images, simultaneously obtain a map characteristic matrix according to a knowledge map, obtain data to be activated by combining the image characteristics and the map characteristic matrix, and then obtain at least two second labels corresponding to the images to be processed according to the data to be activated. Wherein the knowledge-graph is used to indicate the relationships between the tags as well as the attributes of the tags themselves. Because the label classification model uses the information provided by the knowledge graph when adding a plurality of labels to the image to be processed, the reliability of the plurality of labels obtained by the image to be processed is improved, and the complexity of obtaining the plurality of labels is reduced.

Description

Image multi-label classification method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a multi-label classification method and device for images, computer equipment and a storage medium.

Background

With the rapid development of the artificial intelligence technology, the intelligent classification capability of the images in the photo album in the terminal is stronger and stronger.

In the related art, a machine learning model provided by an artificial intelligence technology can intelligently determine the type of an object included in an image, so that the image is labeled with a corresponding label. However, when a scene that a single image needs to be marked with a plurality of labels is faced, the conventional model has the problems of increased error rate, low operation speed and the like.

Disclosure of Invention

The embodiment of the application provides a multi-label classification method and device for images, computer equipment and a storage medium. The technical scheme is as follows:

according to an aspect of the present application, there is provided a multi-label classification method for an image, the method including:

extracting image features of an image to be processed through a feature extraction layer in a label classification model, wherein the label classification model is a neural network model used for adding at least two labels to the image to be processed;

processing the image features through a map feature matrix to obtain data to be activated, wherein the map feature matrix is obtained by processing a knowledge map through a graph convolution neural network, and the knowledge map is used for indicating the attributes of the first labels and the relationship between at least two first labels;

processing the data to be activated through an activation layer in the label classification model to obtain at least two second labels;

determining at least two second labels as labels of the image to be processed, wherein the second labels belong to the first labels.

According to another aspect of the present application, there is provided an apparatus for multi-label classification of images, the apparatus including:

the first acquisition module is used for acquiring an image to be processed;

the characteristic extraction module is used for extracting the image characteristics of the image to be processed;

the second acquisition module is used for acquiring data to be activated according to the image characteristics and a map characteristic matrix, wherein the map characteristic matrix is obtained after a knowledge map is processed by a map convolutional neural network, and the knowledge map is used for indicating the attributes of the first labels and the relationship between at least two first labels;

and the label determining module is used for obtaining at least two second labels according to the data to be activated, and determining the at least two second labels as the labels of the image to be processed, wherein the second labels belong to the first labels.

According to another aspect of the present application, there is provided a terminal comprising a processor and a memory, the memory having stored therein at least one instruction, the instruction being loaded and executed by the processor to implement the multi-label classification method for images as provided in the various aspects of the present application.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to implement a multi-label classification method for images as provided in various aspects of the present application.

According to one aspect of the present application, a computer program product is provided that includes computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the methods provided in the various alternative implementations of the multi-label classification aspect of the images described above.

The application provides a multi-label classification method for images, which can input the images to be processed into a label classification model after the images to be processed are obtained, obtain image characteristics corresponding to the images, simultaneously obtain a map characteristic matrix according to a knowledge map, obtain data to be activated by combining the image characteristics and the map characteristic matrix, and then obtain at least two second labels corresponding to the images to be processed according to the data to be activated. Wherein the knowledge-graph is used to indicate the relationships between the tags as well as the attributes of the tags themselves. Because the label classification model uses the information provided by the knowledge graph when adding a plurality of labels to the image to be processed, the reliability of the plurality of labels obtained by the image to be processed is improved, and the complexity of obtaining the plurality of labels is reduced.

Drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is an architecture diagram of a tag classification model provided by an embodiment of the present application;

FIG. 2 is an architecture diagram of a tag classification model provided based on the embodiment shown in FIG. 1;

FIG. 3 is a flowchart of a multi-label classification method for images according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a global feature provided based on the embodiment shown in FIG. 3;

FIG. 5 is a schematic illustration of another partial feature provided based on the embodiment shown in FIG. 3;

FIG. 6 is a visual interface after image processing provided based on the embodiment shown in FIG. 3;

FIG. 7 is a flowchart of a method for multi-label classification of images according to another exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of an image post-processing provided based on the embodiment shown in FIG. 7;

FIG. 9 is a schematic diagram of a process for automatically creating an album according to an embodiment of the present application;

FIG. 10 is a block diagram illustrating an exemplary embodiment of an apparatus for multi-label classification of images;

fig. 11 is a block diagram of a terminal according to an exemplary embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present application, it is to be noted that, unless otherwise explicitly specified or limited, the terms "connected" and "connected" are to be interpreted broadly, e.g., as being fixed or detachable or integrally connected; can be mechanically or electrically connected; may be directly connected or indirectly connected through an intermediate. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

In order to make the solution shown in the embodiments of the present application easy to understand, several terms appearing in the embodiments of the present application will be described below.

And (3) image to be processed: for tagged images. In one possible approach, the image to be processed is an image taken by the terminal. In another possible mode, the image to be processed is an image shot by other computer equipment, and the terminal adds a label to the image. In another possible way, the image to be processed can also be a virtual image generated by other computer equipment according to a specified algorithm or other image tools.

Optionally, when the image to be processed is a real image captured by the device, the method and the device for acquiring the image to be processed can be divided into two ways. Schematically, the first approach is a device applying the multi-label classification method for images provided by the application, and a shooting approach is performed through an image acquisition component equipped in the device. The second approach is to apply the method for multi-label classification of images provided by the present application to other devices, obtain the images by means of image transmission, and transmit the images to the devices for multi-label classification.

For example, when the embodiment of the present application is applied to a terminal such as a mobile phone, the image to be processed may be an image captured by the terminal through a camera of the terminal. When the embodiment of the application is applied to equipment such as a server, the image to be processed may be an image which is transmitted by the server through the network acquisition terminal and is still an image shot by the terminal through the camera.

A neural network model: are complex network systems formed by a large number of simple processing units interconnected. The processing units may also be referred to as neurons, among others. The neural network model can reflect many basic characteristics of human brain functions and is essentially a nonlinear dynamical learning system. In the present application, the neural network model is a mathematical model to which a neural network structure is applied. In one possible implementation, one part of the neural network model adopts a neural network structure, and the other part can adopt other data structures, and the parts cooperate with each other to process the data to obtain the result expected by the designer. For example, the label classification model used in the present application can print at least two second labels on the image to be processed, thereby realizing the multi-label classification capability of the image.

Convolutional Neural Networks (CNN): the method is a kind of feed-forward Neural Networks (fed-forward Neural Networks) containing convolution calculation and having a deep structure, and is one of the most widely applied algorithms in deep learning (deep learning).

Schematically, a typical structure of a CNN includes an input layer, a hidden layer, and an output layer.

First, the input layer is used to receive data of the input CNN. Generally, the input layer is capable of processing multi-dimensional data, and when the input data is an image, the input layer receives three-dimensional input data indicating coordinates of a pixel point and an RGB (Red Green Blue) channel. Optionally, before the data of the pixel point enters the input layer, normalization processing may be performed to normalize the RGB channel value of the pixel point from [0,255] to [0,1], so as to improve the learning efficiency and reasoning capability of the CNN.

Second, the hidden layer can include 3 common structures of convolutional layers, pooling layers, and fully-connected layers.

(1) A convolutional layer (functional layer) functions to extract features of input data.

The description for the convolutional layer can be performed from three perspectives of a convolutional kernel (convolutional kernel), convolutional layer parameters, and an excitation function.

A. And (4) performing convolution kernel. The convolutional layer contains a plurality of convolution kernels. The convolution kernel comprises a plurality of elements, and each element corresponds to a weight coefficient and a bias vector. Where the elements are similar to neurons in a feedforward neural network (neuron).

B. Convolutional layer parameters. The convolutional layer parameters comprise convolutional kernel size, step length and filling, and the three parameters jointly determine the size of the convolutional layer output characteristic diagram and are hyper-parameters of the convolutional neural network. Where the convolution kernel size is an arbitrary value smaller than the input image size. Accordingly, the larger the convolution kernel, the more complex the input features that can be extracted. The convolution step defines the distance of the positions where the convolution kernel sweeps the feature map twice next. In response to the convolution step size being 1, the convolution kernel will sweep through the elements of the feature map one by one, and the step size being n will skip n-1 elements in the next scan. It should be noted that the purpose of padding is to maintain the feature dimension after the convolution kernel processing.

C. Excitation function (activation function). The role of the stimulus function is to assist in expressing more complex features.

(2) Pooling layer (pooling layer)

After feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer is provided with a preset pooling function and is used for replacing the result of a single point in the feature map with feature map statistics of adjacent areas.

(3) Full connecting layer (full-connected layer)

The full connection layer is used for carrying out nonlinear combination on the characteristics extracted by the layers to obtain output, and outputting data to the output layer.

Again, upstream of the output layer in CNN is typically a fully connected layer. For the image classification problem, the output layer outputs a classification label using a logistic function or a normalized exponential function (softmax function). In the object detection problem, the output layer may be designed to output the center coordinates, size, and classification of the object. In semantic segmentation, the output layer outputs a classification result for each pixel.

Graph Convolutional neural Network (GCN): is a convolutional neural network for feature extraction of graph data.

Knowledge graph: graph data for indicating respective attributes of a plurality of nodes and relationships of the plurality of nodes to each other. In one possible approach, the knowledge-graph includes a label relationship matrix and a node information matrix. The sum of the two matrices may be referred to as a knowledge graph.

The application provides a multi-label classification method for images, which has a better improvement effect on the problems of high error rate or low operation speed and the like caused by multi-label classification of a single image in the related technology. It should be noted that, in the related art, only feature extraction is performed on the image, and specific tags are determined according to features in the image, so that a plurality of accurate tags can be determined only when the features corresponding to the tags are more prominent or obvious. If the feature of an object to be tagged is not obvious in an image, it is difficult for the related art to identify the corresponding tag. However, the solution provided in the present application will be able to identify the above-mentioned tag that needs to be added, please refer to the description of the following embodiments.

According to the method, the label classification model can be constructed by combining the structure of the neural network, so that the multi-label classification method of the image is realized. It should be noted that the label classification model needs to be trained before being applied, i.e. before the inference phase, as described below.

Referring to fig. 1, fig. 1 is an architecture diagram of a tag classification model according to an embodiment of the present application. In fig. 1, the tag classification model 100 includes a convolutional neural network 110, a matrix multiplication module 120, and an activation layer 130.

In the label classification model 100 shown in fig. 1, the convolutional neural network 110 is configured to receive the image 1a to be processed, and the image 1a to be processed is processed by the convolutional neural network 110 to obtain a corresponding image feature matrix 1 b. Subsequently, the image feature matrix 1b and the atlas feature matrix 1c are multiplied in the matrix multiplication module 120 to obtain data to be activated 1d, the data to be activated 1d is input into the activation layer 130, and the activation layer 130 processes the data to be activated to obtain a second tag group 1 e. In fig. 1, the second tag group 1e includes 3 second tags.

Note that the map feature matrix 1c in fig. 1 is data updated as the knowledge map is updated. And combining the updating way of the map feature matrix 1c to obtain another structure of the label classification model. Referring to fig. 2, fig. 2 is an architecture diagram of a tag classification model provided based on the embodiment shown in fig. 1.

In fig. 2, the knowledge graph includes a label relationship matrix 2a and a node information matrix 2 b. The knowledge-graph may be input into the graph convolution neural network 200 to obtain the graph feature matrix 1 c. When the knowledge-graph generates an update, the computer device may retrieve the label relationship matrix 2a and the node information matrix 2b from the updated knowledge-graph and input the label relationship matrix 2a and the node information matrix 2b to the graph convolution neural network 200 so as to obtain the graph feature matrix 1 c. Illustratively, the profile feature matrix 1c is updated only when a change in the knowledge-profile occurs. When the knowledge graph does not change, the graph feature matrix 1c keeps the value obtained by the last calculation participating in the calculation process shown in fig. 1.

Based on the framework of the classification shown in fig. 1 and 2, a computer device may train the framework shown in fig. 2 when constructing the label classification model. In another possible approach, the structure shown in fig. 2 is also referred to as a dual-branch architecture.

When training the model shown in fig. 2, a knowledge graph needs to be constructed first. The construction process of the knowledge graph can be divided into a keyword collection stage and a knowledge graph construction stage.

In the keyword collection stage, the server at the cloud end can collect massive data of the user using the mobile phone album. It should be noted that the data collected by the server using the photo album is desensitized data, and does not relate to any private information of the user. When the server acquires the data of the usage album, the server can extract keywords that the user frequently searches. The keywords frequently searched by the user may be the top n keywords which appear the most frequently among the keywords collected by the server. The type of the keyword may include categories such as an entity, a scene, a behavior, or an event. Wherein the entity may comprise a cat, a dog, a flower, a vehicle, a cake, a balloon, a dish, a drink, a shop, a river, a beach, a sea, etc. entity objects. The scene may include scene information such as sunrise and sunset, banquet, playground, or sports scenes. Behaviors include walking, running, eating, standing, etc. The event comprises information such as travel, shopping or eating.

After the server determines keywords that the user frequently searches for, the server may build a tag list that includes the keywords. Note that, the tag in the tag list here may be the first tag.

Next, the server will build a knowledge graph from the list of tags. In this stage, the server may implement the construction of the knowledge-graph by performing the following steps a) to h).

Step a), extracting a text label relation from a text knowledge graph.

Illustratively, the text class knowledge-graph may comprise a knowledge-graph such as ConceptNet or WordNet. Text-class label relationships may include semantically self-contained relationships of the labels, such as containment or predicate relationships, and the like. In this step, the server will pre-select the text class knowledge graph and extract the text class label relationship from the text class knowledge graph. It should be noted that, in this step, the server may select a knowledge graph with a better use effect of the text class label relationship in the current field, the specific knowledge graph is only an exemplary introduction, and the application does not limit the specific text class indication graph used.

And b), extracting the mutual correlation of the labels in the image from the specified image class data set.

Illustratively, the cross-correlation may be a conditional probability. In response to the cross-correlation being a conditional probability, the method for calculating the conditional probability refers to the following formula.

Wherein, P (a | B) is the conditional probability of the occurrence of the tag a when the tag B occurs, P (ab) is the probability of the simultaneous occurrence of the tag a and the tag B, and P (B) is the probability of the occurrence of the tag B.

And c), calculating the weight among the labels.

In this step, the weights between the text class relationship labels refer to the weights in the text class knowledge graph used in step a). For example, the weights in the knowledge map such as ConceptNet and WordNet are referred to. If a plurality of text label relations are combined, weighted average is carried out on the combined relation weight to serve as the combined relation weight; referring to the conditional probability calculation method in the step b), the image class labels are not combined generally; if the text class label relationship has no weight, the weight will be filled by a numerical value of 0 or 1, 0 indicating that there is no relationship between two nodes, and 1 indicating that there is a relationship between two nodes. It should be noted that 0 and 1 are used to fill the relationship between two nodes having a logical relationship. In an embodiment of the present application, nodes in the knowledge-graph are used to represent tags. For example, the name of a node in the knowledge-graph represents the name of a tag mentioned in the present application.

And d), combining the text type label relation and the image type label relation to be used as the edge of the knowledge graph.

And e), manually sorting the attributes of the label, such as the definition, the keyword, the synonym and the like.

In the step, a technician logically determines whether the knowledge map is close to the situation in the actual life by reading and combing the knowledge map, and manually adjusts the abnormal data. The purpose of this step is to improve the ability of the knowledge-graph to describe associations between photos in real life.

And f), embedding the label by using a specified algorithm to obtain a word vector (word embedding).

The specified algorithm may be an embedding-capable algorithm such as Glove.

And g), acquiring definitions, keywords, synonyms or word vectors from the data to serve as node attributes of the knowledge graph.

And h), combining the edges and the nodes to obtain the established knowledge graph.

In the knowledge graph established above, each node represents a label, and the edges represent relationships between labels, which include, but are not limited to, a sense relationship, a correlation relationship, a positional relationship in an image, a predicate relationship, and the like. The upper and lower meaning relationships are used to indicate the relationship between the upper and lower meaning concepts.

For example, "pet" is a sense concept of "cat," and "cat" is a sense concept of "pet". The correlation is used to indicate the probability that two labels appear in one image at the same time. The positional relationship in the image is used to indicate the positional relationship of the two tags in the image, for example, "apple" is above "table" and "floor" is below "table". Predicate relationships are used to indicate the definition of some labels, etc. For example, "apple" is "food".

For each node's attributes, the above attributes include, but are not limited to, the following, Embelling, node type, and synonyms. Where Embedding is a word vector obtained by Embedding a tag name as a word by a NLP (Natural Language Processing) correlation algorithm. The node types may include objects, scenes or events, and the like.

In summary, the knowledge graph can be constructed through the above process, and after a knowledge graph is constructed, the related data in the knowledge graph is also fixed. Next, the server may train the label classification model applied in the present application according to the architecture shown in fig. 2, so as to obtain a label classification model that can be used for inference.

Taking the label classification model shown in fig. 2 as an example, the whole training process of the label classification model is described. The data in the label classification model that needs to be updated in the training phase are the parameters in the convolutional neural network 110 and the parameters in the convolutional neural network 200. In response to the training process ending, the parameters in the convolutional neural network 110 and the parameters in the convolutional neural network 200 are fixed.

When the atlas neural network 200 is trained, each atlas layer may be represented by the following formula.

In the above formula, H^(l)Is the input to the current map convolutional layer. H⁽¹⁾Is the input to the first layer of map convolutional layers in the convolutional neural network 200, i.e., H⁽¹⁾Is the input graph convolution neural network 200 node information matrix.

Is the label relation matrix A plus the matrix after self-join, i.e.

Is composed of

The degree matrix of (c) is,

corresponding to the pair of adjacent matrixes

And (6) carrying out normalization. W^(l)Is a parameter to be learned in the training process, and σ () is an activation function.

In the training process, each graph convolution layer processes the node information output by the previous graph convolution layer to obtain new node information and outputs the new node information to the next graph convolution layer, and the graph structure A according to the graph convolution layer is not changed in the whole graph convolution neural network 200.

After the label classification model is trained, the computer device can use the model to perform the multi-label classification method for images shown in the present application, for details, see the description of fig. 3.

Referring to fig. 3, fig. 3 is a flowchart of a multi-label classification method for an image according to an embodiment of the present disclosure. Fig. 3 may be applied to a computer device, and in the embodiment of the present application, the computer device may be either a terminal or a server. In the implementation of the method, please refer to the following description.

In the present application, a computer device may acquire an image to be processed.

The manner of acquiring the image to be processed may be different according to the specific implementation manner of the computer device.

Illustratively, when the computer device is a terminal, the terminal can directly shoot an image through the image acquisition assembly, and the shot image is taken as an image to be processed. In another possible mode, the terminal may acquire an image from another computer device, and use the acquired image as an image to be processed. In yet another possible manner, the terminal may further synthesize a virtual image according to a specified instruction and data by an installed image synthesis application, and treat the virtual image as an image to be processed.

Illustratively, when the computer device is a server, the server may receive an image uploaded by the terminal as an image to be processed. Alternatively, the server may synthesize a virtual image according to a specified instruction and data by an installed image synthesis application, and use the virtual image as an image to be processed.

Regarding the number of images to be processed, the number of images to be processed may be one or more. When the images to be processed are multiple, the computer device can process the multiple images to be processed in a serial mode or a parallel mode.

In the serial mode, the computer device will process the next image after one image has been successfully tagged with at least two second tags.

In a parallel manner, the computer device will process several images simultaneously, which will obtain respective corresponding second labels simultaneously.

It should be noted that, the number of images to be processed and the serial manner or the parallel manner adopted by the computer device in the embodiments of the present application are different according to the actual application scenario, and the embodiments of the present application are not limited to this.

Step 310, extracting image features of the image to be processed through a feature extraction layer in a label classification model, wherein the label classification model is a neural network model used for adding at least two labels to the image to be processed.

Illustratively, after acquiring the image to be processed, the computer device will be able to extract image features from the image to be processed. In this example, the computer device extracts image features of the image to be processed through a feature extraction layer in the label classification model. In practical application, the label classification model provided by the application is used for providing at least two labels for one image to be processed.

Alternatively, the image features may include global features and local features according to application scenes.

In response to the image feature being a global feature, the computer device will extract the feature of the entire image as the image feature of the image to be processed, taking the entire image as the material.

In response to the image feature being a local feature, the computer device will extract the corresponding diagnosis as the image feature of the image to be processed, taking the identified one or more local regions in the entire image as material.

Referring to fig. 4, fig. 4 is a schematic diagram of a global feature provided based on the embodiment shown in fig. 3. In fig. 4, each pixel in the image 400 to be processed is used as a material, and after being processed by the computer device, a global feature 420 is extracted, where the global feature 420 is used to indicate a feature of the image 400 to be processed.

Referring to fig. 5, fig. 5 is a schematic diagram of another partial feature provided based on the embodiment shown in fig. 3. In fig. 5, after the image 400 to be processed is processed by the computer device, 3 candidate frames appear, and then the computer device continues processing, and obtains 3 sets of local features, namely, a local feature 510, a local feature 520, and a local feature 530, according to the local images in the 3 candidate frames. In the embodiment of the present application, the sum of the

local features

510, 520, and 530 is referred to as an image feature.

And 320, processing image characteristics through a map characteristic matrix to obtain data to be activated, wherein the map characteristic matrix is obtained after a knowledge map is processed through a graph convolution neural network, and the knowledge map is used for indicating the attributes of the first labels and the relation between at least two first labels.

After the computer equipment obtains the image characteristics, the computer equipment obtains the atlas characteristic matrix. It should be noted that the map feature matrix is a specific matrix obtained from the knowledge map. When the knowledge-graph is unchanged or not updated, the graph feature matrix will not change. That is, when the computer device updates an internally stored knowledge graph, the corresponding graph feature matrix is updated. When the knowledge graph stored in the computer device is not changed, the originally stored graph characteristic matrix is not updated.

When the computer equipment simultaneously obtains the image characteristics and the map characteristic matrix, the computer equipment processes the image characteristics through the map characteristic matrix to obtain the data to be activated. It should be noted that the calculation method may be adjusted according to the form of the image feature. When the image features are in a matrix form, the computer equipment performs a matrix multiplication processing mode on the image features and the map feature matrix, and a result obtained after multiplication is used as data to be activated.

It should be noted that the knowledge graph applied in the embodiment of the present application is used to indicate both the attribute of the first tag itself and the relationship between at least two first tags.

And 330, processing the data to be activated through an activation layer in the label classification model to obtain at least two second labels.

In the application, the computer device processes the data to be activated through an activation layer in the label classification model to obtain at least two second labels. The second labels are used for indicating features in the image to be processed, and each first label is used for indicating that the features which are consistent with the labels exist in the image to be processed. Note that the second label is a label selected from the first labels.

For example, if the first tag includes 9 tags as shown in table one, the second tag may be 3 tags as shown in table two. The second label belongs to the first label, and the second label is the label which is most consistent with the characteristics of the image to be processed in the first label.

Serial number	1	2	3	4	5	6	7	8	9
										Label (R)	Beach sand	Oceans	Sunrise and sunset	Cat (cat)	Dog	Banquet	Shopping	Landscape	Automobile

Watch 1

In table one, a possible category of first tags is shown. The category shown in table one is merely an exemplary illustration, and does not limit the type of the first label used in the embodiments of the present application. In a possible way, the first tag may also include a person, and the tag of the person may be specific to the name of the person, or may be only a tag representing the characteristics of the age, sex, occupation, and the like of the person.

Serial number	1	2	3	4
					Label (R)	Oceans	Beach sand	Dog	Landscape

Watch two

In table two, the labels shown are the second labels screened from the first labels shown in table one in the embodiment of the present application, and a total of 4 second labels are included. In other words, for the image to be processed, the computer device derives that the second labels ocean, beach, dog, and landscape are all labels that conform to the features of the image to be processed.

It should be noted that, for the image to be processed, if the image includes a sea and a dog with obvious features, the image also includes a beach with no obvious features. According to the scheme in the related art, only two labels of sea and dog are marked on the image to be processed with higher probability. According to the scheme provided by the application, because the atlas feature matrix is introduced in the judging process, the matrix is derived from the knowledge atlas, and the knowledge atlas can be used for indicating the relationship between the two first labels, the atlas feature matrix can actually provide the incidence relationship between the sea and the beach with a strong correlation relationship, also can provide the strong correlation relationship between the sea and the scenery and the strong correlation relationship between the beach and the scenery, namely the method provided by the application has the advantage that the sea, the beach, the dog and the scenery can be recognized as the second label of the image to be processed at the same time.

In a practical implementation process, each first tag has a corresponding threshold, and when the probability value of the first tag obtained by the activation layer is greater than the corresponding threshold, the activation layer determines the first tag as a second tag. Schematically, the data shown in table one and table two are taken as an example for description.

Referring to table three, the data shown in table three is based on the probability value indicated by the first tag according to the activation data and the preset threshold value.

Watch III

In table three, after the image to be processed is processed by the label classification model, the actual measurement probability of each first label is obtained, and the actual measurement probability is included in the data to be activated processed by the activation layer. The activation layer may store a preset threshold corresponding to each first tag in advance. The activation layer can obtain actual measurement probabilities between the image to be processed and the labels, the actual measurement probabilities are compared with a preset threshold value, and a first label with the actual measurement probability higher than the preset threshold value is determined as a second label. For example, according to the data shown in table three, the first tags corresponding to sequence numbers 1, 2, 5, and 8 are determined as the second tags.

Step 340, determining at least two second labels as the labels of the image to be processed, wherein the second labels belong to the first labels.

In the embodiment of the application, after determining at least two second tags, the computer device is used as the tags of the image to be processed.

In one possible approach, the second label may be displayed as visual information on the processed image to be processed. Referring to fig. 6, fig. 6 is a visual interface after image processing provided based on the embodiment shown in fig. 3. In the user interface 600, the image 610 may be processed to display three second labels attached thereto, a first second label 620, a second label 630, and a third second label 640, respectively, "beach", "tree", and "sea", respectively, below.

In another possible approach, the second label may not be used as visual information, but as a kind of attribute information of the image. Alternatively, the attribute information may be stored in an attribute frame of the image, or may be additionally stored in a file designated by the computer device. The attribute frame of the image is used as a part of the image, is copied along with the copying of the image, and is lost along with the deletion of the image.

In an actual application scenario, if a plurality of second tags are printed on a plurality of images, the computer device may generate the album intelligently according to the plurality of tags. For example, when several images appear in "beach", "sea" and "landscape", the images are intelligently combined into an album named "seaside play". It should be noted that the operation of generating the album intelligently may be performed on the terminal side or the server side.

When the operation is completed in the server, the terminal can upload the image shot by the terminal to the server through cloud backup or other forms. Therefore, the server realizes the operation of intelligently generating the photo album for a plurality of images.

To sum up, the multi-label classification method for the images can obtain the data to be activated by combining the atlas feature matrix after extracting the image features of the images to be processed, obtain at least two second labels according to the data to be activated, and take the second labels as the labels of the images to be processed. The knowledge graph is used for indicating the attributes of the first labels and the relation between at least two first labels, and the knowledge graph reflecting the relation between the first labels is introduced in the process of determining the second label for the image to be processed, and the second label is determined in an auxiliary mode by using the graph characteristic matrix obtained by the knowledge graph, so that the problem that part of the second labels are omitted in the process of determining the labels is solved, and the accuracy of determining the plurality of labels of the image is improved.

Based on the trained label classification model, the embodiment of the application provides a multi-label classification method for images based on the label classification model. Through the label classification model, a more accurate multi-label classification result can be obtained for one image to be processed. For details see the following description.

Referring to fig. 7, fig. 7 is a flowchart of a multi-label classification method for an image according to another exemplary embodiment of the present application. The multi-label classification method of the image can be applied to the terminal or the server. In fig. 7, the multi-label classification method for the image includes:

step 711, obtain the image to be processed.

In the embodiment of the present application, there may be different methods for acquiring an image to be processed according to applications that are different in execution subject.

In a possible mode, when the image to be processed is acquired by the server, the server acquires the image to be processed from the data transmitted by the terminal. Optionally, the mode of transmitting data from the terminal to the server may include scenes such as cloud album synchronization, smart album creation, cloud backup, and the like.

In another possible mode, when the image to be processed is acquired by the terminal, the terminal extracts the image to be processed from a locally stored gallery, and the image can be shot by the terminal itself or can be an image shot by other terminals and then sent to the terminal.

In the following steps, taking the application of the method in a terminal as an example, the implementation process of the embodiment shown in fig. 7 is described.

Step 712, input the image to be processed into the convolutional neural network.

The image to be processed can be directly input into the convolutional neural network, and the image is processed through the convolutional neural network.

And 713, processing the image to be processed through a convolutional neural network to obtain an image characteristic matrix.

In this example, the convolutional neural network includes a plurality of layer structures, and the convolutional neural network sequentially passes through the plurality of layer structures to obtain an image feature matrix.

In one possible approach, the label classification model includes an input layer, a convolutional layer, and a pooling layer. The process of processing the image to be processed through the convolutional neural network may include inputting the image to be processed into an input layer, and finally obtaining an image feature matrix through the layer-by-layer processing.

Illustratively, the computer device may input the image to be processed into the input layer, to obtain first intermediate data; inputting the first intermediate data into the convolutional layer to obtain second intermediate data; and inputting the second intermediate data into the pooling layer to obtain an image feature matrix.

After the image to be processed is input to the input layer, the first intermediate data is obtained through the processing of the input layer. Subsequently, the input layer in the neural network is connected to the convolutional layer, and the convolutional layer processes the first intermediate data to obtain second intermediate data. And in the neural network, a pooling layer is connected with the convolutional layer, and the second intermediate data is processed by the pooling layer to obtain an image characteristic matrix.

It should be noted that the computer device may execute steps 721 to 723 when the map feature matrix is not stored, so as to obtain the map feature matrix. In response to that the atlas feature matrix stored by the computer device is the atlas feature matrix corresponding to the knowledge atlas of the latest version, the computer device may directly use the stored atlas feature matrix in the process of labeling the image to be processed with a plurality of second labels, without executing steps 721 to 723.

Step 721, inputting the label relation matrix into the graph convolution neural network, wherein the label relation matrix is used for indicating the relation between at least two first labels.

Two matrixes need to be input into the graph convolution neural network, and the label relation matrix is used as one input of the two matrixes, and the two matrixes are input into the graph convolution neural network in the embodiment of the application. Wherein the label relation matrix is used for indicating the relation between at least two first labels.

Step 722, inputting a node information matrix into the graph convolutional neural network, wherein the node information matrix is used for indicating the attribute of the first label.

Optionally, the computer device may be further capable of inputting the node information matrix into the atlas neural network when inputting the label relationship matrix into the atlas neural network. Illustratively, the node information matrix and the label relation matrix jointly form a knowledge graph.

And 723, processing the label relation matrix and the node information matrix through a graph convolution neural network to obtain a graph characteristic matrix.

It should be noted that, in response to the update of the knowledge graph for generating the graph feature matrix, the computer device regenerates a new graph feature matrix according to the updated knowledge graph, and stores the new graph feature matrix to process the image features to obtain the data to be activated.

In actual operation, the computer equipment completes updating in response to the data in the knowledge graph, and acquires the updated knowledge graph; processing the updated knowledge graph through a graph convolution neural network to obtain an updated graph characteristic matrix; and updating the map characteristic matrix in the label classification model through the updated map characteristic matrix.

It should be noted that the update of the knowledge graph may be performed at the server side, and the server calculates the updated graph feature matrix after the update of the knowledge graph, and pushes the updated graph feature matrix to the terminal as new information. And the terminal processes the image characteristics according to the new atlas characteristic matrix to obtain the data to be activated.

In one possible implementation, the size of the atlas feature matrix is C × N, where C is the number of first labels, N is the feature dimension, and C and N are both positive integers.

Correspondingly, the scale of the image feature matrix is N x 1, the scale of the atlas feature matrix is C x N, the scale of the data matrix to be activated is C x 1, C is the number of the first labels, N is the feature dimension, and C and N are both positive integers.

And obtaining the scale of the data matrix to be activated as C x 1 based on the scale of the atlas feature matrix as C x N and the scale of the image feature matrix as N x 1. In colloquial terms, for the data matrix to be activated, each row of data corresponds to the data after one first tag has been activated.

And 731, multiplying the image characteristic matrix and the map characteristic matrix to obtain a data matrix to be activated.

Step 732, processing the data matrix to be activated through an activation layer in the label classification model to obtain at least two second labels.

In this application, the terminal may perform the step (3a), the step (3b) and the step (3c) instead of the step 732 to obtain at least two second tags.

And (3a) inputting the data matrix to be activated into the activation layer.

And (3b) processing the data to be activated through the activation layer to obtain a probability value corresponding to the first label, wherein the probability value is used for indicating the probability that the first label conforms to the image to be processed.

And (3c) responding to the probability value higher than the corresponding first threshold value, and determining the corresponding first label as a second label, wherein the first threshold value is used for judging whether the first label accords with the image to be processed.

It should be noted that the first threshold may correspond to the first label one to one. When the number of the first tags used in the tag classification model is i, the number of the first threshold is also i.

In the present embodiment, when the computer device performs the completing step 732, the image to be processed has obtained at least two second tags. Each image to be processed can obtain at least two second labels to which the image belongs through the process. The computer device can achieve the effect of multi-label classification by performing steps 711 to 732 provided by the embodiments of the present application on a plurality of images to be processed that need to be processed. However, in order to provide a more accurate classification effect, the embodiment of the present application may further add an image post-processing procedure, and determine whether to add the specified second label to the image to be processed according to a feature other than the image content of the image to be processed.

Illustratively, the computer device acquires shooting time relation information between the first image and the second image in response to the first image and the second image having acquired the respective corresponding second tags. The first image and the second image are both to-be-processed images to which the second label has been added. See, for example, table four. Table four shows a case of the second label in which the first image and the second image are processed.

Watch four

As can be seen from the data shown in table four, the second image, after being labeled with a plurality of second labels according to the scheme shown in the present application, includes 3 second labels, namely "ocean", "dog", and "landscape", respectively. After the first image is labeled with a plurality of second labels according to the scheme shown in the application, the first image comprises 4 second labels, namely "sea", "beach", "dog" and "landscape", respectively. In this case, the computer device will acquire the shooting time relationship information between the first image and the second image.

The shooting time relation information is used for indicating the time sequence relation of the first image and the second image at the shooting time, or the shooting time relation information is used for indicating the time length between the shooting time of the first image and the shooting time of the second image.

In the first case indicated by the shooting-relationship information, the shooting-time-relationship information indicates a timing relationship. The time-series relationship includes two cases, the first case being that the capturing timing of the first image is earlier than the capturing timing of the second image. The second case is that the shooting time of the first image is later than the shooting time of the second image. It should be noted that, by default, the first image and the second image processed by the present application are images captured by the same terminal. Therefore, logically speaking of image capturing, the capturing timing at which there is no first image is equal to the capturing timing of the second image.

Optionally, in this embodiment of the application, the first image and the second image are images captured by the same terminal through the same group of cameras. When the terminal for shooting the images is the smart phone, the cameras of the smart phone usually comprise two groups, namely a front camera group and a rear camera group, and the smart phone can select one group of the cameras to shoot the images. In a less common scenario, the smartphone includes only one set of cameras, but the set of cameras can be flipped to the direction of shooting the front side or flipped to the direction of shooting the back side. The first image and the second image shown in the embodiment of the present application are two images taken by the group of cameras facing the same side. The smart phone can determine the orientation of the current camera through the state information.

In the second case indicated by the shooting time relation information, the shooting time relation information indicates time length information that is a time length between the shooting time of the first image and the shooting time of the second image, the information may be a fixed numerical value, and the accuracy may be minutes, seconds, milliseconds, or the like. It should be noted that the precision value may be different according to the application scenario. When the embodiment of the application is applied to scenes such as daily portrait and landscape shooting, the precision can be second-level precision. When the embodiment of the application is applied to shooting scenes of objects moving at high speed, the precision can be milliseconds, such as people moving at high speed, vehicles or particles in microscopic scenes. When the embodiment of the application is applied to a scene monitored in a natural reserve area, the precision can be minutes. The embodiment of the application only introduces the precision of the shooting relation information schematically, and does not limit the actual scene.

By integrating the contents indicated by the two types of shooting time relation information, the computer device can increase the target second label to a second label corresponding to the second image when the shooting time relation information meets the preset condition, wherein the target second label is a label corresponding to the first image and not corresponding to the second image. The preset condition may be used to indicate that the photographing time of the first image and the photographing time of the second image are close. That is, when the photographing time relationship information indicates that the photographing times of the first image and the second image are close, the photographing time relationship information corresponding to the photographing time corresponds to the preset condition.

The information of the shooting time relationship meets the preset conditions, and the introduction on a specific scene is performed. When the shooting time relationship information indicates a timing relationship, the computer apparatus realizes an operation of adding a label to the second image through steps (4a) and (4 b).

And (4a) responding to the first image and the second image to acquire the corresponding second labels, and acquiring the target time length.

Wherein the target time length is a time length between the capturing time of the first image and the capturing time of the second image.

And (4b) in response to the target duration being less than the second threshold, adding the target second label to a second label corresponding to the second image.

In the embodiment of the present application, when the target duration is less than the second threshold, it is described that the capturing timing of the first image is close to the capturing timing of the second timing. In this scene, the scene in the first image is similar to the scene in the second image with a high probability. Thus, the computer device may also print a label that is not present in the second image and that is present in the first image as the second label onto the second image.

When the photographing time relation information indicates a long time, the computer device realizes an operation of adding a label to the second image through the steps (5a) and (5 b).

And (5a) acquiring the shooting time of the first image in response to that the first image and the second image have acquired the corresponding second labels respectively, the number of the first images is 2k, k pieces of the first images are images shot before the second image, and k pieces of the first images are images shot after the second image.

And (5b) in response to the fact that the length of the interval where the shooting time of 2k first images is smaller than a third threshold value, increasing the target second label to a second label corresponding to the second image, wherein k is an integer larger than or equal to 1.

In the embodiment of the present application, the number of first images, which are images continuously taken by the terminal before and after the second image is taken, may be selected to be 2k in total.

For example, please refer to fig. 8, fig. 8 is a schematic diagram of an image post-processing provided based on the embodiment shown in fig. 7. In fig. 8, k is 3, and the terminal continuously captures a first image 811, a second image 812, a third image 813, a second image 820, a fourth image 814, a fifth image 815, and a sixth image 816 in the order of the capture time from morning to evening. Meanwhile, table five may be referred to. Table five shows the shooting time of each image.

Image 811

Image 812

Image 813

Image 820

Image 814

Image 815

Image 816

10:24:49

10:24:56

10:25:06

10:25:17

10:25:24

10:25:29

10:25:35

Watch five

Of the 7 images shown in table five, 6 of the first images have the second label "beach" and the second image 820 does not have the corresponding second label "beach".

In the first processing stage 8A, the second labels of the second image 820 are "tree" and "sea". The second labels of the other 6 first images are all "tree", "sea" and "beach". In the second processing stage 8B, the computer device determines that 3 first images are before the moment of capture of the second image and 3 further first images are after the moment of capture of the second image, acquiring the moment of capture of each first image.

In this example, the third threshold is 60 seconds, and the time period from the capturing period of the first image 811 to the capturing timing of the 6 th first image 816 is 46 seconds. That is, the length of the section where the shooting time of the 6 first images is located is less than the third threshold for 60 seconds, and the computer device will copy the second label "beach" in all the 6 first images as the target second label to the second label corresponding to the second image in the second processing stage 8B. In both the first processing stage 8A and the second processing stage 8B, the second label corresponding to the first image is not changed. Therefore, the first image is not repeatedly shown in fig. 8.

Referring to fig. 9, fig. 9 is a schematic diagram of a process for automatically generating an album according to an embodiment of the present application. In the image acquisition phase 9A of fig. 9, the computer device acquires several images to be processed that need to be processed. If the computer equipment is a server, the image acquisition stage 9A may be to receive a photo uploaded by the terminal, and the landing process may be to perform cloud backup or album backup and the like in the server by the terminal. If the computer device is a terminal, the image capturing stage 9A may be a process of taking a picture, and after the picture is taken and stored, the computer device has obtained a plurality of devices to be processed.

After the computer device has acquired the image to be processed, the computer device may add at least two second labels to the image to be processed through the label classification model provided in the present application in the multi-label determination stage 9B.

When at least two labels have been added to each of the plurality of images to be processed, the computer device is capable of determining, in the image post-processing stage 9C, whether to supplement a target second label, which is a label corresponding to the first image and not corresponding to the second image, to the second image by dividing the image to be processed into the first image and the second image and according to whether the shooting time relationship information between the first image and the second image meets a predetermined condition.

After the image to be processed is processed in the image post-processing stage 9C, the computer device may generate a designated album including the image to be processed according to a preset policy. In one possible strategy, the computer device selects m tags, creates a first album from the images to be processed that include the m tags, and creates a name for the album from the selected m tags. In another possible strategy, the computer device would define a designated shooting location and m tags, and create a second album of similar content shot at the designated shooting location. In another possible strategy, the computer device would define a specified shot time and m tags, creating a third album of similar content shot at the specified shot time. Therefore, according to the scheme provided by the embodiment of the application, a plurality of labels can be printed on the image to be processed on the premise of higher accuracy, the corresponding photo album can be generated based on the intelligent photo album, the efficiency and accuracy of automatically generating the photo album are improved, and the situation that the image which actually meets the standard of the photo album is omitted when the photo album is generated is reduced.

In summary, the tag classification model used in this embodiment includes a structure of a convolutional neural network, where the convolutional neural network is used to extract image content in an image to be processed, after an image feature matrix is extracted by the convolutional neural network, the convolutional neural network can be processed by a map feature matrix derived from a knowledge map to obtain data to be activated, and after the data to be activated is processed by an activation layer, at least two second tags can be obtained, so that the effect of marking multiple tags for the image to be processed is achieved.

Optionally, the embodiment of the application can also introduce a graph convolution neural network to process the knowledge graph, so that a graph feature matrix used for processing the image feature matrix is obtained, and when a plurality of second labels are marked on the image to be processed, data before entering the activation layer can be subjected to the balance control of the cross-correlation system between the first nodes in the knowledge graph, so that the omission of the unobvious labels in the image to be processed is avoided, and the accuracy of marking a plurality of second labels on the image to be processed is improved.

Optionally, in this embodiment, after the to-be-processed image finishes labeling of at least two second tags, whether a second tag that is not labeled exists in the to-be-processed image is detected again through a data post-processing stage. In the post-processing stage, the computer device detects whether a second label which is not marked on the image to be processed exists in the image adjacent to the image to be processed, and if the image adjacent to the image to be processed is closer to the shooting time of the image to be processed at the shooting time, the second label which is not marked on the image to be processed exists in the image adjacent to the image to be processed is marked on the image to be processed, so that the marking accuracy of the second label is improved.

Optionally, when a second label exists in the first k images and the last k images of the image to be processed, and the time intervals of the first k images and the last k images are within the specified time length range, the second label is labeled on the image to be processed, so that the labeling accuracy of the second label is further improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 10, fig. 10 is a block diagram illustrating a structure of an apparatus for multi-label classification of images according to an exemplary embodiment of the present application. The image multi-label classification device can be realized by software, hardware or a combination of the two into all or part of the terminal. The apparatus includes a feature extraction module 1010, a first acquisition module 1020, a tag acquisition module 1030, and a tag determination module 1040. The specific functions of the modules are described.

The feature extraction module 1010 is configured to extract image features of an image to be processed through a feature extraction layer in a tag classification model, where the tag classification model is a neural network model configured to add at least two tags to the image to be processed.

A first obtaining module 1020, configured to process the image features through a map feature matrix to obtain data to be activated, where the map feature matrix is obtained by processing a knowledge map through a map convolutional neural network, and the knowledge map is used to indicate an attribute of a first tag itself and a relationship between at least two first tags.

A tag obtaining module 1030, configured to process the data to be activated through an activation layer in the tag classification model to obtain at least two second tags.

A label determining module 1040, configured to determine at least two second labels as labels of the to-be-processed image, where the second labels belong to the first labels.

In an optional embodiment, the first obtaining module 1020 is configured to multiply the image feature matrix and the atlas feature matrix to obtain a data matrix to be activated. The tag obtaining module 1030 is configured to process the data matrix to be activated through the activation layer in the tag classification model to obtain at least two second tags.

In an optional embodiment, the apparatus relates to the knowledge-graph comprising a label relation matrix and a node information matrix, and further comprises a first input module, a second input module and a second obtaining module. The first input module is configured to input the label relationship matrix into the graph convolution neural network, where the label relationship matrix is configured to indicate a relationship between at least two first labels; the second input module is configured to input the node information matrix into the graph convolution neural network, where the node information matrix is used to indicate an attribute of the first tag itself; the second obtaining module is configured to process the label relation matrix and the node information matrix through the graph convolution neural network to obtain the graph feature matrix.

In an optional embodiment, the apparatus further comprises a third obtaining module, a fourth obtaining module, and a matrix updating module. The third acquisition module is used for responding to the completion of updating of the data in the knowledge graph and acquiring the updated knowledge graph; the fourth obtaining module is configured to process the updated knowledge graph through the graph convolution neural network to obtain the updated graph feature matrix; and the matrix updating module is used for updating the map characteristic matrix in the label classification model through the updated map characteristic matrix.

In an alternative embodiment, the apparatus relates to the atlas feature matrix at a scale C × N, where C is the number of the first labels, N is the feature dimension, and C and N are both positive integers.

In an alternative embodiment, the apparatus relates to the image feature matrix having a size N x 1, the atlas feature matrix having a size C x N, the data matrix to be activated having a size C x 1, C being the number of the first labels, N being the feature dimension, C and N both being positive integers.

In an optional embodiment, the to-be-processed image related to the apparatus includes a first image and a second image, and the apparatus further includes a post-processing module, configured to, in response to that the first image and the second image have acquired the second tags corresponding to each other, acquire shooting time relationship information between the first image and the second image, where the shooting time relationship information is used to indicate a time sequence relationship of the first image and the second image at a shooting time, or the shooting time relationship information is used to indicate a time length between a shooting time of the first image and a shooting time of the second image; and in response to that the shooting time relation information meets a preset condition, adding a target second label to the second label corresponding to the second image, wherein the target second label is the second label corresponding to the first image and not corresponding to the second image.

In an optional embodiment, the post-processing module is configured to add the target second tag to the second tag corresponding to the second image in response to a target duration being less than a second threshold, where the target duration is a duration between a shooting time of the first image and a shooting time of the second image.

In an optional embodiment, the post-processing module is configured to, in response to that the number of the first images is 2k, in 2k first images, k first images are images captured before the second image, and k first images are images captured after the second image, and obtain a capturing time of the first image; and in response to that the length of an interval where the shooting time of 2k pieces of the first images is smaller than a third threshold value, adding a target second label to the second label corresponding to the second image, wherein the target second label is a label corresponding to 2k pieces of the first images, the target second label is a second label which is not corresponding to the second image, and k is an integer greater than or equal to 1.

For example, the multi-label classification method for images shown in the embodiment of the present application may be applied to a computer device, where the computer device may be a terminal, and the terminal has a display screen and a multi-label classification function for images. The terminal may include a mobile phone, a tablet computer, a laptop computer, a desktop computer, a computer all-in-one machine, a server, a workstation, a television, a set-top box, smart glasses, a smart watch, a digital camera, an MP4 player terminal, an MP5 player terminal, a learning machine, a point-and-read machine, an electronic book, an electronic dictionary, a vehicle-mounted terminal, a Virtual Reality (VR) player terminal, an Augmented Reality (AR) player terminal, or the like.

Referring to fig. 11, fig. 11 is a block diagram of a terminal according to an exemplary embodiment of the present application, and as shown in fig. 1, the terminal includes a processor 1120 and a memory 1140, where the memory 1140 stores at least one instruction, and the instruction is loaded and executed by the processor 1120 to implement a multi-tag classification method for an image according to various method embodiments of the present application.

Processor 1120 may include one or more processing cores. The processor 1120, which is coupled to various parts throughout the terminal 110 using various interfaces and lines, performs various functions of the terminal 110 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1140 and invoking data stored in the memory 1140. Optionally, the processor 1120 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1120 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is to be understood that the modem may not be integrated into the processor 1120, but may be implemented by a single chip.

The Memory 1140 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 1140 includes a non-transitory computer-readable storage medium. The memory 1140 may be used to store instructions, programs, code sets, or instruction sets. The memory 1140 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like; the storage data area may store data and the like referred to in the following respective method embodiments.

In the embodiment of the present application, the computer device may also be a server, and the structure of the server may refer to the structure shown in fig. 12.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application. The server is configured to implement the application deployment method provided in the foregoing embodiment. Specifically, the method comprises the following steps:

the server 1200 includes a Central Processing Unit (CPU)1201, a system memory 1204 including a Random Access Memory (RAM)1202 and a Read Only Memory (ROM)1203, and a system bus 1205 connecting the system memory 1204 and the central processing unit 1201. The server 1200 also includes a basic Input/Output system (I/O) 1206, which facilitates transfer of information between devices within the computer, and a mass storage device 1207 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1208 and input device 1209 are connected to the central processing unit 1201 through an input-output controller 1210 coupled to the system bus 1205. The basic input/output system 1206 may also include an input/output controller 1210 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1210 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the server 1200. That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM (Electrically Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.

The server 1200 may also operate as a remote computer connected to a network via a network, such as the internet, according to various embodiments of the present application. That is, the server 1200 may be connected to the network 1212 through a network interface unit 1211 coupled to the system bus 1205, or the network interface unit 1211 may be used to connect to other types of networks or remote computer systems.

The embodiment of the present application further provides a computer-readable medium, which stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the multi-label classification method for images according to the above embodiments.

It should be noted that: in the multi-label classification apparatus for images provided in the above embodiments, when the multi-label classification method for images is executed, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the multi-label classification device for images and the multi-label classification method for images provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the implementation of the present application and is not intended to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for multi-label classification of an image, the method comprising:

2. The method according to claim 1, wherein the processing the image features through a feature map matrix to obtain data to be activated comprises:

multiplying the image characteristic matrix and the atlas characteristic matrix to obtain a data matrix to be activated;

processing the data to be activated through an activation layer in the label classification model to obtain at least two second labels, including:

and processing the data matrix to be activated through the activation layer in the label classification model to obtain at least two second labels.

3. The method of claim 2, wherein the knowledge-graph comprises a label relationship matrix and a node information matrix, the method further comprising:

inputting the label relationship matrix into the graph convolution neural network, the label relationship matrix being used to indicate a relationship between at least two of the first labels;

inputting the node information matrix into the graph convolution neural network, wherein the node information matrix is used for indicating the attribute of the first label;

and processing the label relation matrix and the node information matrix through the graph convolution neural network to obtain the graph characteristic matrix.

4. The method of claim 3, further comprising:

completing updating in response to the data in the knowledge graph, and acquiring the updated knowledge graph;

processing the updated knowledge graph through the graph convolution neural network to obtain an updated graph characteristic matrix;

and updating the spectrum characteristic matrix in the label classification model through the updated spectrum characteristic matrix.

5. The method of claim 3, wherein the profile feature matrix is of size C N, wherein C is the number of the first tags, N is the feature dimension, and both C and N are positive integers.

6. The method of claim 2, wherein the image feature matrix is of size N x 1, the atlas feature matrix is of size C x N, the data matrix to be activated is of size C x 1, C is the number of the first labels, N is a feature dimension, and C and N are both positive integers.

7. The method of claim 1, wherein the image to be processed comprises a first image and a second image, the method further comprising:

acquiring shooting time relation information between the first image and the second image in response to the first image and the second image acquiring the corresponding second labels respectively, wherein the shooting time relation information is used for indicating the time sequence relation of the first image and the second image at the shooting time, or the shooting time relation information is used for indicating the time length between the shooting time of the first image and the shooting time of the second image;

and in response to that the shooting time relation information meets a preset condition, adding a target second label to the second label corresponding to the second image, wherein the target second label is the second label corresponding to the first image and not corresponding to the second image.

8. The method according to claim 7, wherein the adding a target second label to the second label corresponding to the second image in response to the shooting time relationship information meeting a preset condition includes:

and in response to the target duration being less than a second threshold, adding the target second tag to the second tag corresponding to the second image, wherein the target duration is a duration between the shooting time of the first image and the shooting time of the second image.

9. The method according to claim 7, wherein the adding a target second label to the second label corresponding to the second image in response to the shooting time relationship information meeting a preset condition includes:

in response to the number of the first images being 2k, k images of the first images among 2k images being taken before the second image, k images of the first images being taken after the second image, the taking time of the first image being acquired;

and in response to that the length of an interval where the shooting time of 2k pieces of the first images is smaller than a third threshold value, adding a target second label to the second label corresponding to the second image, wherein the target second label is a label corresponding to 2k pieces of the first images, the target second label is a second label which is not corresponding to the second image, and k is an integer greater than or equal to 1.

10. An apparatus for multi-label classification of images, the apparatus comprising:

the system comprises a feature extraction module, a label classification module and a processing module, wherein the feature extraction module is used for extracting image features of an image to be processed through a feature extraction layer in a label classification model, and the label classification model is a neural network model used for adding at least two labels to the image to be processed;

the first acquisition module is used for processing the image features through a map feature matrix to obtain data to be activated, wherein the map feature matrix is obtained by processing a knowledge map through a map convolutional neural network, and the knowledge map is used for indicating the attributes of the first tags and the relationship between at least two first tags;

the label obtaining module is used for processing the data to be activated through an activation layer in the label classification model to obtain at least two second labels;

and the label determining module is used for determining at least two second labels as the labels of the image to be processed, wherein the second labels belong to the first labels.

11. A computer device, characterized in that the terminal comprises a processor, a memory connected to the processor, and program instructions stored on the memory, which when executed by the processor implement the method of multi-label classification of images according to any one of claims 1 to 9.

12. A computer readable storage medium having stored thereon program instructions which, when executed by a processor according to claim 11, implement a method of multi-label classification of images according to any of claims 1 to 9.