WO2022121485A1

WO2022121485A1 - Image multi-tag classification method and apparatus, computer device, and storage medium

Info

Publication number: WO2022121485A1
Application number: PCT/CN2021/122741
Authority: WO
Inventors: 罗彤; 郭彦东; 李亚乾; 杨林
Original assignee: Oppo广东移动通信有限公司; 上海瑾盛通信科技有限公司
Priority date: 2020-12-09
Filing date: 2021-10-09
Publication date: 2022-06-16
Also published as: CN112487207A

Abstract

The embodiments of the present application disclose an image multi-tag classification method and apparatus, a computer device, and a storage medium, belonging to the technical field of image processing. The present application provides an image multi-tag classification method, and can, upon acquisition of an image to be processed, input said image to a tag classification model, so as to obtain image features corresponding to said image, can obtain a graph feature matrix according to a knowledge graph, and combine the image features and the graph feature matrix to obtain data to be activated, and then can obtain, according to said data, at least two second tags corresponding to said image. The knowledge graph is used to indicate the relationship between tags and the attributes of the tags themselves. When adding multiple tags to said image, the tag classification model uses information provided by the knowledge graph, and thus the present application improves the reliability of the multiple tags obtained for said image, and reduces the complexity of acquisition of the multiple tags.

Description

Image multi-label classification method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202011451978.1 and the invention titled "Multi-label classification method, apparatus, computer equipment and storage medium for images" filed on December 9, 2020, the entire contents of which are incorporated by reference in this application.

technical field

The embodiments of the present application relate to the technical field of image processing, and in particular, to a method, apparatus, computer equipment, and storage medium for multi-label classification of images.

Background technique

With the rapid development of artificial intelligence technology, the ability of terminals to intelligently classify images in albums is getting stronger and stronger.

In the related art, the machine learning model provided by the artificial intelligence technology can intelligently determine the type of the object contained in the image, so as to label the image accordingly. However, when faced with a scene where multiple labels need to be added to a single image, the existing model will have problems such as increased error rate or slow operation speed.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a multi-label classification method, apparatus, computer device, and storage medium for images. The technical solution is as follows:

According to an aspect of the present application, there is provided a multi-label classification method for images, the method comprising:

Extract the image features of the image to be processed through the feature extraction layer in the label classification model, where the label classification model is a neural network model for adding at least two labels to the image to be processed;

The image features are processed by a graph feature matrix to obtain the data to be activated. The graph feature matrix is a matrix obtained after a knowledge graph is processed by a graph convolutional neural network, and the knowledge graph is used to indicate the attributes of the first label itself, and , the relationship between at least two of the first tags;

The data to be activated is processed by the activation layer in the label classification model to obtain at least two second labels;

At least two of the second tags are determined as tags of the image to be processed, and the second tags belong to the first tags.

According to another aspect of the present application, there is provided a multi-label classification device for images, the device comprising:

a first acquisition module, used for acquiring the image to be processed;

a feature extraction module for extracting image features of the to-be-processed image;

The second obtaining module is configured to obtain the data to be activated according to the image feature and the graph feature matrix, the graph feature matrix is a matrix obtained after the knowledge graph is processed by the graph convolutional neural network, and the knowledge graph is used to indicate the first An attribute of a tag itself, and a relationship between at least two of the first tags;

A label determination module, configured to obtain at least two second labels according to the data to be activated, and determine the at least two second labels as labels of the to-be-processed image, where the second labels belong to the first label Label.

According to another aspect of the present application, a terminal is provided, the terminal includes a processor and a memory, the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the method as described in the present application A multi-label classification method for images provided by various aspects.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, the instruction being loaded and executed by a processor to implement image processing as provided by various aspects of the present application Multi-label classification methods.

According to one aspect of the present application, there is provided a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various optional implementations of the above-described aspect of multi-label classification of images.

Description of drawings

In order to introduce the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

1 is an architecture diagram of a label classification model provided by an embodiment of the present application;

Fig. 2 is an architecture diagram of a label classification model provided based on the embodiment shown in Fig. 1;

3 is a flowchart of a multi-label classification method for images provided by an embodiment of the present application;

4 is a schematic diagram of a global feature provided based on the embodiment shown in FIG. 3;

Fig. 5 is a schematic diagram of another local feature provided based on the embodiment shown in Fig. 3;

Fig. 6 is a kind of visual interface after image processing provided based on the embodiment shown in Fig. 3;

7 is a flowchart of a multi-label classification method for images provided by another exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of an image post-processing provided based on the embodiment shown in FIG. 7;

9 is a schematic diagram of a process for automatically generating an album provided by an embodiment of the present application;

10 is a structural block diagram of an apparatus for multi-label classification of images provided by an exemplary embodiment of the present application;

11 is a structural block diagram of a terminal provided by an exemplary embodiment of the present application;

FIG. 12 is a schematic structural diagram of a server provided by an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as recited in the appended claims.

In the description of the present application, it should be understood that the terms "first", "second" and the like are used for descriptive purposes only, and should not be construed as indicating or implying relative importance. In the description of this application, it should be noted that, unless otherwise expressly specified and limited, the terms "connected" and "connected" should be understood in a broad sense, for example, it may be a fixed connection, a detachable connection, or an integrated connection. Ground connection; it can be a mechanical connection or an electrical connection; it can be directly connected or indirectly connected through an intermediate medium. For those of ordinary skill in the art, the specific meanings of the above terms in this application can be understood in specific situations. Also, in the description of the present application, unless otherwise specified, "a plurality" means two or more. "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects are an "or" relationship.

The present application provides an image multi-label classification method, wherein the method includes: extracting image features of an image to be processed through a feature extraction layer in a label classification model, where the label classification model is used for the to-be-processed image. A neural network model that adds at least two labels to an image; the image features are processed through a graph feature matrix to obtain data to be activated, and the graph feature matrix is a matrix obtained by processing a knowledge graph through a graph convolutional neural network, and the knowledge graph used to indicate the attributes of the first label itself, and the relationship between at least two of the first labels; the data to be activated is processed by the activation layer in the label classification model to obtain at least two second labels; At least two of the second tags are determined as tags of the image to be processed, and the second tags belong to the first tags.

Optionally, the processing of the image features by the atlas feature matrix to obtain the data to be activated includes: multiplying the image feature matrix and the atlas feature matrix to obtain the data matrix to be activated; The activation layer in the classification model processes the to-be-activated data to obtain at least two second labels, including: processing the to-be-activated data matrix through the activation layer in the label classification model to obtain at least two of the second labels Two labels.

Optionally, the knowledge graph includes a label relationship matrix and a node information matrix, and the method further includes: inputting the label relationship matrix into the graph convolutional neural network, where the label relationship matrix is used to indicate at least two the relationship between the first labels; input the node information matrix into the graph convolutional neural network, the node information matrix is used to indicate the attributes of the first label itself; process through the graph convolutional neural network The label relationship matrix and the node information matrix are used to obtain the graph feature matrix.

Optionally, the method further includes: completing the update in response to the data in the knowledge graph, and obtaining an updated knowledge graph; processing the updated knowledge graph through the graph convolutional neural network, and obtaining an updated knowledge graph. The atlas feature matrix; the atlas feature matrix in the label classification model is updated through the updated atlas feature matrix.

Optionally, the scale of the graph feature matrix is C*N, where C is the number of the first labels, N is the feature dimension, and both C and N are positive integers.

Optionally, the scale of the image feature matrix is N*1, the scale of the graph feature matrix is C*N, the scale of the data matrix to be activated is C*1, and C is the number of the first label. number, N is the feature dimension, and C and N are both positive integers.

Optionally, the to-be-processed image includes a first image and a second image, and the method further includes: acquiring the corresponding second label in response to the first image and the second image having acquired the corresponding second label. shooting time relationship information between the first image and the second image, the shooting time relationship information is used to indicate the time sequence relationship between the first image and the second image at the shooting time, or the The shooting time relationship information is used to indicate the time period between the shooting time of the first image and the shooting time of the second image; in response to the shooting time relationship information meeting the preset condition, the target second tag is added to all the second label corresponding to the second image, the target second label is the second label corresponding to the first image and not corresponding to the second image.

Optionally, the adding the target second label to the second label corresponding to the second image in response to the photographing moment relationship information meeting a preset condition includes: in response to the target duration being less than a second threshold, The target second label is added as the second label corresponding to the second image, and the target duration is the duration between the shooting time of the first image and the shooting time of the second image.

Optionally, the adding the target second label to the second label corresponding to the second image in response to the photographing moment relationship information meeting a preset condition includes: responding to the number of the first images is 2k, among the 2k first images, k first images are images taken before the second image, k first images are images taken after the second image, Acquiring the shooting moment of the first image; in response to the length of the interval where the shooting moments of the 2k first images are located is less than a third threshold, adding a target second label to the second label corresponding to the second image , the target second label is the label corresponding to the 2k first images and the target second label is the second label not corresponding to the second image, and k is an integer greater than or equal to 1.

The present application provides a multi-label classification method for images, which can input the to-be-processed image into a label classification model after acquiring the to-be-processed image, obtain image features corresponding to the image, obtain a map feature matrix according to a knowledge map, and combine image features and map features The matrix obtains the data to be activated, and then at least two second labels corresponding to the image to be processed are obtained according to the data to be activated. Among them, the knowledge graph is used to indicate the relationship between tags and the attributes of the tags themselves. Since the label classification model uses the information provided by the knowledge graph when adding multiple labels to the image to be processed, the present application improves the reliability of the multiple labels obtained from the image to be processed, while reducing the complexity of acquiring multiple labels .

In order to facilitate the understanding of the solutions shown in the embodiments of the present application, several terms appearing in the embodiments of the present application are introduced below.

Pending Image: The image used to be tagged. In a possible manner, the image to be processed is an image captured by the terminal. In another possible manner, the image to be processed is an image captured by other computer equipment, and the terminal adds a tag to the image. In another possible manner, the image to be processed may also be a virtual image generated by other computer equipment according to a specified algorithm or other image tools.

Optionally, when the image to be processed is a real image captured by a device, the present application may divide the acquisition manner of the image to be processed into two approaches. Illustratively, the first approach is to use the device of the multi-label classification method for images provided by the present application, and to shoot through the image acquisition component equipped by the device itself. The second way is to obtain the device other than the device applying the multi-label classification method of images provided by the present application, obtain it by means such as image transmission, and then transmit it to the device for multi-label classification.

For example, when the embodiments of the present application are applied to a terminal such as a mobile phone, the image to be processed may be an image captured by the terminal through its own camera. When the embodiment of the present application is applied to a device such as a server, the image to be processed may be an image acquired by the server through a network and transmitted by the terminal, and the image is still an image captured by the terminal through a camera.

Neural network model: It is a complex network system formed by interconnecting a large number of simple processing units. Among them, the processing unit may also be referred to as a neuron. The neural network model can reflect many basic features of human brain function, and is essentially a nonlinear dynamic learning system. In this application, a neural network model is a mathematical model to which a neural network structure is applied. In a possible implementation manner, a part of the neural network model adopts a neural network structure, and the other part may adopt other data structures, and the above parts cooperate with each other to process the data and obtain the result desired by the designer. For example, the label classification model used in this application can add at least two second labels to the image to be processed, so as to realize the multi-label classification capability of the image.

Convolutional Neural Networks (CNN): It is a kind of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolutional computation and has a deep structure. It is one of the most widely used algorithms in deep learning. .

Illustratively, the usual structure of CNN includes input layer, hidden layer and output layer.

First, the input layer is used to receive data that is fed into the CNN. Generally speaking, the input layer can process multi-dimensional data. When the input data is an image, the three-dimensional input data received by the input layer is used to indicate the coordinates of the pixels and the RGB (Red Green Blue, red, green and blue) channels. . Optionally, before the pixel data enters the input layer, normalization can be performed to normalize the value of the RGB channel of the pixel from [0, 255] to [0, 1], so as to improve the learning efficiency and reasoning of the CNN ability.

Secondly, the hidden layer can include three common structures: convolutional layer, pooling layer and fully connected layer.

(1) Convolutional layer, whose function is to extract the features of the input data.

For the introduction of the convolutional layer, it can be carried out from three perspectives: the convolutional kernel, the parameters of the convolutional layer and the excitation function.

A. Convolution kernel. The convolutional layer contains multiple convolution kernels. For a convolution kernel, it includes several elements, and each element corresponds to a weight coefficient and a bias vector. Among them, the elements are similar to neurons in a feedforward neural network.

B. Convolutional layer parameters. The parameters of the convolutional layer include the size of the convolution kernel, the stride and the padding. The above three parameters together determine the size of the output feature map of the convolutional layer, which is the hyperparameter of the convolutional neural network. where the kernel size is an arbitrary value smaller than the input image size. Correspondingly, the larger the convolution kernel, the more complex the extractable input features. The convolution stride defines the distance between the positions of the convolution kernel when it scans the feature map twice adjacently. In response to a convolution stride of 1, the convolution kernel will sweep through the elements of the feature map one by one, and a stride of n will skip n-1 elements in the next scan. It should be noted that the purpose of padding is to maintain the feature dimension processed by the convolution kernel.

C. The activation function. The role of the excitation function is to assist in expressing more complex features.

(2) Pooling layer.

After the convolutional layer performs feature extraction, the output feature map is passed to the pooling layer for feature selection and information filtering. Among them, the pooling layer is set with a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistics of its adjacent areas.

(3) Fully-connected layer.

The fully connected layer is used to non-linearly combine the features extracted by the aforementioned layers to obtain the output, and output the data to the output layer.

Again, the upstream of the output layer in a CNN is usually a fully connected layer. For image classification problems, the output layer uses a logistic function or a normalized exponential function (softmax function) to output the classification labels. In the object detection problem, the output layer can be designed to output the center coordinates, size and classification of objects. In semantic segmentation, the output layer outputs the classification result of each pixel.

Graph Convolutional Neural Network (GCN): It is a convolutional neural network used for feature extraction from graph data.

Knowledge graph: a graph data used to indicate the respective attributes of multiple nodes and the relationship between multiple nodes. In a possible way, the knowledge graph includes a label relationship matrix and a node information matrix. The sum of the above two matrices can be called a knowledge graph.

The present application provides a multi-label classification method for images, which can effectively improve the problems of high error rate or slow operation speed caused by multi-label classification of a single image in the related art. It should be noted that, in the related art, since only the feature extraction is performed on the image, and which tags are closer to the specific tags are determined according to the features in the image, multiple accurate tags can only be determined when the corresponding features of the tags are more prominent or obvious. Label. If the feature of an object to be labeled in the image is not obvious, it is difficult for the related art to determine the corresponding label. However, the solution provided by the present application will be able to identify the above-mentioned tags that need to be added, please refer to the introduction of the following embodiments.

In the embodiments of the present application, a label classification model can be constructed in combination with the structure of a neural network, so as to realize the above-mentioned multi-label classification method for images. It should be noted that before the label classification model is applied, that is, before the inference stage, it needs to go through a training process, which is described as follows.

Please refer to FIG. 1. FIG. 1 is an architecture diagram of a label classification model provided by an embodiment of the present application. In FIG. 1 , the label classification model 100 includes a convolutional neural network 110 , a matrix multiplication module 120 and an activation layer 130 .

In the label classification model 100 shown in FIG. 1 , the convolutional neural network 110 is used to receive the to-be-processed image 1a, and after the to-be-processed image 1a is processed by the convolutional neural network 110, a corresponding image feature matrix 1b is obtained. Subsequently, the image feature matrix 1b and the map feature matrix 1c are multiplied in the matrix multiplication module 120 to obtain the data to be activated 1d, which is input into the activation layer 130, and the activation layer 130 processes the data to be activated to obtain a second label Group 1e. In FIG. 1, the second tag group 1e includes 3 second tags.

It should be noted that the graph feature matrix 1c in FIG. 1 is data updated with the update of the knowledge graph. Combined with the update approach of the graph feature matrix 1c, another architecture of the label classification model can be obtained. Please refer to FIG. 2 . FIG. 2 is an architecture diagram of a label classification model provided based on the embodiment shown in FIG. 1 .

In Figure 2, the knowledge graph includes a label relationship matrix 2a and a node information matrix 2b. The knowledge graph can be input into the graph convolutional neural network 200 to obtain the graph feature matrix 1c. After the knowledge graph is updated, the computer device can re-acquire the label relation matrix 2a and the node information matrix 2b from the updated knowledge graph, and input the label relation matrix 2a and the node information matrix 2b into the graph convolutional neural network 200, so that Obtain the map feature matrix 1c. Illustratively, the graph feature matrix 1c is only updated when the knowledge graph changes. When the knowledge graph does not change, the graph feature matrix 1c keeps the value obtained by the last calculation and participates in the calculation process shown in FIG. 1 .

In the architecture based on the classification shown in FIG. 1 and FIG. 2 , the computer device can train the architecture shown in FIG. 2 when constructing the label classification model. In another possible way, the structure shown in FIG. 2 is also referred to as a dual-branch architecture.

When training the model shown in Figure 2, the knowledge graph needs to be constructed first. For the construction process of knowledge graph, it can be divided into keyword collection stage and knowledge graph construction stage.

In the keyword collection stage, the server in the cloud can collect a large amount of data of users using mobile phone photo albums. It should be noted that the data collected by the server for the use of the album is desensitized data, and does not involve any user's private information. When the server obtains the data of the use album, the server can extract the keywords frequently searched by the user. The keywords frequently searched by the user may be the top n keywords that appear most frequently among the keywords collected by the server. For the types of keywords, it can include entities, scenes, behaviors, or events. Among them, entities can include entity objects such as cats, dogs, flowers, vehicles, cakes, balloons, dishes, drinks, shops, rivers, beaches, and oceans. The scene can include scene information such as sunrise and sunset, banquet, playground or sports scene. Behaviors include information such as walking, running, eating, and standing. Events include information such as travel, shopping, or eating.

After the server determines the keywords frequently searched by the user, the server may build a tag list including the above keywords. It should be noted that the tag in the tag list here may be the first tag.

Next, the server will build a knowledge graph from the list of tags. In this stage, the server can implement the construction of the knowledge graph by performing the following steps a) to h).

Step a), extract the textual label relationship from the textual knowledge graph.

Illustratively, the text-based knowledge graph may include a knowledge graph such as ConceptNet or WordNet. The textual label relationship can include the semantically own relationship of the label, such as the inclusion relationship or the predicate relationship. In this step, the server will preselect the text-based knowledge graph, and extract the text-based label relationship from the text-based knowledge graph. It should be noted that, in this step, the server can select a knowledge graph with a better use effect of the textual label relationship in the current field. The above-mentioned specific knowledge graph is only an exemplary introduction, and this application does not perform any specific use of the textual indication graph. limited.

Step b), extract the interrelationship of labels in the image from the specified image class dataset.

Illustratively, the correlation may be a conditional probability. In response to the mutual relationship being the conditional probability, please refer to the following formula for the calculation method of the conditional probability.

Among them, P(A|B) is the conditional probability that label A appears when label B appears, P(AB) is the probability that label A and label B appear at the same time, and P(B) is the probability that label B appears.

Step c), calculate the weight between the labels.

In this step, for the weight between the text class relationship labels, refer to the weight in the text class knowledge graph used in step a). For example, refer to the weights in knowledge graphs such as ConceptNet or WordNet. If multiple text label relationships are merged, the weight of the merged relationship will be weighted and averaged as the merged relationship weight; image labels refer to the conditional probability calculation method in step b), and generally do not merge; if there is no text label relationship weight, the weight will be filled with a value of 0 or 1, 0 means there is no relationship between the two nodes, and 1 means there is a relationship between the two nodes. It should be noted that 0 and 1 are used to fill the relationship between two nodes with a logical relationship. In the embodiments of the present application, the nodes in the knowledge graph are used to represent labels. For example, the name of a node in the knowledge graph represents the name of a label mentioned in this application.

Step d), merge the relationship between the text class label and the image class label relationship as the edge of the knowledge graph.

Step e), manually sorting attributes such as definitions, keywords, and synonyms of tags.

In this step, technicians read and sort out the knowledge map to logically determine whether the knowledge map is close to the situation in real life, and manually adjust the abnormal data. The purpose of this step is to improve the ability of the knowledge graph to describe the correlation between photos in real life.

Step f), use the specified algorithm to implement embedding (embedding) on the label, and obtain the word embedding (word embedding).

The specified algorithm may be an algorithm with embedding capability such as Glove.

In step g), definitions, keywords, synonyms or word vectors are obtained from the above data as node attributes of the knowledge graph.

Step h), merge edges and nodes to obtain the established knowledge graph.

In the above-established knowledge graph, each node represents a label, and edges represent the relationship between the labels, these relationships include but are not limited to the upper-lower relationship, the correlation relationship, the position relationship in the image and the predicate relationship, etc. Among them, the upper-lower relationship is used to indicate the relationship between the upper-level concept and the lower-level concept.

For example, "pet" is a superordinate concept of "cat", and "cat" is a subordinate concept of "pet". Correlation is used to indicate the probability of two labels appearing in an image at the same time. The positional relationship in the image is used to indicate the positional relationship of the two labels in the image, for example, "apple" is above "table" and "floor" is below "table". Predicate relationships are used to indicate the definition of some labels, etc. For example, "apple" is "food".

For the properties of each node, the above properties include but are not limited to the following content Embedding, node type and synonyms. Among them, Embedding is the word vector obtained by the tag embedding the tag name through NLP (Natural Language Processing) related algorithms. Node types can include objects, scenes, or events, etc.

To sum up, the knowledge graph can be constructed through the above process. When a knowledge graph is constructed, the relevant data in the knowledge graph is also fixed. Next, the server can train the label classification model applied in this application according to the architecture shown in FIG. 2 to obtain a label classification model that can be used for inference.

Taking the label classification model shown in Figure 2 as an example, the training process of the entire label classification model is introduced. The data in the label classification model that needs to be updated during the training phase are the parameters in the convolutional neural network 110 and the parameters in the graph convolutional neural network 200 . In response to the end of the training process, the parameters in the convolutional neural network 110 and the parameters in the graph convolutional neural network 200 are fixed.

When the graph convolutional neural network 200 is trained, each graph convolutional layer can be represented by the following formula.

In the above formula, H ^(l) is the input of the current graph convolutional layer. H ⁽¹⁾ is the input of the first graph convolutional layer in the graph convolutional neural network 200, that is, H ⁽¹⁾ is the node information matrix of the input graph convolutional neural network 200.

is the label relationship matrix A plus the matrix after self-connection, that is

for

The degree matrix of ,

Equivalent to the adjacency matrix

Do normalization. W ^(l) is the parameter to be learned during training and σ(.) is the activation function.

During the training process, each graph convolution layer will process the node information output by the previous graph convolution layer to obtain new node information and output it to the next graph convolution layer. There is no change in the convolutional neural network 200 .

After the training of the label classification model is completed, the computer device can use the model to perform the multi-label classification method for images shown in this application. For details, please refer to the introduction of FIG. 3 .

Please refer to FIG. 3 . FIG. 3 is a flowchart of a multi-label classification method for images provided by an embodiment of the present application. FIG. 3 can be applied to a computer device. In the embodiment of the application, the computer device can be either a terminal or a server. During the execution of this method, please refer to the following introduction.

In this application, a computer device can acquire images to be processed.

The manner of acquiring the image to be processed may be different according to the specific implementation manner of the computer device.

Illustratively, when the computer device is a terminal, the terminal can directly use the image acquisition component to capture images, and use the captured images as the images to be processed. In another possible manner, the terminal may acquire images from other computer devices, and use the acquired images as the images to be processed. In another possible manner, the terminal can also synthesize a virtual image according to a specified instruction and data through an installed image synthesizing application, and use the virtual image as the image to be processed.

Illustratively, when the computer device is a server, the server may receive an image uploaded by the terminal, and use the image as an image to be processed. Alternatively, the server can also synthesize a virtual image through an installed image synthesizing application according to specified instructions and data, and use the virtual image as the image to be processed.

Regarding the number of images to be processed, the number of images to be processed may be one or multiple. When there are multiple images to be processed, the computer device may choose to process the multiple images to be processed in a serial manner or in a parallel manner.

In serial mode, the computer device will process the next image after an image has been successfully tagged with at least two second labels.

In the parallel manner, the computer device will process several images simultaneously, and several images will simultaneously obtain their corresponding second labels.

It should be noted that the number of images to be processed and the serial mode or parallel mode adopted by the computer device in the embodiment of the present application will vary according to the actual application scenario, which is not limited in the embodiment of the present application.

Step 310: Extract the image features of the image to be processed through the feature extraction layer in the label classification model, where the label classification model is a neural network model for adding at least two labels to the image to be processed.

Illustratively, after acquiring the image to be processed, the computer device will be able to extract image features from the image to be processed. In this example, the computer device extracts the image features of the image to be processed through the feature extraction layer in the label classification model. In practical applications, the label classification model provided in this application is used to provide at least two labels for an image to be processed.

Optionally, the image features may include global features and local features according to different application scenarios.

In response to the image feature being the global feature, the computer device will use the entire image as a material, and extract the feature of the entire image as the image feature of the image to be processed.

In response to the image feature being a local feature, the computer device will use the identified one or more local regions in the entire image as materials, and extract the corresponding special diagnosis as the image feature of the image to be processed.

Please refer to FIG. 4 , which is a schematic diagram of a global feature provided based on the embodiment shown in FIG. 3 . In FIG. 4 , each pixel in the to-be-processed image 400 is used as a material, and after being processed by a computer device, a global feature 420 is extracted, and the global feature 420 is used to indicate the feature of the to-be-processed image 400 .

Please refer to FIG. 5 , which is a schematic diagram of another partial feature provided based on the embodiment shown in FIG. 3 . In FIG. 5 , after the image 400 to be processed is processed by the computer equipment, three candidate frames appear, and then the computer equipment continues processing, and obtains three sets of local features according to the local images in the three candidate frames, which are respectively the local feature 510 , the local feature 520 and local features 530. In this embodiment of the present application, the sum of the local features 510, the local features 520 and the local features 530 is referred to as an image feature.

In step 320, the image features are processed by the graph feature matrix to obtain the data to be activated. The graph feature matrix is a matrix obtained after the knowledge graph is processed by the graph convolutional neural network. The knowledge graph is used to indicate the attributes of the first label itself, and at least two The relationship between the first labels.

After the computer device obtains the image features, the computer device will obtain the atlas feature matrix. It should be noted that the graph feature matrix is a specified matrix obtained from the knowledge graph. When the knowledge graph does not change or is not updated, the graph feature matrix will not change. That is, the computer device will update the corresponding graph feature matrix when the internally stored knowledge graph is updated. When the knowledge graph stored in the computer device has not changed, the original stored graph feature matrix is not updated.

When the computer device obtains the image feature and the map feature matrix at the same time, the computer device will process the image feature through the map feature matrix to obtain the data to be activated. It should be noted that the calculation method can be adjusted according to the form of the image features. When the image features are in the form of matrices, the computer equipment performs matrix multiplication of the image features and the atlas feature matrix, and uses the result obtained after the multiplication as the data to be activated.

It should be noted that, the knowledge graph applied in the embodiment of the present application is used to indicate not only the attribute of the first tag itself, but also the relationship between at least two first tags.

Step 330: Process the data to be activated through the activation layer in the label classification model to obtain at least two second labels.

In the present application, the computer equipment processes the data to be activated through the activation layer in the label classification model to obtain at least two second labels. The second label is used to indicate a feature in the image to be processed, and each first label is used to indicate that there is a feature in the image to be processed that matches the label. It should be noted that the second label is a label filtered from the first label.

For example, if the first label includes 9 labels as shown in Table 1, the second label may be 3 labels as shown in Table 2. Wherein, the second label belongs to the first label, and the second label is the label in the first label that best matches the characteristics of the image to be processed.

序号serial number	11	22	33	44	55	66	77	88	99
标签Label	沙滩beach	海洋ocean	日出日落sunrise sunset	猫cat	狗dog	宴会banquet	逛街shop	风景landscape	汽车car

Table I

In Table 1, one possible kind of first label is shown. It should be noted that the categories shown in Table 1 are only illustrative, and do not limit the types of the first labels used in the embodiments of the present application. In a possible manner, the first tag may also include a person, and the tag of the person may be specific to the person's name, or may only be a tag that represents the person's age, gender, occupation, and other characteristics.

序号serial number	11	22	33	44
标签Label	海洋ocean	沙滩beach	狗dog	风景landscape

Table II

In Table 2, the shown tags are the second tags screened out from the first tags shown in Table 1 in the embodiment of the present application, including 4 second tags in total. In other words, for the image to be processed, the computer device obtains that the second tags ocean, beach, dog and landscape are all tags that conform to the characteristics of the image to be processed.

It should be noted that, for the image to be processed, if the image includes oceans and dogs with obvious features, it also includes beaches with less obvious features. According to the solution in the related art, only two labels of ocean and dog are marked on the image to be processed with a high probability. According to the solution provided by this application, since this application introduces a graph feature matrix in the process of judgment, the matrix is derived from the knowledge graph, and the knowledge graph can be used to indicate the relationship between the two first tags. Therefore, the graph feature matrix It can actually provide a strong correlation between the ocean and the beach, as well as the strong correlation between the ocean and the landscape, as well as the strong correlation between the beach and the landscape, that is, the method provided by this application is more likely to identify the ocean. , Beach, Dog, and Landscape are also used as second tags for the images to be processed.

In an actual implementation process, each first label has its own corresponding threshold. When the probability value of the first label obtained by the activation layer is greater than its corresponding threshold, the activation layer determines the first label as the first label. Two labels. Illustratively, the data shown in Table 1 and Table 2 are taken as examples for introduction.

Please refer to Table 3. The data shown in Table 3 are based on the probability value indicated by the first tag shown in Table 1 according to the activation data and the preset threshold.

序号serial number	11	22	33	44	55	66	77	88	99
标签Label	沙滩beach	海洋ocean	日出日落sunrise sunset	猫cat	狗dog	宴会banquet	逛街shop	风景landscape	汽车car
预设阈值Preset Threshold	0.80.8	0.950.95	0.960.96	0.980.98	0.980.98	0.880.88	0.750.75	0.850.85	0.950.95
实测概率Measured probability	0.830.83	0.980.98	0.650.65	0.320.32	0.990.99	0.110.11	0.320.32	0.930.93	0.260.26

Table 3

In Table 3, after the image to be processed is processed by the label classification model, the measured probability of each first label is obtained, and the measured probability is included in the data to be activated processed by the activation layer. The preset threshold corresponding to each first label may be pre-stored in the activation layer. The activation layer can obtain the measured probability between the image to be processed and each label and compare it with a preset threshold, and determine the first label whose measured probability is higher than the preset threshold as the second label. For example, according to the data shown in Table 3, the first tags corresponding to serial numbers 1, 2, 5 and 8 are determined as the second tags.

Step 340, at least two second tags are determined as tags of the image to be processed, and the second tags belong to the first tags.

In this embodiment of the present application, after the computer device determines at least two second tags, they are used as tags of the image to be processed.

In a possible manner, the second label may be displayed on the processed image to be processed as visual information. Referring to FIG. 6 , FIG. 6 is a visual interface after image processing provided based on the embodiment shown in FIG. 3 . In the user interface 600, after the image 610 is processed, three second labels attached to it can be displayed below, which are the first second label 620, the second second label 630 and the third second label 640, respectively. They are "Beach", "Trees" and "Ocean".

In another possible manner, the second label may not be used as visual information, but as a kind of attribute information of the image. Optionally, the attribute information may be stored in the attribute frame of the image, or may be additionally stored in a file designated by the computer device. Among them, the attribute frame of the image, as a part of the image, is copied with the copying of the image, and disappears with the deletion of the image.

In a practical application scenario, if several images are marked with multiple second tags, the computer device can intelligently generate albums according to the multiple tags. For example, when "beach", "ocean" and "landscape" appear in several images, these images are intelligently combined into a photo album named "Seaside Play". It should be noted that, the operation of intelligently generating an album can be completed on the terminal side or on the server side.

When the operation is completed in the server, the terminal can upload the image captured by the local end to the server through cloud backup or other forms. Thus, the server realizes the operation of intelligently generating a photo album for multiple images.

To sum up, the multi-label classification method for images provided by this application can obtain the data to be activated in combination with the atlas feature matrix after extracting the image features of the to-be-processed image, and obtain at least two second labels according to the data to be activated, Use the second label as the label of the image to be processed. Among them, the graph feature matrix is a matrix obtained after the knowledge graph is processed by the graph convolutional neural network, and the knowledge graph is used to indicate the attributes of the first label itself, and the relationship between at least two first labels. In the process of determining the second label of the image to be processed, a knowledge graph reflecting the relationship between each first label is introduced, and the graph feature matrix obtained from the knowledge graph is used to assist in determining the second label, avoiding the need for some second labels to be processed. The problem that the features in the image are not obvious, and thus is missed in the process of determining the label, improves the accuracy of determining the multiple labels of the image.

Based on the above trained label classification model, an embodiment of the present application provides a multi-label classification method for images based on the label classification model. Through the label classification model, the present application can obtain a more accurate multi-label classification result for an image to be processed. For details, see the introduction below.

Please refer to FIG. 7. FIG. 7 is a flowchart of a method for classifying images with multiple labels according to another exemplary embodiment of the present application. The multi-label classification method of the image can be applied to the terminal or server shown above. In Figure 7, the multi-label classification method for this image includes:

Step 711, acquiring the image to be processed.

In this embodiment of the present application, there may be different methods for acquiring images to be processed according to different execution subjects of the application.

In a possible manner, when the image to be processed is acquired by the server, the server acquires the image to be processed from the data transmitted from the terminal. Optionally, the manner in which the terminal transmits data to the server may include scenarios such as cloud album synchronization, smart album creation, or cloud backup.

In another possible way, when the image to be processed is acquired by the terminal, the terminal will extract the image to be processed from the locally stored gallery, and the image can be either shot by the terminal itself or sent to the terminal after being shot by other terminals. image.

In the subsequent steps, the implementation process of the embodiment shown in FIG. 7 is introduced by taking the method applied to the terminal as an example.

Step 712, input the image to be processed into the convolutional neural network.

The image to be processed can be directly input into the convolutional neural network, and the image is processed through the convolutional neural network.

Step 713: Process the image to be processed through a convolutional neural network to obtain an image feature matrix.

In this example, the convolutional neural network includes several layers of structures, and the convolutional neural network sequentially passes through the above-mentioned several layers of structures to obtain the image feature matrix.

In one possible way, the label classification model includes an input layer, a convolutional layer and a pooling layer. The process of processing the image to be processed through the convolutional neural network may include inputting the image to be processed into the input layer, and through the above layer-by-layer processing, an image feature matrix is finally obtained.

Illustratively, the computer equipment can input the image to be processed into the input layer to obtain the first intermediate data; input the first intermediate data into the convolution layer to obtain the second intermediate data; input the second intermediate data into the pooling layer to obtain the image features matrix.

After the image to be processed is input to the input layer, the first intermediate data is obtained after processing by the input layer. Then, the input layer in the neural network is connected with the convolution layer, and the convolution layer processes the first intermediate data to obtain the second intermediate data. In the neural network, the pooling layer is connected to the convolutional layer. After the pooling layer processes the second intermediate data, the image feature matrix is obtained.

It should be noted that the computer device may execute steps 721 to 723 when the atlas feature matrix has not been stored to obtain the atlas feature matrix. In response to the graph feature matrix stored by the computer device being the graph feature matrix corresponding to the latest version of the knowledge graph, the computer device can directly use the stored graph feature matrix in the process of marking the image to be processed with multiple second labels, without the need for Steps 721 to 723 are performed.

Step 721 , input the label relationship matrix into the graph convolutional neural network, where the label relationship matrix is used to indicate the relationship between at least two first labels.

In the graph convolutional neural network, two matrices need to be input, and the label relationship matrix is used as one of the inputs, which will be input into the graph convolutional neural network in this embodiment of the present application. Wherein, the label relationship matrix is used to indicate the relationship between at least two first labels.

Step 722: Input the node information matrix into the graph convolutional neural network, where the node information matrix is used to indicate the attributes of the first label itself.

Optionally, when inputting the label relationship matrix into the graph convolutional neural network, the computer device can also input the node information matrix into the graph convolutional neural network. Illustratively, the node information matrix and the label relationship matrix together form a knowledge graph.

Step 723: Process the label relationship matrix and the node information matrix through a graph convolutional neural network to obtain a graph feature matrix.

It should be noted that, in response to the update of the knowledge graph that generates the graph feature matrix, the computer device will regenerate a new graph feature matrix according to the updated knowledge graph, and store the new graph feature matrix to process the image features and obtain the to-be-activated data.

In actual operation, the computer equipment completes the update in response to the data in the knowledge graph, and obtains the updated knowledge graph; processes the updated knowledge graph through the graph convolutional neural network, and obtains the updated graph feature matrix; through the updated graph feature matrix, which updates the graph feature matrix in the label classification model.

It should be noted that the update of the knowledge graph can be performed on the server side, and the server calculates the updated graph feature matrix after the knowledge graph is updated, and pushes the updated graph feature matrix to the terminal as new information. The terminal processes the image features according to the new atlas feature matrix, and obtains the data to be activated.

In a possible implementation manner, the scale of the graph feature matrix is C*N, where C is the number of first labels, N is the feature dimension, and both C and N are positive integers.

Correspondingly, the size of the image feature matrix is N*1, the size of the map feature matrix is C*N, the size of the data matrix to be activated is C*1, C is the number of first labels, N is the feature dimension, C and N are positive integers.

Based on the above-mentioned size of the map feature matrix is C*N and the size of the image feature matrix is N*1, the size of the data matrix to be activated is C*1. Generally speaking, for the data matrix to be activated, each row of data corresponds to the data after a first label is activated.

Step 731: Multiply the image feature matrix and the atlas feature matrix to obtain a data matrix to be activated.

Step 732: Process the data matrix to be activated through the activation layer in the label classification model to obtain at least two second labels.

In this application, the terminal may perform step (3a), step (3b) and step (3c) instead of realizing the effect of obtaining at least two second tags shown in step 732.

Step (3a), input the data matrix to be activated into the activation layer.

In step (3b), the data to be activated is processed by the activation layer to obtain a probability value corresponding to the first label, and the probability value is used to indicate the probability that the first label conforms to the image to be processed.

Step (3c), in response to the probability value being higher than the corresponding first threshold, determine the corresponding first label as the second label, and the first threshold is a threshold for judging whether the first label conforms to the image to be processed.

It should be noted that, the first threshold may be in one-to-one correspondence with the first label. When the number of first labels used in the label classification model is i, the number of first thresholds is also i.

In this embodiment of the present application, when the computer device completes step 732, the image to be processed has obtained at least two second tags. Each image to be processed can obtain at least two second labels to which it belongs through the above process. The computer device can achieve the effect of multi-label classification by performing steps 711 to 732 provided in this embodiment of the present application on multiple images to be processed. However, in order to provide a solution with a more accurate classification effect, the embodiment of the present application may also add an image post-processing process, which determines whether to add a specified second label to the to-be-processed image based on features other than the image content of the to-be-processed image. .

Illustratively, the computer device acquires the shooting time relationship information between the first image and the second image in response to the first image and the second image having acquired their corresponding second labels. Both the first image and the second image are images to be processed to which the second label has been added. For example, see Table IV. Table 4 shows a situation of the second label after the first image and the second image are processed.

Table 4

From the data shown in Table 4, it can be seen that the second image includes three second labels, namely "ocean", "dog" and "landscape", after a plurality of second labels are applied through the scheme shown in this application. After the first image is marked with a plurality of second labels through the solution shown in this application, it includes four second labels, namely "ocean", "beach", "dog" and "landscape". In this case, the computer device will acquire the shooting time relationship information between the first image and the second image.

Wherein, the shooting time relationship information is used to indicate the time sequence relationship between the first image and the second image at the shooting time, or the shooting time relationship information is used to indicate the duration between the shooting time of the first image and the shooting time of the second image .

In the first case indicated by the shooting relationship information, the shooting time relationship information indicates a timing relationship. The time sequence relationship includes two cases. The first case is that the shooting time of the first image is earlier than the shooting time of the second image. The second case is that the shooting time of the first image is later than the shooting time of the second image. It should be noted that, because the first image and the second image processed by this application are images captured by the same terminal by default. Therefore, in terms of image capture logic, there is no capture moment of the first image equal to the capture moment of the second image.

Optionally, in this embodiment of the present application, the first image and the second image are images captured by the same terminal through the same set of cameras. When the terminal for capturing images is a smart phone, the camera of the smart phone usually includes two groups of a front camera group and a rear camera group, and the smart phone can select one group of cameras to capture images. In a rarer scenario, the smartphone includes only one set of cameras, but the set of cameras can be flipped to capture the front side or flip to capture the back side. The first image and the second image shown in the embodiment of the present application are two images captured by the camera group facing the same side. Among them, the smartphone can determine the current orientation of the camera through the status information.

In the second case indicated by the shooting time relationship information, the shooting time relationship information indicates a kind of duration information, and the duration information is the duration between the shooting time of the first image and the shooting time of the second image, and the information can be Is a fixed value, the precision can be minutes, seconds or milliseconds, etc. It should be noted that the precision value may vary according to the application scenario. When the embodiments of the present application are applied in scenes such as daily shooting of portraits and landscapes, the accuracy may be on the order of seconds. When the embodiments of the present application are applied to a scene of shooting a high-speed moving object, the accuracy may be milliseconds, such as a high-speed moving person, a vehicle, or particles in a microscopic scene. When the embodiment of the present application is applied in a scene of monitoring a nature reserve, the accuracy may be minutes. The embodiments of the present application only provide a schematic introduction to the accuracy of the shooting relationship information, and do not limit the actual scene.

Combining the contents indicated by the above two kinds of shooting time relationship information, the computer equipment can add the target second label to the second label corresponding to the second image when the shooting time relationship information meets the preset conditions, and the target second label is corresponding to the second label of the second image. a label that does not correspond to a second image. The preset condition may be used to indicate that the shooting time of the first image is close to the shooting time of the second image. That is, when the shooting time relationship information indicates that the shooting time of the first image and the second image are close, the corresponding shooting time relationship information meets the preset condition.

For the above-mentioned shooting time relationship information that meets the preset conditions, a specific scene is introduced here. When the photographing time relationship information indicates a time sequence relationship, the computer device implements the operation of adding a label to the second image through steps (4a) and (4b).

Step (4a), in response to the first image and the second image having acquired the respective corresponding second labels, acquire the target duration.

The target duration is the duration between the shooting time of the first image and the shooting time of the second image.

Step (4b), in response to the target duration being less than the second threshold, adding the target second label to the second label corresponding to the second image.

In the embodiment of the present application, when the target duration is less than the second threshold, it means that the shooting time of the first image is close to the shooting time of the second time. In this scenario, the scene in the first image is likely to be similar to the scene in the second image. Therefore, the computer device can use the label not in the second image and the label in the first image as the second label, and also print it on the second image.

When the shooting time relationship information indicates the duration, the computer device implements the operation of adding a label to the second image through steps (5a) and (5b).

Step (5a), in response to the first image and the second image having acquired their corresponding second labels, and the number of the first images is 2k, the k first images are images taken before the second image, and the k first images are the images taken before the second image. The image is an image captured after the second image, and the capturing moment of the first image is obtained.

In step (5b), in response to the length of the interval where the 2k first images are taken at the time of shooting is less than the third threshold, the target second label is added to the second label corresponding to the second image, where k is an integer greater than or equal to 1.

In this embodiment of the present application, the number of first images may be selected as a total of 2k images, and these first images are images continuously captured by the terminal before and after capturing the second image.

For example, please refer to FIG. 8 , which is a schematic diagram of an image post-processing provided based on the embodiment shown in FIG. 7 . In FIG. 8, the value of k is 3. According to the order of shooting time from morning to night, the terminal continuously shoots the first first image 811, the second first image 812, the third first image 813, Second image 820 , fourth first image 814 , fifth first image 815 , and sixth first image 816 . At the same time, you can refer to Table 5. Table 5 shows the shooting time of each image.

图像811image 811	图像812image 812	图像813image 813	图像820 image 820	图像814image 814	图像815image 815	图像816image 816
10:24:4910:24:49	10:24:5610:24:56	10:25:0610:25:06	10:25:1710:25:17	10:25:2410:25:24	10:25:2910:25:29	10:25:3510:25:35

Table 5

Among the 7 images shown in Table 5, the 6 first images all have the second label "beach", and the second image 820 does not correspond to the second label "beach".

In the first processing stage 8A, the second labels of the second image 820 are "tree" and "ocean". The second labels of the other 6 first images are "tree", "ocean" and "beach". In the second processing stage 8B, the computer device determines that the three first images are located before the shooting time of the second image, and the other three first images are located after the shooting time of the second image, and obtains the shooting time of each first image .

In this example, the third threshold is 60 seconds, and the duration between the shooting period of the first first image 811 and the shooting time of the sixth first image 816 is 46 seconds. That is, if the length of the interval in which the six first images are taken is less than the third threshold of 60 seconds, the computer device will use the second label "beach" in the six first images as the second processing stage 8B. The target second label, copied to the second label corresponding to the second image. It should be noted that, in the first processing stage 8A and the second processing stage 8B, the second label corresponding to the first image does not change. Therefore, the first image is not shown repeatedly in FIG. 8 .

Please refer to FIG. 9 , which is a schematic diagram of a process of automatically generating an album provided by an embodiment of the present application. In the image acquisition stage 9A of Figure 9, the computer device acquires several images to be processed that need to be processed. If the computer device is a server, the image acquisition stage 9A may be to receive photos uploaded by the terminal, and the process of landing may be that the terminal performs cloud backup or album backup in the server. If the computer device is a terminal, the image acquisition stage 9A may be a process of taking pictures. After the pictures are taken and stored, the computer device has obtained several pieces of devices to be processed.

After the computer device has collected the to-be-processed image, the computer device may add at least two second labels to the to-be-processed image through the label classification model provided in the present application in the multi-label determination stage 9B.

When at least two tags have been added to several images to be processed, the computer device can divide the images to be processed into the first image and the second image in the image post-processing stage 9C, and classify the images according to the first image and the second image. Whether the photographing time relationship information between the images meets a predetermined condition determines whether the second image is supplemented with a target second label, which is a label corresponding to the first image and not corresponding to the second image.

After the to-be-processed image is processed in the image post-processing stage 9C, the computer device can generate a designated album according to a preset strategy, and the album includes the to-be-processed image. In a possible strategy, the computer device selects m tags, generates the first album according to the images to be processed including the m tags, and generates the name of the album according to the selected m tags. In another possible strategy, the computer device will define a specified shooting location and m tags, and generate a second album of similar content shot at the specified shooting location. In another possible strategy, the computer device will define a specified shooting time and m tags to generate a third album of similar content shot at the specified shooting time. It can be seen that the solutions provided by the embodiments of the present application can add multiple tags to the images to be processed under the premise of high accuracy, and can intelligently generate corresponding albums based on this, which improves the efficiency and accuracy of automatically generating albums , reducing the occurrence of missing images that actually meet the standards of the album when generating the album.

To sum up, the label classification model used in this embodiment includes the structure of a convolutional neural network. The convolutional neural network is used to extract the image content in the image to be processed. After the convolutional neural network extracts the image feature matrix, , which can be processed by the map feature matrix derived from the knowledge map to obtain the data to be activated. After the data to be activated is processed by the activation layer, at least two second labels can be obtained, and the effect of marking multiple labels for the image to be processed has been completed.

Optionally, the embodiment of the present application can also introduce a graph convolutional neural network to process the knowledge graph, so as to obtain a graph feature matrix for processing the image feature matrix, so that when multiple second labels are added to the to-be-processed image, the entry can be activated. The data before the layer is checked and balanced by the mutual relationship between the first nodes in the knowledge graph, thereby avoiding the omission of inconspicuous labels in the image to be processed, and improving the accuracy of adding multiple second labels to the image to be processed.

Optionally, in this embodiment, after the image to be processed is marked with at least two second labels, it can be detected again whether there is a second label that has not been marked in the image to be processed through the data post-processing stage. In this post-processing stage, the computer equipment will detect whether there is a second label not marked on the image to be processed in the image adjacent to the image to be processed, if the image adjacent to the image to be processed is different from the image to be processed at the time of shooting If the shooting times of the images are relatively close, the present application marks the images adjacent to the images to be processed with second labels that are not marked on the to-be-processed images on the to-be-processed images to improve the accuracy of the second label labeling.

Optionally, when there is a second label in the first k images and the last k images of the image to be processed, and the time interval where the first k images and the last k images are located is within the specified duration, then the The second label will be marked on the image to be processed, thereby further improving the accuracy of the second label labeling.

The following are apparatus embodiments of the present application, which can be used to execute the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Please refer to FIG. 10 . FIG. 10 is a structural block diagram of an apparatus for classifying images with multiple labels according to an exemplary embodiment of the present application. The image multi-label classification device can be implemented as all or a part of the terminal through software, hardware or a combination of the two. The apparatus includes a feature extraction module 1010 , a first acquisition module 1020 , a label acquisition module 1030 and a label determination module 1040 . The specific functions of the above modules are introduced.

The feature extraction module 1010 is configured to extract image features of the image to be processed through a feature extraction layer in a label classification model, where the label classification model is a neural network model for adding at least two labels to the image to be processed.

The first acquisition module 1020 is configured to process the image features through a graph feature matrix to obtain the data to be activated. The graph feature matrix is a matrix obtained after the knowledge graph is processed by a graph convolutional neural network, and the knowledge graph is used to indicate The properties of the first tag itself, and the relationship between at least two of the first tags.

The label obtaining module 1030 is configured to process the data to be activated through the activation layer in the label classification model to obtain at least two second labels.

The label determination module 1040 is configured to determine at least two of the second labels as labels of the image to be processed, and the second labels belong to the first labels.

In an optional embodiment, the first obtaining module 1020 is configured to multiply the image feature matrix and the atlas feature matrix to obtain a data matrix to be activated. The label obtaining module 1030 is configured to process the data matrix to be activated through the activation layer in the label classification model to obtain at least two second labels.

In an optional embodiment, the knowledge graph involved in the apparatus includes a label relationship matrix and a node information matrix, and the apparatus further includes a first input module, a second input module, and a second acquisition module. The first input module is configured to input the label relationship matrix into the graph convolutional neural network, where the label relationship matrix is used to indicate the relationship between at least two of the first labels; the second input a module for inputting the node information matrix into the graph convolutional neural network, where the node information matrix is used to indicate the attributes of the first label itself; the second acquisition module is used for passing the graph volume The product neural network processes the label relationship matrix and the node information matrix to obtain the graph feature matrix.

In an optional embodiment, the apparatus further includes a third acquiring module, a fourth acquiring module and a matrix updating module. The third acquisition module is used to complete the update in response to the data in the knowledge map, and acquire the updated knowledge map; the fourth acquisition module is used to process the updated knowledge map through the graph convolutional neural network The updated knowledge graph is obtained, and the updated graph feature matrix is obtained; the matrix update module is configured to update the graph feature matrix in the label classification model through the updated graph feature matrix.

In an optional embodiment, the scale of the graph feature matrix involved in the device is C*N, where C is the number of the first labels, N is the feature dimension, and both C and N are positive integer.

In an optional embodiment, the size of the image feature matrix involved in the apparatus is N*1, the size of the atlas feature matrix is C*N, and the size of the data matrix to be activated is C*1 , C is the number of the first label, N is the feature dimension, and both C and N are positive integers.

In an optional embodiment, the to-be-processed images involved in the apparatus include a first image and a second image, and the apparatus further includes a post-processing module, where the post-processing module is configured to respond to the first image An image and the second image have acquired the respective corresponding second labels, and acquired the shooting time relationship information between the first image and the second image, and the shooting time relationship information is used to indicate the the time sequence relationship between the first image and the second image at the shooting time, or the shooting time relationship information is used to indicate the duration between the shooting time of the first image and the shooting time of the second image; In response to the photographing moment relationship information meeting a preset condition, the target second label is added to the second label corresponding to the second image, and the target second label corresponds to the first image and does not correspond to the second label on the second image.

In an optional embodiment, the post-processing module is configured to, in response to the target duration being less than a second threshold, add the target second label to the second label corresponding to the second image, the The target duration is the duration between the capture moment of the first image and the capture moment of the second image.

In an optional embodiment, the post-processing module is configured to respond that the number of the first images is 2k, and among the 2k first images, the k first images are the The images taken before the second image, the k first images are images taken after the second image, and the shooting time of the first image is obtained; in response to the interval in which the shooting time of the 2k first images is located is less than the third threshold, the target second label is added to the second label corresponding to the second image, the target second label is the label corresponding to the 2k first images, and the target second label is the label corresponding to the first image. The second label is the second label not corresponding to the second image, and k is an integer greater than or equal to 1.

Exemplarily, the multi-label classification method for images shown in the embodiments of the present application may be applied to a computer device, and the computer device may be a terminal having a display screen and a multi-label image classification function. Terminals can include mobile phones, tablet computers, laptop computers, desktop computers, computer all-in-one computers, servers, workstations, TVs, set-top boxes, smart glasses, smart watches, digital cameras, MP4 playback terminals, MP5 playback terminals, learning machines, point-of-view computer, electronic paper book, electronic dictionary, vehicle terminal, virtual reality (Virtual Reality, VR) playback terminal or augmented reality (Augmented Reality, AR) playback terminal, etc.

Please refer to FIG. 11 , which is a structural block diagram of a terminal provided by an exemplary embodiment of the present application. As shown in FIG. 1 , the terminal includes a processor 1120 and a memory 1140 , and the memory 1140 stores at least one instruction , the instructions are loaded and executed by the processor 1120 to implement the multi-label classification method for images according to the various method embodiments of the present application.

The processor 1120 may include one or more processing cores. The processor 1120 uses various interfaces and lines to connect various parts in the entire terminal 110, and executes the terminal by running or executing the instructions, programs, code sets or instruction sets stored in the memory 1140, and calling the data stored in the memory 1140. 110 various functions and processing data. Optionally, the processor 1120 may use at least one of digital signal processing (Digital Signal Processing, DSP), field-programmable gate array (Field-Programmable Gate Array, FPGA), and programmable logic array (Programmable Logic Array, PLA). A hardware form is implemented. The processor 1120 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), a modem, and the like. Among them, the CPU mainly handles the operating system, user interface, and application programs; the GPU is used to render and draw the content that needs to be displayed on the display screen; the modem is used to handle wireless communication. It can be understood that, the above-mentioned modem may not be integrated into the processor 1120, and is implemented by a single chip.

The memory 1140 may include random access memory (Random Access Memory, RAM), or may include read-only memory (Read-Only Memory, ROM). Optionally, the memory 1140 includes a non-transitory computer-readable storage medium. Memory 1140 may be used to store instructions, programs, codes, sets of codes, or sets of instructions. The memory 1140 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playback function, an image playback function, etc.), Instructions, etc., used to implement the following method embodiments; the storage data area can store data and the like involved in the following method embodiments.

In this embodiment of the present application, the computer device may also be a server, and the structure of the server may refer to the structure shown in FIG. 12 .

Please refer to FIG. 12 , which is a schematic structural diagram of a server provided by an embodiment of the present application. The server is used to implement the application deployment method provided by the above embodiment. Specifically:

The server 1200 includes a central processing unit (CPU) 1201, a system memory 1204 including a random access memory (RAM) 1202 and a read only memory (ROM) 1203, and a system bus 1205 connecting the system memory 1204 and the central processing unit 1201. The server 1200 also includes a basic input/output system (Input/Output, I/O system) 1206 that helps to transfer information between various devices in the computer, and is used to store the operating system 1213, application programs 1214 and other program modules 1215 The mass storage device 1207.

The basic input/output system 1206 includes a display 1208 for displaying information and input devices 1209 such as a mouse, keyboard, etc., for user input of information. The display 1208 and the input device 1209 are both connected to the central processing unit 1201 through the input and output controller 1210 connected to the system bus 1205. The basic input/output system 1206 may also include an input output controller 1210 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus. Similarly, input output controller 1210 also provides output to a display screen, printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205 . The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the server 1200 . That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.

Without loss of generality, the computer-readable media can include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media include RAM, ROM, EPROM (Electrical Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory, electrically erasable programmable read only memory), flash memory or other solid-state storage technologies, CD-ROM, DVD (Digital Video Disc, High Density Digital Video Disc) or other optical storage, cassette, magnetic tape, magnetic disk storage or other magnetic storage device. Of course, those skilled in the art know that the computer storage medium is not limited to the above-mentioned ones. The system memory 1204 and the mass storage device 1207 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1200 may also be operated by connecting to a remote computer on the network through a network such as the Internet. That is, the server 1200 can be connected to the network 1212 through the network interface unit 1211 connected to the system bus 1205, or can also use the network interface unit 1211 to connect to other types of networks or remote computer systems.

Embodiments of the present application further provide a computer-readable medium, where the computer-readable medium stores at least one instruction, and the at least one instruction is loaded and executed by the processor to realize the multiplexing of images according to the above embodiments. Label classification method.

It should be noted that when the apparatus for multi-label classification of images provided in the above embodiments executes the method for multi-label classification of images, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions may be allocated as required. It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for multi-label classification of images provided by the above embodiments and the embodiments of the multi-label classification method for images belong to the same concept, and the specific implementation process is detailed in the method embodiments, which will not be repeated here.

The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, etc.

The above descriptions are only exemplary embodiments that can be implemented in the present application, and are not intended to limit the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present application shall be included in the within the scope of protection of this application.

Claims

A multi-label classification method for images, wherein the method comprises:

Extract the image features of the image to be processed through the feature extraction layer in the label classification model, where the label classification model is a neural network model for adding at least two labels to the image to be processed;

The image features are processed by a graph feature matrix to obtain the data to be activated. The graph feature matrix is a matrix obtained after a knowledge graph is processed by a graph convolutional neural network, and the knowledge graph is used to indicate the attributes of the first label itself, and , the relationship between at least two of the first tags;

The data to be activated is processed by the activation layer in the label classification model to obtain at least two second labels;

At least two of the second tags are determined as tags of the image to be processed, and the second tags belong to the first tags.
The method according to claim 1, wherein the processing of the image features by the atlas feature matrix to obtain the data to be activated comprises:

Multiplying the image feature matrix and the atlas feature matrix to obtain a data matrix to be activated;

The to-be-activated data is processed by the activation layer in the label classification model to obtain at least two second labels, including:

The data matrix to be activated is processed by the activation layer in the label classification model to obtain at least two second labels.
The method according to claim 2, wherein the knowledge graph comprises a label relationship matrix and a node information matrix, and the method further comprises:

inputting the label relationship matrix into the graph convolutional neural network, where the label relationship matrix is used to indicate a relationship between at least two of the first labels;

inputting the node information matrix into the graph convolutional neural network, where the node information matrix is used to indicate the attribute of the first label itself;

The graph feature matrix is obtained by processing the label relationship matrix and the node information matrix through the graph convolutional neural network.
The method of claim 3, further comprising:

Acquiring an updated knowledge graph in response to the data in the knowledge graph being updated;

Process the updated knowledge graph through the graph convolutional neural network to obtain the updated graph feature matrix;

The atlas feature matrix in the label classification model is updated through the updated atlas feature matrix.
The method according to claim 3, wherein the scale of the graph feature matrix is C*N, where C is the number of the first labels, N is the feature dimension, and both C and N are positive integers.
The method according to claim 2, wherein the size of the image feature matrix is N*1, the size of the graph feature matrix is C*N, the size of the data matrix to be activated is C*1, and C is the size of the The number of first labels, N is the feature dimension, and C and N are both positive integers.
The method of claim 1, wherein the images to be processed include a first image and a second image, the method further comprising:

In response to the first image and the second image having acquired the respective corresponding second labels, acquiring shooting time relationship information between the first image and the second image, the shooting time relationship information used to indicate the time sequence relationship between the first image and the second image at the shooting time, or the shooting time relationship information is used to indicate the shooting time of the first image and the shooting time of the second image the time between;

In response to the photographing moment relationship information meeting a preset condition, the target second label is added to the second label corresponding to the second image, and the target second label corresponds to the first image and does not correspond to the second label on the second image.
The method according to claim 7, wherein in response to the photographing moment relationship information meeting a preset condition, adding a target second label to the second label corresponding to the second image, comprising:

In response to the target duration being less than the second threshold, adding the target second label to the second label corresponding to the second image, where the target duration is the shooting moment of the first image and the second image of time between capture moments.
The method according to claim 7, wherein in response to the photographing moment relationship information meeting a preset condition, adding a target second label to the second label corresponding to the second image, comprising:

In response to the number of the first images being 2k, among the 2k first images, the k first images are images captured before the second image, and the k first images are the an image taken after the second image, and obtain the shooting moment of the first image;

In response to the length of the interval where the 2k first images are at the shooting time being less than the third threshold, the target second label is added to the second label corresponding to the second image, and the target second label is 2k Each of the first images corresponds to a label and the target second label is the second label not corresponding to the second image, and k is an integer greater than or equal to 1.
A multi-label classification device for images, wherein the device includes:

A feature extraction module for extracting image features of the image to be processed through a feature extraction layer in a label classification model, where the label classification model is a neural network model for adding at least two labels to the image to be processed;

The first acquisition module is used to process the image features through a graph feature matrix to obtain data to be activated. The graph feature matrix is a matrix obtained after the knowledge graph is processed by a graph convolutional neural network, and the knowledge graph is used to indicate the first An attribute of a tag itself, and a relationship between at least two of the first tags;

A label acquisition module, configured to process the data to be activated through the activation layer in the label classification model to obtain at least two second labels;

A label determination module, configured to determine at least two of the second labels as labels of the to-be-processed image, where the second labels belong to the first labels.
The device of claim 10,

The first acquisition module is used to multiply the image feature matrix and the atlas feature matrix to obtain a data matrix to be activated;

The label acquisition module is configured to process the data matrix to be activated through the activation layer in the label classification model to obtain at least two second labels.
The device according to claim 11, wherein the knowledge graph comprises a label relationship matrix and a node information matrix, the device further comprises a first input module, a second input module and a second acquisition module;

the first input module, configured to input the label relationship matrix into the graph convolutional neural network, where the label relationship matrix is used to indicate the relationship between at least two of the first labels;

The second input module is configured to input the node information matrix into the graph convolutional neural network, where the node information matrix is used to indicate the attribute of the first label itself;

The second obtaining module is configured to process the label relationship matrix and the node information matrix through the graph convolutional neural network to obtain the graph feature matrix.
The apparatus according to claim 12, further comprising a third acquisition module, a fourth acquisition module and a matrix update module;

the third obtaining module, configured to complete the update in response to the data in the knowledge graph, and obtain the updated knowledge graph;

The fourth acquisition module is configured to process the updated knowledge graph through the graph convolutional neural network to obtain the updated graph feature matrix;

The matrix updating module is configured to update the atlas feature matrix in the label classification model through the updated atlas feature matrix.
The apparatus according to claim 12, wherein the scale of the graph feature matrix is C*N, where C is the number of the first labels, N is the feature dimension, and both C and N are positive integers.
The apparatus according to claim 11, the scale of the image feature matrix is N*1, the scale of the graph feature matrix is C*N, the scale of the data matrix to be activated is C*1, and C is the scale of the The number of first labels, N is the feature dimension, and C and N are both positive integers.
The device according to claim 10, wherein the to-be-processed image comprises a first image and a second image, the device further comprises a post-processing module, the post-processing module is configured to:

In response to the first image and the second image having acquired the respective corresponding second labels, acquiring shooting time relationship information between the first image and the second image, the shooting time relationship information used to indicate the time sequence relationship between the first image and the second image at the shooting time, or the shooting time relationship information is used to indicate the shooting time of the first image and the shooting time of the second image the time between;

In response to the photographing moment relationship information meeting a preset condition, the target second label is added to the second label corresponding to the second image, and the target second label corresponds to the first image and does not correspond to the second label on the second image.
The apparatus according to claim 16, wherein the post-processing module is configured to, in response to the target duration being less than a second threshold, add the target second label to the second label corresponding to the second image, the The target duration is the duration between the capture moment of the first image and the capture moment of the second image.
The apparatus of claim 16, the post-processing module for:

In response to the number of the first images being 2k, among the 2k first images, the k first images are images captured before the second image, and the k first images are the an image taken after the second image, and obtain the shooting moment of the first image;

In response to the length of the interval where the 2k first images are at the shooting time being less than the third threshold, the target second label is added to the second label corresponding to the second image, and the target second label is 2k Each of the first images corresponds to a label and the target second label is the second label not corresponding to the second image, and k is an integer greater than or equal to 1.
A computer device, wherein the computer device comprises a processor, a memory connected to the processor, and program instructions stored on the memory, the processor implementing the program instructions as claimed in the claims The multi-label classification method of images according to any one of 1 to 9.
A computer-readable storage medium storing program instructions, wherein, when the program instructions are executed by the processor of claim 19, the image as claimed in any one of claims 1 to 9 is implemented. Multi-label classification methods.