CN113204659A

CN113204659A - Label classification method and device for multimedia resources, electronic equipment and storage medium

Info

Publication number: CN113204659A
Application number: CN202110331593.XA
Authority: CN
Inventors: 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-08-03
Anticipated expiration: 2041-03-26
Also published as: CN113204659B

Abstract

The disclosure relates to a label classification method and device for multimedia resources, electronic equipment and a storage medium. The label classification method comprises the following steps: acquiring a target image and a target text corresponding to a multimedia resource to be processed and label characteristic information corresponding to a preset label set; inputting the target image and the target text into a multi-modal feature extraction model, and performing feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed; inputting the label characteristic information into a graph convolution network, and performing label characteristic correlation processing to obtain target label characteristic description information; performing feature fusion processing on the target image-text feature information and the target label feature description information to obtain target feature information; and determining at least one label from a preset label set as the label information of the multimedia resource according to the target characteristic information. According to the technical scheme provided by the disclosure, the accuracy of multimedia resource label classification can be improved.

Description

Label classification method and device for multimedia resources, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of vision technologies, and in particular, to a method and an apparatus for classifying tags of multimedia resources, an electronic device, and a storage medium.

Background

The label classification is a basis of deep learning and data recommendation service, in the related technology, the label classification is generally performed based on the single-mode characteristics of data, and a single tree structure is adopted among labels in a label set for the label classification. When multimedia data is faced, the existing single-mode-based tag classification mode cannot be applied to data with multi-mode characteristics due to the fact that the multimedia data contains multi-mode characteristics such as images, texts and sounds; in addition, the content of the multimedia data is rich, and generally, the multimedia data has a plurality of tags, and when the tags of the tree structure in the related technology are used for tag classification of the multimedia data, the accuracy of the tag classification is poor.

Disclosure of Invention

The present disclosure provides a method and an apparatus for classifying tags of multimedia resources, an electronic device, and a storage medium, so as to at least solve the problem of how to improve the accuracy of classifying tags of multimedia resources in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for classifying tags of multimedia resources, including:

acquiring a target image and a target text corresponding to a multimedia resource to be processed and label characteristic information corresponding to a preset label set;

inputting the target image and the target text into a multi-modal feature extraction model, and performing feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed;

inputting the label characteristic information into a graph convolution network, and performing label characteristic correlation processing to obtain target label characteristic description information;

performing feature fusion processing on the target image-text feature information and the target label feature description information to obtain target feature information;

and determining at least one label from the preset label set as the label information of the multimedia resource according to the target characteristic information.

In one possible implementation manner, the multi-modal feature extraction model includes an image feature extraction module, a text feature extraction module and a feature fusion module; the step of inputting the target image and the target text into a multi-modal feature extraction model for feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed comprises the following steps:

inputting the target image into the image feature extraction module, and performing image feature extraction processing to obtain target image feature information;

inputting the target text into the text feature extraction module, and performing text feature extraction processing to obtain target text feature information;

and inputting the target image characteristic information and the target text characteristic information into the characteristic fusion module for characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation, the image feature extraction module includes a convolution module, a first down-sampling module, a first full-link layer, a second down-sampling module, and a second full-link layer; the step of inputting the target image into the image feature extraction module to perform image feature extraction processing to obtain target image feature information comprises:

inputting the target image into the convolution module, and performing feature extraction processing to obtain initial image feature information;

inputting the initial image characteristic information into the first down-sampling module, and performing down-sampling processing to obtain first image characteristic information of a first scale;

inputting the first image characteristic information into the second down-sampling module, and performing down-sampling processing to obtain second image characteristic information of a second scale;

inputting the first image feature information into the first full-connection layer, and performing feature length adjustment processing to obtain third image feature information with a preset length;

inputting the second image characteristic information into the second full-connection layer, and performing characteristic length adjustment processing to obtain fourth image characteristic information with a preset length;

and taking the third image characteristic information and the fourth image characteristic information as the target image characteristic information.

In one possible implementation manner, the text feature extraction module includes a first text feature extraction unit, a third full-connected layer, a second text feature extraction unit, and a fourth full-connected layer; the step of inputting the target text into the text feature extraction module to perform text feature extraction processing to obtain target text feature information comprises:

inputting the target text into the first text feature extraction unit, and performing text feature extraction processing to obtain first text feature information;

inputting the first text feature information into the second text feature extraction unit, and performing text feature extraction processing to obtain second text feature information;

inputting the first text characteristic information into the third full-connection layer, and performing characteristic length adjustment processing to obtain third text characteristic information with a preset length;

inputting the second text characteristic information into the fourth full-connection layer, and performing characteristic length adjustment processing to obtain fourth text characteristic information with a preset length;

and taking the third text characteristic information and the fourth text characteristic information as the target text characteristic information.

In one possible implementation, the feature fusion module includes a first feature fusion module and a second feature fusion module; the step of inputting the target image characteristic information and the target text characteristic information into the characteristic fusion module for characteristic fusion processing to obtain the target image-text characteristic information comprises the following steps:

inputting the third image characteristic information and the third text characteristic information into the first characteristic fusion module, and performing image-text characteristic fusion processing to obtain first image-text characteristic information;

and inputting the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information into the second characteristic fusion module, and performing image-text characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation, the graph convolution network includes a first graph convolution module and a second graph convolution module; the step of inputting the label characteristic information into a graph convolution network to carry out label characteristic correlation processing to obtain target label characteristic description information comprises the following steps:

inputting the label characteristic information into the first graph convolution module, and performing label characteristic correlation processing to obtain label characteristic description information to be processed;

and inputting the label feature description information to be processed into the second graph convolution module, and performing label feature correlation processing to obtain the target label feature description information.

In a possible implementation manner, after the step of inputting the third image feature information and the third text feature information into the first feature fusion module to perform image-text feature fusion processing to obtain first image-text feature information, the tag classification method further includes:

performing feature fusion processing on the first image-text feature information and the tag feature description information to be processed to obtain second image-text feature information;

the step of inputting the fourth image feature information, the fourth text feature information and the first image-text feature information into the second feature fusion module to perform image-text feature fusion processing to obtain the target image-text feature information includes:

and inputting the fourth image characteristic information, the fourth text characteristic information and the second image-text characteristic information into the second characteristic fusion module, and performing image-text characteristic fusion processing to obtain the target image-text characteristic information.

In a possible implementation manner, the step of determining at least one tag from the preset tag set as the tag information of the multimedia resource according to the target feature information includes:

and inputting the target characteristic information into a target full-connection layer, and performing classification processing to obtain the label information.

In a possible implementation manner, the step of obtaining the tag feature information corresponding to the preset tag set includes:

acquiring label correlation information between every two labels in the preset label set and weight information of the target full-connection layer;

using the weight information as label feature description information;

and taking the label correlation information and the label feature description information as label feature information corresponding to the preset label set.

In one possible implementation manner, the tag classification method further includes:

obtaining a plurality of sample multimedia resources and corresponding sample labels; the plurality of sample multimedia assets comprise a corresponding plurality of sample images and a plurality of sample texts;

inputting the sample images and the sample texts into a preset feature extraction model, and performing feature extraction processing to obtain first sample image-text feature information;

inputting the first sample image-text characteristic information into a preset full-connection layer, and performing classification processing to obtain a first prediction label;

acquiring first loss information according to the sample label and the first prediction label;

training the preset feature extraction model and the preset full connection layer according to the first loss information until the first loss information meets a preset condition, and obtaining the multi-modal feature extraction model and a target full connection layer.

In a possible implementation manner, after the step of obtaining a plurality of sample multimedia resources and corresponding sample labels, the label classification method further includes:

inputting the plurality of sample images and the plurality of sample texts into the multi-modal feature extraction model, and performing feature extraction processing to obtain second sample image-text feature information;

inputting the label characteristic information into a preset graph convolution network, and performing label characteristic correlation processing to obtain sample label characteristic description information;

performing feature fusion processing on the second sample image-text feature information and the sample label feature description information to obtain sample feature information;

inputting the sample characteristic information into the target full-connection layer, and performing classification processing to obtain a second prediction label;

acquiring second loss information according to the sample label and the second prediction label;

and training the preset graph convolution network according to the second loss information until the second loss information meets a preset condition to obtain the graph convolution network.

According to a second aspect of the embodiments of the present disclosure, there is provided a tag classification apparatus for a multimedia resource, including:

the model input information acquisition module is configured to execute acquisition of a target image and a target text corresponding to the multimedia resource to be processed and label characteristic information corresponding to a preset label set;

the target image-text characteristic information acquisition module is configured to input the target image and the target text into a multi-mode characteristic extraction model for characteristic extraction processing to obtain target image-text characteristic information of the multimedia resource to be processed;

the target label characteristic description information acquisition module is configured to input the label characteristic information into a graph convolution network and perform label characteristic correlation processing to obtain target label characteristic description information;

the target characteristic information acquisition module is configured to perform characteristic fusion processing on the target image-text characteristic information and the target label characteristic description information to obtain target characteristic information;

and the label information acquisition module is configured to determine at least one label from the preset label set as the label information of the multimedia resource according to the target characteristic information.

In one possible implementation manner, the multi-modal feature extraction model includes an image feature extraction module, a text feature extraction module and a feature fusion module; the target image-text characteristic information acquisition module comprises:

the target image characteristic information acquisition unit is configured to input the target image into the image characteristic extraction module, and perform image characteristic extraction processing to obtain target image characteristic information;

the target text characteristic information acquisition unit is configured to input the target text into the text characteristic extraction module, and perform text characteristic extraction processing to obtain target text characteristic information;

and the target image-text characteristic information acquisition unit is configured to input the target image characteristic information and the target text characteristic information into the characteristic fusion module for characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation, the image feature extraction module includes a convolution module, a first down-sampling module, a first full-link layer, a second down-sampling module, and a second full-link layer; the target image feature information acquisition unit includes:

the initial image characteristic information acquisition unit is configured to input the target image into the convolution module for characteristic extraction processing to obtain initial image characteristic information;

the first image characteristic information acquisition unit is configured to input the initial image characteristic information into the first down-sampling module for down-sampling processing to obtain first image characteristic information of a first scale;

the second image characteristic information acquisition unit is configured to input the first image characteristic information into the second down-sampling module for down-sampling processing to obtain second image characteristic information of a second scale;

a third image characteristic information obtaining unit, configured to perform input of the first image characteristic information into the first full connection layer, and perform characteristic length adjustment processing to obtain third image characteristic information of a preset length;

the fourth image characteristic information acquisition unit is configured to input the second image characteristic information into the second full connection layer, and perform characteristic length adjustment processing to obtain fourth image characteristic information with a preset length;

a target image feature information determination unit configured to perform the third image feature information and the fourth image feature information as the target image feature information.

In one possible implementation manner, the text feature extraction module includes a first text feature extraction unit, a third full-connected layer, a second text feature extraction unit, and a fourth full-connected layer; the target text feature information acquiring unit includes:

the first text characteristic information acquisition unit is configured to input the target text into the first text characteristic extraction unit, and perform text characteristic extraction processing to obtain first text characteristic information;

the second text characteristic information acquisition unit is configured to input the first text characteristic information into the second text characteristic extraction unit, and perform text characteristic extraction processing to obtain second text characteristic information;

a third text characteristic information obtaining unit, configured to perform input of the first text characteristic information into the third full connection layer, and perform characteristic length adjustment processing to obtain third text characteristic information of a preset length;

the fourth text characteristic information acquisition unit is configured to input the second text characteristic information into the fourth full connection layer, and perform characteristic length adjustment processing to obtain fourth text characteristic information with a preset length;

a target text feature information determination unit configured to perform the third text feature information and the fourth text feature information as the target text feature information.

In one possible implementation, the feature fusion module includes a first feature fusion module and a second feature fusion module; the target image-text characteristic information acquisition unit comprises:

the first image-text characteristic information acquisition unit is configured to input the third image characteristic information and the third text characteristic information into the first characteristic fusion module, and perform image-text characteristic fusion processing to obtain first image-text characteristic information;

and the first target image-text characteristic information acquisition unit is configured to input the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information into the second characteristic fusion module for image-text characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation, the graph convolution network includes a first graph convolution module and a second graph convolution module; the target tag feature description information acquisition module comprises:

the tag feature description information to be processed acquiring unit is configured to input the tag feature information into the first graph convolution module for tag feature correlation processing to obtain tag feature description information to be processed;

and the target label feature description information acquisition unit is configured to input the to-be-processed label feature description information into the second graph convolution module, and perform label feature correlation processing to obtain the target label feature description information.

In a possible implementation manner, the tag classification apparatus further includes:

the second image-text characteristic information acquisition module is configured to perform characteristic fusion processing on the first image-text characteristic information and the to-be-processed label characteristic description information to obtain second image-text characteristic information;

the target image-text characteristic information acquisition unit further comprises:

and the second target image-text characteristic information acquisition unit is configured to input the fourth image characteristic information, the fourth text characteristic information and the second image-text characteristic information into the second characteristic fusion module for image-text characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation manner, the tag information obtaining module includes:

and the label information acquisition unit is configured to input the target characteristic information into a target full-link layer, perform classification processing and obtain the label information.

In one possible implementation manner, the model input information obtaining module includes:

a tag correlation information and weight information acquiring unit configured to perform acquiring tag correlation information between every two tags in the preset tag set and weight information of the target full connection layer;

a tag feature description information acquisition unit configured to perform the weighting information as tag feature description information;

a tag feature information obtaining unit configured to perform the tag correlation information and the tag feature description information as tag feature information corresponding to the preset tag set.

a training data acquisition module configured to perform acquisition of a plurality of sample multimedia resources and corresponding sample labels; the plurality of sample multimedia assets comprise a corresponding plurality of sample images and a plurality of sample texts;

the first sample image-text characteristic information acquisition module is configured to input the plurality of sample images and the plurality of sample texts into a preset characteristic extraction model for characteristic extraction processing to obtain first sample image-text characteristic information;

the first prediction label acquisition module is configured to input the first sample image-text characteristic information into a preset full connection layer for classification processing to obtain a first prediction label;

a first loss information obtaining module configured to obtain first loss information according to the sample label and the first prediction label;

the first training module is configured to train the preset feature extraction model and the preset full connection layer according to the first loss information until the first loss information meets a preset condition, so that the multi-modal feature extraction model and the target full connection layer are obtained.

the second sample image-text characteristic information acquisition module is configured to input the plurality of sample images and the plurality of sample texts into the multi-modal characteristic extraction model for characteristic extraction processing to obtain second sample image-text characteristic information;

the sample label feature description information acquisition module is configured to input the label feature information into a preset graph convolution network, and perform label feature correlation processing to obtain sample label feature description information;

the sample characteristic information acquisition module is configured to perform characteristic fusion processing on the second sample image-text characteristic information and the sample label characteristic description information to obtain sample characteristic information;

the second prediction label acquisition module is configured to input the sample characteristic information into the target full-link layer for classification processing to obtain a second prediction label;

a second loss information obtaining module configured to obtain second loss information according to the sample label and the second prediction label;

and the second training module is configured to train the preset graph convolution network according to the second loss information until the second loss information meets a preset condition, so as to obtain the graph convolution network.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any of the first aspects above.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of the first aspect of the embodiments of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, cause a computer to perform the method of any one of the first aspects of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

target image-text characteristic information of the multimedia resource to be processed is obtained through the multi-mode characteristic extraction model, and the method can be effectively applied to label classification of the multimedia resource; moreover, by combining the graph convolution network to perform correlation processing on the tag characteristics and perform characteristic fusion processing on the target image-text characteristic information and the target tag characteristic description information, the correlation fusion of the multi-mode characteristics of the multimedia resources and the tag characteristic information is realized, so that the target characteristic information has richer semantic expression and can represent the content of the multimedia resources more accurately, that is, the comprehension degree of the content of the multimedia resources can be improved, and the accuracy of the multimedia resource tag classification can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating an application environment in accordance with an exemplary embodiment.

Fig. 2 is a flowchart illustrating a method of tag classification for a multimedia asset according to an exemplary embodiment.

FIG. 3 is a block diagram illustrating a tag classification model according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating a method for inputting a target image and a target text into a multi-modal feature extraction model to perform feature extraction processing, so as to obtain target image-text feature information of a multimedia resource to be processed according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating a method for obtaining tag feature information corresponding to a preset tag set according to an exemplary embodiment.

FIG. 6 is a block diagram illustrating a tag classification model in accordance with an exemplary embodiment.

Fig. 7 is a flowchart illustrating a method for inputting a target image into an image feature extraction module to perform an image feature extraction process to obtain target image feature information according to an exemplary embodiment.

Fig. 8 is a flowchart illustrating a method for inputting a target text into a text feature extraction module to perform text feature extraction processing to obtain target text feature information according to an exemplary embodiment.

Fig. 9 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment.

Fig. 10 is a flowchart illustrating a method for inputting tag feature information into a graph convolution network to perform tag feature correlation processing to obtain target tag feature description information according to an exemplary embodiment.

Fig. 11 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment.

FIG. 12 is a block diagram illustrating a tag classification model in accordance with an exemplary embodiment.

FIG. 13 is a flowchart illustrating a method for training a multi-modal feature extraction model and a target fully connected layer, according to an example embodiment.

FIG. 14 is an architectural diagram illustrating a pre-defined feature extraction model and pre-defined fully connected layers according to an exemplary embodiment.

FIG. 15 is a flowchart illustrating training of a graph convolution network, according to an example embodiment.

FIG. 16 is an architecture diagram illustrating a pre-set label classification model in accordance with an exemplary embodiment.

Fig. 17 is a block diagram illustrating an apparatus for classifying tags of a multimedia asset according to an exemplary embodiment.

FIG. 18 is a block diagram illustrating an electronic device for tag classification of multimedia assets in accordance with an exemplary embodiment.

FIG. 19 is a block diagram illustrating an electronic device for tag classification of multimedia assets in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment according to an exemplary embodiment, which may include a server 01 and a terminal 02, as shown in fig. 1.

In an alternative embodiment, server 01 may be used for training of a multimodal feature extraction model and graph convolution network gcn (graph relational network); or training a label classification model, wherein the label classification model can comprise a multi-modal feature extraction model, a graph convolution network and a target full connection layer. Specifically, the server 01 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

In an alternative embodiment, the terminal 02 may be used in conjunction with the server 01 to perform a tag classification method for multimedia resources, wherein the multi-modal feature extraction model and the graph convolution network used by the terminal 02 may be trained by the server 01 and then sent to the terminal 02. Specifically, the terminal 02 may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and other types of electronic devices. Optionally, the operating system running on the electronic device may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In addition, it should be noted that fig. 1 illustrates only one application environment of the image processing method provided by the present disclosure. For example, the server 01 may perform a tag classification method of the multimedia resource; the terminal 02 can perform training of the multi-modal feature extraction model and the graph convolution network. The present disclosure is not limited thereto.

In the embodiment of the present specification, the server 01 and the terminal 02 may be directly or indirectly connected by a wired or wireless communication method, and the present application is not limited herein.

Fig. 2 is a flowchart illustrating a method of tag classification for a multimedia asset according to an exemplary embodiment. As shown in fig. 2, the following steps may be included.

In step S201, a target image and a target text corresponding to a multimedia resource to be processed and tag feature information corresponding to a preset tag set are obtained.

In this embodiment of the present specification, the preset tag set may be a set including a preset number of tags, and the preset number of tags may be obtained according to actual needs or statistics, so that the preset number of tags may be used as the preset tag set. Further, the labels in the preset label set can be used for carrying out label classification on the multimedia resource to be processed.

In practical application, the label feature information may refer to a directed graph of labels in a preset label set, edges of the directed graph may refer to that there is correlation between labels, points of the directed graph may refer to each label in the preset label set, each label may be described by a feature vector, and weights of the edges may be obtained according to a statistical manner, so that the feature vector of each label, the weight of the edge between each label, and the weight of the edge may be used as the directed graph, and the directed graph may be used as the label feature information. The directed graph may be in the form of a matrix or a vector, none of which are limited by this disclosure.

In one example, the preset tab set may be represented as [ tab 1, tab 2, … …, tab n ], n is a preset number, for example, may be 100, 284, etc., which is not limited by the present disclosure.

In practical application, a multimedia resource that needs to be subjected to label classification may be used as a multimedia resource to be processed, for example, a short video that needs to be subjected to label classification may be used as a multimedia resource to be processed. And an image related to the multimedia resource to be processed can be obtained as a target image corresponding to the multimedia resource to be processed. And acquiring a text related to the multimedia resource to be processed as a target text corresponding to the multimedia resource to be processed. Wherein, the target image may refer to an image capable of characterizing the content of the multimedia resource to be processed. For example, when the multimedia resource to be processed is a short video, a cover image of the short video may be taken as a target image and text related to the short video may be taken as target text. The text related to the short video may include text in the short video, title text of the short video, description text of the short video, text corresponding to audio in the short video, and the like, which is not limited in this disclosure.

Optionally, an image related to the multimedia resource to be processed may be compressed into an image in a preset format of a first preset pixel, the image compressed into the preset format of the first preset pixel may be randomly cropped to obtain an image in a preset format of a second preset pixel, and the image in the preset format of the second preset pixel may be used as the target image. In one example, the first preset pixel may be 256 × 256, the second preset pixel may be 224 × 224, and the preset format may be an RGB (red, green, blue, three primary colors of red, green, and blue) format. The first preset pixel, the second preset pixel and the preset format are not limited in the present disclosure, as long as the target image can satisfy the input of the multi-modal feature extraction model.

In step S203, the target image and the target text are input into the multi-modal feature extraction model, and feature extraction processing is performed to obtain target image-text feature information of the multimedia resource to be processed.

In the embodiment of the present specification, the target image and the target text may be input into the multi-modal feature extraction model, and feature extraction processing is performed to obtain target image-text feature information of the multimedia resource to be processed. For example, the image feature and the text feature may be extracted, and the image feature and the text feature may be fused to obtain target text feature information. The target text feature information may be a feature vector or a feature matrix, which is not limited in this disclosure.

In step S205, the tag feature information is input into the graph convolution network, and tag feature correlation processing is performed to obtain target tag feature description information.

In this embodiment of the present specification, the tag feature information may be input into a graph convolution network, and tag feature correlation processing may be performed to obtain target tag feature description information. For example, a directed graph of tags in a preset tag set may be input into a graph convolution network, and tag feature correlation processing is performed to obtain target tag feature description information.

In step S207, feature fusion processing is performed on the target image-text feature information and the target tag feature description information to obtain target feature information.

In this embodiment of the present specification, feature fusion processing may be performed on the target image-text feature information and the target tag feature description information to obtain target feature information. In one example, the target image-text feature information and the target label feature description information may be multiplied to implement the feature fusion process, so as to obtain the target feature information. The multiplication process may be a matrix multiplication process, which is not limited by this disclosure.

The target characteristic information can represent characteristic information obtained by fusing multi-modal target image-text characteristic information and target label characteristic description information subjected to label characteristic correlation processing, and the target characteristic information can be used for representing label information of multimedia resources to be processed.

In step S209, at least one tag is determined from the preset tag set as tag information of the multimedia resource according to the target feature information.

In this embodiment of the present specification, at least one tag may be determined from a preset tag set as tag information of a multimedia resource according to target feature information. In one example, when at least one tag determined from the preset tag set includes tag 1 and tag 2 according to the object feature information, the tag information may be represented as [1,1,0, … …, 0], that is, tag 1 and tag 2 in the preset tag set may be set to 1, and the other tags may be set to 0. Alternatively, the tag information may be determined to be tag 1 and tag 2. The form of the tag information is not limited by the present disclosure.

In one possible implementation, step S209 may include: and inputting the target characteristic information into a target full-connection layer, and performing classification processing to obtain label information. The label classification processing is carried out through the target full-connection layer, so that the precision of label information and the efficiency of label classification can be improved.

Optionally, the tag information may be used to tag the multimedia resource to be processed, or based on the tag information, recommendation of the multimedia resource, search of the multimedia resource, and the like may be performed.

FIG. 3 is a block diagram illustrating a tag classification model according to an exemplary embodiment. Fig. 4 is a flowchart illustrating a method for inputting a target image and a target text into a multi-modal feature extraction model to perform feature extraction processing, so as to obtain target image-text feature information of a multimedia resource to be processed according to an exemplary embodiment.

In one possible implementation, as shown in FIG. 3, the label classification model may include a multi-modal feature extraction model and a graph convolution network. The multi-modal feature extraction model can include an image feature extraction module, a text feature extraction module, and a feature fusion module.

On the basis of the multi-modal feature extraction model in fig. 3, as shown in fig. 4, in a possible implementation manner, the step S203 may include:

in step S401, inputting the target image into an image feature extraction module, and performing image feature extraction processing to obtain target image feature information;

in step S403, inputting the target text into a text feature extraction module, and performing text feature extraction processing to obtain target text feature information;

in step S405, the target image feature information and the target text feature information are input to the feature fusion module for feature fusion processing, so as to obtain the target image-text feature information.

In the embodiment of the present specification, a target image G may be input to an image feature extraction module, and image feature extraction processing is performed to obtain target image feature information; and the target text T can be input into a text feature extraction module to carry out text feature extraction processing to obtain target text feature information. Further, the feature information of the target image and the feature information of the target text can be input into the feature fusion module for feature fusion processing to obtain the feature information of the target image and text.

Alternatively, referring to fig. 3, a matrix multiplication module

And the convolutional network may be used as a graphics context and label fusion module, and accordingly, the S207 may include: input matrix multiplication module for target image-text characteristic information and target label characteristic description information

Matrix multiplication is carried out to obtain the characteristic information of the image-text to be processed; the image-text characteristic information to be processed can be input into a convolution network for characteristic length adjustment to obtain target characteristic information with preset length. The convolution network may be a convolution network of d 1 × 1, and d may be a preset length, which is not limited in this disclosure.

In one example, the image feature extraction module may be an image feature extraction neural network, the backbone network of which may be a residual network Resnet-50, where the residual network Resnet-50 may include 50 convolutional layers and 4 downsampling modules (4 blocks); the text feature extraction module may be a text feature extraction neural network, and the backbone network of the text feature extraction neural network may be a Bert (bidirectional Encoder replication from transformations) network, and the Bert network may include 12 hidden layers; the feature fusion module may be used for a neural network for feature fusion, which may be, for example, a network comprising an attention layer. The present disclosure is not limited to these.

By setting the multi-modal feature extraction model to comprise the image feature extraction module, the text feature extraction module and the feature fusion module, the multi-modal features of the image and the text can be efficiently extracted and fused, and the acquisition efficiency and accuracy of the target image-text feature information are improved.

Fig. 5 is a flowchart illustrating a method for obtaining tag feature information corresponding to a preset tag set according to an exemplary embodiment. As shown in fig. 5, in a possible implementation manner, the step S201 may include the following steps:

in step S501, label correlation information between every two labels in the preset label set and weight information of the target full link layer are obtained.

In the embodiment of the present specification, the tag correlation information between every two tags in the preset tag set may be obtained in a statistical manner. For example, the tag correlation information may be a square matrix of conditional probabilities P of two tags, such as a square matrix of n × n, where n may be the number of tags in the preset tag set. As an example, n may be 3, and the tags in the preset tag set include basketball, boy student and girl student. Tag correlation information A^lCan be that

In this embodiment, the weight information of the target fully-connected layer may be obtained, for example, the weight information of the target fully-connected layer in fig. 3 may be obtained. The weight information may be a matrix of d × n, d may be a dimension of the target feature information, and n may be the number of tags in the preset tag set.

In step S503, the weight information is taken as tag feature description information;

in step S505, the tag correlation information and the tag feature description information are used as tag feature information corresponding to a preset tag set.

In this embodiment, the weight information may be used as the tag feature description information H^lAnd the label correlation information and the label feature description information can be used as label feature information corresponding to the preset label set. Accordingly, A^lCan be taken as the edge of the directed graph and the corresponding weight, H^lMay be the vertices of the directed graph.

The conditional probability of every two labels is used as label correlation information, and the weight information of a target full-connection layer is used as label feature description information, so that directed graph expression of the labels is realized, and the input of a graph convolution network is obtained; in addition, the weight information of the target full-link layer is used as the label feature description information, so that the label feature description information has semantics consistent with actual data distribution, the deviation of text single-mode on label feature description can be avoided, and the acquisition of the label feature description information is more convenient and efficient.

FIG. 6 is a block diagram illustrating a tag classification model in accordance with an exemplary embodiment. As shown in fig. 6, in one possible implementation, the image feature extraction module may include a convolution module, a first down-sampling module, a first fully-connected layer, a second down-sampling module, and a second fully-connected layer; the text feature extraction module can comprise a first text feature extraction unit, a third full-connection layer, a second text feature extraction unit and a fourth full-connection layer; the feature fusion module may include a first feature fusion module and a second feature fusion module; the graph convolution network may include a first graph convolution module and a second graph convolution module. Optionally, as shown in fig. 6, the label classification model may further include a first convolutional network, a second convolutional network, and a target fully-connected layer. The structure of the label classification model is not limited by the present disclosure.

Fig. 7 is a flowchart illustrating a method for inputting a target image into an image feature extraction module to perform an image feature extraction process to obtain target image feature information according to an exemplary embodiment. As shown in fig. 7, in a possible implementation manner, the step S401 may include the following steps:

in step S701, the target image is input to the convolution module, and feature extraction processing is performed to obtain initial image feature information.

In this embodiment, the target image may be input to the convolution module, and feature extraction processing may be performed to obtain initial image feature information. In one example, the convolution module may be 50 convolution layers of Resnet-50. This is merely an example and is not intended to limit the present disclosure.

In step S703, inputting the initial image feature information into a first down-sampling module, and performing down-sampling processing to obtain first image feature information of a first scale;

in step S705, the first image feature information is input into the second down-sampling module, and down-sampling processing is performed to obtain second image feature information of a second scale.

In this embodiment of the present specification, the initial image feature information may be input into a first down-sampling module, and be subjected to down-sampling processing, for example, 1/2 down-sampling processing, to obtain first image feature information of a first scale; and the first image feature information may be input into a second down-sampling module, and down-sampled, for example, 1/2 down-sampled, to obtain second image feature information at a second scale.

The first scale may refer to a characteristic length of the first image characteristic information, the second scale may refer to a characteristic length of the second image characteristic information, and the second scale may be longer than the first scale.

In step S707, inputting the first image feature information into the first full connection layer, and performing feature length adjustment processing to obtain third image feature information with a preset length;

in step S709, inputting the second image feature information into the second full connection layer, and performing feature length adjustment processing to obtain fourth image feature information with a preset length;

in step S711, the third image feature information and the fourth image feature information are set as target image feature information.

In practical application, in order to adapt to the fusion with the target text feature information in the feature fusion module, the feature length of the target image feature information may be set to be the same as the feature length of the target text feature information. The predetermined length may be 512, which is not limited in this disclosure.

In this embodiment of the present description, as shown in fig. 6, the first image feature information may be input into the first full connection layer, and feature length adjustment processing may be performed to obtain third image feature information with a preset length; and the second image characteristic information can be input into the second full-connection layer to carry out characteristic length adjustment processing, so that fourth image characteristic information with a preset length is obtained. So that the third image feature information and the fourth image feature information can be taken as the target image feature information.

Through setting up image feature extraction module and including convolution module, first downsampling module, first full tie layer, second downsampling module and the full tie layer of second, can realize the degree of depth extraction to the image feature for target image characteristic information is more accurate.

Fig. 8 is a flowchart illustrating a method for inputting a target text into a text feature extraction module to perform text feature extraction processing to obtain target text feature information according to an exemplary embodiment. As shown in fig. 8, in one possible implementation, the step S403 may include the following steps:

in step S801, inputting a target text into a first text feature extraction unit, and performing text feature extraction processing to obtain first text feature information;

in step S803, the first text feature information is input to the second text feature extraction unit, and text feature extraction processing is performed to obtain second text feature information.

In this embodiment of the present specification, the target text may be input to the first text feature extraction unit to obtain first text feature information, and the first text feature information may be input to the second text feature extraction unit to obtain second text feature information. In an example, the first text feature extraction unit may be a preset number of hidden layers of a Bert network, and the second text feature extraction unit may be a preset number of hidden layers of the Bert network, where the preset number of layers may be 3, which is not limited in this disclosure. In a case that the first text feature extraction unit may be a preset number of hidden layers of a Bert network, and the second text feature extraction unit may be a preset number of hidden layers of the Bert network, the first text feature information may be a feature output by the [ CLS ] tag in the first text feature extraction unit, and the second text feature information may be a feature output by the [ CLS ] tag in the second text feature extraction unit. In one example, the feature length of the first text feature information and the second text feature information may be 768, which is not limited by this disclosure.

Alternatively, when the first text feature extraction unit is an implied layer with a preset number of layers of the Bert network, the target text may be input into an input layer of the Bert network, so that an output of the input layer is used as an input of the first text feature extraction unit.

In step S805, the first text feature information is input into the third full connection layer, and feature length adjustment processing is performed to obtain third text feature information with a preset length;

in step S807, inputting the second text feature information into the fourth full connection layer, and performing feature length adjustment processing to obtain fourth text feature information with a preset length;

in step S809, the third text feature information and the fourth text feature information are taken as target text feature information.

In this embodiment of the present specification, feature length adjustment processing may be performed on the first text feature information and the second text feature information, respectively, to obtain third text feature information with a preset length and fourth text feature information with a preset length, so as to obtain target text feature information with a preset length, so as to be suitable for fusion processing in the feature fusion module.

The text feature extraction module comprises a first text feature extraction unit, a third full-connection layer, a second text feature extraction unit and a fourth full-connection layer, so that the text features are extracted deeply, and the extraction efficiency and accuracy of the target text feature information are improved.

Fig. 9 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment. As shown in fig. 9, in one possible implementation, the step S405 may include the following steps:

in step S901, inputting the third image feature information and the third text feature information into a first feature fusion module, and performing image-text feature fusion processing to obtain first image-text feature information;

in step S903, the fourth image feature information, the fourth text feature information, and the first image-text feature information are input to the second feature fusion module, and image-text feature fusion processing is performed to obtain target image-text feature information.

In this embodiment of the present specification, the third image feature information and the third text feature information may be input to the first feature fusion module, and image-text feature fusion processing is performed to obtain first image-text feature information; and the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information can be input into a second characteristic fusion module to carry out image-text characteristic fusion processing to obtain target image-text characteristic information. The first feature fusion module and the second feature fusion module may be a multi-head attention layer, which is not limited by the present disclosure.

By arranging the feature fusion module to comprise the first feature fusion module and the second feature fusion module, the visual and text shallow semantic features and the deep semantic features can be fused layer by layer, the multi-mode feature fusion is fully realized, the effective expression of the image-text features of all scales is ensured, and the target image-text feature information is more accurate.

Fig. 10 is a flowchart illustrating a method for inputting tag feature information into a graph convolution network to perform tag feature correlation processing to obtain target tag feature description information according to an exemplary embodiment. As shown in fig. 10, in a possible implementation manner, the step S205 may include the following steps:

in step S1001, the tag feature information is input into the first graph convolution module, and tag feature correlation processing is performed to obtain tag feature description information to be processed.

In this embodiment of the present specification, the tag feature information may be input to the first graph convolution module, and tag feature correlation processing is performed to obtain tag feature description information to be processed.

In one example, in the case that the tag feature information includes tag correlation information and tag feature description information, tag feature correlation processing may be implemented by the following formula (1), and to-be-processed tag feature description information H is obtained^l+1：

H^l+1＝A^l*H^l*W^l (1)

Wherein A is^lFor tag correlation information, H^lDescribing information for tag characteristics, W^lIs a parameter of the first graph convolution module, in one example, W^l∈R^d*dD may be the above-mentioned preset length, for example 512.

In step S1003, the tag feature description information to be processed is input to the second graph convolution module, and tag feature correlation processing is performed to obtain target tag feature description information.

In this embodiment of the present specification, the tag feature description information to be processed may be input to the second graph convolution module, and tag feature correlation processing is performed to obtain target tag feature description information.

By arranging the graph convolution network to comprise the first graph convolution module and the second graph convolution module, tag correlation information and tag feature description information can be subjected to tag feature correlation processing from shallow to deep by utilizing the first graph convolution module and the second graph convolution module, more accurate target tag feature description information can be obtained, and the extraction efficiency and the extraction accuracy of the target tag feature description information are improved.

Fig. 11 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment. As shown in fig. 11, after step S901, the tag classification method may further include the steps of:

in step S1101, feature fusion processing is performed on the first image-text feature information and the tag feature description information to be processed, so as to obtain second image-text feature information.

In one example, as shown in fig. 6, the first image-text characteristic information and the tag characteristic description information to be processed may be matrix-multiplied to obtain the first image-text characteristic information to be processed. And the first image-text characteristic information to be processed can be input into the first convolution network for convolution processing to obtain second image-text characteristic information.

Accordingly, step S903 may include:

in step S1103, the fourth image feature information, the fourth text feature information, and the second image-text feature information are input to the second feature fusion module, and image-text feature fusion processing is performed to obtain target image-text feature information. The implementation manner of this step may refer to step S901, which is not described herein again.

Further, S207 may include: and carrying out matrix multiplication on the target image-text characteristic information and the target label characteristic description information to obtain fourth image-text characteristic information to be processed. And the fourth image-text characteristic information to be processed can be input into a second convolution network for convolution processing to obtain target characteristic information with preset length.

By means of multi-scale fusion of the multi-modal features and the graph convolution features, the full fusion of the relevance between the label multi-modal features and the labels is guaranteed, high-level features have richer semantic expressions, namely the obtained target image-text feature information can have richer semantic expressions, so that the method can be effectively used for label classification of multimedia resources with rich contents, and the label classification accuracy of the multimedia resources is improved.

In one possible implementation, to further achieve depth fusion of image features and text features, the depth of the multi-modal feature extraction model and the depth of the graph convolution network may be increased. For example, as shown in fig. 12, the depth of the multi-modal feature extraction model and the depth of the graph convolution network may be set to 4, i.e., 4 times of fusion of the image feature, the text feature and the tag feature information is performed. Through 4 times of fusion from shallow to deep, the accuracy of multimedia resource feature classification can be further improved. The present disclosure does not limit the depth of the multi-modal feature extraction model and the depth of the graph convolution network.

As an example, as shown in fig. 12, the image feature extraction module may include a convolution module, a first down-sampling module, a first full-link layer, a second down-sampling module, a second full-link layer, a third down-sampling module, a fifth full-link layer, a fourth down-sampling module, and a seventh full-link layer. The convolution module, the first down-sampling module, the second down-sampling module, the third down-sampling module and the fourth down-sampling module can be connected in sequence; the first down-sampling module can be connected with the first feature fusion module through a first full connection layer, the second down-sampling module can be connected with the second feature fusion module through a second full connection layer, the third down-sampling module can be connected with the third feature fusion module through a fifth full connection layer, and the fourth down-sampling module can be connected with the fourth feature fusion module through a seventh full connection layer.

As shown in fig. 12, the text feature extraction module may include a first text feature extraction unit, a second text feature extraction unit, a third text feature extraction unit, and a fourth text feature extraction unit, which are connected in sequence; the first text feature extraction unit can be connected with the first feature fusion module through a third full connection layer, the second text feature extraction unit can be connected with the second feature fusion module through a fourth full connection layer, the third text feature extraction unit can be connected with the third feature fusion module through a sixth full connection layer, and the fourth text feature extraction unit can be connected with the fourth feature fusion module through an eighth full connection layer.

As shown in fig. 12, the feature fusion module may include a first feature fusion module, a second feature fusion module, a third feature fusion module, and a fourth feature fusion module; the graph convolution network may include a first graph convolution module, a second graph convolution module, a third graph convolution module, and a fourth graph convolution module. Optionally, as shown in fig. 12, the label classification model may further include a first convolutional network, a second convolutional network, a third convolutional network, a fourth convolutional network, and a target fully-connected layer. The structure of the label classification model is not limited by the present disclosure.

Referring to fig. 12, the tag classification model further includes a matrix multiplication module

The first graph convolution module is connected with the first characteristic fusion module through the first matrix multiplication module, and the first matrix multiplication module can also be connected with the first convolution network, namely the first graph convolution module, the first characteristic fusion module and the first convolution network are respectively connected with the first matrix multiplication module; similarly, the second graph convolution module, the second feature fusion module and the second convolution network are respectively connected with the second matrix multiplication module; the third graph convolution module, the third feature fusion module and the third convolution network are respectively connected with the third matrix multiplication module; and the fourth graph convolution module, the fourth feature fusion module and the fourth convolution network are respectively connected with the fourth matrix multiplication module. The number of the matrix multiplication modules includes 4, and from top to bottom, the matrix multiplication modules include a first matrix multiplication module, a second matrix multiplication module, a third matrix multiplication module and a fourth matrix multiplication module.

As shown in fig. 12, the target image G may be input to the convolution module, and the outputs of the first down-sampling module, the second down-sampling module, the third down-sampling module, and the fourth down-sampling module may be obtained as first image feature information, second image feature information, fifth image feature information, and sixth image feature information, respectively. The outputs of the first full connection layer, the second full connection layer, the fifth full connection layer and the seventh full connection layer are respectively third image characteristic information, fourth image characteristic information, seventh image characteristic information and eighth image characteristic information.

The output of the first text feature extraction unit, the output of the second text feature extraction unit, the output of the third text feature extraction unit and the output of the fourth text feature extraction unit are respectively first text feature information, second text feature information, fifth text feature information and sixth text feature information; the output of the third full connection layer, the fourth full connection layer, the sixth full connection layer and the eighth full connection layer is respectively third text characteristic information, fourth text characteristic information, seventh text characteristic information and eighth text characteristic information.

In the examples of this specification, A^l，H^lAnd the output of the first graph convolution module, the second graph convolution module, the third graph convolution module and the fourth graph convolution module is respectively first label feature description information, second label feature description information, third label feature description information and fourth label feature description information.

The outputs of the first feature fusion module, the second feature fusion module, the third feature fusion module and the fourth feature fusion module are respectively first image-text feature information, third image-text feature information to be processed, sixth image-text feature information to be processed and ninth image-text feature information to be processed;

the outputs of the first matrix multiplication module, the second matrix multiplication module, the third matrix multiplication module and the fourth matrix multiplication module are respectively the first to-be-processed image-text characteristic information, the fourth to-be-processed image-text characteristic information, the seventh to-be-processed image-text characteristic information and the tenth to-be-processed image-text characteristic information;

the outputs of the first convolution network, the second convolution network, the third convolution network and the fourth convolution network are respectively second to-be-processed image-text characteristic information (the second image-text characteristic information), fifth to-be-processed image-text characteristic information, eighth to-be-processed image-text characteristic information and target characteristic information.

The processing procedure of fig. 12 can refer to the related contents of fig. 6 to 11, and is not described herein again.

The feature lengths (the first scale, the second scale, the third scale and the fourth scale) corresponding to the first image feature information, the second image feature information, the fifth image feature information and the sixth image feature information may be 256, 512, 1024 and 2048, respectively; accordingly, the row and column of the weight information corresponding to the first fully-connected layer, the second fully-connected layer, the fifth fully-connected layer, and the seventh fully-connected layer may be 256 × 512, 512 × 512, 1024 × 512, and 2048 × 512, respectively. The characteristic length of the output of the third full link layer, the fourth full link layer, the sixth full link layer, and the eighth full link layer may also be 512. Therefore, the characteristic length of the target image characteristic information input into the characteristic fusion module is ensured to be consistent with the length of the target text characteristic information.

Optionally, the first down-sampling module may be connected to the first fully-connected layer through the first pooling layer; accordingly, the second downsampling module may connect the second fully connected layer through the second pooled layer; the third downsampling module may connect the fifth full connection layer through the third pooling layer; the fourth downsampling module may connect the seventh full connection layer through the fourth pooling layer. The first pooling layer, the second pooling layer, the third pooling layer and the fourth pooling layer may be subjected to an average pooling process, which is not limited by the present disclosure.

The above is the content of performing label classification by using the trained multi-modal feature extraction model, the target full-link layer and the graph convolution network, and the training of the multi-modal feature extraction model, the target full-link layer and the graph convolution network, that is, the training of the label classification model is described below, and refer to the content in fig. 13 and fig. 15. When the label classification model is trained, the preset feature extraction model and the preset full connection layer may be trained to obtain a multi-modal feature extraction model and a target full connection layer, for example, the preset feature extraction model and the preset full connection layer shown in fig. 14 may be trained to obtain the multi-modal feature extraction model and the target full connection layer based on the framework diagram shown in fig. 14. Further, a preset label classification model shown in fig. 16 may be constructed based on the trained multi-modal feature extraction model and the target full connection layer, and the trained multi-modal feature extraction model and the target full connection layer may be fixed to train a preset convolution network to obtain a convolution network. Through the training of the two parts, a label classification model can be obtained.

FIG. 13 is a flowchart illustrating a method for training a multi-modal feature extraction model and a target fully connected layer, according to an example embodiment. As shown in fig. 13, in one possible implementation, the following steps may be included:

in step S1301, a plurality of sample multimedia resources and corresponding sample tags are obtained.

In this embodiment of the present description, a plurality of sample multimedia resources may be obtained, and a sample label may be labeled for each sample multimedia resource, so that the sample multimedia resource has a corresponding sample label. The sample label may be at least one label in a preset set of labels. The sample label may be [1, 0, 1, … …, 0], which may indicate that the sample label is label 1 and label 3 in the preset label set.

Wherein the plurality of sample multimedia assets can include a corresponding plurality of sample images and a plurality of sample texts. In practical applications, an image capable of characterizing the content of each sample multimedia resource may be obtained as a sample image corresponding to each sample multimedia resource. For example, a cover image of the sample short video may be acquired as a sample image corresponding to the sample short video. And the text associated with each sample multimedia asset may be obtained as the sample text corresponding to each sample multimedia asset. Specifically, refer to step S201, which is not described herein again.

In step S1303, the plurality of sample images and the plurality of sample texts are input into a preset feature extraction model, and feature extraction processing is performed to obtain first sample image-text feature information.

In this embodiment of the present specification, the implementation manner of this step may refer to step S203 described above, and is not described herein again.

In step S1305, the first sample image-text feature information is input into a preset full link layer, and is subjected to classification processing, so as to obtain a first prediction tag.

In this embodiment of the present specification, the output dimension of the preset fully-connected layer may be n, and n may be the number of tags in the preset tag set. The weight of the predetermined fully-connected layer may be a matrix of a predetermined length d × n, which is not limited by this disclosure. The first sample image-text characteristic information can be input into a preset full connection layer for classification processing, and a first prediction label is obtained.

In step S1307, first loss information is acquired from the sample label and the first prediction label.

In this embodiment, the difference information between the sample label and the first prediction label may be used as the first loss information, which is not limited in this disclosure.

In one example, the first Loss information Loss may be obtained according to the following formula (2):

Loss＝-ln∑_i(y_i)，where t_i＝1； (2)

wherein, y_iMay be a probability value for the ith one of the first predicted tags; t is t_iMay refer to the ith label in the sample label; the sample label may be a set of labels [ Label 1, Label 2, … …, Label n]N may be greater than 1, i may range from [1, n [ ]]. The sample label may include t_iA label corresponding to 1, for example, a sample label of [1, 0, … …,1]The corresponding sample labels may include label 1 and label n. That is, for the first loss information of one sample multimedia resource, the first loss information may be obtained by calculating a probability value corresponding to the sample label in the first prediction label of the one sample multimedia resource.

For example, when n is 3 and the tag set is [ basketball, boy student, girl student ]]And the sample labels corresponding to one sample multimedia asset are basketball and boy student, the sample label corresponding to the one sample multimedia asset can be represented as [1,1,0]. In this case, the formula (2) may be a Loss when calculating i ═ 1 and i ═ 2, that is, Loss ═ ln (y)₁+y₂)。

In step S1309, the preset feature extraction model and the preset full connection layer are trained according to the first loss information until the first loss information meets the preset condition, so as to obtain a multi-modal feature extraction model and a target full connection layer.

In this embodiment of the present specification, the first gradient information may be obtained according to the first loss information, so that a gradient reverse transmission method may be used to adjust parameters of the preset feature extraction model and parameters of the preset full connection layer, so as to implement training of the preset feature extraction model and the preset full connection layer until the first loss information satisfies a preset condition, and obtain the multi-modal feature extraction model and the target full connection layer. The preset condition may be that the first loss information is smaller than the loss threshold, or the preset condition may be that the first loss information is not decreased any more. The present disclosure is not limited thereto.

In an example, first Gradient information may be obtained by using an SGD (Stochastic Gradient Descent) method, the initial learning rate may be 0.1, after the training process is performed for 12 epochs, the first loss information may tend to be stable, that is, no longer increase, the training may be terminated, and the multi-modal feature extraction model and the target full-link layer may be obtained. This is merely an example and is not intended to limit the present disclosure.

By taking the sample image and the sample text as the input of the preset feature extraction model, the training of the multi-modal feature extraction of the preset feature extraction model is realized, so that the trained multi-modal feature extraction model can be suitable for the multi-modal feature extraction of multimedia resources, the accuracy of the feature extraction can be improved, and the precision of the label classification of the multimedia resources can be improved by combining with the target full-connection layer.

FIG. 15 is a flowchart illustrating training of a graph convolution network, according to an example embodiment. As shown in fig. 15, in a possible implementation manner, after the step S1301, the following steps may be included:

in step S1501, the plurality of sample images and the plurality of sample texts are input into the multimodal feature extraction model, and feature extraction processing is performed to obtain second sample image-text feature information. The implementation manner of this step can be referred to the above step S203, and is not described herein again.

In step S1503, inputting the label feature information into a preset graph convolution network, and performing label feature correlation processing to obtain sample label feature description information;

in step S1505, performing feature fusion processing on the second sample image-text feature information and the sample label feature description information to obtain sample feature information;

in step S1507, the sample feature information is input to the target full-link layer, and classification processing is performed to obtain a second prediction tag.

In the embodiment of this specification, the implementation manner of steps S1503 to S1507 may refer to steps S205 to S209 described above, and is not described herein again.

In step S1509, second loss information is acquired based on the sample label and the second prediction label.

In this embodiment, the loss of the sample label and the second prediction label may be calculated as the second loss information by using a preset loss function. The preset loss function may be a multi label classification loss multilabel software markingloss function, which is not limited by the present disclosure.

In step S1511, a preset graph convolution network is trained according to the second loss information until the second loss information satisfies a preset condition, so as to obtain the graph convolution network. The implementation manner of this step can be referred to the above step S1309, and is not described herein again.

Optionally, the convolutional network may be obtained based on training a preset convolutional network according to the second loss information.

By combining with the training of the preset graph convolution network, the trained label classification model can comprise the graph convolution network, and the accuracy of label classification can be further improved in the label classification of multimedia resources.

Fig. 17 is a block diagram illustrating an apparatus for classifying tags of a multimedia asset according to an exemplary embodiment. Referring to fig. 17, the apparatus may include:

a model input information obtaining module 1701 configured to perform obtaining of a target image and a target text corresponding to a multimedia resource to be processed and tag feature information corresponding to a preset tag set;

a target image-text characteristic information acquisition module 1703 configured to perform inputting the target image and the target text into a multi-modal characteristic extraction model, and perform characteristic extraction processing to obtain target image-text characteristic information of the multimedia resource to be processed;

a target tag feature description information obtaining module 1705 configured to perform tag feature information input into the graph convolution network, and perform tag feature correlation processing to obtain target tag feature description information;

a target characteristic information obtaining module 1707 configured to perform characteristic fusion processing on the target image-text characteristic information and the target tag characteristic description information to obtain target characteristic information;

a tag information obtaining module 1709 configured to determine at least one tag from a preset tag set as tag information of the multimedia resource according to the target feature information.

In one possible implementation manner, the multi-modal feature extraction model comprises an image feature extraction module, a text feature extraction module and a feature fusion module; the target image-text characteristic information acquisition module 1703 includes:

the target image characteristic information acquisition unit is configured to input a target image into the image characteristic extraction module, and perform image characteristic extraction processing to obtain target image characteristic information;

the target text characteristic information acquisition unit is configured to execute a text characteristic extraction module for inputting a target text, and perform text characteristic extraction processing to obtain target text characteristic information;

the first image characteristic information acquisition unit is configured to input the initial image characteristic information into a first down-sampling module for down-sampling processing to obtain first image characteristic information of a first scale;

the second image characteristic information acquisition unit is configured to input the first image characteristic information into a second down-sampling module for down-sampling processing to obtain second image characteristic information of a second scale;

the third image characteristic information acquisition unit is configured to input the first image characteristic information into the first full-connection layer, and perform characteristic length adjustment processing to obtain third image characteristic information with a preset length;

the fourth image characteristic information acquisition unit is configured to input the second image characteristic information into the second full-connection layer, and perform characteristic length adjustment processing to obtain fourth image characteristic information with a preset length;

a target image feature information determination unit configured to perform the third image feature information and the fourth image feature information as target image feature information.

In one possible implementation manner, the text feature extraction module includes a first text feature extraction unit, a third full-connected layer, a second text feature extraction unit, and a fourth full-connected layer; the target text characteristic information acquiring unit includes:

the first text characteristic information acquisition unit is configured to input a target text into the first text characteristic extraction unit, and perform text characteristic extraction processing to obtain first text characteristic information;

the third text characteristic information acquisition unit is configured to input the first text characteristic information into a third full connection layer, and perform characteristic length adjustment processing to obtain third text characteristic information with a preset length;

the fourth text characteristic information acquisition unit is configured to input the second text characteristic information into a fourth full connection layer, and perform characteristic length adjustment processing to obtain fourth text characteristic information with a preset length;

a target text feature information determination unit configured to perform the third text feature information and the fourth text feature information as target text feature information.

and the first target image-text characteristic information acquisition unit is configured to input the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information into the second characteristic fusion module for image-text characteristic fusion processing to obtain target image-text characteristic information.

In one possible implementation, the graph convolution network includes a first graph convolution module and a second graph convolution module; the target tag feature description information acquisition module 1705 includes:

In one possible implementation manner, the tag classification apparatus further includes:

In a possible implementation manner, the tag information obtaining module 1709 includes:

and the label information acquisition unit is configured to input the target characteristic information into the target full-link layer, perform classification processing and obtain label information.

In one possible implementation, the model input information obtaining module 1701 includes:

the label correlation information and weight information acquiring unit is configured to acquire label correlation information between every two labels in a preset label set and weight information of a target full-connection layer;

a tag feature description information acquisition unit configured to perform weighting information as tag feature description information;

and the tag characteristic information acquisition unit is configured to execute tag correlation information and tag characteristic description information as tag characteristic information corresponding to the preset tag set.

a training data acquisition module configured to perform acquisition of a plurality of sample multimedia resources and corresponding sample labels; the plurality of sample multimedia assets comprise a plurality of corresponding sample images and a plurality of sample texts;

the first sample image-text characteristic information acquisition module is configured to input a plurality of sample images and a plurality of sample texts into a preset characteristic extraction model for characteristic extraction processing to obtain first sample image-text characteristic information;

a first loss information acquisition module configured to perform acquisition of first loss information according to the sample label and the first prediction label;

and the first training module is configured to train the preset feature extraction model and the preset full connection layer according to the first loss information until the first loss information meets a preset condition, so as to obtain a multi-mode feature extraction model and a target full connection layer.

the second sample image-text characteristic information acquisition module is configured to input a plurality of sample images and a plurality of sample texts into the multi-modal characteristic extraction model for characteristic extraction processing to obtain second sample image-text characteristic information;

and the second training module is configured to train the preset graph convolution network according to the second loss information until the second loss information meets a preset condition to obtain the graph convolution network.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 18 is a block diagram illustrating an electronic device for tag classification of multimedia assets, which may be a terminal, according to an exemplary embodiment, and an internal structure thereof may be as shown in fig. 18. The electronic device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of tag classification for a multimedia asset. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configuration shown in fig. 18 is a block diagram of only a portion of the configuration associated with the disclosed aspects and does not constitute a limitation on the electronic device to which the disclosed aspects apply, and that a particular electronic device may include more or less components than those shown in the figures, or combine certain components, or have a different arrangement of components.

Fig. 19 is a block diagram illustrating an electronic device for tag classification of multimedia assets, which may be a server, according to an exemplary embodiment, and an internal structure thereof may be as shown in fig. 19. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of tag classification for a multimedia asset.

Those skilled in the art will appreciate that the architecture shown in fig. 19 is merely a block diagram of some of the structures associated with the disclosed aspects and does not constitute a limitation on the electronic devices to which the disclosed aspects apply, as a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a method of tag classification for multimedia assets as in an embodiment of the disclosure.

In an exemplary embodiment, there is also provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a tag classification method of a multimedia asset in an embodiment of the present disclosure. The computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of tag classification of a multimedia asset in embodiments of the present disclosure.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for classifying labels of multimedia resources is characterized in that the method for classifying labels comprises the following steps:

2. The label classification method according to claim 1, wherein the multi-modal feature extraction model comprises an image feature extraction module, a text feature extraction module and a feature fusion module; the step of inputting the target image and the target text into a multi-modal feature extraction model for feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed comprises the following steps:

3. The label classification method according to claim 2, wherein the image feature extraction module comprises a convolution module, a first down-sampling module, a first fully-connected layer, a second down-sampling module, and a second fully-connected layer; the step of inputting the target image into the image feature extraction module to perform image feature extraction processing to obtain target image feature information comprises:

4. The label classification method according to claim 3, wherein the text feature extraction module comprises a first text feature extraction unit, a third full-connected layer, a second text feature extraction unit and a fourth full-connected layer; the step of inputting the target text into the text feature extraction module to perform text feature extraction processing to obtain target text feature information comprises:

5. The label classification method according to claim 4, wherein the feature fusion module comprises a first feature fusion module and a second feature fusion module; the step of inputting the target image characteristic information and the target text characteristic information into the characteristic fusion module for characteristic fusion processing to obtain the target image-text characteristic information comprises the following steps:

6. The label classification method according to claim 5, characterized in that the graph convolution network comprises a first graph convolution module and a second graph convolution module; the step of inputting the label characteristic information into a graph convolution network to carry out label characteristic correlation processing to obtain target label characteristic description information comprises the following steps:

7. The label classification method according to claim 6, wherein after the step of inputting the third image feature information and the third text feature information into the first feature fusion module and performing image-text feature fusion processing to obtain first image-text feature information, the label classification method further comprises:

8. An apparatus for classifying a label of a multimedia resource, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of tag classification of a multimedia asset as claimed in any of claims 1 to 7.

10. A computer-readable storage medium, whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of tag classification of a multimedia asset as claimed in any one of claims 1 to 7.