CN113204659B

CN113204659B - Label classification method and device for multimedia resources, electronic equipment and storage medium

Info

Publication number: CN113204659B
Application number: CN202110331593.XA
Authority: CN
Inventors: 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2024-01-19
Anticipated expiration: 2041-03-26
Also published as: CN113204659A

Abstract

The disclosure relates to a tag classification method, a tag classification device, electronic equipment and a storage medium for multimedia resources. The label classification method comprises the following steps: acquiring a target image and a target text corresponding to a multimedia resource to be processed and tag characteristic information corresponding to a preset tag set; inputting a target image and a target text into a multi-mode feature extraction model, and performing feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed; inputting the tag characteristic information into a graph convolution network, and performing tag characteristic correlation processing to obtain target tag characteristic description information; performing feature fusion processing on the target image-text feature information and the target tag feature description information to obtain target feature information; and determining at least one tag from a preset tag set as tag information of the multimedia resource according to the target feature information. According to the technical scheme provided by the disclosure, the accuracy of the classification of the multimedia resource tags can be improved.

Description

Label classification method and device for multimedia resources, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of vision technology, and in particular relates to a tag classification method and device for multimedia resources, electronic equipment and a storage medium.

Background

The tag classification is the basis of deep learning and data recommendation service, in the related art, the tag classification is generally performed based on the single-mode characteristics of data, and a single tree structure is adopted among tags in a tag set for tag classification. When facing multimedia data, the multimedia data contains multi-mode characteristics such as images, texts, sounds and the like, so that the existing label classification mode based on single mode cannot be suitable for the data with the multi-mode characteristics; in addition, the content of the multimedia data is rich, and the multimedia data generally has a plurality of labels, and when the labels with the tree structures in the related art are used for classifying the labels, the accuracy of label classification is poor.

Disclosure of Invention

The disclosure provides a tag classification method, a device, an electronic device and a storage medium for multimedia resources, so as to at least solve the problem of how to improve the tag classification precision of the multimedia resources in the related art. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a tag classification method for a multimedia resource, including:

Acquiring a target image and a target text corresponding to a multimedia resource to be processed and tag characteristic information corresponding to a preset tag set;

inputting the target image and the target text into a multi-mode feature extraction model, and performing feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed;

inputting the tag characteristic information into a graph convolution network, and performing tag characteristic correlation processing to obtain target tag characteristic description information;

performing feature fusion processing on the target image-text feature information and the target tag feature description information to obtain target feature information;

and determining at least one tag from the preset tag set as tag information of the multimedia resource according to the target feature information.

In one possible implementation manner, the multi-modal feature extraction model comprises an image feature extraction module, a text feature extraction module and a feature fusion module; the step of inputting the target image and the target text into a multi-mode feature extraction model to perform feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed comprises the following steps:

inputting the target image into the image feature extraction module, and performing image feature extraction processing to obtain target image feature information;

Inputting the target text into the text feature extraction module, and performing text feature extraction processing to obtain target text feature information;

and inputting the target image characteristic information and the target text characteristic information into the characteristic fusion module to perform characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation manner, the image feature extraction module includes a convolution module, a first downsampling module, a first full-connection layer, a second downsampling module, and a second full-connection layer; the step of inputting the target image into the image feature extraction module to perform image feature extraction processing to obtain target image feature information comprises the following steps:

inputting the target image into the convolution module, and performing feature extraction processing to obtain initial image feature information;

inputting the initial image characteristic information into the first downsampling module, and performing downsampling processing to obtain first image characteristic information of a first scale;

inputting the first image characteristic information into the second downsampling module, and performing downsampling processing to obtain second image characteristic information of a second scale;

inputting the first image characteristic information into the first full-connection layer, and performing characteristic length adjustment processing to obtain third image characteristic information with preset length;

Inputting the second image characteristic information into the second full-connection layer, and performing characteristic length adjustment processing to obtain fourth image characteristic information with preset length;

and taking the third image characteristic information and the fourth image characteristic information as the target image characteristic information.

In one possible implementation manner, the text feature extraction module includes a first text feature extraction unit, a third fully-connected layer, a second text feature extraction unit, and a fourth fully-connected layer; the step of inputting the target text into the text feature extraction module to perform text feature extraction processing to obtain target text feature information comprises the following steps:

inputting the target text into the first text feature extraction unit, and performing text feature extraction processing to obtain first text feature information;

inputting the first text feature information into the second text feature extraction unit, and performing text feature extraction processing to obtain second text feature information;

inputting the first text characteristic information into the third full-connection layer, and performing characteristic length adjustment processing to obtain third text characteristic information with preset length;

inputting the second text characteristic information into the fourth full-connection layer, and performing characteristic length adjustment processing to obtain fourth text characteristic information with preset length;

And taking the third text characteristic information and the fourth text characteristic information as the target text characteristic information.

In one possible implementation manner, the feature fusion module includes a first feature fusion module and a second feature fusion module; the step of inputting the target image characteristic information and the target text characteristic information into the characteristic fusion module to perform characteristic fusion processing, and the step of obtaining the target image-text characteristic information comprises the following steps:

inputting the third image characteristic information and the third text characteristic information into the first characteristic fusion module to perform image-text characteristic fusion processing to obtain first image-text characteristic information;

and inputting the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information into the second characteristic fusion module to perform image-text characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation, the graph rolling network includes a first graph rolling module and a second graph rolling module; the step of inputting the tag characteristic information into a graph convolution network to perform tag characteristic correlation processing to obtain target tag characteristic description information comprises the following steps:

Inputting the tag characteristic information into the first graph convolution module, and performing tag characteristic correlation processing to obtain tag characteristic description information to be processed;

and inputting the label characteristic description information to be processed into the second graph convolution module, and performing label characteristic correlation processing to obtain the target label characteristic description information.

In one possible implementation manner, after the step of inputting the third image feature information and the third text feature information into the first feature fusion module and performing image-text feature fusion processing to obtain first image-text feature information, the tag classification method further includes:

performing feature fusion processing on the first image-text feature information and the tag feature description information to be processed to obtain second image-text feature information;

the step of inputting the fourth image feature information, the fourth text feature information and the first image-text feature information into the second feature fusion module to perform image-text feature fusion processing, and obtaining the target image-text feature information comprises the following steps:

and inputting the fourth image characteristic information, the fourth text characteristic information and the second image-text characteristic information into the second characteristic fusion module to perform image-text characteristic fusion processing to obtain the target image-text characteristic information.

In a possible implementation manner, the step of determining at least one tag from the preset tag set as tag information of the multimedia resource according to the target feature information includes:

and inputting the target characteristic information into a target full-connection layer, and performing classification processing to obtain the tag information.

In one possible implementation manner, the step of obtaining tag feature information corresponding to the preset tag set includes:

acquiring label correlation information between every two labels in the preset label set and weight information of the target full-connection layer;

taking the weight information as tag characteristic description information;

and taking the tag correlation information and the tag characteristic description information as tag characteristic information corresponding to the preset tag set.

In one possible implementation manner, the tag classification method further includes:

acquiring a plurality of sample multimedia resources and corresponding sample labels; the plurality of sample multimedia resources comprise a plurality of corresponding sample images and a plurality of sample texts;

inputting the plurality of sample images and the plurality of sample texts into a preset feature extraction model, and performing feature extraction processing to obtain first sample graphic and text feature information;

Inputting the first sample image-text characteristic information into a preset full-connection layer, and performing classification processing to obtain a first prediction tag;

acquiring first loss information according to the sample tag and the first prediction tag;

training the preset feature extraction model and the preset full-connection layer according to the first loss information until the first loss information meets preset conditions, and obtaining the multi-mode feature extraction model and the target full-connection layer.

In one possible implementation manner, after the step of acquiring the plurality of sample multimedia resources and the corresponding sample tags, the tag classification method further includes:

inputting the plurality of sample images and the plurality of sample texts into the multi-mode feature extraction model, and performing feature extraction processing to obtain second sample image-text feature information;

inputting the tag characteristic information into a preset graph convolutional network, and performing tag characteristic correlation processing to obtain sample tag characteristic description information;

performing feature fusion processing on the second sample image-text feature information and the sample tag feature description information to obtain sample feature information;

inputting the sample characteristic information into the target full-connection layer, and performing classification processing to obtain a second prediction tag;

Acquiring second loss information according to the sample tag and the second prediction tag;

training the preset graph convolution network according to the second loss information until the second loss information meets preset conditions, and obtaining the graph convolution network.

According to a second aspect of embodiments of the present disclosure, there is provided a tag classification apparatus for a multimedia resource, including:

the model input information acquisition module is configured to acquire a target image and a target text corresponding to the multimedia resource to be processed and tag characteristic information corresponding to a preset tag set;

the target image-text characteristic information acquisition module is configured to input the target image and the target text into a multi-mode characteristic extraction model to perform characteristic extraction processing to obtain target image-text characteristic information of the multimedia resource to be processed;

the target tag feature description information acquisition module is configured to input the tag feature information into a graph convolution network to perform tag feature correlation processing to obtain target tag feature description information;

the target characteristic information acquisition module is configured to perform characteristic fusion processing on the target image-text characteristic information and the target tag characteristic description information to obtain target characteristic information;

And the tag information acquisition module is configured to determine at least one tag from the preset tag set as tag information of the multimedia resource according to the target feature information.

In one possible implementation manner, the multi-modal feature extraction model comprises an image feature extraction module, a text feature extraction module and a feature fusion module; the target image-text characteristic information acquisition module comprises:

a target image feature information obtaining unit configured to perform image feature extraction processing by inputting the target image into the image feature extraction module, to obtain target image feature information;

the target text feature information acquisition unit is configured to input the target text into the text feature extraction module, and perform text feature extraction processing to obtain target text feature information;

the target image-text characteristic information acquisition unit is configured to input the target image characteristic information and the target text characteristic information into the characteristic fusion module for characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation manner, the image feature extraction module includes a convolution module, a first downsampling module, a first full-connection layer, a second downsampling module, and a second full-connection layer; the target image feature information acquisition unit includes:

The initial image characteristic information acquisition unit is configured to input the target image into the convolution module for characteristic extraction processing to obtain initial image characteristic information;

the first image characteristic information acquisition unit is configured to input the initial image characteristic information into the first downsampling module to perform downsampling processing to obtain first image characteristic information of a first scale;

a second image feature information obtaining unit configured to perform downsampling processing by inputting the first image feature information into the second downsampling module to obtain second image feature information of a second scale;

a third image feature information obtaining unit configured to perform feature length adjustment processing to obtain third image feature information with a preset length by inputting the first image feature information into the first full-connection layer;

a fourth image feature information obtaining unit configured to perform feature length adjustment processing to obtain fourth image feature information with a preset length by inputting the second image feature information into the second full-connection layer;

a target image feature information determination unit configured to perform the third image feature information and the fourth image feature information as the target image feature information.

In one possible implementation manner, the text feature extraction module includes a first text feature extraction unit, a third fully-connected layer, a second text feature extraction unit, and a fourth fully-connected layer; the target text feature information acquisition unit includes:

a first text feature information obtaining unit configured to perform text feature extraction processing by inputting the target text into the first text feature extracting unit, to obtain first text feature information;

a second text feature information obtaining unit configured to perform text feature extraction processing by inputting the first text feature information into the second text feature extraction unit, to obtain second text feature information;

a third text feature information obtaining unit configured to perform feature length adjustment processing to obtain third text feature information with a preset length by inputting the first text feature information into the third full connection layer;

a fourth text feature information obtaining unit configured to perform feature length adjustment processing to obtain fourth text feature information with a preset length by inputting the second text feature information into the fourth full connection layer;

And a target text feature information determination unit configured to perform the third text feature information and the fourth text feature information as the target text feature information.

In one possible implementation manner, the feature fusion module includes a first feature fusion module and a second feature fusion module; the target image-text characteristic information acquisition unit comprises:

the first image-text characteristic information acquisition unit is configured to input the third image characteristic information and the third text characteristic information into the first characteristic fusion module for image-text characteristic fusion processing to obtain first image-text characteristic information;

the first target image-text characteristic information acquisition unit is configured to input the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information into the second characteristic fusion module for image-text characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation, the graph rolling network includes a first graph rolling module and a second graph rolling module; the target tag characteristic description information acquisition module comprises:

the to-be-processed tag feature description information acquisition unit is configured to input the tag feature information into the first graph convolution module to perform tag feature correlation processing to obtain to-be-processed tag feature description information;

And the target tag characteristic description information acquisition unit is configured to input the tag characteristic description information to be processed into the second graph convolution module, and perform tag characteristic correlation processing to obtain the target tag characteristic description information.

In one possible implementation, the tag classification apparatus further includes:

the second image-text characteristic information acquisition module is configured to perform characteristic fusion processing on the first image-text characteristic information and the label characteristic description information to be processed to obtain second image-text characteristic information;

the target image-text characteristic information acquisition unit further comprises:

the second target image-text characteristic information acquisition unit is configured to input the fourth image characteristic information, the fourth text characteristic information and the second image-text characteristic information into the second characteristic fusion module for image-text characteristic fusion processing to obtain the target image-text characteristic information.

In one possible implementation manner, the tag information obtaining module includes:

and the tag information acquisition unit is configured to input the target characteristic information into a target full-connection layer for classification processing to obtain the tag information.

In one possible implementation manner, the model input information acquisition module includes:

the tag correlation information and weight information acquisition unit is configured to acquire tag correlation information between every two tags in the preset tag set and weight information of the target full-connection layer;

a tag feature description information acquiring unit configured to perform taking the weight information as tag feature description information;

and the tag characteristic information obtaining unit is configured to perform the tag characteristic information corresponding to the preset tag set by taking the tag correlation information and the tag characteristic description information.

a training data acquisition module configured to perform acquiring a plurality of sample multimedia resources and corresponding sample tags; the plurality of sample multimedia resources comprise a plurality of corresponding sample images and a plurality of sample texts;

the first sample image-text characteristic information acquisition module is configured to input the plurality of sample images and the plurality of sample texts into a preset characteristic extraction model, and perform characteristic extraction processing to obtain first sample image-text characteristic information;

The first prediction tag acquisition module is configured to input the first sample image-text characteristic information into a preset full-connection layer for classification processing to obtain a first prediction tag;

a first loss information acquisition module configured to perform acquisition of first loss information according to the sample tag and the first prediction tag;

and the first training module is configured to train the preset feature extraction model and the preset full-connection layer according to the first loss information until the first loss information meets preset conditions, so as to obtain the multi-mode feature extraction model and the target full-connection layer.

the second sample image-text characteristic information acquisition module is configured to input the plurality of sample images and the plurality of sample texts into the multi-mode characteristic extraction model, and perform characteristic extraction processing to obtain second sample image-text characteristic information;

the sample tag characteristic description information acquisition module is configured to input the tag characteristic information into a preset graph convolutional network to perform tag characteristic correlation processing to obtain sample tag characteristic description information;

The sample characteristic information acquisition module is configured to perform characteristic fusion processing on the second sample image-text characteristic information and the sample tag characteristic description information to obtain sample characteristic information;

the second prediction tag acquisition module is configured to input the sample characteristic information into the target full-connection layer for classification processing to obtain a second prediction tag;

a second loss information acquisition module configured to perform acquisition of second loss information according to the sample tag and the second prediction tag;

and the second training module is configured to train the preset graph convolution network according to the second loss information until the second loss information meets preset conditions, so as to obtain the graph convolution network.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any of the first aspects above.

According to a fourth aspect of the disclosed embodiments, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any of the first aspects of the disclosed embodiments.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, cause the computer to perform the method of any one of the first aspects of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the target image-text characteristic information of the multimedia resource to be processed is obtained through the multi-mode characteristic extraction model, so that the method can be effectively applied to label classification of the multimedia resource; and by carrying out correlation processing on the tag characteristics and carrying out characteristic fusion processing on the target image-text characteristic information and the target tag characteristic description information by combining a graph-convolution network, the correlation fusion of the multi-mode characteristics of the multimedia resource and the tag characteristic information is realized, so that the target characteristic information has richer semantic expression and can more accurately represent the content of the multimedia resource, namely, the understanding degree of the content of the multimedia resource can be improved, and the accuracy of the classification of the multimedia resource tags can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of an application environment, shown in accordance with an exemplary embodiment.

Fig. 2 is a flowchart illustrating a tag classification method for a multimedia asset according to an exemplary embodiment.

Fig. 3 is a schematic diagram illustrating a label classification model according to an exemplary embodiment.

Fig. 4 is a flowchart of a method for inputting a target image and a target text into a multi-modal feature extraction model for feature extraction processing to obtain target image-text feature information of a multimedia resource to be processed according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating a method for acquiring tag characteristic information corresponding to a preset tag set according to an exemplary embodiment.

Fig. 6 is a schematic diagram illustrating a label classification model according to an exemplary embodiment.

Fig. 7 is a flowchart illustrating a method for inputting a target image into an image feature extraction module for image feature extraction processing to obtain target image feature information according to an exemplary embodiment.

Fig. 8 is a flowchart illustrating a method for inputting a target text into a text feature extraction module for text feature extraction processing to obtain target text feature information according to an exemplary embodiment.

Fig. 9 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment.

FIG. 10 is a flowchart illustrating a method for inputting tag characteristic information into a graph convolution network for tag characteristic correlation processing to obtain target tag characteristic description information, according to an exemplary embodiment.

Fig. 11 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment.

Fig. 12 is a schematic diagram illustrating a label classification model according to an exemplary embodiment.

FIG. 13 is a flowchart illustrating a method for training a multi-modal feature extraction model and a target full-connectivity layer, according to an example embodiment.

Fig. 14 is a schematic diagram of a pre-set feature extraction model and pre-set fully connected layers, according to an example embodiment.

FIG. 15 is a training flow diagram of a graph rolling network, according to an example embodiment.

Fig. 16 is a diagram illustrating an architecture of a preset tag classification model according to an exemplary embodiment.

Fig. 17 is a block diagram illustrating a tag classification apparatus of a multimedia asset according to an exemplary embodiment.

Fig. 18 is a block diagram illustrating an electronic device for tag classification of multimedia assets, according to an example embodiment.

Fig. 19 is a block diagram illustrating an electronic device for tag classification of multimedia assets, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment according to an exemplary embodiment, and as shown in fig. 1, the application environment may include a server 01 and a terminal 02.

In an alternative embodiment, server 01 may be used for training of the multimodal feature extraction model and graph roll-up network GCN (Graph Convolutional Network); or for training of a tag classification model that may include a multi-modal feature extraction model, a graph convolution network, and a target full connection layer. Specifically, the server 01 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

In an alternative embodiment, the terminal 02 may be combined with the server 01 to perform a tag classification method for multimedia resources, where the multimodal feature extraction model and the graph rolling network used by the terminal 02 may be trained by the server 01 and then sent to the terminal 02. Specifically, the terminal 02 may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a smart wearable device, and other types of electronic devices. Alternatively, the operating system running on the electronic device may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In addition, it should be noted that fig. 1 is only one application environment of the image processing method provided by the present disclosure. For example, the server 01 may perform a tag classification method of multimedia resources; the terminal 02 may perform a multi-modal feature extraction model and training of the graph rolling network. The present disclosure is not limited in this regard.

In the embodiment of the present disclosure, the server 01 and the terminal 02 may be directly or indirectly connected through a wired or wireless communication method, which is not limited herein.

Fig. 2 is a flowchart illustrating a tag classification method for a multimedia asset according to an exemplary embodiment. As shown in fig. 2, the following steps may be included.

In step S201, a target image and a target text corresponding to a multimedia resource to be processed and tag feature information corresponding to a preset tag set are obtained.

In this embodiment of the present disclosure, the preset tag set may be a set including a preset number of tags, and the preset number of tags may be obtained according to actual needs or statistics, so that the preset number of tags may be used as the preset tag set. Further, labels in the preset label set can be used for classifying labels for the multimedia resources to be processed.

In practical application, the tag feature information may refer to a directed graph of tags in a preset tag set, edges of the directed graph may refer to a correlation between the tags, points of the directed graph may be each tag in the preset tag set, each tag may be described by a feature vector, weights of the edges may be obtained according to a statistical manner, so that the feature vector of each tag, the edges between the tags and the weights of the edges may be used as the directed graph, and the directed graph may be used as tag feature information. The directed graph may be in the form of a matrix or vector, neither of which is limited by the present disclosure.

In one example, the set of preset tags may be represented as [ tag 1, tag 2, … …, tag n ], n being a preset number, e.g., may be 100, 284, etc., which is not limited by the present disclosure.

In practical applications, the multimedia resource to be subjected to tag classification may be used as a multimedia resource to be processed, for example, a short video to be subjected to tag classification may be used as a multimedia resource to be processed. And an image related to the multimedia resource to be processed can be obtained as a target image corresponding to the multimedia resource to be processed. And acquiring the text related to the multimedia resources to be processed as a target text corresponding to the multimedia resources to be processed. Wherein the target image may refer to an image capable of characterizing the content of the multimedia asset to be processed. For example, when the multimedia resource to be processed is a short video, a cover image of the short video may be taken as a target image and text related to the short video may be taken as a target text. The text related to the short video may include text in the short video, title text of the short video, description text of the short video, text corresponding to audio in the short video, etc., which is not limited in this disclosure.

Optionally, an image related to the multimedia resource to be processed may be compressed into an image in a preset format of the first preset pixel, the image compressed into the preset format of the first preset pixel may be randomly cropped to obtain an image in a preset format of the second preset pixel, and the image in the preset format of the second preset pixel may be used as the target image. In one example, the first preset pixels may be 256×256, the second preset pixels may be 224×224, and the preset format may be an RGB (red, green, blue, red, green, blue) format. The present disclosure does not limit the first preset pixel, the second preset pixel, and the preset format, as long as the target image can satisfy the input of the multi-modal feature extraction model.

In step S203, the target image and the target text are input into a multi-mode feature extraction model, and feature extraction processing is performed to obtain target image-text feature information of the multimedia resource to be processed.

In the embodiment of the specification, the target image and the target text can be input into a multi-mode feature extraction model to perform feature extraction processing, so as to obtain target image-text feature information of the multimedia resource to be processed. For example, the extraction processing of the image features and the text features can be performed, and the fusion processing of the image features and the text features can be performed to obtain the target text feature information. Wherein the target text feature information may be a feature vector or a feature matrix, which is not limited by the present disclosure.

In step S205, the tag feature information is input into the graph convolution network, and tag feature correlation processing is performed, so as to obtain target tag feature description information.

In the embodiment of the present disclosure, the tag feature information may be input into a graph convolution network, and the tag feature correlation processing may be performed to obtain the target tag feature description information. For example, a directed graph of the labels in the preset label set may be input into a graph convolution network, and label feature correlation processing may be performed to obtain target label feature description information.

In step S207, feature fusion processing is performed on the target image-text feature information and the target tag feature description information, so as to obtain target feature information.

In the embodiment of the specification, feature fusion processing can be performed on the target image-text feature information and the target tag feature description information to obtain target feature information. In one example, the target image-text feature information and the target tag feature description information may be multiplied to implement the feature fusion process, so as to obtain target feature information. The multiplication process may be a matrix multiplication process, which is not limited by the present disclosure.

The target feature information can represent feature information of fusion of multi-mode target image-text feature information and target tag feature description information after tag feature correlation processing, and the target feature information can be used for representing tag information of a multimedia resource to be processed.

In step S209, at least one tag is determined from the preset tag set as tag information of the multimedia resource according to the target feature information.

In the embodiment of the present disclosure, at least one tag may be determined from a preset tag set according to the target feature information as tag information of the multimedia resource. In one example, when at least one tag determined from the preset tag set includes tag 1 and tag 2 according to the target feature information, the tag information may be represented as [1, 0, … …,0], i.e., tag 1 and tag 2 in the preset tag set may be set to 1, and the other tags to 0. Alternatively, the tag information may be determined as tag 1 and tag 2. The present disclosure is not limited in terms of the form of the tag information.

In one possible implementation, step S209 may include: and inputting the target characteristic information into a target full-connection layer, and performing classification processing to obtain the label information. The label classification processing is carried out through the target full-connection layer, so that the precision of label information and the label classification efficiency can be improved.

Alternatively, the multimedia resource to be processed may be tagged with the tag information, or recommendation of the multimedia resource, search of the multimedia resource, and the like may be performed based on the tag information.

Fig. 3 is a schematic diagram illustrating a label classification model according to an exemplary embodiment. Fig. 4 is a flowchart of a method for inputting a target image and a target text into a multi-modal feature extraction model for feature extraction processing to obtain target image-text feature information of a multimedia resource to be processed according to an exemplary embodiment.

In one possible implementation, as shown in FIG. 3, the tag classification model may include a multi-modal feature extraction model and a graph rolling network. The multimodal feature extraction model may include an image feature extraction module, a text feature extraction module, and a feature fusion module.

Based on the multi-modal feature extraction model of fig. 3, as shown in fig. 4, in a possible implementation manner, the step S203 may include:

in step S401, inputting the target image into an image feature extraction module, and performing image feature extraction processing to obtain target image feature information;

in step S403, inputting the target text into a text feature extraction module, and performing text feature extraction processing to obtain target text feature information;

in step S405, the target image feature information and the target text feature information are input into a feature fusion module, and feature fusion processing is performed to obtain target image-text feature information.

In the embodiment of the specification, the target image G may be input into an image feature extraction module, and image feature extraction processing may be performed to obtain target image feature information; and inputting the target text T into a text feature extraction module, and performing text feature extraction processing to obtain target text feature information. Further, the target image characteristic information and the target text characteristic information can be input into a characteristic fusion module to be subjected to characteristic fusion processing, so that the target image-text characteristic information is obtained.

Alternatively, referring to fig. 3, a matrix multiplication module And the convolutional network may be used as a graphic and tag fusion module, and correspondingly, the step S207 may include: inputting the target graphic characteristic information and the target tag characteristic description information into a matrix multiplication module>Performing matrix multiplication to obtain graphic characteristic information to be processed; the graphic characteristic information to be processed can be input into a convolutional network to adjust the characteristic length, and target characteristic information with preset length is obtained. Wherein, the convolution network can be a convolution network with d 1, d may be a preset length, which is not limited by the present disclosure.

In one example, the image feature extraction module may be an image feature extraction neural network, the backbone network of which may be a residual network Resnet-50, where residual network Resnet-50 may include 50 convolutional layers and 4 downsampling modules (4 blocks); the text feature extraction module may be a text feature extraction neural network, the backbone network of which may be a Bert (Bidirectional Encoder Representation from Transformers) network, and the Bert network may include 12 hidden layers; the feature fusion module may be for a neural network of feature fusion, for example, a network including an attention layer. None of these are limiting to the present disclosure.

By arranging the multi-modal feature extraction model comprising an image feature extraction module, a text feature extraction module and a feature fusion module, multi-modal features of images and texts can be efficiently extracted and fused, and the acquisition efficiency and accuracy of target image-text feature information are improved.

Fig. 5 is a flowchart illustrating a method for acquiring tag characteristic information corresponding to a preset tag set according to an exemplary embodiment. As shown in fig. 5, in one possible implementation, the step S201 may include the following steps:

in step S501, label correlation information between every two labels in a preset label set and weight information of a target full-connection layer are obtained.

In the embodiment of the present disclosure, the tag correlation information between every two tags in the preset tag set may be obtained through a statistical manner. For example, the tag correlation information may be a square matrix formed by conditional probabilities P of two tags, for example, a square matrix of n×n, where n may be the number of tags in the preset tag set. As an example, n may be 3, and the tags in the preset tag set include basketball, boy, girl. Tag correlation information A ^l May beWherein P (basketball |basketball) =1, P (boy |basketball) =0.85, and P (girl|basketball) =0.06; p (basketball |boy) =0.07, P (boy |boy) =1, P (girl|boy) =0.2; p (basketball |girl) =0.01, P (girl |girl) =0.5, and P (girl|girl) =1.

In the embodiment of the present disclosure, the weight information of the target fully-connected layer may be obtained, for example, the weight information of the target fully-connected layer in fig. 3 may be obtained. The weight information may be a matrix of d×n, d may be a dimension of the target feature information, and n may be a number of tags in the preset tag set.

In step S503, the weight information is taken as tag feature description information;

in step S505, the tag correlation information and the tag characteristic description information are used as tag characteristic information corresponding to a preset tag set.

In the embodiment of the present specification, the weight information may be used as the tag characteristic description information H ^l And the tag correlation information and the tag characteristic description information can be used as tag characteristic information corresponding to a preset tag set. Accordingly, A ^l Can be used as the edge of the directed graph and the corresponding weight, H ^l Can be used as the vertex of the directed graph.

The conditional probability of each label is used as label correlation information, and the weight information of the target full-connection layer is used as label characteristic description information, so that the directed graph expression of the labels is realized, and the input of a graph convolution network is obtained; and the weight information of the target full-connection layer is used as the tag feature description information, so that the tag feature description information has semantics consistent with actual data distribution, deviation of text single-mode on tag feature description can be avoided, and the tag feature description information can be acquired more conveniently and efficiently.

Fig. 6 is a schematic diagram illustrating a label classification model according to an exemplary embodiment. As shown in fig. 6, in one possible implementation, the image feature extraction module may include a convolution module, a first downsampling module, a first fully-connected layer, a second downsampling module, and a second fully-connected layer; the text feature extraction module may include a first text feature extraction unit, a third full connection layer, a second text feature extraction unit, and a fourth full connection layer; the feature fusion module may include a first feature fusion module and a second feature fusion module; the graph rolling network can include a first graph rolling module and a second graph rolling module. Optionally, as shown in fig. 6, the tag classification model may further include a first convolutional network, a second convolutional network, and a target full-connectivity layer. The present disclosure does not limit the structure of the tag classification model.

Fig. 7 is a flowchart illustrating a method for inputting a target image into an image feature extraction module for image feature extraction processing to obtain target image feature information according to an exemplary embodiment. As shown in fig. 7, in one possible implementation, the step S401 may include the following steps:

In step S701, the target image is input to the convolution module, and feature extraction processing is performed to obtain initial image feature information.

In the embodiment of the present disclosure, the target image may be input to a convolution module, and feature extraction processing may be performed to obtain initial image feature information. In one example, the convolution module may be 50 convolution layers of Resnet-50. This is merely an example and is not intended to limit the present disclosure.

In step S703, the initial image feature information is input to a first downsampling module, and downsampling processing is performed to obtain first image feature information of a first scale;

in step S705, the first image feature information is input to a second downsampling module, and downsampling processing is performed to obtain second image feature information of a second scale.

In the embodiment of the present disclosure, the initial image feature information may be input to a first downsampling module to perform downsampling, for example, 1/2 downsampling, to obtain first image feature information of a first scale; and the first image characteristic information can be input into a second downsampling module to perform downsampling processing, for example, 1/2 downsampling processing, so as to obtain second image characteristic information of a second scale.

The first scale may refer to a feature length of the first image feature information, the second scale may refer to a feature length of the second image feature information, and the second scale may be longer than the first scale, and the first scale and the second scale are not limited in this disclosure.

In step S707, inputting the first image feature information into the first full-connection layer, and performing feature length adjustment processing to obtain third image feature information with a preset length;

in step S709, inputting the second image feature information into the second full-connection layer, and performing feature length adjustment processing to obtain fourth image feature information with a preset length;

in step S711, the third image feature information and the fourth image feature information are set as target image feature information.

In practical application, in order to adapt to the fusion of the feature fusion module with the feature information of the target text, the feature length of the feature information of the target image may be set to be the same as the feature length of the feature information of the target text. The preset length may be 512, which is not limited by the present disclosure.

In this embodiment of the present disclosure, as shown in fig. 6, the first image feature information may be input into the first full-connection layer, and feature length adjustment is performed to obtain third image feature information with a preset length; and the second image characteristic information can be input into the second full-connection layer to be subjected to characteristic length adjustment processing, so that fourth image characteristic information with preset length is obtained. So that the third image feature information and the fourth image feature information can be regarded as target image feature information.

Through setting up image feature extraction module and including convolution module, first downsampling module, first full tie layer, second downsampling module and second full tie layer, can realize the degree of depth extraction to image feature for target image feature information is more accurate.

Fig. 8 is a flowchart illustrating a method for inputting a target text into a text feature extraction module for text feature extraction processing to obtain target text feature information according to an exemplary embodiment. As shown in fig. 8, in one possible implementation, the step S403 may include the following steps:

in step S801, a target text is input into a first text feature extraction unit, and text feature extraction processing is performed to obtain first text feature information;

in step S803, the first text feature information is input to the second text feature extraction unit, and text feature extraction processing is performed to obtain second text feature information.

In the embodiment of the present disclosure, the target text may be input to the first text feature extraction unit to obtain first text feature information, and the first text feature information may be input to the second text feature extraction unit to obtain second text feature information. In one example, the first text feature extraction unit may be a preset number of hidden layers of the Bert network, and the second text feature extraction unit may be a preset number of hidden layers of the Bert network, where the preset number of layers may be 3, and the disclosure is not limited thereto. In the case that the first text feature extraction unit may be a preset number of hidden layers of the Bert network and the second text feature extraction unit may be a preset number of hidden layers of the Bert network, the first text feature information may be a feature of [ CLS ] tag symbol output in the first text feature extraction unit, and the second text feature information may be a feature of [ CLS ] tag symbol output in the second text feature extraction unit. In one example, the feature length of the first text feature information and the second text feature information may be 768, which is not limited by the present disclosure.

Alternatively, when the first text feature extraction unit is an implicit layer of a preset number of layers of the Bert network, the target text may be input to an input layer of the Bert network, so that an output of the input layer is taken as an input of the first text feature extraction unit.

In step S805, inputting the first text feature information into a third full connection layer, and performing feature length adjustment processing to obtain third text feature information with a preset length;

in step S807, the second text feature information is input into the fourth full connection layer, and feature length adjustment processing is performed to obtain fourth text feature information with a preset length;

in step S809, the third text feature information and the fourth text feature information are set as target text feature information.

In the embodiment of the specification, feature length adjustment processing may be performed on the first text feature information and the second text feature information respectively to obtain third text feature information with a preset length and fourth text feature information with a preset length, so as to obtain target text feature information with a preset length, so as to be suitable for fusion processing in a feature fusion module.

By arranging that the text feature extraction module comprises a first text feature extraction unit, a third full-connection layer, a second text feature extraction unit and a fourth full-connection layer, deep extraction of text features is achieved, and extraction efficiency and accuracy of target text feature information are improved.

Fig. 9 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment. As shown in fig. 9, in one possible implementation, the step S405 may include the following steps:

in step S901, inputting the third image feature information and the third text feature information into a first feature fusion module, and performing image-text feature fusion processing to obtain first image-text feature information;

in step S903, the fourth image feature information, the fourth text feature information, and the first image-text feature information are input into the second feature fusion module, and image-text feature fusion processing is performed to obtain target image-text feature information.

In the embodiment of the specification, the third image feature information and the third text feature information can be input into the first feature fusion module to perform image-text feature fusion processing to obtain first image-text feature information; and the fourth image characteristic information, the fourth text characteristic information and the first image-text characteristic information can be input into a second characteristic fusion module to be subjected to image-text characteristic fusion processing, so that target image-text characteristic information is obtained. The first feature fusion module and the second feature fusion module may be multi-head attention layers, which is not limited in this disclosure.

Through setting up the feature fusion module and including first feature fusion module and second feature fusion module, can carry out the layer upon layer fusion from the shallow semantic feature of vision and text to deep semantic feature, fully realized the fusion of multimode characteristic, guaranteed the effective expression of each scale picture and text characteristic for the target picture and text characteristic information is more accurate.

FIG. 10 is a flowchart illustrating a method for inputting tag characteristic information into a graph convolution network for tag characteristic correlation processing to obtain target tag characteristic description information, according to an exemplary embodiment. As shown in fig. 10, in one possible implementation, the step S205 may include the following steps:

in step S1001, the tag feature information is input to the first graph convolution module, and tag feature correlation processing is performed, so as to obtain tag feature description information to be processed.

In this embodiment of the present disclosure, the tag feature information may be input to the first graph convolution module, and the tag feature correlation processing may be performed to obtain tag feature description information to be processed.

In one example, in the case where the tag characteristic information includes tag correlation information and tag characteristic description information, the tag characteristic correlation processing may be implemented by the following formula (1) to obtain tag characteristic description information H to be processed ^l+1 ：

H ^l+1 ＝A ^l *H ^l *W ^l (1)

Wherein A is ^l For tag correlation information, H ^l For tag characteristic description information, W ^l Is a parameter of the first convolution module, in one example, W ^l ∈R ^d*d D may be the predetermined length described above, e.g., 512.

In step S1003, the tag feature description information to be processed is input to the second graph convolution module, and tag feature correlation processing is performed, so as to obtain target tag feature description information.

In this embodiment of the present disclosure, the to-be-processed tag feature description information may be input to the second graph convolution module, and the tag feature correlation processing may be performed to obtain the target tag feature description information.

The first graph rolling module and the second graph rolling module are arranged to conduct label characteristic correlation processing from shallow to deep on the label correlation information and the label characteristic description information, so that more accurate target label characteristic description information can be obtained, and the extraction efficiency and the extraction precision of the target label characteristic description information are improved.

Fig. 11 is a flowchart of a method for inputting feature information of a target image and feature information of a target text into a feature fusion module to perform feature fusion processing to obtain feature information of the target image and text according to an exemplary embodiment. As shown in fig. 11, after step S901, the tag classification method may further include the steps of:

In step S1101, feature fusion processing is performed on the first graphic feature information and the tag feature description information to be processed, so as to obtain second graphic feature information.

In one example, as shown in fig. 6, the first graphic characteristic information and the tag characteristic description information to be processed may be multiplied by a matrix to obtain the first graphic characteristic information to be processed. And the first graphic characteristic information to be processed can be input into a first convolution network for convolution processing to obtain the second graphic characteristic information.

Accordingly, step S903 may include:

in step S1103, the fourth image feature information, the fourth text feature information, and the second image-text feature information are input into the second feature fusion module, and image-text feature fusion processing is performed to obtain target image-text feature information. The implementation of this step may refer to step S901, which is not described herein.

Further, S207 may include: and carrying out matrix multiplication on the target image-text characteristic information and the target tag characteristic description information to obtain fourth image-text characteristic information to be processed. And the fourth graphic characteristic information to be processed can be input into a second convolution network to be subjected to convolution processing, so that target characteristic information with preset length is obtained.

Through multi-scale fusion of the multi-mode features and the graph convolution features, full fusion of the correlation between the multi-mode features and the tags of the tags is ensured, so that the high-level features have richer semantic expression, namely the obtained target image-text feature information can have richer semantic expression, and therefore, the method can be effectively used for tag classification of multimedia resources with rich contents, and the accuracy of tag classification of the multimedia resources is improved.

In one possible implementation, to further enable depth fusion of image features and text features, the depth of the multi-modal feature extraction model and the depth of the graph rolling network may be increased. For example, as shown in fig. 12, the depth of the multi-modal feature extraction model and the depth of the graph convolution network may be set to 4, i.e., fusion of image features, text features, and tag feature information is performed 4 times. The accuracy of the classification of the multimedia resource features can be further improved through 4 times of fusion from shallow to deep. The present disclosure does not limit the depth of the multi-modal feature extraction model and the depth of the graph rolling network.

As an example, as shown in fig. 12, the image feature extraction module may include a convolution module, a first downsampling module, a first full connection layer, a second downsampling module, a second full connection layer, a third downsampling module, a fifth full connection layer, a fourth downsampling module, and a seventh full connection layer. The convolution module, the first downsampling module, the second downsampling module, the third downsampling module and the fourth downsampling module can be sequentially connected; the first down-sampling module can be connected with the first feature fusion module through a first full-connection layer, the second down-sampling module can be connected with the second feature fusion module through a second full-connection layer, the third down-sampling module can be connected with the third feature fusion module through a fifth full-connection layer, and the fourth down-sampling module can be connected with the fourth feature fusion module through a seventh full-connection layer.

As shown in fig. 12, the text feature extraction module may include a first text feature extraction unit, a second text feature extraction unit, a third text feature extraction unit, and a fourth text feature extraction unit connected in order; the first text feature extraction unit can be connected with the first feature fusion module through a third full-connection layer, the second text feature extraction unit can be connected with the second feature fusion module through a fourth full-connection layer, the third text feature extraction unit can be connected with the third feature fusion module through a sixth full-connection layer, and the fourth text feature extraction unit can be connected with the fourth feature fusion module through an eighth full-connection layer.

As shown in fig. 12, the feature fusion modules may include a first feature fusion module, a second feature fusion module, a third feature fusion module, and a fourth feature fusion module; the graph rolling network can include a first graph rolling module, a second graph rolling module, a third graph rolling module, and a fourth graph rolling module. Optionally, as shown in fig. 12, the tag classification model may further include a first convolutional network, a second convolutional network, a third convolutional network, a fourth convolutional network, and a target full-connectivity layer. The present disclosure does not limit the structure of the tag classification model.

Referring to FIG. 12, the tag classification model further includes a matrix multiplication moduleThe first graph convolution module is connected with the first characteristic fusion module through a first matrix multiplication module, and the first matrix multiplication module can be connected with a first convolution network, namely the first graph convolution module, the first characteristic fusion module and the first convolution network are respectively connected with the first matrix multiplication module; similarly, a second graph convolution module, a second feature fusion module, andthe second convolution network is respectively connected with the second matrix multiplication module; the third graph convolution module, the third feature fusion module and the third convolution network are respectively connected with the third matrix multiplication module; the fourth graph convolution module, the fourth feature fusion module and the fourth convolution network are respectively connected with the fourth matrix multiplication module. The matrix multiplication modules comprise 4 matrix multiplication modules, namely a first matrix multiplication module, a second matrix multiplication module, a third matrix multiplication module and a fourth matrix multiplication module from top to bottom.

As shown in fig. 12, the target image G may be input to a convolution module, and outputs of the first downsampling module, the second downsampling module, the third downsampling module, and the fourth downsampling module may be respectively first image feature information, second image feature information, fifth image feature information, and sixth image feature information. The output of the first full connection layer, the second full connection layer, the fifth full connection layer and the seventh full connection layer is respectively third image characteristic information, fourth image characteristic information, seventh image characteristic information and eighth image characteristic information.

The output of the first text feature extraction unit, the second text feature extraction unit, the third text feature extraction unit and the fourth text feature extraction unit is respectively first text feature information, second text feature information, fifth text feature information and sixth text feature information; the output of the third full connection layer, the fourth full connection layer, the sixth full connection layer and the eighth full connection layer is respectively third text characteristic information, fourth text characteristic information, seventh text characteristic information and eighth text characteristic information.

In the examples of the present specification, A will be ^l ，H ^l The first graph rolling module is input, and the outputs of the first graph rolling module, the second graph rolling module, the third graph rolling module and the fourth graph rolling module are respectively first label characteristic description information, second label characteristic description information, third label characteristic description information and fourth label characteristic description information.

The output of the first feature fusion module, the second feature fusion module, the third feature fusion module and the fourth feature fusion module is respectively first image-text feature information, third image-text feature information to be processed, sixth image-text feature information to be processed and ninth image-text feature information to be processed;

The outputs of the first matrix multiplication module, the second matrix multiplication module, the third matrix multiplication module and the fourth matrix multiplication module are respectively first graphic characteristic information to be processed, fourth graphic characteristic information to be processed, seventh graphic characteristic information to be processed and tenth graphic characteristic information to be processed;

the output of the first convolution network, the second convolution network, the third convolution network and the fourth convolution network is respectively second graphic characteristic information to be processed (the second graphic characteristic information), fifth graphic characteristic information to be processed, eighth graphic characteristic information to be processed and target characteristic information.

The processing procedure of fig. 12 may be referred to in the relevant content of fig. 6-11, and will not be described here again.

The feature lengths (first scale, second scale, third scale and fourth scale) corresponding to the first image feature information, the second image feature information, the fifth image feature information and the sixth image feature information may be 256, 512, 1024 and 2048 respectively; correspondingly, the rows and columns of the weight information corresponding to the first full connection layer, the second full connection layer, the fifth full connection layer and the seventh full connection layer may be 256×512, 512×512, 1024×512 and 2048×512, respectively. The characteristic length of the output of the third full connection layer, the fourth full connection layer, the sixth full connection layer, and the eighth full connection layer may be 512. Therefore, the feature length of the target image feature information input into the feature fusion module is consistent with the length of the target text feature information.

Optionally, the first downsampling module may be connected to the first fully-connected layer through the first pooling layer; correspondingly, the second downsampling module may be connected to the second fully-connected layer through the second pooling layer; the third downsampling module may be connected to the fifth full-connection layer through a third pooling layer; the fourth downsampling module may be coupled to the seventh fully-coupled layer through a fourth pooling layer. The first pooling layer, the second pooling layer, the third pooling layer, and the fourth pooling layer may be subjected to a mean pooling process, which is not limited in this disclosure.

The above is the content of label classification by using the trained multi-modal feature extraction model, the target full-connection layer and the graph rolling network, and for the training of the multi-modal feature extraction model, the target full-connection layer and the graph rolling network, that is, the training of the label classification model is described below, see the content of fig. 13 and 15. When training the tag classification model, the preset feature extraction model and the preset full-connection layer may be trained to obtain the multi-modal feature extraction model and the target full-connection layer, and the architecture diagram of the preset feature extraction model and the preset full-connection layer shown in fig. 14 may be based on the architecture diagram of fig. 14, so as to obtain the multi-modal feature extraction model and the target full-connection layer through training. Further, a preset label classification model shown in fig. 16 can be constructed based on the trained multi-modal feature extraction model and the target full-connection layer, the trained multi-modal feature extraction model and the target full-connection layer can be fixed, and the preset graph rolling network is trained to obtain the graph rolling network. Through the training of the two parts, a label classification model can be obtained.

FIG. 13 is a flowchart illustrating a method for training a multi-modal feature extraction model and a target full-connectivity layer, according to an example embodiment. As shown in fig. 13, in one possible implementation, the following steps may be included:

in step S1301, a plurality of sample multimedia assets and corresponding sample tags are acquired.

In the embodiment of the present disclosure, a plurality of sample multimedia resources may be obtained, and a sample tag may be labeled for each sample multimedia resource, so that the sample multimedia resource has a corresponding sample tag. The sample tag may be at least one tag in a preset tag set. The sample tag may be [1,0,1, … …,0], which may indicate that the sample tag is tag 1 and tag 3 in the preset tag set.

Wherein the plurality of sample multimedia assets may include a corresponding plurality of sample images and a plurality of sample text. In practical applications, an image capable of characterizing the content of each sample multimedia asset may be obtained as a sample image corresponding to each sample multimedia asset. For example, a cover image of a sample short video may be acquired as a sample image corresponding to the sample short video. And text associated with each sample multimedia asset may be obtained as sample text corresponding to each sample multimedia asset. The specific reference may be made to the above step S201, and the details are not repeated here.

In step S1303, a plurality of sample images and a plurality of sample texts are input into a preset feature extraction model, and feature extraction processing is performed to obtain first sample graphic feature information.

In the embodiment of the present disclosure, the implementation manner of this step may refer to the above step S203, which is not described herein.

In step S1305, the first sample graphic feature information is input into a preset full-connection layer, and classification processing is performed to obtain a first prediction tag.

In this embodiment of the present disclosure, the output dimension of the preset full connection layer may be n, where n may be the number of tags in the preset tag set. The weights of the preset full connection layer may be a matrix of preset lengths d×n, which is not limited in this disclosure. The first sample graphic characteristic information can be input into a preset full-connection layer for classification processing to obtain a first prediction label.

In step S1307, first loss information is acquired from the sample tag and the first prediction tag.

In the embodiment of the present disclosure, the difference information between the sample tag and the first prediction tag may be used as the first loss information, which is not limited in this disclosure.

In one example, the first Loss information Loss may be obtained according to the following equation (2):

Loss＝-ln∑ _i (y _i )，where t _i ＝1； (2)

Wherein y is _i A probability value for an i-th tag of the first predictive tags; t is t _i May refer to the ith tag in the sample tags; the sample tag may be a tag set [ tag 1, tag 2, … …, tag n ]]N may be greater than 1, i may range from [1, n ]]. The sample tag may include t _i Corresponding to =1Labels, e.g. sample labels, are [1,0, … …,1]The corresponding sample tags may include tag 1 and tag n. That is, for the first loss information of one sample multimedia resource, the first loss information may be obtained by calculating a probability value corresponding to the sample tag in the first prediction tag of the one sample multimedia resource.

For example, when n is 3, the tag set is [ basketball, boy, girl ]]The sample label corresponding to one sample multimedia resource is basketball and boy, and then the sample label corresponding to one sample multimedia resource can be expressed as [1,0 ]]. In this case, equation (2) may be the Loss when i=1 and i=2 are calculated, i.e., loss= -ln (y ₁ +y ₂ )。

In step S1309, training the preset feature extraction model and the preset full-connection layer according to the first loss information until the first loss information meets the preset condition, and obtaining the multi-mode feature extraction model and the target full-connection layer.

In the embodiment of the specification, the first gradient information may be obtained according to the first loss information, so that the parameters of the preset feature extraction model and the parameters of the preset full-connection layer may be adjusted by using a gradient reverse transmission method, so as to realize training of the preset feature extraction model and the preset full-connection layer until the first loss information meets the preset condition, and obtain the multi-mode feature extraction model and the target full-connection layer. The preset condition may be that the first loss information is smaller than a loss threshold, or the preset condition may be that the first loss information is not decreased any more. The present disclosure is not limited in this regard.

In one example, SGD (Stochastic Gradient Descent, random gradient descent method) may be used to obtain the first gradient information, the initialization learning rate may be 0.1, after the above training process passes through 12 epochs, the first loss information may be smoothed, i.e. no longer increased, and the training may be terminated, to obtain the multi-modal feature extraction model and the target full-connection layer. This is merely an example and is not intended to limit the present disclosure.

By taking the sample image and the sample text as the input of the preset feature extraction model, the training of multi-modal feature extraction of the preset feature extraction model is realized, so that the trained multi-modal feature extraction model can be suitable for multi-modal feature extraction of multimedia resources, the accuracy of feature extraction can be improved, and the accuracy of label classification of the multimedia resources can be improved by combining a target full-connection layer.

FIG. 15 is a training flow diagram of a graph rolling network, according to an example embodiment. As shown in fig. 15, in one possible implementation, after the step S1301, the following steps may be included:

in step S1501, a plurality of sample images and a plurality of sample texts are input into a multi-modal feature extraction model, feature extraction processing is performed, and second sample graphic feature information is obtained. The implementation of this step may be referred to above in step S203, and will not be described here again.

In step S1503, the tag feature information is input into a preset graph convolutional network, and tag feature correlation processing is performed to obtain sample tag feature description information;

in step S1505, feature fusion processing is performed on the second sample image-text feature information and the sample tag feature description information to obtain sample feature information;

in step S1507, the sample feature information is input to the target full-connection layer, and classification processing is performed to obtain a second prediction tag.

In the embodiment of the present disclosure, the implementation manner of steps S1503 to S1507 may refer to steps S205 to S209, which are not described herein.

In step S1509, second loss information is acquired from the sample tag and the second prediction tag.

In this embodiment of the present disclosure, the loss of the sample tag and the second prediction tag may be calculated as the second loss information by using a preset loss function. The preset penalty function herein may be a multi-tag class penalty multi-labelsoftmarginloss function, which is not limited by the present disclosure.

In step S1511, training a preset graph rolling network according to the second loss information until the second loss information meets the preset condition, thereby obtaining the graph rolling network. The implementation manner of this step may be referred to the above step S1309, and will not be described herein.

Alternatively, the convolutional network may be obtained based on training a preset convolutional network according to the second loss information.

By combining with training of a preset graph rolling network, the trained label classification model can comprise the graph rolling network, and in the label classification of the multimedia resource, the label classification accuracy can be further improved.

Fig. 17 is a block diagram illustrating a tag classification apparatus of a multimedia asset according to an exemplary embodiment. Referring to fig. 17, the apparatus may include:

the model input information acquiring module 1701 is configured to acquire a target image and a target text corresponding to the multimedia resource to be processed and tag characteristic information corresponding to a preset tag set;

The target image-text characteristic information obtaining module 1703 is configured to input a target image and a target text into the multi-mode characteristic extraction model for characteristic extraction processing to obtain target image-text characteristic information of the multimedia resource to be processed;

the target tag feature description information obtaining module 1705 is configured to perform tag feature information input into the graph convolution network, and perform tag feature correlation processing to obtain target tag feature description information;

the target feature information obtaining module 1707 is configured to perform feature fusion processing on the target image-text feature information and the target tag feature description information to obtain target feature information;

the tag information acquiring module 1709 is configured to determine at least one tag from a preset tag set as tag information of the multimedia resource according to the target feature information.

In one possible implementation, the multi-modal feature extraction model includes an image feature extraction module, a text feature extraction module, and a feature fusion module; the target teletext feature information acquisition module 1703 comprises:

the target image feature information acquisition unit is configured to input a target image into the image feature extraction module to perform image feature extraction processing to obtain target image feature information;

the target text feature information acquisition unit is configured to execute the process of inputting the target text into the text feature extraction module and extracting the text feature to obtain target text feature information;

the target image-text characteristic information acquisition unit is configured to input the target image characteristic information and the target text characteristic information into the characteristic fusion module for characteristic fusion processing to obtain target image-text characteristic information.

In one possible implementation, the image feature extraction module includes a convolution module, a first downsampling module, a first fully-connected layer, a second downsampling module, and a second fully-connected layer; the target image feature information acquisition unit includes:

The first image characteristic information acquisition unit is configured to input the initial image characteristic information into the first downsampling module for downsampling to obtain first image characteristic information of a first scale;

the second image characteristic information acquisition unit is configured to input the first image characteristic information into the second downsampling module for downsampling to obtain second image characteristic information of a second scale;

the third image characteristic information acquisition unit is configured to input the first image characteristic information into the first full-connection layer, perform characteristic length adjustment processing and obtain third image characteristic information with preset length;

a fourth image feature information obtaining unit configured to perform feature length adjustment processing by inputting the second image feature information into the second full-connection layer, to obtain fourth image feature information of a preset length;

and a target image feature information determination unit configured to perform taking the third image feature information and the fourth image feature information as target image feature information.

In one possible implementation, the text feature extraction module includes a first text feature extraction unit, a third fully-connected layer, a second text feature extraction unit, and a fourth fully-connected layer; the target text feature information acquisition unit includes:

The first text feature information acquisition unit is configured to input the target text into the first text feature extraction unit, and perform text feature extraction processing to obtain first text feature information;

the second text feature information acquisition unit is configured to input the first text feature information into the second text feature extraction unit for text feature extraction processing to obtain second text feature information;

the third text feature information acquisition unit is configured to input the first text feature information into a third full-connection layer, and perform feature length adjustment processing to obtain third text feature information with a preset length;

the fourth text feature information acquisition unit is configured to input the second text feature information into the fourth full-connection layer, and perform feature length adjustment processing to obtain fourth text feature information with preset length;

and a target text feature information determination unit configured to perform taking the third text feature information and the fourth text feature information as target text feature information.

In one possible implementation, the feature fusion module includes a first feature fusion module and a second feature fusion module; the target image-text characteristic information acquisition unit comprises:

the first target image-text characteristic information acquisition unit is configured to input fourth image characteristic information, fourth text characteristic information and first image-text characteristic information into the second characteristic fusion module for image-text characteristic fusion processing to obtain target image-text characteristic information.

In one possible implementation, a graph rolling network includes a first graph rolling module and a second graph rolling module; the target tag characteristic description information acquiring module 1705 includes:

the to-be-processed tag feature description information acquisition unit is configured to input tag feature information into the first graph convolution module to perform tag feature correlation processing to obtain to-be-processed tag feature description information;

the target tag characteristic description information acquisition unit is configured to input the tag characteristic description information to be processed into the second graph convolution module to perform tag characteristic correlation processing to obtain the target tag characteristic description information.

In one possible implementation, the tag information acquiring module 1709 includes:

and the tag information acquisition unit is configured to input the target characteristic information into the target full-connection layer for classification processing to obtain tag information.

In one possible implementation, the model input information acquisition module 1701 includes:

the tag correlation information and weight information acquisition unit is configured to acquire tag correlation information between every two tags in a preset tag set and weight information of a target full-connection layer;

a tag feature description information acquiring unit configured to execute the weighting information as tag feature description information;

And the label characteristic information acquisition unit is configured to execute the label characteristic information corresponding to the label correlation information and the label characteristic description information serving as a preset label set.

the first sample image-text characteristic information acquisition module is configured to input a plurality of sample images and a plurality of sample texts into a preset characteristic extraction model, and perform characteristic extraction processing to obtain first sample image-text characteristic information;

the first prediction tag acquisition module is configured to input the graphic characteristic information of the first sample into a preset full-connection layer for classification processing to obtain a first prediction tag;

the first training module is configured to perform training of the preset feature extraction model and the preset full-connection layer according to the first loss information until the first loss information meets preset conditions, and a multi-mode feature extraction model and a target full-connection layer are obtained.

the second sample image-text characteristic information acquisition module is configured to input a plurality of sample images and a plurality of sample texts into the multi-mode characteristic extraction model for characteristic extraction processing to obtain second sample image-text characteristic information;

the sample tag feature description information acquisition module is configured to input tag feature information into a preset graph convolutional network, and perform tag feature correlation processing to obtain sample tag feature description information;

the sample characteristic information acquisition module is configured to perform characteristic fusion processing on the second sample image-text characteristic information and the sample label characteristic description information to obtain sample characteristic information;

the second prediction tag acquisition module is configured to input sample characteristic information into the target full-connection layer for classification processing to obtain a second prediction tag;

and the second training module is configured to perform training of a preset graph rolling network according to the second loss information until the second loss information meets the preset condition to obtain the graph rolling network.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 18 is a block diagram illustrating an electronic device for tag classification of multimedia resources, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 18, according to an exemplary embodiment. The electronic device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for tag classification of multimedia resources. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 18 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not limiting of the electronic device to which the disclosed aspects apply, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Fig. 19 is a block diagram illustrating an electronic device for tag classification of multimedia resources, which may be a server, and an internal structure diagram thereof may be as shown in fig. 19, according to an exemplary embodiment. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for tag classification of multimedia resources.

It will be appreciated by those skilled in the art that the structure shown in fig. 19 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not limiting of the electronic device to which the disclosed aspects apply, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a tag classification method for a multimedia resource as in an embodiment of the present disclosure.

In an exemplary embodiment, a computer readable storage medium is also provided, which when executed by a processor of an electronic device, enables the electronic device to perform the tag classification method of a multimedia resource in an embodiment of the present disclosure. The computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product containing instructions that, when run on a computer, cause the computer to perform the tag classification method of a multimedia resource in an embodiment of the present disclosure is also provided.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for classifying tags of multimedia resources, the method comprising:

determining at least one tag from the preset tag set as tag information of the multimedia resource according to the target feature information;

the label characteristic information corresponding to the preset label set is obtained through the following steps: acquiring label correlation information between every two labels in the preset label set and weight information of a target full-connection layer; taking the weight information as tag characteristic description information; and taking the tag correlation information and the tag characteristic description information as tag characteristic information corresponding to the preset tag set.

2. The tag classification method of claim 1, wherein the multi-modal feature extraction model comprises an image feature extraction module, a text feature extraction module, and a feature fusion module; the step of inputting the target image and the target text into a multi-mode feature extraction model to perform feature extraction processing to obtain target image-text feature information of the multimedia resource to be processed comprises the following steps:

3. The label classification method of claim 2, wherein the image feature extraction module comprises a convolution module, a first downsampling module, a first fully-connected layer, a second downsampling module, and a second fully-connected layer; the step of inputting the target image into the image feature extraction module to perform image feature extraction processing to obtain target image feature information comprises the following steps:

4. The tag classification method of claim 3, wherein the text feature extraction module comprises a first text feature extraction unit, a third fully-connected layer, a second text feature extraction unit, and a fourth fully-connected layer; the step of inputting the target text into the text feature extraction module to perform text feature extraction processing to obtain target text feature information comprises the following steps:

5. The tag classification method of claim 4, wherein the feature fusion module comprises a first feature fusion module and a second feature fusion module; the step of inputting the target image characteristic information and the target text characteristic information into the characteristic fusion module to perform characteristic fusion processing, and the step of obtaining the target image-text characteristic information comprises the following steps:

6. The label classification method of claim 5, wherein the graph rolling network comprises a first graph rolling module and a second graph rolling module; the step of inputting the tag characteristic information into a graph convolution network to perform tag characteristic correlation processing to obtain target tag characteristic description information comprises the following steps:

7. The tag classification method according to claim 6, wherein after the step of inputting the third image feature information and the third text feature information into the first feature fusion module and performing a graphic feature fusion process to obtain first graphic feature information, the tag classification method further comprises:

8. The tag classification method according to any one of claims 1 to 7, wherein the step of determining at least one tag from the preset tag set as tag information of the multimedia resource according to the target feature information includes:

and inputting the target characteristic information into the target full-connection layer, and performing classification processing to obtain the tag information.

9. The tag classification method of claim 1, further comprising:

10. The tag classification method of claim 9, wherein after the step of obtaining a plurality of sample multimedia assets and corresponding sample tags, the tag classification method further comprises:

11. A tag classification apparatus for multimedia resources, comprising:

a tag information acquisition module configured to perform determining at least one tag from the preset tag set as tag information of the multimedia resource according to the target feature information;

the model input information acquisition module comprises:

the tag correlation information and weight information acquisition unit is configured to acquire tag correlation information between every two tags in the preset tag set and weight information of a target full-connection layer;

12. The tag classification device of claim 11, wherein the multi-modal feature extraction model comprises an image feature extraction module, a text feature extraction module, and a feature fusion module; the target image-text characteristic information acquisition module comprises:

13. The label classification device of claim 12, wherein the image feature extraction module comprises a convolution module, a first downsampling module, a first fully-connected layer, a second downsampling module, and a second fully-connected layer; the target image feature information acquisition unit includes:

14. The tag classification apparatus of claim 13, wherein the text feature extraction module comprises a first text feature extraction unit, a third fully-connected layer, a second text feature extraction unit, and a fourth fully-connected layer; the target text feature information acquisition unit includes:

15. The tag classification device of claim 14, wherein the feature fusion module comprises a first feature fusion module and a second feature fusion module; the target image-text characteristic information acquisition unit comprises:

16. The label sorting apparatus of claim 15, wherein the graph rolling network includes a first graph rolling module and a second graph rolling module; the target tag characteristic description information acquisition module comprises:

17. The label sorting apparatus of claim 16, wherein said label sorting apparatus further comprises:

18. The tag classification apparatus according to any one of claims 11 to 17, wherein the tag information acquiring module includes:

and the tag information acquisition unit is configured to input the target characteristic information into the target full-connection layer for classification processing to obtain the tag information.

19. The label sorting apparatus of claim 11, wherein said label sorting apparatus further comprises:

20. The label sorting apparatus of claim 19, wherein said label sorting apparatus further comprises:

21. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the tag classification method of a multimedia resource as claimed in any one of claims 1 to 10.

22. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the tag classification method of a multimedia resource as claimed in any one of claims 1 to 10.