CN115129902A

CN115129902A - Media data processing method, device, equipment and storage medium

Info

Publication number: CN115129902A
Application number: CN202210765470.1A
Authority: CN
Inventors: 祁雷; 岑杰鹏; 杨伟东; 胡益珲; 何俊烽; 马锴; 陈宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-30
Anticipated expiration: 2042-06-30
Also published as: CN115129902B

Abstract

The embodiment of the application discloses a media data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: processing the initial media identification model based on the media characteristic information corresponding to the M sample multimedia data to obtain a first predicted media label and a predicted media category corresponding to the M sample multimedia data; determining a media prediction error of an initial media identification model according to a first labeled media label, a labeled media category, a first predicted media label and a predicted media category which correspond to the M sample multimedia data respectively; determining a characteristic extraction error of an initial media identification model according to media characteristic information respectively corresponding to the M sample multimedia data; and according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, performing first adjustment on the initial media identification model to obtain a target media identification model, so that the prediction accuracy of the media identification model for multimedia data is improved.

Description

Media data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing media data.

Background

With the development of multimedia platform technology, the quality of multimedia data is higher and higher, and more users participate in the processes of making and browsing the multimedia data. In order to facilitate management of multimedia data in the multimedia platform or management of multimedia data uploaded by a user, information such as a corresponding media tag may be generated for each multimedia data. Generally, a media recognition model can be used to generate information such as tags corresponding to multimedia data. The currently used media recognition model is usually obtained by training an initial media recognition model with a sample multimedia data and a media tag corresponding to the sample multimedia data. However, the inventors have found that the media recognition model obtained in this way has a limited accuracy of prediction for multimedia data.

Disclosure of Invention

The embodiment of the application provides a media data processing method, a device, equipment and a storage medium, and improves the prediction accuracy of a media identification model for multimedia data.

An embodiment of the present application provides a media data processing method, including:

obtaining a first sample set, wherein the first sample set comprises M sample multimedia data, and first labeled media labels and labeled media categories corresponding to the M sample multimedia data respectively; m is a positive integer;

extracting media characteristic information corresponding to the M sample multimedia data respectively by using an initial media identification model;

respectively predicting labels of the M sample multimedia data by using the initial media identification model based on the media characteristic information respectively corresponding to the M sample multimedia data to obtain first predicted media labels respectively corresponding to the M sample multimedia data, and performing category prediction on the M sample multimedia data by using the initial media identification model to obtain predicted media categories respectively corresponding to the M sample multimedia data;

determining a media prediction error of the initial media identification model according to a first labeled media label, a labeled media category, a first predicted media label and a predicted media category which correspond to the M sample multimedia data respectively;

determining a characteristic extraction error of the initial media identification model according to the media characteristic information respectively corresponding to the M sample multimedia data;

according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, performing first adjustment on the initial media identification model to obtain a target media identification model; the target media identification model is used for identifying at least one of a media tag and a media category of the target multimedia data.

In another aspect, an embodiment of the present invention provides a media data processing apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a first sample set, and the first sample set comprises M sample multimedia data, and a first labeled media label and a labeled media category which correspond to the M sample multimedia data respectively; m is a positive integer;

the characteristic extraction module is used for extracting the media characteristic information corresponding to the M sample multimedia data by using an initial media identification model;

the prediction module is used for performing label prediction on the M sample multimedia data respectively based on the media characteristic information respectively corresponding to the M sample multimedia data by using the initial media identification model to obtain first predicted media labels respectively corresponding to the M sample multimedia data, and performing category prediction on the M sample multimedia data by using the initial media identification model to obtain predicted media categories respectively corresponding to the M sample multimedia data;

a determining module, configured to determine a media prediction error of the initial media identification model according to a first labeled media tag, a labeled media category, a first predicted media tag, and a predicted media category that correspond to the M sample multimedia data, respectively;

the determining module is further configured to determine a feature extraction error of the initial media identification model according to media feature information corresponding to the M sample multimedia data, respectively;

the adjusting module is used for carrying out first adjustment on the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model to obtain a target media identification model; the target media identification model is used for identifying at least one of a media tag and a media category of the target multimedia data.

In another aspect, an embodiment of the present application provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

Yet another aspect of embodiments of the present application provides a computer-readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the method.

A further aspect of embodiments of the present application provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the method.

In summary, the computer device may obtain a first sample set, where the first sample set includes M sample multimedia data, and a first labeled media tag and a labeled media category corresponding to the M sample multimedia data, respectively; the computer equipment can utilize the initial media identification model to predict to obtain first predicted media labels corresponding to the M sample multimedia data respectively, and utilizes the initial media identification model to predict to obtain predicted media categories corresponding to the M sample multimedia data respectively; furthermore, the computer device may determine a media prediction error of the initial media identification model according to a first labeled media tag, a labeled media category, a first predicted media tag, and a predicted media category corresponding to the M sample multimedia data, respectively; in addition, the computer equipment can also determine the characteristic extraction error of the initial media identification model according to the media characteristic information respectively corresponding to the M sample multimedia data; furthermore, the computer equipment can perform first adjustment on the initial media recognition model according to the media prediction error of the initial media recognition model and the feature extraction error of the initial media recognition model to obtain a target media recognition model, and the initial media recognition model is trained by utilizing the first sample set in a multi-task learning mode in the process, so that the feature expression capability of the model is improved, the generalization capability of the model is also improved, and the prediction accuracy of the media recognition model for multimedia data can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1A is a schematic structural diagram of a media data processing system according to an embodiment of the present application;

fig. 1B is a schematic diagram of a media data processing process according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a media data processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of determining a target medium identification model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a media data processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

The present application relates to artificial intelligence, and for example, the present application mainly relates to a machine learning technique in artificial intelligence, which trains an initial media identification model to obtain a target media identification model, and improves the identification accuracy of a media label or a media category of the target media identification model. Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The present application relates to an artificial intelligence cloud Service in cloud technology, which is also generally referred to as AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API interface, and some of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the dedicated cloud artificial intelligence services. For example, in the present application, after an initial media recognition model is trained, a target media recognition model may be obtained, and the target media recognition model is added to an artificial intelligence platform, so that multiple users or multiple organizations may share the target media recognition model.

In order to facilitate a clearer understanding of the present application, a media data processing system for implementing the media data processing method of the present application is first introduced, as shown in fig. 1A, the media data processing system includes a server 10 and a terminal cluster, and the terminal cluster may include one or more terminals, where the number of the terminals is not limited herein. As shown in fig. 1A, the terminal cluster may specifically include a terminal 1, a terminal 2, …, and a terminal n; it is understood that terminal 1, terminal 2, terminal 3, …, and terminal n may be all connected to server 10 via a network, so that each terminal may interact data with server 10 via the network connection.

Wherein, the terminal can be installed with a multimedia platform for providing multimedia data for users, and the multimedia platform can include but is not limited to: a game application download platform, a short video platform, an audio/video playing platform, a shopping platform, an information browsing platform, etc. In one embodiment, the terminal may identify a media tag or a media category of the multimedia data through the target media identification model.

The specific content of the multimedia data in different multimedia platforms may be different, for example, in a game application download platform, the multimedia data may refer to a game live video; in a short video platform, multimedia data may refer to a short video. In the audio/video playing platform, the multimedia data can refer to film and television works, television dramas, audio data and the like; in the shopping platform, the multimedia data can be shopping live video; in an information browsing platform, the multimedia data may refer to information including graphics and/or video.

The server 10 may refer to a device for providing a back-end service for a multimedia platform. In one embodiment, the server 10 may be configured to train the initial media recognition model to obtain the target media recognition model. In one embodiment, the server 10 may identify the media tag or media category of the multimedia data through the target media identification model.

The server may be an independent physical server, a server cluster or a distributed system formed by at least two physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal may specifically refer to, but is not limited to, a vehicle-mounted terminal, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a sound box with a screen, a smart watch, and the like. Each terminal and each server may be directly or indirectly connected through a wired or wireless communication manner, and the number of the terminals and the number of the servers may be one or at least two, which is not limited herein.

The embodiment of the present application provides a media data processing scheme, which can be applied to the media data processing system shown in fig. 1A. The media data processing scheme is specifically to obtain a first sample set, wherein the first sample set comprises M sample multimedia data, and a first labeled media tag and a labeled media category which correspond to the M sample multimedia data respectively; extracting media characteristic information corresponding to the M sample multimedia data respectively by using an initial media identification model; respectively predicting labels of the M sample multimedia data by using the initial media identification model based on the media characteristic information respectively corresponding to the M sample multimedia data to obtain first predicted media labels respectively corresponding to the M sample multimedia data, and performing category prediction on the M sample multimedia data by using the initial media identification model to obtain predicted media categories respectively corresponding to the M sample multimedia data; determining a media prediction error of the initial media identification model according to a first labeled media label, a labeled media category, a first predicted media label and a predicted media category which correspond to the M sample multimedia data respectively; determining a characteristic extraction error of the initial media identification model according to the media characteristic information respectively corresponding to the M sample multimedia data; according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, performing first adjustment on the initial media identification model to obtain a target media identification model; the target media identification model is used for identifying at least one of a media tag and a media category of the target multimedia data.

In one embodiment, one of the application processes of the media data processing scheme can be referred to in fig. 1B. In fig. 1B, on one hand, the small sample tags (corresponding to the second media labeling tags) included in the small sample tag set (corresponding to the P second media labeling tags) may be obtained, and based on the small sample tag set, an active retrieval technique is adopted, or a manual review mode is combined on this basis, so that the training data corresponding to the small sample tags is expanded, and the training data is no longer lacked. Wherein, the small sample label refers to the obtained label originally lacking the training data, and such label may also be a media label of the multimedia data. Although training data of the labels is not lack after the training data is subjected to data expansion, the labels can be still called small sample labels for distinguishing common labels from the labels in the subsequent use process. The process can obtain the training data of the small sample labels by utilizing active retrieval or further combining with the data expansion technology of human review under the condition of giving a batch of small sample labels. In one aspect, after the base video (corresponding to the M sample multimedia data) is obtained, the base video is used as input data of the initial media recognition model to pre-train the initial video recognition model (corresponding to the initial media recognition model) in a multi-task learning manner. Then, based on the training data after the small sample label expansion, a model optimization method based on meta-learning is utilized to optimize the video identification model obtained based on the base video training, so as to obtain a small sample label prediction model (corresponding to the target video identification model) for performing label prediction on multimedia data such as video.

The media data processing process is improved by the media data scheme from a data level and a model level respectively. In the data plane, the embodiment of the application introduces a data expansion link for active retrieval. In some embodiments, the present application embodiment may combine the data expansion link with manual review to expand the sample multimedia data, and in particular, may expand the sample multimedia data for an annotated media tag lacking the sample multimedia data. Compared with a data expansion method based on active learning, the combination of active retrieval and manual review can give consideration to the quantity and quality of sample expansion, and high-quality labeled samples can be quickly expanded before an initial media recognition model is trained. The technology relieves the problems of insufficient discrimination information and easy overfitting of a model caused by few training samples of labels in the original label recognition task from a sample level. On the model level, the embodiment of the application provides a pre-training technology based on multi-task learning and a model optimization method based on meta-learning. The pre-training technology based on the multi-task learning improves the feature expression capability of the model aiming at the media labels through supervised training of labeling information such as labels and classes based on the multimedia data on one hand, and improves the generalization of the model through self-supervised training among multi-modal information of the multimedia data on the other hand. Compared with the traditional method for pre-training the initial media recognition model based on the single task, the method for pre-training the initial media recognition model based on the multi-task learning can be used for simultaneously learning the model parameters under the constraint of various supervision information (information required by the model for multi-task learning, including label information such as labels and categories, and the like), so that higher model generalization is obtained. Specifically, the invention adopts three tasks of multimedia data labeling (corresponding to a label prediction task), multimedia data classification (corresponding to a category prediction task) and multi-modal contrast learning (corresponding to a contrast learning task among different modal information) to jointly train. The multimedia data tagging task (the task uses a first tagged media tag and sample multimedia data corresponding to the first tagged media tag, such as a common video tag and sample video corresponding to the first tagged media tag) and the multimedia data classifying task (the task uses a tagged media type and sample multimedia data corresponding to the tagged media type, such as a video type and sample video corresponding to the tagged media type) are similar to the multimedia data tag identification task (the task uses a second tagged media tag and sample multimedia data corresponding to the second tagged media tag, such as other video tags), so that prior knowledge can be provided for model optimization. Multimodal contrast learning optimizes models by approximating the distance between different modality information within the same multimedia data. Because the method is not limited by the labeling information of the multimedia data, a large amount of unlabeled data can be used for training the model, so that the generalization capability of the model is enhanced. Therefore, the multi-task learning mode adopted by the application can simultaneously give consideration to the specificity and the generalization of the model. The model optimization method based on meta-learning is used for fine tuning of a pre-training model, over-fitting of the model is easily caused compared with an optimization mode based on random batch sampling, and the model is optimized by constructing a plurality of sample learning subtasks based on the meta-learning method. The model optimization method based on meta-learning provided by the embodiment of the application converts a single overall optimization task into subtasks for learning a plurality of samples, so that a better solution of the model can be obtained on the premise of a small amount of training samples. The technology point includes two key points. (1) And (4) a training sample sampling mode based on label division. Different from a common mode of randomly extracting a part of multimedia data as training data in each iteration in random batch sampling, the training sample sampling based on label division only uses part of second labeling media labels as optimization targets in each iteration, and the diversity of optimization directions is increased by using the difference between different second labeling media labels. (2) A loss design based on a distance metric. Different from the conventional method of using a full-connection network as a classification layer, the embodiment of the application relies on a designed distance measurement mode to calculate the loss of the label on each sample in the corresponding query set, and as more classification layers are not introduced, the parameter quantity of the model is reduced, and overfitting is avoided.

In one embodiment, the media data prediction scheme provided in the embodiments of the present application may be used in the following scenarios:

1) and for the published or uploaded video, auditing the video by identifying the label of the video.

2) For a video recommendation system, a recall queue and a sorting feature are made by identifying tags of videos.

3) For a system for distributing information, multimedia data can be tagged by identifying the multimedia data such as video or graphics.

The method and the device remarkably save manual auditing amount through 1), and simultaneously accelerate the whole auditing process. Applying the video tags to all links of a recommendation system through 2) and 3), wherein the steps comprise recalling and sequencing videos, performing strategy scattering on contents and the like; the video tag is used as a page at the front end for explicit presentation.

Please refer to fig. 2, which is a flowchart illustrating a data processing method according to an embodiment of the present disclosure. The method may be applied to a computer device, which may be the aforementioned terminal or server. Specifically, the method may comprise the steps of:

s201, obtaining a first sample set, wherein the first sample set comprises M sample multimedia data, and a first labeled media label and a labeled media category corresponding to the M sample multimedia data respectively.

In an embodiment of the application, a computer device may obtain a first sample set to train an initial media recognition model using the first sample set. The first sample set may be randomly drawn from the target sample set, and in particular may be randomly drawn from the target sample set at each iteration of the initial media recognition model.

In the embodiment of the present application, in the process of training the initial media recognition model by using the first sample set, the computer device may specifically adopt a multitask learning mode to train the initial media recognition model by using the first sample set. The multitask here includes a label prediction task, a category prediction task, and a contrast learning task between different modality information.

It is to be understood that sample multimedia data referred to herein may refer to multimedia data used to train an initial media recognition model, which may refer to video, audio, graphics, and the like. In one embodiment, the first annotated media tag may be a non-small sample tag, the non-small sample tag may be a subset of the annotated media tags of the sample multimedia data, the number of sample multimedia data having such annotated media tags being greater than a number threshold. The first tagged media tag and the tagged media category may be obtained after a plurality of users tag and review the sample multimedia data, the first tagged media tag is used for reflecting detailed description information of the sample multimedia data, and the tagged media category is used for reflecting rough description information of the sample multimedia data, that is, the first tagged media tag belongs to a sub-category of the tagged media category. For example, the sample multimedia data is sample video data, and the labeled media category of the sample video includes one of a tv drama, a movie, an animation, a documentary, etc.; when the tagged media category of the sample video data is a tv series, the tagged media tag of the sample video includes a sub-category of the tv series, such as a short series, an ancient decoration history, a metropolitan life, and so on. In one embodiment, a second annotated media tag as referred to herein may refer to a small sample tag, which may refer to a sample multimedia data having such annotated media tags with a quantity less than or equal to a quantity threshold, which may refer to a subset of the annotated media tags of the sample multimedia data.

S202, extracting media characteristic information corresponding to the M sample multimedia data respectively by using an initial media identification model.

In this embodiment, in the process of training the initial media recognition model by using the first sample set, the computer device may input the first sample set into the initial media recognition model, and extract media feature information corresponding to the M sample multimedia data through the initial media recognition model. The media characteristic information herein refers to characteristic information capable of reflecting media information of corresponding multimedia data.

In some embodiments, the computer device may first obtain media information corresponding to the M sample multimedia data through the initial media identification model, and then perform feature extraction on the media information corresponding to the M sample multimedia data, to obtain media feature information corresponding to the M sample multimedia data, respectively.

In some embodiments, the media information may be target modality information, and accordingly, the media feature information may be a target modality feature. After obtaining the target modal information corresponding to the M sample multimedia data, the computer device may perform feature extraction on the target modal information corresponding to the M sample multimedia data, to obtain target modal features corresponding to the M sample multimedia data.

In some embodiments, the target modality information may include a plurality of modality information, such as a first modality information and a second modality information, and correspondingly, the target modality characteristics may include a plurality of modality characteristics (modality characteristics acquired respectively according to the plurality of modality information), such as a first modality characteristic and a second modality characteristic. The computer device may perform feature extraction on the first modality information corresponding to the M sample multimedia data after acquiring the first modality information corresponding to the M sample multimedia data, respectively, to obtain first modality features corresponding to the M sample multimedia data, respectively, and may perform feature extraction on the second modality information corresponding to the M sample multimedia data, respectively, after acquiring the second modality information corresponding to the M sample multimedia data, respectively, to obtain second modality features corresponding to the M sample multimedia data, respectively. For example, when the sample multimedia data is a sample video, the computer device may perform feature extraction on video frame sets corresponding to M sample multimedia data, respectively, after obtaining video frame sets corresponding to M sample multimedia data, respectively, to obtain image features corresponding to M sample multimedia data, and may perform feature extraction on text sets corresponding to M sample multimedia data, respectively, after obtaining text sets corresponding to M sample multimedia data, to obtain text features corresponding to M sample multimedia data, respectively.

In some embodiments, when the target modality information includes first modality information and second modality information, when the media characteristic information includes a target modal characteristic and the target modal characteristic includes a first modal characteristic and a second modal characteristic, the computer device may obtain a plurality of first modality information corresponding to the respective M sample multimedia data, the method includes the steps that feature extraction can be carried out on a plurality of pieces of first modal information corresponding to M pieces of sample multimedia data respectively, so that modal features corresponding to each piece of first modal information in the plurality of pieces of first modal information corresponding to the M pieces of sample multimedia data respectively are obtained, and the computer equipment can determine average modal features corresponding to the M pieces of sample multimedia data respectively according to the modal features corresponding to each piece of first modal information in the plurality of pieces of first modal information corresponding to the M pieces of sample multimedia data respectively, so that the average modal features corresponding to the M pieces of sample multimedia data respectively serve as the first modal features corresponding to the M pieces of sample multimedia data respectively; the computer device may further perform feature extraction on the second modality information corresponding to the M sample multimedia data after obtaining the second modality information corresponding to the M sample multimedia data, respectively, to obtain second modality features corresponding to the M sample multimedia data, respectively. For example, when the sample multimedia data is a sample video, the computer device may perform feature extraction on a plurality of pieces of first modality information corresponding to M sample videos, respectively, to obtain an image feature corresponding to each video frame in a video frame set corresponding to M sample videos, respectively, and the computer device may determine an average image feature corresponding to each of the M sample videos, respectively, by using the image feature corresponding to each video frame in the video frame set corresponding to each of the M sample videos; the computer device may further perform feature extraction on the second modality information corresponding to the M sample videos after acquiring the text information sets corresponding to the M sample videos, respectively, to obtain text features corresponding to the M sample videos, respectively. Here, the computer device may perform addition operation on each image feature in the image feature set corresponding to the sample video to obtain an added image feature corresponding to the sample video, and divide the added image feature corresponding to the sample video by the number of video frames in the sample video to obtain an average image feature corresponding to the sample video.

In some embodiments, the feature extraction process may be implemented by a feature extraction model. The computer device may perform feature extraction on the media information corresponding to the sample multimedia data by using a feature extraction model to obtain media feature information corresponding to the sample multimedia data, where the feature extraction model may be a sub-model included in the media feature identification model. In other words, the computer device may first input the media information corresponding to the sample multimedia data into the feature extraction model, and then output the media feature information corresponding to the sample multimedia data through the feature extraction model. In some embodiments, when the media information includes target modality information and the target modality information includes multi-modality information, the computer device may first input the multi-modality information corresponding to the sample multimedia data into the multiple feature extraction models, respectively, and then process the multi-modality information through the multiple feature extraction models to obtain one modal feature corresponding to the input one modality information, respectively, so as to obtain the multi-modality features of the sample multimedia data. For example, when the sample multimedia data is a sample video, the computer device may perform feature extraction on video frame sets corresponding to M sample multimedia data respectively through the image feature extraction model to obtain image feature sets corresponding to M sample multimedia data respectively, and perform feature extraction on text sets corresponding to M sample multimedia data respectively through the text feature extraction model to obtain text features corresponding to M sample multimedia data respectively, where the image feature extraction model and the text feature extraction model may be sub-models included in the video recognition model.

In one embodiment, the text mentioned in the embodiment of the present application may include at least one of a title, a subtitle, a bullet screen, and a character extracted based on a picture of multimedia data. In other embodiments, the text may also be other texts of multimedia data, which is not limited in this application embodiment.

S203, label prediction is carried out on the M sample multimedia data respectively by utilizing the initial media identification model based on the media characteristic information respectively corresponding to the M sample multimedia data to obtain first predicted media labels respectively corresponding to the M sample multimedia data, and category prediction is carried out on the M sample multimedia data by utilizing the initial media identification model to obtain predicted media categories respectively corresponding to the M sample multimedia data.

In some embodiments, when the media feature information includes a target modal feature and the target modal feature includes a plurality of modal features, the computer device may perform feature fusion on the plurality of modal features corresponding to the M sample multimedia data, respectively, by using the initial media identification model, to obtain fused features corresponding to the M sample multimedia data, respectively; the computer equipment can utilize the initial media identification model to predict labels according to the fused characteristics corresponding to the M sample multimedia data respectively, so as to obtain first predicted media labels corresponding to the M sample multimedia data respectively; the computer device can also utilize the initial media identification model to perform category prediction according to the fused features corresponding to the M sample multimedia data respectively, so as to obtain predicted media categories corresponding to the M sample multimedia data respectively. For example, when the sample video is the sample video, the computer device may perform feature fusion on the average image features and the text features corresponding to the M sample videos by using the initial video identification model after obtaining the average image features and the text features corresponding to the M sample videos, respectively, to obtain fused features corresponding to the M sample videos, respectively; the computer equipment can utilize the initial video identification model to perform label prediction according to the fused characteristics corresponding to the M sample videos respectively to obtain first prediction video labels corresponding to the M sample videos respectively; the computer device can utilize the initial video recognition model to perform category prediction according to the fused features corresponding to the M sample videos, so as to obtain video categories corresponding to the M sample videos.

It should be noted that, when the media feature information includes a target modality feature, the target modality feature includes multiple modality information, and there are more than one modality information in the multiple modality information, respectively, the computer device may obtain, in a manner similar to the above manner, the first predicted media tags and the media prediction categories corresponding to the M sample videos, respectively. Illustratively, when the media feature information includes a target modal feature, the target modal feature includes a plurality of modal features, the plurality of modal features includes a first modal feature, a second modal feature, a third modal feature, and the first modal feature is plural, the computer device may determine, by using the initial media identification model, a first average modal feature corresponding to each of the M sample multimedia data according to a plurality of first modal features corresponding to each of the M sample multimedia data, and determine, by using the initial media identification model, a second average modal feature corresponding to each of the M sample multimedia data according to a plurality of third modal features corresponding to each of the M sample multimedia data; the computer equipment can perform feature fusion on the first average modal feature, the second average modal feature and the second modal feature which respectively correspond to the M sample multimedia data by using the initial media identification model to obtain fused features which respectively correspond to the M sample multimedia data; the computer device can utilize the initial media identification model to perform label prediction according to the fused features respectively corresponding to the M sample multimedia data to obtain first predicted media labels respectively corresponding to the M sample multimedia data, and perform category prediction according to the fused features respectively corresponding to the M sample multimedia data to obtain predicted media categories respectively corresponding to the M sample multimedia data.

In some embodiments, the aforementioned feature fusion process may be implemented by a feature fusion model, which may be a sub-model of the multimedia recognition model. The aforementioned tag prediction process may be implemented by a tag classification model, the tag classification model may be a sub-model of a multimedia recognition model, the aforementioned category prediction process may be implemented by a category classification model, and the category classification model may be a sub-model of the multimedia recognition model.

S204, determining a media prediction error of the initial media identification model according to a first labeled media label, a labeled media type, a first predicted media label and a predicted media type which are respectively corresponding to the M sample multimedia data.

In the embodiment of the present application, it is assumed that the M sample multimedia data include sample multimedia data M _a (a is a positive integer less than or equal to M), the computer device may calculate sample multimedia data M _a The error between the corresponding first annotated media tag and the first predicted media tag is taken as the sample multimedia data M _a Corresponding media label prediction error and can calculate sample multimedia data M _a Corresponding error between the annotated media category and the predicted media category as sample multimedia data M _a Corresponding media class prediction error, thereby sampling multimedia data M _a Corresponding media label prediction error and media category prediction error are determined as sample multimedia data M _a Corresponding media prediction error. Here sample multimedia data M _a The corresponding media prediction error can be understood as the initial media recognition model for the sample multimedia data M _a The prediction error of (2). By adopting the mode, the computer equipment can obtain the media prediction errors respectively corresponding to the M sample media data, and the media prediction errors respectively corresponding to the M sample media data are determined as the media prediction errors of the initial media identification model.

In some embodiments, the sample multimedia data corresponds to a media tag prediction error, such as sample multimedia data M _a The corresponding media tag prediction error can be used as Loss _tag Denotes, Loss _tag The calculation formula of (c) is as follows:

Loss _tag ＝BCE(I,T,Y _tag ) Formula 1.1;

wherein I and T are both sample multimedia data, such as sample multimedia data M _a The media information of (1). Y is _tag For sample multimedia data, e.g. sample multimedia data M _a And the corresponding first labeling media label. In some embodiments, I and T may represent two modality information of the sample multimedia data Ma, respectively. For example, if the sample multimedia data is sample video, I and T can be represented as sample video M _a A set of video frames and a set of text information. Wherein, BCE is a binary cross entropy calculation function. In some embodiments, the loss function other than the BCE may be used to calculate the media tag prediction error corresponding to the sample multimedia data, which is not limited in this application.

In some embodiments, the sample multimedia data corresponds to a media category prediction error, such as sample multimedia data M _a The corresponding media category prediction error can be used as Loss _cls Denotes, Loss _cls The calculation formula of (c) is as follows:

Loss _cls ＝CE(I,T,Y _cls ) Formula 1.2;

where CE is a cross entropy calculation function. In some embodiments, the loss function other than CE may also be used to calculate the media category prediction error corresponding to the sample multimedia data, which is not limited in this application.

S205, determining the characteristic extraction error of the initial media identification model according to the media characteristic information respectively corresponding to the M sample multimedia data.

In the embodiment of the present application, it is assumed that the M sample multimedia data include sample multimedia data M _a The computer device can be based on the sample multimedia data M _a Corresponding media characteristic information, determining initial media identification model with respect to sample multimedia data M _a The error is extracted from the features of (1). In particular, assume that the M sample multimedia data further include sample multimedia data M _b (b is a positive integer less than or equal to M, a is different from b), the computer device can be based on the sample multimedia data M _a Corresponding media characteristic information and sample multimedia data M _b Determining an initial media recognition model with respect to the sample multimedia data M _a The error is extracted from the features of (1). Wherein the sample multimedia data M _b Can divide the sample multimedia data M from the M sample multimedia data _a Any sample multimedia data. Illustratively, sample multimedia data M _b May be a decimated multimedia data M comprised from M sample multimedia data _a One sample multimedia data randomly extracted from the remaining sample multimedia data. In this way, the computer device may obtain the feature extraction errors of the initial media identification model with respect to the M sample multimedia data, respectively, and determine the feature extraction errors of the initial media identification model with respect to the M sample multimedia data, respectively, as the feature extraction errors of the initial media identification model.

In some embodiments, assuming that the media feature information includes a target modal feature, and the target modal feature includes a first modal feature and a second modal feature, the process of determining, by the computer device, a feature extraction error of the initial media identification model according to the media feature information corresponding to the M sample multimedia data may be as follows:

first, a computer device may obtain sample multimedia data M _a The first modal characteristics and the sample multimedia data M _a Of the second modal characteristics. Wherein the first distance refers to the sample multimedia data M _a The first modal characteristics and the sample multimedia data M _a Of the second modal characteristics. Through this process, the computer device can obtain the sample multimedia data M _a Including the distance between the different modality information. Here, sample multimedia data M _a The distance between different types of modal information can be reflected by the sample multimedia data M _a The matching degree between different types of modal information is included.

In some embodiments, the distance between different types of modal features of the same sample multimedia data, such as the first distance, may be represented by D _p And (4) showing. D _p The calculation formula of (c) may be as follows:

wherein, F _I For sample multimedia data M _a First mode characteristics of (1), F _T For sample multimedia data M _a The second modality of (1). For example, when the sample multimedia data is sample video，F _I May be a sample video M _a Image feature of (e.g. may be specifically a sample video M) _a Average image feature of) F _T May be a sample video M _a The text feature of (1).

Meanwhile, the computer device can obtain the sample multimedia data M _a The first modal characteristics and the sample multimedia data M _b And determining a second distance between the second modal characteristics of (a) and (b) sample multimedia data M _a Of the second modality and the sample multimedia data M _b Of the first modal characteristics. Wherein the second distance refers to the sample multimedia data M _a The first modal characteristics and the sample multimedia data M _b A third distance, which refers to the sample multimedia data M _a Of the second modality and the sample multimedia data M _b Of the first modal characteristics. Through this process, the computer device can obtain the sample multimedia data M _a And sample multimedia data M _b The distance between different kinds of modality information.

The computer device may then determine an initial media identification model for the sample multimedia data M based on the first distance, the second distance, and the third distance _a The error is extracted.

In some embodiments, the computer device may first calculate an average distance between the second distance and the third distance, and then determine the initial media identification model with respect to the sample multimedia data M based on the first distance and the average distance _a The error is extracted from the features of (1). Wherein the average distance may be understood as the distance between different modal features between different sample multimedia data. Here, the average distance may reflect a degree of matching between different modal characteristics between different sample multimedia data.

In some embodiments, the distance between different types of modal features between different sample multimedia data, such as the average distance mentioned, may be represented by D _n And (4) showing. D _n The calculation formula of (c) may be as follows:

wherein, F' _I For sample multimedia data M _b Of a first modal characteristic of, F' _T For sample multimedia data M _b The second modal characteristics of (1). For example, F 'when the sample multimedia data is sample video' _I May be a sample video M _b Image feature of (e.g. may be specifically a sample video M) _b Average image feature of) F' _T May be a sample video M _b The text feature of (1).

In some implementations, the initial media model relates to sample multimedia data M _a The feature extraction error of (2) may be in Loss _self And (4) showing. Loss _self The calculation formula of (c) may be as follows:

Loss _self ＝||D _p -D _n formula 1.5;

or the like, or, alternatively,

Loss _self ＝max(m,||D _p -D _n | |) equation 1.6;

where m is a set value, such as a set empirical value. Corresponding to Loss _self To understand, the initial media recognition model relates to sample multimedia data M _a The feature extraction error of (2) can reflect the matching degree between different modal features of the same sample multimedia data and the matching degree between different modal features of different sample multimedia data, and the contrast loss of the two matching degrees. Different from the loss based on labels and classification, the loss based on modal self-supervision only considers the information of the multimedia data, so that the information extracted after the initial media recognition model is trained corresponding to the first sample set is higher in richness and diversity, and better generalization performance is achieved.

And then, if the initial media identification model respectively determines that the feature extraction errors of the M sample multimedia data are all completed, determining the feature extraction errors of the initial media identification model respectively related to the M sample multimedia data as the feature extraction errors of the initial media identification model.

S206, according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, carrying out first adjustment on the initial media identification model to obtain a target media identification model.

Specifically, the computer device may perform a first adjustment on the initial media recognition model according to a media prediction error of the initial media recognition model and a feature extraction error of the initial media recognition model to obtain an adjusted media recognition model, and determine a target media recognition model according to the adjusted media recognition model, where the target media recognition model is used to recognize at least one of a media tag and a media category of target multimedia data.

More specifically, the computer device may first determine a total error of media identification for the initial media identification model based on the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model. Then, the computer device may perform a first adjustment on the initial media recognition model by using the total media recognition error when the initial media recognition model does not satisfy the training stop condition until the initial media recognition model satisfies the training stop condition, so as to obtain an adjusted media recognition model. The training stopping condition may be that the iteration number reaches an upper limit of the iteration number, the initial media recognition model converges, a total error of media recognition of the initial media recognition model reaches a minimum value, and the like. The first adjustment may be to update model parameters of the initial media identification model with the total error of media identification, optimize the total loss function of the initial media identification model, and so on. The overall loss function herein may be constructed from a loss function used to calculate the media tag prediction error and a loss function used to calculate the media feature extraction error. Further, the overall Loss function may be constructed from a Loss function for calculating a media tag prediction error, a Loss function for calculating a media class prediction error, and a Loss function for calculating a feature extraction error, for example, the overall Loss function may be Loss here _total . After obtaining the adjusted media identification model, the computer device determines the media identification model based on the adjusted media identification modelA targeted media recognition model.

In some embodiments, assuming that the media prediction error of the initial media identification model includes a media tag prediction error and a media category prediction error corresponding to the M sample multimedia data, respectively, and the feature extraction error of the initial media identification model includes a feature extraction error of the initial multimedia identification model with respect to the M sample multimedia data, respectively, the computer device may determine the total media identification error of the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model as follows: computer device to sample multimedia data M _a Weighting the corresponding media label prediction error, media category prediction error and feature extraction error to obtain sample multimedia data M _a The medium identification error of (2). By adopting the mode, the computer equipment can obtain the media identification errors corresponding to the M sample multimedia data respectively. After the media identification errors corresponding to the M sample multimedia data are obtained, the computer device may perform superposition processing on the media identification errors corresponding to the M sample multimedia data, so as to obtain a total media identification error of the initial media identification model.

In some embodiments, the sample multimedia data has a media identification error, such as sample multimedia data M _a The medium identification error of (2) can be represented by Loss _total And (4) showing. Loss _total The calculation formula of (c) may be as follows:

Loss _total ＝A*Loss _tag +B*LOss _cls +(1-A-B)*Loss _self formula 1.7;

wherein A is Loss _tag B is Loss _cls Weight of (1-A-B) is Loss _self The weight of (c).

In some embodiments, the manner in which the computer device determines the target media recognition model from the adjusted media recognition model may be as follows: the computer device determines the adjusted media recognition model as a target media recognition model.

In some embodiments, the computer device may train the adjusted media recognition model with the second sample set to obtain the target media recognition model according to the manner of determining the target media recognition model by the adjusted media recognition model. The second sample set here may include K sample multimedia data, and the K sample multimedia data respectively correspond to second labeled media tags. The second annotated media tag is different from the first annotated media tag mentioned above. Specifically, the computer device determines a mode of the target media recognition model according to the adjusted media recognition model, that is, a process of training the adjusted media recognition model by using the second sample set to obtain the target media recognition model, may refer to steps S301 to S305 in fig. 3, and specifically:

s301, a second sample set is obtained, wherein the second sample set comprises K sample multimedia data and second labeled media labels corresponding to the K sample multimedia data respectively.

In some embodiments, the second sample set may be obtained by sample mining according to reference multimedia data corresponding to P second media annotation tags, where the P second media annotation tags include second annotation media tags corresponding to K sample multimedia data, and K is a positive integer. Through sample mining, the sample multimedia data corresponding to the second labeling media tag can be effectively expanded under the condition that the number of the sample multimedia data corresponding to the second labeling media tag is small, so that the sample multimedia data corresponding to the second labeling media tag is enriched. For example, assuming that only x sample multimedia data originally corresponding to the second annotated media tag are available, the sample mining can increase the number of the tags of the sample multimedia data corresponding to the second annotated media tag from x to x + y tags, where y is a positive integer greater than or equal to 1 and is less than the number of the multimedia data in the multimedia data set.

In some embodiments, the computer device performs sample mining on the reference multimedia data corresponding to the P second media annotation tags, to obtain a second sample set, specifically as follows: computer equipment acquires P second media annotation labelsRespectively corresponding reference multimedia data. The computer device can label P according to the second media _c Retrieving the second media tag P from the multimedia data set according to the media feature information of the corresponding reference multimedia data _c Matching multimedia data as a second media annotation tag P _c Corresponding candidate multimedia data. By adopting the above manner, the computer device can retrieve the multimedia data corresponding to the P second media annotation tags from the multimedia data set. After obtaining the candidate multimedia data corresponding to the P second media annotation tags, the computer device may construct a second sample set according to the candidate multimedia data corresponding to the P second media annotation tags, respectively. With the above process, the computer device can implement a sample mining process based on the P second media annotation tags.

In some embodiments, the computer device annotates tag P with the second media _c Retrieving the second media label P from the multimedia data set according to the media characteristic information of the corresponding reference multimedia data _c The manner of the matched multimedia data may be: the computer equipment acquires the media characteristic information of each multimedia data in the multimedia data set and determines the second media label P _c The media distance between the media characteristic information of the corresponding reference multimedia data and the media characteristic information of each multimedia data in the multimedia data set respectively, and the second media label P is retrieved from the multimedia data set according to the determined media distance _c Matching multimedia data as the second media annotation tag P _c The matched candidate multimedia data. By adopting the process, the computer equipment can realize the labeling of the label P based on the second media _c The sample mining process of (1).

In some embodiments, the media feature information of the reference multimedia data and the media feature information of each multimedia data in the multimedia data set corresponding to the P second media annotation tags, respectively, may be obtained after feature extraction is performed by using a feature extraction model. For example, the computer device may extract, through the image feature extraction model, image features of the reference multiple videos and image features of each video in the video set, which correspond to the P second media annotation tags, respectively. The image feature extraction model here may be a CLIP model. The computer equipment can also extract the text features of the reference multi-videos and the text features of each video in the video set corresponding to the P second media annotation labels through a text feature extraction model. The text feature extraction model here may be a BERT model. In one embodiment, the feature extraction model used for extracting the media feature information of the reference multimedia data corresponding to the P second media annotation tags and the media feature information of each multimedia data in the multimedia data set may be a feature extraction model included in an untrained initial media recognition model when the initial media recognition model is not trained by the first sample set.

In some embodiments, the processing of the CLIP model, the determination of the average modal characteristics (e.g., average image characteristics), and the processing of the BERT model are as follows:

to represent a modal characteristic corresponding to one of the plurality of first modal information,

image features corresponding to any one video frame in a video, such as a reference video, may be represented. I is _d And d is a positive integer less than or equal to 1. F _I It represents the average image characteristics of the reference video I. The average image feature can be taken as one of the overall image features of the video.

In some embodiments, the media feature information of the reference multimedia data and the media feature information of each multimedia data in the multimedia data set, which correspond to the P second media labeling labels, may be obtained by performing feature extraction by using a feature extraction model included in the initial media recognition model before training, before training the initial media recognition model.

In some implementations, assuming that the media characteristic information includes a first modal characteristic and a second modal characteristic, the computer device determines the second media annotation tag P _c The media distance between the corresponding media characteristic information of the reference multimedia data and the media characteristic information of each multimedia data in the multimedia data set respectively, and the second media label P is retrieved from the multimedia data set according to the determined media distance _c Matching multimedia data as the second media annotation tag P _c The matched candidate multimedia data is specifically as follows:

determining said second media annotation tag P _c And the distance between the first modal feature of the corresponding reference multimedia data and the first modal feature of each multimedia data in the multimedia data set is used as the first media distance.

That is, the computer device may determine the second media annotation tag P _c A first media distance between the first modal characteristic of the corresponding reference multimedia data and the first modal characteristic of each multimedia data in the set of multimedia data, respectively. In some embodiments, the first media distance may be a cosine distance. Here a second media annotation tag P _c A first media distance between the first modal feature of the corresponding reference multimedia data and the first modal feature of each multimedia data in the multimedia data set is the second media label P _c A distance between the first modality feature of the corresponding reference multimedia data and the first modality feature of each multimedia data in the set of multimedia data, respectively.

In some implementations, the computer device can tag the second media with a tag P _c The first modal characteristics of the corresponding reference multimedia data are normalized to obtain a second media label P _c The corresponding first normalized feature (referring to the second media label P) corresponding to the reference multimedia data _c Corresponding normalization feature corresponding to the reference multimedia data), and can perform normalization processing on the first modal feature of each multimedia data in the multimedia data set respectively to obtain a second normalization feature (referring to the normalization feature corresponding to the multimedia data) corresponding to each multimedia data in the multimedia data set, thereby determining a second media annotation label P _c A distance between the first normalized feature of the corresponding reference multimedia data and the second normalized feature of each multimedia data in the set of multimedia data, respectively. By normalizing each modal characteristic corresponding to the reference multimedia data and each modal characteristic corresponding to each multimedia data in the multimedia data set, a more accurate and normative media distance can be obtained.

In some embodiments, the first modal feature of the reference multimedia data corresponding to the second media annotation tag is a distance from the first modal feature of any multimedia data in the set of multimedia data, such as the second media annotation tag P _c The distance between the first modal characteristic of the corresponding reference multimedia data and the first modal characteristic of any multimedia data in the multimedia data set can be represented by d. d can be calculated by the following formula:

wherein, F _Itag Refers to the first modal characteristic corresponding to the reference multimedia data of the second annotated media tag, such as tagging P for the second media tag _c A corresponding first modality characteristic of the reference multimedia data. Here, F _I Refers to a first modality characteristic of any multimedia data in the set of multimedia data. For example, when the multimedia data is video, F here _Itag Can refer to a second labelThe image feature corresponding to the reference video of the video tag (for example, the average image feature of the reference video may be specifically used). Here, F _I May refer to an image feature of any video in the video set (e.g., may specifically be an average image feature of any video). Where the function of the l2norm function is to normalize the feature vectors.

Determining the second media annotation tag P _c And the distance between the second modal feature of the corresponding reference multimedia data and the second modal feature of each multimedia data in the multimedia data set is used as a second media distance.

That is, the computer device may determine the second media annotation tag P _c A second media distance between the second modality feature of the corresponding reference multimedia data and the second modality feature of each multimedia data in the set of multimedia data, respectively. In some embodiments, the second media distance may be a cosine distance. Here a second media annotation tag P _c A second media distance between the second modal feature of the corresponding reference multimedia data and the second modal feature of each multimedia data in the multimedia data set is the second media label P _c A distance between the second modality feature of the corresponding reference multimedia data and the second modality feature of each multimedia data in the set of multimedia data, respectively.

Wherein the computer device determines the second media annotation tag P _c The manner of determining the distance between the second modal characteristic of the corresponding reference multimedia data and the second modal characteristic of each multimedia data in the multimedia data set may be referred to as determining the second media annotation tag P _c The manner of referring to the distance between the first modal feature of the multimedia data and the first modal feature of each multimedia data in the multimedia data set is not repeated herein.

And retrieving multimedia data with a first media distance smaller than a first distance threshold value from the multimedia data set as a first multimedia data subset, and retrieving multimedia data with a second media distance smaller than a second distance threshold value from the multimedia data set as a second multimedia data subset.

In some embodiments, the computer device may further sort the multimedia data in the multimedia data set by the first media distance to obtain a first sorting result, and determine the first multimedia data subset from the multimedia data set according to the first sorting result. For example, when the sorting mode is that the multimedia data with smaller first media distance is sorted more forward, the computer device may determine the top Q from the multimedia data set ₁ Bits of multimedia data to construct a first subset of multimedia data. In some embodiments, Q ₁ May be a hyper-parameter, wherein the hyper-parameter is also called a hyper-parameter, and the hyper-parameter may refer to a parameter set before the initial media recognition model is trained in the embodiment of the present application. The computer device may also sort each multimedia data in the multimedia data set according to the second media distance to obtain a second sorting result, and determine a second multimedia data subset from the multimedia data set according to the second sorting result. For example, when the sorting mode is that the multimedia data with smaller second media distance is sorted more forward, the computer device may determine the top Q from the multimedia data set ₂ Bits of multimedia data to construct a first subset of multimedia data. In some embodiments, Q ₂ Can be radix Ginseng Indici. Q ₁ And Q ₂ And the adjustment can be carried out according to the actual service requirement.

For example, assuming the multimedia data is a video and the media characteristic information includes image characteristics and text characteristics, the computer device may determine the second media annotation tag P _c The distances between the image characteristics of the corresponding reference video and the image characteristics of each video in the video set are respectively selected, and the distance between the image characteristics of the corresponding reference video and the image characteristics of each video in the video set is selected from the video set to obtain a second media annotation label P _c Q with closest image feature distance of corresponding reference video ₁ And (6) a video. The computer device may also determine the second media annotation tag P _c The text features of the corresponding reference video respectively correspond to the text features in the video setThe distance between the text features of the videos is screened out from the video set, and the distance between the second media annotation label P and the video set is selected _c Corresponding to Q nearest to text feature of reference video ₂ And (4) sampling. Through the above process, the computer device can retrieve the second media annotation tag P from the video collection _c Corresponding Q ₁ +Q ₂ A video.

And fourthly, determining the matched multimedia data from the first multimedia data subset and the second multimedia data subset. In this embodiment, the computer device may determine the first multimedia data subset and the second multimedia data subset as the second media annotation tag P _c The matched multimedia data.

In some embodiments, to improve the accuracy of the matched multimedia data, the computer can label the second media with a tag P _c The first multimedia data subset and the second multimedia data subset are sent to a designated user terminal, and the designated user terminal can display a second media annotation label P _c A first subset of multimedia data and a second subset of multimedia data. The user corresponding to the appointed user terminal can manually judge whether each multimedia data in the first multimedia data subset is associated with the second media label P _c Matching and judging whether each multimedia data in the second multimedia data subset is matched with the second media label P _c Matching to screen out a second media annotation tag P from the first and second subsets of multimedia data _c The matched multimedia data. The user can inform the computer device of the second media annotation label P by specifying the user terminal _c Matched multimedia data, to which the computer device can obtain the second media P _c The matched multimedia data.

S302, obtaining S sample multimedia data from the K sample multimedia data to serve as a target support set, and obtaining T sample multimedia data from the K sample multimedia data to serve as a target query set.

In some embodiments, S may be a sum of numbers of sample multimedia data in the supporting set corresponding to the N second annotated media tags, respectively. Wherein S is a positive integer less than K. N is a positive integer. The S sample multimedia data may include each sample multimedia data in the supporting set corresponding to each of the N second labeled media tags. The target support set may be a support set corresponding to each of the N second annotation media tags.

For example, when N is 3, the supporting set corresponding to the 1 st second annotated media tag includes 6 sample multimedia data, the supporting set corresponding to the 2 nd second annotated media tag includes 6 sample multimedia data, and the supporting set corresponding to the 3 rd second annotated media tag includes 6 sample multimedia data, then S is 18. The S sample multimedia data may include 6 sample multimedia data in the supporting set corresponding to the above 3 second labeled media tags, respectively, for a total of 18 sample multimedia data. The target support set may be a support set corresponding to 3 second labeling media tags respectively.

In some embodiments, T is the sum of the numbers of sample multimedia data in the query set corresponding to the N second annotated media tags, respectively. Wherein T is a positive integer less than K. The T sample multimedia data may include each sample multimedia data in the query set corresponding to each of the N second annotated media tags. The target query set may be a query set corresponding to the N second labeled media tags, respectively. The target support set is different from the target query set, specifically, a support set corresponding to a target second tagged media tag of the N second tagged media tags is different from a query set corresponding to the target second tagged media tag, and the target second tagged media tag is any one of the N second tagged media tags.

For example, when N is 3, the query set corresponding to the 1 st second labeled media tag includes 6 sample multimedia data, the query set corresponding to the 2 nd second labeled media tag includes 6 sample multimedia data, the query set corresponding to the 3 rd second labeled media tag includes 6 sample multimedia data, and then T is 18. The T sample multimedia data may include 6 sample multimedia data in the query set corresponding to the above 3 second annotated media tags, respectively, for a total of 18 sample multimedia data. The target query set may be a query set corresponding to 3 second annotated media tags respectively.

In some embodiments, the computer device obtains the target support set and the target query set as follows:

extracting N second labeled media labels from the second labeled media labels respectively corresponding to the K sample multimedia data.

Extracting a second labeled media label N from the K sample multimedia data _r Corresponding E sample multimedia data;

wherein the second annotated media tag Nr belongs to N second annotated media tags; r is a positive integer less than or equal to N, and E is a positive integer less than K.

And thirdly, if E sample multimedia data corresponding to the N second labeling media tags respectively are extracted from the K sample multimedia data, dividing the E sample multimedia data corresponding to the N second labeling media tags respectively to obtain a support set and a query set corresponding to the N second labeling media tags respectively, determining the support sets corresponding to the N second labeling media tags respectively as a target support set, and determining the query sets corresponding to the N second labeling media tags respectively as a target query set.

In some embodiments, the computer device extracts the second annotated media tag N _r After the corresponding E sample multimedia data, the second labeled media tag N can be labeled _r Dividing the corresponding E sample multimedia data into two parts, wherein one part of the sample multimedia data is used for constructing a second labeling media label N _r Corresponding to the support set, another part of the sample multimedia data is used for constructing a second annotated media tag N _r A corresponding set of queries. By adopting the mode, the computer equipment can obtain the support set and the query set corresponding to the N second media label tags respectively, determine the support sets corresponding to the N second label media tags respectively as the target support set, and respectively determine the N second label media tagsThe corresponding query set is determined as a target query set.

In some embodiments, the computer device can extract the second annotated media tag N _r After the corresponding E sample multimedia data, the second labeled media tag N can be labeled _r The corresponding E sample multimedia data are averagely divided into two parts, wherein one part of E/2 sample multimedia data are used for constructing a second labeling media label N _r Corresponding to the support set, and the other part of the E/2 sample multimedia data is used for constructing a second labeling media label N _r A corresponding set of queries. In this manner above, the support set may include the same number of sample multimedia data as the query set. By adopting the above manner, the computer device can obtain the support sets and the query sets corresponding to the N second media labeling tags, respectively, determine the support sets corresponding to the N second labeling media tags as the target support sets, and determine the query sets corresponding to the N second labeling media tags as the target query sets.

S303, training the adjusted media identification model according to the target support set to obtain a candidate media identification model.

S304, determining the media label identification error of the candidate media identification model according to the target query set.

Specifically, the target query set includes query sets corresponding to the N second labeled media tags, and the manner for determining the media tag identification error of the candidate media identification model by the computer device according to the target query set may be as follows:

using the candidate media identification model based on a second labeled media label N _r Sample multimedia data F in a corresponding query set _t Performing label identification to obtain multimedia data F related to the sample _t The tag prediction information of (1).

In this embodiment, the computer device may tag the second tagged media with a media tag N _r Sample multimedia data F in a corresponding query set _t Inputting a candidate medium identification model, and processing the candidate medium identification modelTo multimedia data F related to the sample _t The tag prediction information of (1). Wherein the second labeling media label N _r Belonging to the N second labeling media tags. r is a positive integer less than or equal to N. The tag prediction information may include sample multimedia data F _t The second predicted media tag is a second annotated media tag N _r And sample multimedia data F _t The second predicted media tag is a second annotated media tag N _u A second probability of (d); second media annotation tag N _u Dividing the second labeling media label N from the N second media labeling labels _r Any other media is marked with a label, u is an integer less than or equal to N, and r is different from u.

In some embodiments, the computer device obtains the multimedia data F related to the sample through the processing of the candidate medium identification model _t The tag prediction information may specifically be: determining, by the computer device, a second annotated media tag N _r Corresponding label average characteristics; sample multimedia data F extracted by calling candidate media identification model _t Corresponding media characteristic information; according to the second labeling media label N _r Corresponding tag average characteristics and sample multimedia data F _t Corresponding media characteristic information, calculating sample multimedia data F _t The corresponding second predicted media tag is the second labeled media tag N _r A first probability of (d); determining a second annotated media tag N _u Corresponding label average characteristics; calling the candidate media identification model according to the second labeled media label N _u Corresponding tag average characteristics and sample multimedia data F _t Corresponding media characteristic information, calculating sample multimedia data F _t The corresponding second predicted media tag is the second labeled media tag N _u A second probability of (2).

In some implementations, the computer device determines a second annotated media tag N _r The corresponding tag average characteristics may be in the following manner: the computer equipment acquires a second labeling media label N _r Correspondingly supporting the media characteristic information of each sample multimedia data in the set, and labeling the media label N according to the second label _r Corresponding supportCentralizing the media characteristic information of each sample multimedia data, and determining a second labeled media label N _r Corresponding label average characteristics.

In some embodiments, the average tag characteristic of the media tag can be labeled as f _i Is shown in which f _i Can be calculated by the following formula:

wherein i represents the ith second annotation media tag. f. of _ij Can represent the target media characteristics of the jth sample multimedia data corresponding to the ith second annotation media tag in the supporting set. The target media feature may be, for example, a feature fusion of multi-modal features of the j-th sample multimedia data to obtain a fused feature. k represents the number of sample multimedia data in the supporting set corresponding to the ith second annotation media label. Wherein i is a positive integer less than or equal to N. J is a positive integer less than or equal to k. The ith second annotation media tag can be the second annotation media tag N _r The j-th sample multimedia data can be sample multimedia data F _t 。

In some embodiments, if the second annotated media tag is N _r If the corresponding supporting set includes k sample multimedia data, the second labeled media tag N _r The corresponding query set may include (E-k) sample multimedia data. When the media characteristic information includes a target modal characteristic, the target modal characteristic including a first modal characteristic and a second modal characteristic, the computer device may utilize the candidate media identification model to tag the media according to the second annotated media tag N _r The k sample multimedia data respectively correspond to first modal characteristics; the computer device may tag the second tagged media with N using the candidate media identification model _r Respectively corresponding first modal characteristics and second modal characteristics of the k sample multimedia data to perform characteristic fusion to obtain a second labeled media label N _r Respectively corresponding to the k sample multimedia dataCharacteristic; the computer device may utilize the candidate media recognition model to tag the media according to the second annotation N _r Determining the second labeling media label N according to the fused characteristics corresponding to the k sample multimedia data respectively _r Corresponding label average characteristics. The manner of obtaining the first modal characteristic and the second modal characteristic corresponding to the sample multimedia data may refer to the aforementioned manner, which is not described herein again.

For example, when the sample multimedia data is a sample video, the computer device can obtain the second annotation video tag N _r After the image feature sets corresponding to the k sample videos respectively, determining a second annotation video label N _r The k sample videos respectively correspond to the average image characteristics. And, the computer device can also acquire N _r The k sample videos respectively correspond to text features. Thereafter, the computer device can utilize the initial video recognition model to label the second annotated video tag N _r Respectively corresponding average image features and text features of the k sample videos are subjected to feature fusion to obtain a second labeled video label N _r The k sample videos respectively correspond to the fused features, thereby determining a second labeled video label N _r Determining the second label video label N according to the fused characteristics corresponding to the k sample videos respectively _r Corresponding label average characteristics. Here, the computer device may perform addition operation on each image feature in the image feature set corresponding to the sample video to obtain an added image feature corresponding to the sample video, and divide the added image feature corresponding to the sample video by the number of video frames in the sample video to obtain an average image feature corresponding to the sample video.

In one embodiment, the computer device may extract the media feature information corresponding to the sample multimedia data through the feature extraction model included in the candidate media identification model, for example, extract the sample multimedia data F by calling the feature extraction model included in the candidate media identification model _t Corresponding media characteristic information.

By executing the above operations, the computer device can respectively acquire the second annotation mediaLabel N _r Corresponding average label feature and second labeled media label N _u Corresponding label average characteristics.

After performing the above operations, the computer device may tag N according to the second annotated media _r Corresponding tag average characteristics and sample multimedia data F _t Corresponding media characteristic information, calculating sample multimedia data F _t The corresponding second predicted media tag is the second labeled media tag N _r The first probability of (1). And the computer equipment can also call the candidate media identification model according to the second labeled media label N _u Corresponding tag average characteristics and sample multimedia data F _t Corresponding media characteristic information, calculating sample multimedia data F _t The corresponding second predicted media tag is the second labeled media tag N _u Of the second probability.

In this way, the computer device can obtain the sample multimedia data F _t The corresponding second predicted media tags are the probabilities of the N second labeled media tags, respectively. In some embodiments, the computer device may specifically calculate the probabilities that the second predicted media tags corresponding to the sample multimedia data are the plurality of second labeled media tags respectively in the following manner.

P(j,i)＝Sigmoid(β*l2norm(f _i ) ^T ×l2norm(f′ _ij ) Equation 2.1;

wherein, f' _ij And the target media characteristics represent the jth sample multimedia number of the ith second labeled media label in the corresponding query set, and the target media characteristics can be obtained by performing feature fusion according to the multi-modal characteristics of the jth sample multimedia data. Where the function of the l2norm function is to normalize the feature vectors. The expression of the Sigmoid function may be as follows:

in view of the above formula, the computer device can label the ith second labeling medium by making the ith second labeling medium labelTagging the media with a second tag N _r The j-th sample multimedia data is F _t Sample multimedia data F can be obtained by calculation _t The corresponding second predicted media tag is the second labeled media tag N _r The first probability of (1). Similarly, let the ith second annotated media label be the second annotated media label N _u The j-th sample multimedia data is F _t Sample multimedia data F can be obtained by calculation _t The corresponding second predicted media tag is the second labeled media tag N _u A second probability of (2).

Determining the second labeled media label N according to the first probability and the second probability _r In the sample multimedia data F _t The sample prediction error of (1). Wherein the second labeling media label N _r In the sample multimedia data F _t The above sample prediction error can be calculated by the following formula:

loss _ij ＝-log(p(j,i))+∑ _i′ log (1-p (j, i'))) equation 2.3;

wherein i' represents any second annotated media tag except the ith second annotated media tag.

After the computer device obtains the probabilities that the second predicted media tags corresponding to the sample multimedia data are respectively the plurality of second labeled media tags, the computer device can calculate the sample prediction error of the second media labeled tags on the sample multimedia data in the corresponding query set through the formula.

Specifically, the computer device obtains a second predicted media tag N corresponding to the sample multimedia data _r After the probabilities of the plurality of second labeled media labels are obtained, the second labeled media label N can be obtained through the formula calculation _r Sample multimedia data F in a corresponding query set _t The sample prediction error of (1).

The computer device can mark the ith second labeling media label as the second labeling media label N _r The ith' second labeled media label is the rest second labeled media labels, and the jth sample multimedia data is F _t Obtaining a second labeled media label N _r Sample prediction errors on each multimedia data in the corresponding query set.

Thirdly, according to the second label media label N _r In the sample multimedia data F _t And determining a total sample prediction error of the second labeled media tag Nr in the corresponding query set, and if the total sample prediction errors of the N second labeled media tags in the corresponding query sets are obtained, determining a media tag identification error of the candidate media identification model according to the total sample prediction errors of the N second labeled media tags.

The computer device can obtain the second annotation media tag N _r After the sample prediction error of each sample multimedia data in the corresponding query set, the second labeled media label N is marked _r And (4) performing superposition processing on the sample prediction errors of the multimedia data of each sample in the corresponding query set to obtain the sample prediction total error of the second labeled media label Nr on the corresponding query set. By adopting the above manner, the computer device can obtain the total error of sample prediction of the N second labeled media tags on the corresponding query set, respectively, as the total error of sample prediction of the N second labeled media tags. The computer equipment can perform superposition processing on the total predicted errors of the N samples to obtain the media label identification errors of the candidate media identification model.

In some embodiments, the media tag identification error of the candidate media identification model may be in Loss _all Denotes, Loss _all Can be calculated by the following method:

Loss _all ＝∑ _i ∑ _j loss _ij formula 2.4;

s305, carrying out second adjustment on the candidate medium identification model by using the medium label identification error of the candidate medium identification model to obtain a target medium identification model.

In this embodiment, the computer device may perform a second adjustment on the candidate medium identification model by using the medium tag identification error of the candidate medium identification model when the candidate medium identification model does not satisfy the training stop condition, so as to obtain the target mediumA model is identified. The training stopping condition may be that the iteration number reaches an upper limit of the iteration number, the initial media identification model converges, the media tag identification error of the initial media identification model reaches a minimum value, and the like. The second adjustment may be to update model parameters of the candidate medium identification model with the media tag identification error of the candidate medium identification model to optimize a Loss function, such as Loss, of the candidate medium identification model _all And so on. In one embodiment, the computer device may perform steps S302-S305 each time the candidate media identification model is iterated. In steps S301 to S305, a model optimization method based on meta-learning is used, so that only a part of the second media label tags are used as optimization targets during each iteration, and diversity of training directions is increased by using differences between different second media label tags, thereby avoiding overfitting.

As can be seen, the computer device may obtain a first sample set, where the first sample set includes M sample multimedia data, and a first labeled media tag and a labeled media category corresponding to each of the M sample multimedia data; the computer equipment can utilize the initial media identification model to predict to obtain first predicted media labels corresponding to the M sample multimedia data respectively, and utilizes the initial media identification model to predict to obtain predicted media categories corresponding to the M sample multimedia data respectively; furthermore, the computer device may determine a media prediction error of the initial media identification model according to a first labeled media tag, a labeled media category, a first predicted media tag, and a predicted media category corresponding to the M sample multimedia data, respectively; in addition, the computer equipment can also determine the characteristic extraction error of the initial media identification model according to the media characteristic information respectively corresponding to the M sample multimedia data; furthermore, the computer equipment can perform first adjustment on the initial media recognition model according to the media prediction error of the initial media recognition model and the feature extraction error of the initial media recognition model to obtain a target media recognition model, and the initial media recognition model is trained by utilizing the first sample set in a multi-task learning mode in the process, so that the feature expression capability of the model is improved, the generalization capability of the model is also improved, and the prediction accuracy of the media recognition model for multimedia data can be effectively improved.

Fig. 4 is a schematic structural diagram of a media data processing apparatus according to an embodiment of the present application. The multimedia data processing apparatus may be a computer program (including program code) running in a computer device, for example, the media data processing apparatus is an application software, and the apparatus may be configured to execute corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 4, the apparatus may include: an acquisition module 401, a feature extraction module 402, a prediction module 403, a determination module 404, and an adjustment module 405. Optionally, the apparatus may further comprise a building block 406.

The system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a first sample set, and the first sample set comprises M sample multimedia data, and a first labeled media label and a labeled media category which correspond to the M sample multimedia data respectively; and M is a positive integer.

Optionally, the determining module determines the media prediction error of the initial media identification model according to the first labeled media tag, the labeled media category, the first predicted media tag, and the predicted media category respectively corresponding to the M sample multimedia data, including:

according to multimedia data M _a Determining the multimedia data M according to the corresponding first labeled media tag and the first predicted media tag _a A corresponding media tag prediction error; the multimedia data M _a Belongs to the M sample multimedia data, a is a positive integer less than or equal to M;

according to the multimedia data M _a Determining the multimedia data M according to the corresponding labeled media category and the predicted media category _a A corresponding media category prediction error;

and if the media category prediction errors and the media label prediction errors respectively corresponding to the M sample multimedia data are all obtained, determining the media category prediction errors and the media label prediction errors respectively corresponding to the M sample multimedia data as the media prediction errors of the initial media identification model.

Optionally, the media feature information includes a first modality feature and a second modality feature, and the determining module determines the feature extraction error of the initial media identification model according to the media feature information corresponding to the M sample multimedia data, including:

obtaining sample multimedia data M _a Of the sample multimedia data M _a A first distance between the second modal characteristics;

obtaining the sample multimedia data M _a First modality ofCharacterizing and sampling multimedia data M _b And determining the sample multimedia data M _a And the sample multimedia data M _b A third distance between the first modal characteristics; the sample multimedia data M _b Divide the sample multimedia data M for the M sample multimedia data _a Any other sample multimedia data, b is a positive integer less than or equal to M, and a is different from b;

determining the initial media identification model with respect to the sample multimedia data M according to the first distance, the second distance, and the third distance _a Extracting errors from the features of (1);

and if the characteristic extraction errors of the initial media identification model respectively related to the M sample multimedia data are determined to be finished, determining the characteristic extraction errors of the initial media identification model respectively related to the M sample multimedia data as the characteristic extraction errors of the initial media identification model.

Optionally, the adjusting module performs a first adjustment on the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, to obtain a target media identification model, and the adjusting module includes:

determining a total error of the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model;

if the initial media recognition model does not meet the training stopping condition, performing first adjustment on the initial media recognition model by using the total media recognition error until the initial media recognition model meets the training stopping condition to obtain an adjusted media recognition model;

and determining a target medium identification model according to the adjusted medium identification model.

Optionally, the media prediction error of the initial media identification model includes a media tag prediction error and a media category prediction error respectively corresponding to the M sample multimedia data; the feature extraction errors of the initial media identification model comprise feature extraction errors of the initial multimedia identification model respectively related to the M sample multimedia data; the adjusting module determines a total error of the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, and the method comprises the following steps:

for sample multimedia data M _a Weighting the corresponding media label prediction error, media category prediction error and characteristic extraction error to obtain the sample multimedia data M _a The media identification error of (a); the multimedia data M _a Belongs to the M sample multimedia data, a is a positive integer less than or equal to M;

and if the media identification errors respectively corresponding to the M sample multimedia data are all obtained, performing superposition processing on the media identification errors respectively corresponding to the M sample multimedia data to obtain a total media identification error of the initial media identification model.

Optionally, the determining, by the adjusting module, the target media identification model according to the adjusted media identification model includes:

acquiring a second sample set, wherein the second sample set comprises K sample multimedia data and second labeled media labels corresponding to the K sample multimedia data respectively; k is a positive integer;

s sample multimedia data are obtained from the K sample multimedia data and are used as a target support set, and T sample multimedia data are obtained from the K sample multimedia data and are used as a target query set; s and T are positive integers smaller than K, and the target support set is different from the target query set;

training the adjusted media identification model according to the target support set to obtain a candidate media identification model;

determining a media tag identification error of the candidate media identification model according to the target query set; and performing second adjustment on the candidate medium identification model by using the medium label identification error of the candidate medium identification model to obtain a target medium identification model.

Optionally, the adjusting module obtains S sample multimedia data from the K sample multimedia data as a target support set, and obtains T sample multimedia data from the K sample multimedia data as a target query set, where the method includes:

extracting N second labeled media labels from second labeled media labels respectively corresponding to the K sample multimedia data; n is a positive integer;

extracting a second labeled media label N from the K sample multimedia data _r Corresponding E sample multimedia data; the second annotated media tag Nr belongs to N second annotated media tags; r is a positive integer less than or equal to N, and E is a positive integer less than K;

if E sample multimedia data respectively corresponding to the N second labeling media tags are extracted from the K sample multimedia data, dividing the E sample multimedia data respectively corresponding to the N second labeling media tags to obtain a support set and a query set respectively corresponding to the N second labeling media tags;

determining support sets corresponding to the N second labeling media tags as target support sets; the sum of the number of the sample multimedia data in the support set corresponding to the N second labeling media labels is S;

determining the query sets corresponding to the N second labeled media labels as target query sets; and the sum of the number of the sample multimedia data in the query set corresponding to the N second labeling media labels is T.

Optionally, the target query set includes query sets corresponding to N second labeled media tags, and the adjusting module determines the media tag identification error of the candidate media identification model according to the target query set, including:

tagging N with the candidate media recognition model based on a second tagged media _r Sample multimedia data F in a corresponding query set _t Performing tag identification to obtain information aboutSample multimedia data F _t The tag prediction information of (1); the second labeling media tag N _r Belong to N second labeling media tags; r is a positive integer less than or equal to N; the label prediction information comprises the sample multimedia data F _t The second predicted media tag is the second annotated media tag N _r And the sample multimedia data F _t The second predicted media tag is a second annotated media tag N _u A second probability of (d); the second media annotation label N _u Dividing the second labeled media label N from the N second media label labels _r Labeling any other media with a label, wherein u is an integer less than or equal to N, and r is different from u;

determining the second labeled media label N according to the first probability and the second probability _r In the sample multimedia data F _t The sample prediction error of (1);

according to the second labeling media label N _r In the sample multimedia data F _t Determining a sample prediction total error of the second annotated media tag Nr on the corresponding query set;

and if the total sample prediction errors of the N second media labeling labels on the respectively corresponding query sets are obtained, determining the media label identification errors of the candidate media identification model according to the total sample prediction errors of the N second media labeling labels.

Optionally, the adjusting module uses the candidate media identification model based on the second labeled media tag N _r Sample multimedia data F in a corresponding query set _t Performing label identification to obtain multimedia data F related to the sample _t The tag prediction information of (1), comprising:

determining the second annotated media tag N _r Corresponding label average characteristics;

calling the candidate media identification model to extract the sample multimedia data F _t Corresponding media characteristic information;

according to the second labeling media label N _r Corresponding label average characteristics andthe sample multimedia data F _t Calculating the sample multimedia data F according to the media characteristic information _t The corresponding second predicted media tag is the second labeled media tag N _r A first probability of (d);

determining the second annotated media tag N _u Corresponding label average characteristics;

calling the candidate media identification model according to the second labeled media label N _u Corresponding label average characteristics and the sample multimedia data F _t Calculating the sample multimedia data F according to the media characteristic information _t The corresponding second predicted media tag is the second labeled media tag N _u Of the second probability.

Optionally, the apparatus further comprises a construction module.

The P second media labeling labels comprise second labeling media labels corresponding to the K sample multimedia data respectively;

labeling the label P according to the second media _c Retrieving the second media label P from the multimedia data set according to the media characteristic information of the corresponding reference multimedia data _c Matching multimedia data as the second media annotation tag P _c Corresponding candidate multimedia data;

and if the candidate multimedia data corresponding to the P second media label tags are all obtained, constructing the second sample set according to the candidate multimedia data corresponding to the P second media label tags.

Optionally, the construction module labels P according to the second media _c Retrieving the second media label P from the multimedia data set according to the media characteristic information of the corresponding reference multimedia data _c Matching multimedia data, comprising:

acquiring media characteristic information of each multimedia data in the multimedia data set;

determining the second media annotation tag P _c The corresponding media characteristic information of the reference multimedia data and the multimedia dataThe media distance between the media characteristic information of each multimedia data in the data set is searched out from the multimedia data set according to the determined media distance and the second media label P _c Matching multimedia data as the second media annotation tag P _c Matching multimedia data.

Optionally, the media feature information includes a first modal feature and a second modal feature; the construction module determines the second media annotation tag P _c The media distance between the corresponding media characteristic information of the reference multimedia data and the media characteristic information of each multimedia data in the multimedia data set respectively, and the second media label P is retrieved from the multimedia data set according to the determined media distance _c The matched multimedia data comprises:

determining the second media annotation tag P _c The distance between the first modal feature of the corresponding reference multimedia data and the first modal feature of each multimedia data in the multimedia data set is used as a first media distance;

determining the second media annotation tag P _c The distance between the second modal feature of the corresponding reference multimedia data and the second modal feature of each multimedia data in the multimedia data set is used as a second media distance;

retrieving multimedia data with a first media distance smaller than a first distance threshold value from the multimedia data set as a first multimedia data subset;

retrieving multimedia data with a second media distance smaller than a second distance threshold value from the multimedia data set as a second multimedia data subset;

determining the second media annotation tag P from the first and second subsets of multimedia data _c Matching multimedia data.

As can be seen, the media data processing device may obtain a first sample set, where the first sample set includes M sample multimedia data, and a first labeled media tag and a labeled media category corresponding to the M sample multimedia data, respectively; the media data processing device can predict and obtain first predicted media labels corresponding to the M sample multimedia data by using the initial media identification model, and predict and obtain predicted media categories corresponding to the M sample multimedia data by using the initial media identification model; furthermore, the media data processing device can determine a media prediction error of the initial media identification model according to a first labeled media tag, a labeled media category, a first predicted media tag and a predicted media category corresponding to the M sample multimedia data respectively; in addition, the media data processing device can also determine the feature extraction error of the initial media identification model according to the media feature information corresponding to the M sample multimedia data respectively; furthermore, the media data processing device can perform first adjustment on the initial media recognition model according to the media prediction error of the initial media recognition model and the feature extraction error of the initial media recognition model to obtain a target media recognition model, and the initial media recognition model is trained by utilizing the first sample set in a multi-task learning manner, so that the feature expression capability of the model is improved, the generalization capability of the model is also improved, and the prediction accuracy of the media recognition model for multimedia data can be effectively improved.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 5, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one storage device remote from the processor 1001. As shown in fig. 5, the memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 5, the network interface 1004 may provide network communication functions; and the user interface 1003 is mainly used for an interface for providing input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

acquiring a first sample set, wherein the first sample set comprises M sample multimedia data, and first labeled media labels and labeled media categories corresponding to the M sample multimedia data respectively; m is a positive integer;

respectively performing label prediction on the M sample multimedia data by using the initial media identification model based on the media characteristic information respectively corresponding to the M sample multimedia data to obtain first predicted media labels respectively corresponding to the M sample multimedia data, and performing category prediction on the M sample multimedia data by using the initial media identification model to obtain predicted media categories respectively corresponding to the M sample multimedia data;

In the application, a computer device may obtain a first sample set, where the first sample set includes M sample multimedia data, and a first labeled media tag and a labeled media category corresponding to the M sample multimedia data, respectively; the computer equipment can utilize the initial media identification model to predict to obtain first predicted media labels corresponding to the M sample multimedia data respectively, and utilizes the initial media identification model to predict to obtain predicted media categories corresponding to the M sample multimedia data respectively; furthermore, the computer device may determine a media prediction error of the initial media identification model according to a first labeled media tag, a labeled media category, a first predicted media tag, and a predicted media category corresponding to the M sample multimedia data, respectively; in addition, the computer equipment can also determine the characteristic extraction error of the initial media identification model according to the media characteristic information respectively corresponding to the M sample multimedia data; furthermore, the computer equipment can perform first adjustment on the initial media recognition model according to the media prediction error of the initial media recognition model and the feature extraction error of the initial media recognition model to obtain a target media recognition model, and the initial media recognition model is trained by utilizing the first sample set in a multi-task learning mode in the process, so that the feature expression capability of the model is improved, the generalization capability of the model is also improved, and the prediction accuracy of the media recognition model for multimedia data can be effectively improved.

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the media data processing method in the foregoing embodiment of fig. 2, and may also perform the description of the multimedia data processing apparatus in the foregoing embodiment corresponding to fig. 4, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present invention further provides a computer-readable storage medium, where a computer program executed by the aforementioned media data processing apparatus is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the media data processing method in the embodiment of fig. 2 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

As an example, the program instructions described above may be executed on one computer device, or on at least two computer devices distributed over at least two sites and interconnected by a communication network, or the at least two computer devices distributed over at least two sites and interconnected by a communication network may constitute a blockchain network.

The computer readable storage medium may be the media data processing device provided in any of the foregoing embodiments or a central storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both a central storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different media items and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

An embodiment of the present application further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the description of the media data processing method in the embodiment corresponding to fig. 2 is implemented, and therefore, details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product referred to in the present application, reference is made to the description of the method embodiments of the present application.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and specifically, each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flows and/or blocks in the flowchart and/or the block diagram, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable network connection device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable network connection device, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable network connection device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable network connection device to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and should not be taken as limiting the scope of the present application, so that the present application will be covered by the appended claims.

Claims

1. A method for media data processing, comprising:

2. The method of claim 1, wherein the determining the media prediction error of the initial media identification model according to the first annotated media tag, the annotated media category, the first predicted media tag, and the predicted media category respectively corresponding to the M sample multimedia data comprises:

3. The method of claim 2, wherein the media characteristic information comprises a first modal characteristic and a second modal characteristic;

determining a feature extraction error of the initial media identification model according to media feature information corresponding to the M sample multimedia data respectively, including:

obtaining the sample multimedia data M _a The first modal characteristics and the sample multimedia data M _b And determining the sample multimedia data M _a And the sample multimedia data M _b A third distance between the first modal characteristics; the sample multimedia data M _b Divide the sample multimedia data M for the M sample multimedia data _a Any other sample multimedia data, b is a positive integer less than or equal to M, and a is different from b;

4. The method of claim 1, wherein the first adjusting the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model to obtain the target media identification model comprises:

determining a total media identification error of the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model;

5. The method of claim 4, wherein the media prediction errors of the initial media identification model comprise media tag prediction errors and media category prediction errors corresponding to the M sample multimedia data, respectively; the feature extraction errors of the initial media identification model comprise feature extraction errors of the initial multimedia identification model respectively related to the M sample multimedia data;

determining a total error of media identification of the initial media identification model according to the media prediction error of the initial media identification model and the feature extraction error of the initial media identification model, including:

for sample multimedia data M _a Weighting the corresponding media label prediction error, media category prediction error and feature extraction error to obtain the sample multimedia data M _a The media identification error of (a); the multimedia data M _a Belongs to the M sample multimedia data, a is a positive integer less than or equal to M;

6. The method of claim 4, wherein determining a target media recognition model from the adjusted media recognition model comprises:

obtaining S sample multimedia data from the K sample multimedia data as a target support set, and obtaining T sample multimedia data from the K sample multimedia data as a target query set; s and T are positive integers smaller than K, and the target support set is different from the target query set;

determining a media tag identification error of the candidate media identification model according to the target query set;

and performing second adjustment on the candidate medium identification model by using the medium label identification error of the candidate medium identification model to obtain a target medium identification model.

7. The method of claim 6, wherein the obtaining S sample multimedia data from the K sample multimedia data as a target support set and T sample multimedia data from the K sample multimedia data as a target query set comprises:

extracting a second labeled media tag N from the K sample multimedia data _r Corresponding E sample multimedia data; the second annotated media tag Nr belongs to N second annotated media tags; r is a positive integer less than or equal to N, and E is a positive integer less than K;

8. The method of claim 6, wherein the target query set comprises a query set corresponding to N second tagged media tags, and wherein determining the media tag identification error of the candidate media identification model from the target query set comprises:

tagging N with the candidate media recognition model based on a second tagged media _r Sample multimedia data F in a corresponding query set _t Performing label identification to obtain multimedia data F related to the sample _t The tag prediction information of (1); the second labeling media tag N _r Belong to N second labeling media tags; r is a positive integer less than or equal to N; the label prediction information comprises the sample multimedia data F _t The second predicted media tag is the second annotated media tag N _r And the sample multimedia data F _t The second predicted media tag is a second annotated media tag N _u A second probability of (d); the second media annotation label N _u Dividing the second labeled media label N from the N second media label labels _r Labeling any other media with a label, wherein u is an integer less than or equal to N, and r is different from u;

according to the second labeling media label N _r At the sample multimedia data F _t To determine the sample prediction error ofSample prediction total error of a second labeling media label Nr on a corresponding query set;

9. The method of claim 8, wherein the utilizing the candidate media identification model is based on a second annotated media tag N _r Sample multimedia data F in a corresponding query set _t Performing label identification to obtain multimedia data F related to the sample _t The tag prediction information of (1), comprising:

according to the second labeling media label N _r Corresponding tag average characteristics and the sample multimedia data F _t Calculating the sample multimedia data F according to the media characteristic information _t The corresponding second predicted media tag is the second labeled media tag N _r A first probability of (d);

calling the candidate media identification model according to the second labeled media label N _u Corresponding label average characteristics and the sample multimedia data F _t Calculating the sample multimedia data F according to the media characteristic information _t The corresponding second predicted media tag is the second labeled media tag N _u A second probability of (2).

10. The method of claim 6, further comprising:

acquiring reference multimedia data corresponding to the P second media labeling labels respectively; the P second media labeling labels comprise second labeling media labels corresponding to the K sample multimedia data respectively;

and if the candidate multimedia data corresponding to the P second media labeling labels are obtained, constructing the second sample set according to the candidate multimedia data corresponding to the P second media labeling labels.

11. The method of claim 10, wherein said tagging P with said second media is performed _c Retrieving the second media label P from the multimedia data set according to the media characteristic information of the corresponding reference multimedia data _c Matching multimedia data, comprising:

determining the second media annotation tag P _c The media distance between the corresponding media characteristic information of the reference multimedia data and the media characteristic information of each multimedia data in the multimedia data set respectively, and the second media label P is retrieved from the multimedia data set according to the determined media distance _c Matching multimedia data as the second media annotation tag P _c The matched multimedia data.

12. A media data processing apparatus, comprising:

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 11 when executed by a processor.