CN115953645A

CN115953645A - Model training method and device, electronic equipment and storage medium

Info

Publication number: CN115953645A
Application number: CN202211617637.6A
Authority: CN
Inventors: 崔东林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-04-11

Abstract

The disclosure provides a model training method, a model training device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the technical fields of neural networks, big data and the like. The specific implementation scheme is as follows: inputting a video sample into a multi-modal feature extraction model to be trained to obtain video features and text features; constructing a positive sample by adopting the video characteristics and the text characteristics of the same video sample, and constructing a negative sample by adopting the video characteristics and the text characteristics of different video samples; training a multi-modal feature extraction model based on positive and negative samples; extracting video features and text features by adopting a trained multi-modal feature extraction model; and fine-tuning the network model of the target task by adopting the fusion characteristics of the text characteristics and the video characteristics. According to the embodiment of the invention, the positive and negative samples are automatically marked, the multi-mode feature extraction model can be trained by adopting mass data, and the training can be completed by finely adjusting the network model of the downstream target task, so that the training efficiency can be improved, and the resources can be saved.

Description

Model training method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the technical fields of neural networks, big data, and the like.

Background

With the rapid development of the internet technology, mass video resources are uploaded to the network, and the difficulty of recommending the search technology is greatly increased. In order to effectively meet and improve the search experience of different users, resources with higher quality need to be provided for the users. Whether mass video resources can be efficiently and accurately understood directly influences the accuracy of recommendation and search strategies, and finally influences the impression experience of users and the retention rate of the users.

The current industry mainly uses the deep learning mode based on supervision to carry out the model training of small batch, only solves a certain video problem from a certain latitude or direction, and the same operation needs to be repeated to different video problems, leads to training inefficiency repeated waste processing resource.

Disclosure of Invention

The disclosure provides a model training method, a model training device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a model training method, including:

respectively inputting a plurality of video samples into a multi-modal feature extraction model to be trained to obtain video features and text features of the video samples;

constructing a positive sample by adopting the video characteristics and the text characteristics of the same video sample, and constructing a negative sample by adopting the video characteristics and the text characteristics of different video samples to obtain a sample set;

based on the sample set, adjusting model parameters of the multi-modal feature extraction model to be trained, and obtaining the trained multi-modal feature extraction model under the condition of meeting a training convergence condition;

aiming at a target task, extracting video features and text features of a training sample of the target task by adopting a trained multi-modal feature extraction model;

performing fusion processing on the text features of the training samples and the video features of the training samples to obtain fusion features;

and training a network model of the target task based on the fusion characteristics.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the input module is used for respectively inputting the plurality of video samples into a multi-modal feature extraction model to be trained to obtain video features and text features of the video samples;

the construction module is used for constructing a positive sample by adopting the video characteristics and the text characteristics of the same video sample, and constructing a negative sample by adopting the video characteristics and the text characteristics of different video samples to obtain a sample set;

the adjusting module is used for adjusting model parameters of the multi-modal feature extraction model to be trained based on the sample set, and obtaining the trained multi-modal feature extraction model under the condition that a training convergence condition is met;

the extraction module is used for extracting the video characteristics and the text characteristics of the training samples of the target task by adopting the trained multi-modal characteristic extraction model aiming at the target task;

the fusion module is used for carrying out fusion processing on the text characteristics of the training samples and the video characteristics of the training samples to obtain fusion characteristics;

and the training module is used for training the network model of the target task based on the fusion characteristics.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

In the embodiment of the disclosure, the positive and negative samples are automatically marked, so that the method can support the training of the multi-mode feature extraction model by adopting mass data, and can finish the training by finely adjusting the network model of the downstream target task, thereby improving the training efficiency and saving resources.

It should be understood that the above summary is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope. Wherein:

FIG. 1 is a flow diagram of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of extracting video features in a model training method according to another embodiment of the present disclosure;

FIG. 3 is a flow chart of text feature extraction in a model training method according to another embodiment of the present disclosure;

FIG. 4 is an overall framework diagram of feature extraction in a model training method according to another embodiment of the present disclosure;

FIG. 5 (a) is a schematic flow chart of determining cross-entropy loss in a model training method according to another embodiment of the present disclosure;

FIG. 5 (b) is an example of determining cross-entropy loss in a model training method according to another embodiment of the present disclosure;

FIG. 6 is an overall framework diagram of a model training method according to another embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a model training apparatus according to another embodiment of the present disclosure;

FIG. 8 is a block diagram of an electronic device for implementing a model training method of an embodiment of the present disclosure.

Detailed Description

The following detailed description and technical contents of the present application are described with reference to the drawings, which are provided for reference and illustration purposes only and are not intended to limit the present application.

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms "first," "second," and the like in the embodiments of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

In the prior art, a target model is generally trained by using a small amount of labeled data based on a supervised mode so as to extract the characteristics of the target model for a video to be processed. However, the manual labeling cost of the prior art is too high, and the realization effect of migrating to different downstream tasks is not good. Moreover, different video problems need to repeat the same operation to train an applicable target model, and multiple costs are required to solve multiple video problems simultaneously, so that the training efficiency is low, and the processing resources are wasted repeatedly. In view of this, if the annotation cost can be reduced and the video features with high expression and suitable for different video problems are extracted, the processing resources can be saved and the training efficiency can be improved on the basis of ensuring the training effect.

Based on the technical concept, the model training method comprises the steps of extracting video characteristics and text characteristics of video samples based on a large number of video samples to construct a sample set, and training a multi-modal characteristic extraction model to be trained by utilizing the sample set in a self-supervision mode. The trained multi-modal feature extraction model can be applied to video feature extraction of different target tasks. Therefore, aiming at different video problems, the multi-mode feature extraction model does not need to be trained repeatedly, and only the model of the downstream task is finely adjusted. Based on the technical concept, as shown in fig. 1, the model training method in the embodiment of the present disclosure includes the following steps:

s101, respectively inputting the plurality of video samples into a multi-modal feature extraction model to be trained to obtain video features and text features of the video samples.

The multi-modal feature extraction model to be trained can be any model with the capability of extracting video features and text features of video samples.

S102, a positive sample is constructed by adopting the video characteristics and the text characteristics of the same video sample, and a negative sample is constructed by adopting the video characteristics and the text characteristics of different video samples, so that a sample set is obtained.

Based on the video features and text features of the video samples obtained in the step S101, the video features and text features from the same video sample and the video features and text features from different video samples are divided, and the video features and text features are respectively used for constructing a positive sample and a negative sample so as to adjust model parameters of the multi-modal feature extraction model to be trained.

S103, based on the sample set, adjusting model parameters of the multi-modal feature extraction model to be trained, and obtaining the trained multi-modal feature extraction model under the condition that a training convergence condition is met.

Based on the positive and negative samples in the sample set, in the process of adjusting the model parameters of the multi-modal feature extraction model to be trained, the corresponding relation between the video features and the text features in the positive sample and the corresponding relation between the video features and the text features in the negative sample can be respectively determined. And based on the determined corresponding relation, adjusting the model parameters of the multi-modal feature extraction model to be trained again to obtain the trained multi-modal feature extraction model. Finally, when the trained multi-modal feature extraction model is used for extracting the video features and the text features of the video sample, the video features and the text features which can more accurately express the video sample can be obtained.

And S104, aiming at the target task, extracting the video characteristics and the text characteristics of the training sample of the target task by adopting the trained multi-modal characteristic extraction model.

The target task may include a classification task and a retrieval task of the video, and other specific downstream tasks. Downstream tasks for generating video description information may also be included, although specific target tasks embodiments of the present disclosure are not specifically limited.

And S105, carrying out fusion processing on the text features of the training samples and the video features of the training samples to obtain fusion features.

And S106, training a network model of the target task based on the fusion characteristics.

In the embodiment of the present disclosure, the video features and the text features of the same video sample are used as the multi-modal features of the video sample. The multi-modal features can fully describe the video sample and thus are more expressive. The video features and the text features of the same video sample are constructed into positive samples, the video features and the text features of different video samples are constructed into negative samples, and self-supervision training of a multi-modal feature extraction model to be trained can be achieved. The model can be trained without marking work, and self-supervision training with mass data can be achieved. Therefore, the trained multi-modal feature extraction model of the embodiment of the disclosure can learn the feature extraction capability from massive non-labeled data, so that the trained multi-modal feature extraction model can extract features suitable for different video problems. Furthermore, for any downstream target task, only network model parameters of the downstream task need to be trained and fine-tuned in a supervision mode. According to the embodiment of the disclosure, a plurality of video problems can be solved more efficiently in a mode of combining self-supervision with supervision training. For different video problems, only the corresponding network model needs to be adjusted, and the multi-mode feature extraction model does not need to be trained repeatedly, so that the training efficiency of the model can be improved, and the processing resource is saved.

In some embodiments, in order to better describe a video, in the embodiments of the present disclosure, a text feature of a training sample and a video feature of the training sample need to be fused to obtain a fused feature. The mode of the fusion process may be implemented as:

fusion mode 1), carrying out weighted average on the text characteristic of the training sample and the video characteristic of the training sample to obtain fusion characteristic.

In the embodiment of the disclosure, compared with the method that the video sample is described by adopting both the video feature and the text feature, the feature dimension and the use of the edge downstream task can be reduced by a weighted average mode.

And 2) splicing the text features of the training samples and the video features of the training samples to obtain fusion features.

In implementation, the Concat splicing function may be used to splice the text features of the training samples and the video features of the training samples. Of course, other splicing layers may be selected as needed and are all suitable for the embodiments of the present disclosure.

In the embodiment of the disclosure, the text features of the training samples and the video features of the training samples are subjected to fusion processing, and the fusion features are taken as the features of the video samples, instead of only taking the text features of the training samples as the features of the video samples or only taking the video features of the training samples as the features of the video samples. The finally determined characteristics of the video sample can be more fit with the real characteristics of the video sample, the training of a multi-modal characteristic extraction model is facilitated, the downstream target task is facilitated, and the target is better realized based on the fusion characteristics.

In some embodiments, training the network model of the target task based on the fused features may be implemented as:

when the target task is a classification task, a Multi-Layer perceptron (MLP) is used as a network model of the target task, and the network model of the target task is trained based on training labels of the classification task.

For example, the video classification task is used to classify videos into children, sports, and science popularization categories, and these categories can be used as training labels to train a network model of the classification task.

And under the condition that the target task comprises a classification task and a regression task, determining the classification layer and the regression layer as a network model of the target task, and training the network model of the target task based on the classification label and the position label of the target task.

The network model of the target task may include a neural network model. At present, there are many variations of Neural Network models, such as Back Propagation (BP) Neural Network, probabilistic Neural Network, convolutional Neural Network (CNN — applicable to image recognition), temporal recursive Neural Network (LSTM — applicable to speech recognition), and so on. The basic structure of the multilayer perceptron MLP can be obtained based on a biological neuron model, and the most typical MLP comprises three layers: the MLP neural network model comprises an input layer, a hidden layer and an output layer, wherein different layers of the MLP neural network model are fully connected (any neuron in the upper layer is connected with all neurons in the lower layer).

The neural network model mainly has three basic elements: weight, bias and activation functions. Wherein, the weight is that the connection strength between the neurons is represented by the weight, and the size of the weight represents the size of the possibility. The bias is to correctly classify the samples and is an important parameter in the model, i.e. to ensure that the output values calculated by the inputs cannot be activated at will. The activation function acts as a non-linear mapping that limits the output amplitude of the neuron to a certain range, typically between (-1 to 1) or (0 to 1). The most commonly used activation function is the Sigmoid function, which can map a number (— infinity, + ∞) into the range of (0 to 1). The activation function also comprises functions of tanh, reLU and the like, the tanh is the deformation of the Sigmoid function, the mean value of the tanh is 0, and the activation function has better effect than the Sigmoid in practical application; reLU is a recently popular activation function, with the output being 0 when the input signal is less than 0; when the input signal is greater than 0, the output equals the input.

In the embodiment of the disclosure, a suitable network model may be constructed based on the classification task, and the network model of the target task may be trained based on the training labels or the neural network model of the target task may be trained based on the classification labels and the position labels of the target task.

In the embodiment of the disclosure, the MLP has strong adaptive learning capability and can handle complex multi-input and multi-output nonlinear problems. The network model is adopted to train the target task based on the fusion characteristics, so that the multi-modal characteristic extraction model can be better applied to the characteristic extraction of the downstream task, and the characteristic extraction quality is improved.

In some embodiments, the multi-modal feature extraction model to be trained comprises a picture encoder and a video encoder. For each video sample, the video features can be extracted in the manner shown in fig. 2, which includes the following contents:

s201, down-sampling the video of the video sample to obtain a frame sequence.

For down-sampling, the image frames are extracted from the video of the video sample at intervals of fixed duration or fixed frame number, and a frame sequence consisting of a series of image frames is obtained.

S202, extracting picture features of each frame in the frame sequence by adopting a picture encoder.

Among other things, a picture encoder may include tools such as a jpeg _ axi picture codec that enable encoding of pictures.

S203, inputting the picture characteristics of the frame sequence into a video encoder to obtain the video characteristics of the video sample.

The video encoder can comprise a tool which is composed of a special audio and video compression codec chip, a data and alarm input and output channel, a network interface, an audio and video interface (such as HDMI, VGA and HD-SDI), RS232 serial interface control, protocol interface control, embedded software and the like and can obtain video characteristics by using picture characteristics.

In the embodiment of the disclosure, the picture features are extracted by using the picture encoder, and then the picture features of the frame sequence are input into the video encoder to obtain the video features, so that the extraction result of the video features has more expressive power, and the training of a multi-mode feature extraction model is facilitated.

In some embodiments, the multi-modal feature extraction model to be trained includes a speech recognition model, a text abstract extractor, and a text encoder, where a plurality of video samples are respectively input into the multi-modal feature extraction model to be trained, and the text feature of each video sample is obtained, as shown in fig. 3, which may be implemented as:

s301, obtaining the audio frequency of each video sample in a plurality of video samples.

S302, the audio of each video sample is respectively input into the voice recognition model, and the audio text of each video sample is obtained.

The speech recognition model may include, for example, dynamic time warping, vector quantization, hidden markov models, etc., among others. The audio in each video sample is converted to audio text by a speech recognition model.

And S303, respectively inputting the audio text of each video sample into a text abstract extractor to obtain the video abstract text of each video sample.

And extracting the audio text abstract by using a text abstract extractor based on the acquired audio text, wherein the audio text abstract is the video abstract text of each video sample. The text abstract extractor can extract the video abstract text by adopting the traditional way of abstraction type abstract or generation type abstract. A decimated summary is a text of a video summary generated by decimating key sentences in a spliced audio text. The generated abstract is a self-organizing language according to the important content expressed by the audio text, summarizes the audio text, and the whole process is an end-to-end process similar to a translation task and a conversation task.

Video summary text may also be extracted through neural network models applied to text summaries, including decimated models, generated models, and compressed models. The extraction model mainly models the problem into two tasks of Sequence marking and sentence sequencing, including a Sequence marking method, a sentence sequencing method and a seq2seq (Sequence-to-Sequence). The generative model is mainly a generative model introducing various auxiliary information based on seq2seq and a transformer model. The compression model is mainly based on information bottleneck (information bottleneck), and may also be referred to as a model in which an extraction formula and a generation formula are mixed.

And S304, respectively inputting the video abstract text of each video sample into a text encoder to obtain the text characteristics of each video sample.

For ease of understanding, fig. 4 is a schematic flow chart illustrating the generation of a sample set according to an embodiment of the present disclosure. In which a fast moving picture expert group (ffmpeg) tool is used to divide a video sample into a sequence of video frames of one frame per second and extract audio from the video sample. Then based on the audio of the video sample, an audio text is obtained by adopting a speech recognition model, then a text abstract is extracted from the audio text by adopting an abstract model (such as a MatchSum model), and finally an audio text vector is obtained by utilizing a text encoder (such as Bert). Based on video frames, a picture encoder (e.g., viT) is used to convert the video frames into picture vectors, and then a video encoder (e.g., transform) is used to obtain the video vectors.

In the embodiment of the disclosure, the voice recognition model is adopted to obtain the audio text of each video sample, and then the text abstract extractor is utilized to process the audio text to obtain the video abstract text of each video sample, so that the text characteristics have more expressive power, the video samples can be better described, and further the training of the multi-modal characteristic extraction model can be accelerated.

In other embodiments, in addition to audio extraction of text features, as shown in FIG. 3, text from multiple sources may be used to extract text features of a video sample. Correspondingly, in this embodiment, the multi-modal feature extraction model to be trained includes a text abstract extractor and a text encoder, where the multi-modal feature extraction model to be trained is input to a plurality of video samples, respectively, to obtain the text features of each video sample, and the text features may be implemented as:

step A1, obtaining text information of each video sample, wherein the text information comprises a text corresponding to an audio of the video sample, a text in a picture of the video sample and text description information of the video sample.

The text description information of the video sample may include text information, video titles, category labels, comment lists, and the like in the video file.

And step A2, respectively inputting the text information of each video sample into a text abstract extractor to obtain the video abstract text of each video sample. And respectively inputting the video abstract text of each video sample into a text encoder to obtain the text characteristics of each video sample.

In some embodiments, textual information may also be extracted for the image frames in each video sample by an Optical Character Recognition (OCR) model. Both audio text and OCR text can be used as video text.

In the embodiment of the disclosure, the text feature source is determined by adopting various video related character description information, so that the text features can be more comprehensively obtained, the video samples can be well described by the text features, and the aim of accelerating the training of the multi-modal feature extraction model is fulfilled.

Based on the foregoing description, how to construct positive and negative samples, and how to extract video features and text features, respectively, are introduced. In order to train the multi-modal feature extraction model in a self-monitoring manner better, in the embodiment of the present disclosure, based on the sample set, the model parameters of the multi-modal feature extraction model to be trained are adjusted to obtain the trained multi-modal feature extraction model, which can be implemented as shown in fig. 5 (a):

s501, aiming at each sample to be processed in the training sample set, determining feature similarity between video features and text features in the sample to be processed.

Under the condition that the video features and the text features in the sample to be processed are not in the same dimensionality, the video features and the text features in the sample to be processed are converted into the same dimensionality, and then the feature similarity between the video features and the text features in the sample to be processed is calculated. The same dimension may be a preset dimension. In a specific calculation process, the video features and the text features in the sample to be processed can be respectively converted into video vectors and text vectors, and then cosine similarity between the video vectors and the text vectors is calculated, namely cosine values of an included angle between the video vectors and the text vectors in a vector space are used for measuring the similarity between the video text features and the video features.

S502, determining cross entropy loss based on the feature similarity corresponding to each sample to be processed and the sample label of each sample to be processed. Wherein, the sample label is a positive sample or a negative sample.

In some embodiments, the dimension of the label matrix may be determined according to the number of samples to be processed, where the samples to be processed are from the same video label and are 1, and the samples to be processed are from different video labels and are 0. Meanwhile, the feature similarity among the samples to be processed is respectively calculated, and a similarity matrix of the samples to be processed is obtained. And comparing the label matrix with the similarity matrix, and obtaining the cross entropy loss based on the feature similarity corresponding to each sample to be processed and the sample label of each sample to be processed.

The smaller the cross entropy loss is, the more accurately the multi-modal feature extraction model can extract the video features. The larger the cross entropy loss is, the multimodal feature extraction model still needs to be trained, and at the moment, the model parameters of the multimodal feature extraction model to be trained can be adjusted based on the cross entropy loss so as to achieve the purpose of reducing the cross entropy loss.

For ease of understanding, as shown in fig. 5 (b), a schematic diagram of obtaining cross entropy loss in the embodiment of the present disclosure is shown. For example, if there are three video samples, i.e., video 1, video 2, and video 3, a 3 × 3 similarity matrix may be generated, and the similarity matrix is compared with the tag matrix, so that the cross entropy loss may be obtained.

Finally, in step S503, the training is terminated when the training convergence condition is satisfied, and the trained multi-modal feature extraction model is obtained. And under the condition that the training convergence condition is not met, returning to and executing the step of respectively inputting the plurality of video samples into the multi-modal feature extraction model to be trained in the step S101 to obtain the video features and the text features of the video samples until the multi-modal feature extraction model converges.

In the embodiment of the disclosure, cross entropy loss is obtained by calculating the feature similarity between the video feature and the text feature in the sample to be processed and comparing the feature similarity with the sample label, and finally, the model parameters of the multi-modal feature extraction model to be trained can be adjusted based on the cross entropy loss. The feature similarity can describe the difference between multi-modal features in the same video, and can also describe the difference between multi-modal features in different videos. The cross entropy loss is determined by adopting the feature similarity, so that the extracted text features and the extracted video features of a multi-mode feature extraction model are not independently extracted, but understand the content of the same video mutually, the aim of learning the text features and the video features of the same video mutually is fulfilled, and the video features and the text features extracted from the same video can better describe the video. And, the characteristics of different videos can be distinguished. The feature differences of different videos are better understood. Thus, the feature extraction capability of the multi-modal feature extraction model is improved.

To facilitate an understanding of the overall aspects of the present disclosure, reference is made to fig. 6. Fig. 6 is a schematic diagram of an overall flow of model training in the embodiment of the present disclosure. In the upstream model training, a multi-modal feature extraction model is trained based on an automatic supervision mode, and once the multi-modal feature extraction model is converged, namely the training is completed, parameters in the multi-modal feature extraction model are fixed. In the training of a network model (such as MLP for classification) combined with a downstream target task, the multi-modal feature extraction model is applied to different downstream tasks, and the network model of the downstream tasks is subjected to fine tuning. When the downstream tasks comprise action classification, video classification and other scenes, the head parameters of the newly added downstream tasks are only required to be finely adjusted according to different downstream tasks.

Based on the same technical concept, an embodiment of the present disclosure further provides a model training apparatus, as shown in fig. 7, including:

the adjusting module is used for adjusting the model parameters of the multi-modal feature extraction model to be trained on the basis of the sample set, and obtaining the trained multi-modal feature extraction model under the condition of meeting the training convergence condition;

the fusion module is used for carrying out fusion processing on the text characteristics of the training sample and the video characteristics of the training sample to obtain fusion characteristics;

In some embodiments, an input module comprises:

the sampling sub-module is used for carrying out down-sampling on the video of each video sample to obtain a frame sequence;

the extraction submodule is used for extracting picture characteristics of each frame in the frame sequence by adopting the picture encoder;

and the first input sub-module is used for inputting the picture characteristics of the frame sequence into the video encoder to obtain the video characteristics of the video sample.

In some embodiments, the input module further comprises:

the acquisition submodule is used for acquiring the audio frequency of each video sample in the plurality of video samples;

the second input submodule is used for respectively inputting the audio frequency of each video sample into the voice recognition model to obtain the audio text of each video sample;

the third input submodule is used for respectively inputting the audio text of each video sample into the text abstract extractor to obtain the video abstract text of each video sample;

and the fourth input sub-module is used for respectively inputting the video abstract text of each video sample into the text encoder to obtain the text characteristics of each video sample.

In some embodiments, the extraction module comprises:

the first determining submodule is used for determining the feature similarity between the video features and the text features in the samples to be processed aiming at each sample to be processed in the training sample set;

the second determining submodule is used for determining cross entropy loss based on the feature similarity corresponding to each sample to be processed and the sample label of each sample to be processed; the sample label is a positive sample or a negative sample;

the adjusting sub-module is used for adjusting the model parameters of the multi-modal feature extraction model to be trained on the basis of the cross entropy loss;

the first execution submodule is used for finishing the training under the condition of meeting the training convergence condition to obtain the trained multi-modal feature extraction model;

and the second execution submodule is used for returning and executing the step of respectively inputting the plurality of video samples into the multi-modal feature extraction model to be trained under the condition that the training convergence condition is not met to obtain the video features and the text features of the video samples.

In some embodiments, a fusion module, comprising:

carrying out weighted average on the text features of the training samples and the video features of the training samples to obtain the fusion features; or,

and splicing the text features of the training sample and the video features of the training sample to obtain the fusion features.

In some embodiments, a training module comprises:

the first training submodule is used for taking the multilayer perceptron as a network model of the target task under the condition that the target task is a classification task and training the network model of the target task based on a training label of the classification task;

and the second training submodule is used for determining the classification layer and the regression layer as the network model of the target task under the condition that the target task comprises the classification task and the regression task, and training the network model of the target task based on the classification label and the position label of the target task.

In some embodiments, an input module, comprising:

the acquisition submodule is used for acquiring text information of each video sample, wherein the text information comprises texts corresponding to audios of the video samples, texts in pictures of the video samples and text description information of the video samples;

the abstract extraction submodule is used for respectively inputting the text information of each video sample into the text abstract extractor to obtain the video abstract text of each video sample;

and the coding submodule is used for respectively inputting the video abstract text of each video sample into the text coder to obtain the text characteristics of each video sample.

For a description of specific functions and examples of each module and sub-module of the apparatus in the embodiment of the present disclosure, reference may be made to the description of corresponding steps in the foregoing method embodiments, and details are not repeated here.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 801 performs the various methods and processes described above, such as the model training method. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model training apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A model training method, comprising:

based on the sample set, adjusting model parameters of the multi-modal feature extraction model to be trained, and obtaining a trained multi-modal feature extraction model under the condition that a training convergence condition is met;

aiming at a target task, extracting video features and text features of a training sample of the target task by adopting the trained multi-modal feature extraction model;

2. The method according to claim 1, wherein the multi-modal feature extraction model to be trained comprises a picture encoder and a video encoder, and wherein the respectively inputting the plurality of video samples into the multi-modal feature extraction model to be trained to obtain the video features of each video sample comprises:

for each video sample, down-sampling the video of the video sample to obtain a frame sequence;

extracting picture features for each frame in the frame sequence by adopting the picture encoder;

and inputting the picture characteristics of the frame sequence into the video encoder to obtain the video characteristics of the video sample.

3. The method according to claim 1 or 2, wherein the multi-modal feature extraction model to be trained comprises a speech recognition model, a text abstract extractor and a text encoder, and the step of inputting the plurality of video samples into the multi-modal feature extraction model to be trained to obtain the text feature of each video sample comprises:

obtaining an audio frequency of each video sample in the plurality of video samples;

respectively inputting the audio of each video sample into the voice recognition model to obtain the audio text of each video sample;

respectively inputting the audio text of each video sample into the text abstract extractor to obtain the video abstract text of each video sample;

and respectively inputting the video abstract text of each video sample into the text encoder to obtain the text characteristics of each video sample.

4. The method according to any one of claims 1-3, wherein the adjusting model parameters of the multi-modal feature extraction model to be trained based on the sample set, and obtaining the trained multi-modal feature extraction model when a training convergence condition is satisfied, comprises:

determining feature similarity between video features and text features of the samples to be processed for each sample to be processed in the training sample set;

determining cross entropy loss based on the feature similarity corresponding to each sample to be processed and the sample label of each sample to be processed; the sample label is a positive sample or a negative sample;

adjusting model parameters of the multi-modal feature extraction model to be trained based on the cross entropy loss;

under the condition of meeting the training convergence condition, finishing training to obtain the trained multi-modal feature extraction model;

and under the condition that the training convergence condition is not met, returning to execute the step of respectively inputting the plurality of video samples into the multi-modal feature extraction model to be trained to obtain the video features and the text features of the video samples.

5. The method according to any one of claims 1-4, wherein the fusing the text features of the training samples and the video features of the training samples to obtain fused features comprises:

and splicing the text features of the training samples and the video features of the training samples to obtain the fusion features.

6. The method of any of claims 1-5, wherein the training of the network model of the target task based on the fused features comprises:

under the condition that the target task is a classification task, taking a multilayer perceptron as a network model of the target task, and training the network model of the target task based on a training label of the classification task;

and under the condition that the target task comprises a classification task and a regression task, determining a classification layer and a regression layer as a network model of the target task, and training the network model of the target task based on the classification label and the position label of the target task.

7. The method according to any one of claims 1-6, wherein the multi-modal feature extraction model to be trained comprises a text abstract extractor and a text encoder, and wherein the inputting the plurality of video samples into the multi-modal feature extraction model to be trained respectively to obtain the text features of each video sample comprises:

acquiring text information of each video sample, wherein the text information comprises a text corresponding to an audio of the video sample, a text in a picture of the video sample and text description information of the video sample;

respectively inputting the text information of each video sample into a text abstract extractor to obtain a video abstract text of each video sample;

8. A model training apparatus comprising:

the fusion module is used for carrying out fusion processing on the text features of the training samples and the video features of the training samples to obtain fusion features;

9. The apparatus of claim 8, the multi-modal feature extraction model to be trained comprising a picture encoder and a video encoder, wherein the input module comprises:

10. The apparatus of claim 8 or 9, the multi-modal feature extraction model to be trained comprising a speech recognition model, a text summarization extractor, and a text encoder, wherein the input module comprises:

the obtaining sub-module is used for obtaining the audio frequency of each video sample in the plurality of video samples;

the second input submodule is used for respectively inputting the audio frequency of each video sample into the voice recognition model to obtain the audio frequency text of each video sample;

11. The apparatus of any one of claims 8-10, wherein the extraction module comprises:

the adjusting submodule is used for adjusting the model parameters of the multi-modal feature extraction model to be trained based on the cross entropy loss;

the first execution submodule is used for finishing training under the condition that a training convergence condition is met to obtain the trained multi-modal feature extraction model;

and the second execution sub-module is used for returning and executing the step of respectively inputting the plurality of video samples into the multi-modal feature extraction model to be trained under the condition that the training convergence condition is not met to obtain the video features and the text features of the video samples.

12. The apparatus of any one of claims 8-11, wherein the fusion module comprises:

13. The apparatus of any of claims 8-12, wherein the training module comprises:

the first training sub-module is used for taking the multilayer perceptron as a network model of the target task and training the network model of the target task based on a training label of the classification task under the condition that the target task is the classification task;

and the second training submodule is used for determining a classification layer and a regression layer as a network model of the target task under the condition that the target task comprises a classification task and a regression task, and training the network model of the target task based on the classification label and the position label of the target task.

14. The apparatus of any one of claims 8-13, the multi-modal feature extraction model to be trained comprising a text summarization extractor and a text encoder, wherein the input module comprises:

the obtaining submodule is used for obtaining text information of each video sample, and the text information comprises a text corresponding to the audio of the video sample, a text in a picture of the video sample and text description information of the video sample;

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.