CN115130650A

CN115130650A - Model training method and related device

Info

Publication number: CN115130650A
Application number: CN202210452459.XA
Authority: CN
Inventors: 李廷天; 孙子荀
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2022-09-30

Abstract

The embodiment of the application discloses a model training method and a related device in the field of artificial intelligence, wherein the method comprises the following steps: acquiring a plurality of training samples comprising video segments and audio segments; determining a first prediction characteristic corresponding to each training sample according to the first segment in each training sample through a first coding network; clustering based on first prediction features corresponding to the training samples, determining the category of the first segment in the training samples, and configuring a pseudo label for the second segment in the training samples according to the category of the first segment in the training samples; determining a second prediction characteristic corresponding to each training sample according to a second segment in each training sample through a second coding network, and determining a category prediction result corresponding to the second segment in each training sample; and training the second coding network based on the class prediction result and the pseudo label corresponding to the second segment in each training sample. The method can improve the characteristic coding capability of a video coding network and an audio coding network.

Description

Model training method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a model training method and a related device.

Background

In practical application, the interaction of vision and hearing can make the perception function of human more complete and accurate; for example, when people watch videos, people usually need to understand the content in the video pictures by means of sound. Based on this, when performing related tasks (such as classification tasks) on a video, image features and audio features of the video often need to be considered comprehensively; at present, the image characteristics of a video are mainly determined according to a video picture through a video coding network, and the audio characteristics of the video are determined according to the audio of the video through an audio coding network.

In the related art, the video coding network and the audio coding network are usually trained by means of contrast learning. Specifically, a video segment and an audio segment that are synchronized in a video can be used as positive samples, and a video segment and an audio segment in different videos or an unsynchronized video segment and an audio segment in the same video can be used as negative samples; then, a binary model for identifying the positive samples and the negative samples is trained, and a video coding network and an audio coding network included in the binary model are correspondingly trained in the process.

However, the feature coding capabilities of the video coding network and the audio coding network trained in the above manner are not ideal, and the image features and the audio features coded by the video coding network and the audio coding network are often difficult to be well applied to downstream tasks. The reason is that the difference between the positive samples and the negative samples used in the above training method is usually very obvious, and in the course of training the model, the trained binary model can easily and accurately distinguish the positive samples from the negative samples, and the video coding network and the audio coding network therein are not sufficiently trained.

Disclosure of Invention

The embodiment of the application provides a model training method and a related device, which can ensure that a video coding network and an audio coding network obtained by training have better characteristic coding capability, so that the model training method and the related device can be better applied to downstream tasks.

In view of this, a first aspect of the present application provides a model training method, including:

obtaining a plurality of training samples; the training sample comprises a video clip and an audio clip corresponding to the video clip;

for each training sample, determining a first prediction feature corresponding to the training sample according to a first segment in the training sample through a first coding network; the first coding network is any one of a video coding network and an audio coding network;

performing clustering processing based on first prediction features corresponding to the training samples respectively, and determining a category to which a first segment in each training sample belongs; configuring corresponding pseudo labels for second segments in the training samples according to the category to which the first segments in the training samples belong for each training sample; the second segment is different from the first segment;

for each training sample, determining a second prediction feature corresponding to the training sample according to a second segment in the training sample through a second coding network; determining a category prediction result corresponding to a second segment in the training sample according to a second prediction characteristic corresponding to the training sample; the second encoding network is any one of the video encoding network and the audio encoding network and is different from the first encoding network;

and training the second coding network based on the class prediction result and the pseudo label corresponding to the second segment in the plurality of training samples.

A second aspect of the present application provides a model training apparatus, the apparatus comprising:

the training sample acquisition module is used for acquiring a plurality of training samples; the training sample comprises a video clip and an audio clip corresponding to the video clip;

the first feature prediction module is used for determining a first prediction feature corresponding to each training sample according to a first segment in the training sample through a first coding network; the first coding network is any one of a video coding network and an audio coding network;

the first feature clustering module is used for performing clustering processing based on first prediction features corresponding to the training samples respectively and determining the category to which the first segment in each training sample belongs; configuring corresponding pseudo labels for second segments in the training samples according to the category to which the first segments in the training samples belong for each training sample; the second segment is different from the first segment;

the second network prediction module is used for determining a second prediction characteristic corresponding to each training sample according to a second segment in the training sample through a second coding network; determining a category prediction result corresponding to a second segment in the training sample according to a second prediction feature corresponding to the training sample; the second encoding network is any one of the video encoding network and the audio encoding network and is different from the first encoding network;

and the second network training module is used for training the second coding network based on the class prediction result and the pseudo label which respectively correspond to the second segments in the plurality of training samples.

A third aspect of the application provides a computer apparatus comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is adapted to perform the steps of the model training method according to the first aspect according to the computer program.

A fourth aspect of the present application provides a computer-readable storage medium for storing a computer program for performing the steps of the model training method of the first aspect described above.

A fifth aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of the model training method according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a model training method, wherein in the method, a plurality of training samples comprising video clips and corresponding audio clips are obtained; then, for each training sample, determining, by a first coding network (which may be any one of a video coding network and an audio coding network), a first prediction feature corresponding to the training sample according to a first segment (which is a segment suitable for being processed by the first coding network) in the training sample; then, clustering processing is carried out based on first prediction features corresponding to a plurality of training samples, the category of a first segment in each training sample is determined, and a corresponding pseudo label is configured for a second segment (another segment except the first segment in a video segment and an audio segment) in the training sample according to the category; furthermore, for each training sample, determining, by a second coding network (which is another coding network except the first coding network in the video coding network and the audio coding network), a second prediction feature corresponding to the training sample according to the second segment in the training sample, and determining, according to the second prediction feature corresponding to the training sample, a class prediction result corresponding to the second segment in the training sample; finally, the second coding network may be trained according to the class prediction result and the pseudo label corresponding to each second segment in the plurality of training samples. When the video coding network and the audio coding network are trained by the method, the clustering result of the coding characteristics generated by one coding network is utilized to determine the available supervision signals when the other coding network is trained; on one hand, marking of the training samples is avoided, processing resources consumed by marking of the training samples are saved, and meanwhile, the problem that the performance of a trained coding network is poor due to the fact that the constructed training samples have defects is solved; on the other hand, because the video segments and the audio segments in the training samples have corresponding relations, a corresponding pseudo label is configured for another segment in the training samples based on the feature clustering result corresponding to one segment in the training samples, so that the reliability of the configured pseudo label can be ensured to a certain extent, correspondingly, the pseudo label is used as a supervision signal to train another coding network, the reliable training of the coding network can be ensured, that is, the trained video coding network or audio coding network can have better feature coding capability, and can be better applied to downstream tasks.

Drawings

Fig. 1 is a schematic view of an application scenario of a model training method provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an implementation principle of a video coding network and an audio coding network for collaborative training according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an implementation principle of applying a video coding network and an audio coding network to a target classification task according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating an implementation principle of applying a video coding network and an audio coding network to a background audio generation task according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an implementation principle of a model training method according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

The scheme provided by the embodiment of the application relates to the machine learning technology of artificial intelligence. In addition, the embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.

In the related art, the video coding network and the audio coding network trained by adopting a comparative learning mode have poor general performance, the feature coding capabilities of the video coding network and the audio coding network are not ideal, and the image features and the audio features obtained by coding the video coding network and the audio coding network are often difficult to be well applied to downstream tasks.

In order to solve the above problem, an embodiment of the present application provides a model training method, where when a video coding network and an audio coding network are trained by the method, a clustering result of a coding feature generated by one of the coding networks is used to assist in training the other coding network, so as to achieve an effect of collaborative training of the video coding network and the audio coding network, and improve performances of the trained video coding network and audio coding network.

Specifically, in the model training method provided in the embodiment of the present application, a plurality of training samples including a video clip and an audio clip corresponding to the video clip are obtained first. Then, for each training sample, a first prediction feature corresponding to the training sample is determined according to a first segment (which is a segment suitable for being processed by the first coding network) in the training sample through the first coding network (which may be any one of a video coding network and an audio coding network). Then, clustering processing is carried out based on the first prediction features corresponding to the training samples, the category to which the first segment in each training sample belongs is determined, and corresponding pseudo labels are configured for the second segment (the other segment except the first segment in the video segment and the audio segment) in the training samples according to the category. Furthermore, for each training sample, a second coding network (which is another coding network except the first coding network in the video coding network and the audio coding network) is used for determining a second prediction feature corresponding to the training sample according to the second segment in the training sample, and a category prediction result corresponding to the second segment in the training sample is determined according to the second prediction feature corresponding to the training sample. Finally, the second coding network may be trained according to the class prediction result and the pseudo label corresponding to each second segment in the plurality of training samples.

According to the model training method, the positive and negative samples are not constructed in the process of training the video coding network and the audio coding network, so that the problem caused by the use of the positive and negative samples in the model training method based on contrast learning in the related technology can be solved. When the video coding network and the audio coding network are trained through the method, the clustering result of the coding features generated by one coding network is utilized to determine the available supervision signals when the other coding network is trained. On one hand, marking of the training samples is avoided, processing resources consumed by marking of the training samples are saved, and meanwhile the problem that the performance of a trained coding network is poor due to the fact that the constructed training samples have defects can be avoided. On the other hand, because the video segments and the audio segments in the training samples have corresponding relations, a corresponding pseudo label is configured for another segment in the training samples based on the feature clustering result corresponding to one segment in the training samples, so that the reliability of the configured pseudo label can be ensured to a certain extent, correspondingly, the pseudo label is used as a supervision signal to train another coding network, the reliable training of the coding network can be ensured, that is, the trained video coding network or audio coding network can have better feature coding capability, and can be better applied to downstream tasks.

It should be understood that the model training method provided by the embodiment of the present application may be executed by a computer device with image processing capability and audio processing capability, and the computer device may be a terminal device or a server. The terminal equipment comprises but is not limited to a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, a vehicle-mounted terminal and the like; the server may specifically be an application server or a Web server, and in actual deployment, the server may be an independent server, or may also be a cluster server or a cloud server formed by a plurality of physical servers. The data related to the embodiment of the present application may be stored in a block chain.

In order to facilitate understanding of the model training method provided in the embodiment of the present application, an application scenario of the model training method is exemplarily described below by taking an execution subject of the model training method as a server as an example.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a model training method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a server 110 and a database 120; the server 110 may access the database 120 via a network, or the database 120 may be integrated in the server 110. The server 110 is configured to execute the method provided in the embodiments of the present application to train a video coding network or an audio coding network; the database 120 stores a plurality of video clips and audio clips having corresponding relationships.

In practical applications, the server 110 may obtain multiple sets of video segments and audio segments having corresponding relationships from the database 120, and then use the obtained video segments and audio segments having corresponding relationships as training samples. For example, a video segment corresponding to a certain playing time period in a certain voiced video may be cut out as a video segment in a training sample, and a video audio corresponding to the playing time period in the voiced video may be cut out as an audio segment in the training sample, so that the video segment and the audio segment corresponding to the same playing time period in the voiced video that are cut out are the video segment and the audio segment having the corresponding relationship.

After the server 110 obtains a plurality of training samples including video segments and audio segments having a corresponding relationship, for each training sample, a first prediction feature corresponding to the training sample may be determined according to a first segment in the training sample through the first coding network 111. It should be noted that the first encoding network 111 may be any one of a video encoding network and an audio encoding network; accordingly, the first segment may be a segment of the video segment and the audio segment comprised by the training sample that is suitable for processing by the first coding network 111.

After the server 110 obtains the first predicted features corresponding to the multiple training samples through the first coding network 111, the first predicted features corresponding to the multiple training samples may be clustered to determine the category to which the first segment in each training sample belongs. In addition, since there is a corresponding relationship between the video segment and the audio segment included in each training sample, after the class to which the first segment in each training sample belongs is determined, a corresponding pseudo tag may be configured for the second segment in the training sample according to the class to which the first segment in the training sample belongs, and the pseudo tag may be used as a supervised signal in the subsequent training of the second coding network 112. It should be noted that the second encoding network 112 is any one of a video encoding network and an audio encoding network, and the second encoding network 112 is different from the first encoding network 111; accordingly, the second segment is a different one of the video segment and the audio segment comprised by the training sample that is suitable for processing by the second coding network 112 than the first segment.

After the server 110 configures the pseudo labels corresponding to the second segments in each training sample in the above manner, the second coding network 112 may be trained based on the second segments in each training sample and the corresponding pseudo labels thereof. Specifically, for each training sample, the server 110 may determine, through the second coding network 112, a second prediction feature corresponding to the training sample according to a second segment in the training sample; then, a class prediction result corresponding to the second segment in the training sample may be determined according to the second prediction feature corresponding to the training sample.

After obtaining the class prediction result corresponding to the second segment in each training sample in the manner described above, the server 110 may train the second coding network 112 based on the class prediction result and the pseudo label corresponding to the second segment in each training sample; it should be understood that the training for the video coding network can be achieved in the above manner when the second coding network 112 is a video coding network, and the training for the audio coding network can be achieved in the above manner when the second coding network 112 is an audio coding network.

It should be understood that in practical applications, the server 110 may train the first encoding network 111 based on the above manner, in addition to the second encoding network 112. Specifically, the server 110 may also configure a corresponding pseudo tag for the first segment in the training sample in a clustering manner; that is, the server 110 may perform clustering processing based on the second predicted features corresponding to the training samples, determine a class to which the second segment in each training sample belongs, and configure, for each training sample, a corresponding pseudo tag for the first segment in the training sample according to the class to which the second segment in the training sample belongs. Furthermore, for each training sample, the server 110 may determine, according to the first prediction feature corresponding to the training sample (i.e., the coding feature determined by the first coding network 111 according to the first segment in the training sample), a category prediction result corresponding to the first segment in the training sample; further, the first coding network 111 is trained based on the class prediction result and the pseudo label corresponding to the first segment in the plurality of training samples.

It should be understood that the application scenario shown in fig. 1 is only an example, and in practical applications, the model training method provided in the embodiment of the present application may also be applied to other scenarios; for example, server 110 may also obtain multiple training samples through other channels (e.g., determining training samples based on audio videos uploaded by particular subjects). The application scenario of the model training method provided in the embodiment of the present application is not limited at all.

The model training method provided by the present application is described in detail below by way of method embodiments.

Referring to fig. 2, fig. 2 is a schematic flowchart of a model training method provided in the embodiment of the present application. For convenience of description, the following embodiments are still introduced by taking the execution subject of the model training method as an example of the server. As shown in fig. 2, the model training method includes the following steps:

step 201: obtaining a plurality of training samples; the training samples comprise video clips and audio clips corresponding to the video clips.

In the embodiment of the present application, before a server trains a video coding network or an audio coding network, a plurality of unsupervised training samples for training the video coding network or the audio coding network need to be obtained, where the obtained training samples include a video segment and an audio segment that have a correspondence relationship.

The video coding network is a neural network for coding video features according to a plurality of frames of video pictures having a time sequence relationship in a video. The audio coding network described above is a neural network for coding audio features from audio of video. The video segment and the audio segment having a corresponding relationship in the training sample may be obtained based on the same voiced video, for example, the video segment in a certain playing time period may be intercepted in a certain voiced video, and the audio segment in the playing time period in the voiced video may be intercepted, so that the video segment and the audio segment corresponding to the same playing time period intercepted based on the same voiced video are the video segment and the audio segment having a corresponding relationship in the training sample; certainly, in practical application, the video segments and the audio segments having the corresponding relationship in the training sample may also be obtained in other manners, for example, a corresponding background audio segment may be labeled for a certain video segment, and the video segment and the labeled background audio segment may also be used as the video segment and the audio segment having the corresponding relationship in the training sample.

For example, the server may obtain the training samples based on a relevant database, where a plurality specifically means greater than or equal to two; for example, the server may obtain a large number of audio videos from the relevant database, and then cut out video segments and audio segments corresponding to the same playing time period from the audio videos to form training samples. The server can also obtain the training sample based on the audio video sent by the terminal equipment; for example, the server may receive an audio video uploaded by the terminal device, and then, a training sample is composed of a video segment and an audio segment which correspond to the same playing time period and are intercepted from the audio video; for another example, the server may receive a video clip uploaded by the terminal device and a background audio clip configured for the video clip, and then may compose a training sample by using the video clip and the background audio clip. The server can also directly acquire the training samples based on an open-source training video data set (such as an AudioSet data set), wherein the open-source training video data set usually includes a large amount of video data, and the server can acquire video data with a specific proportion (such as 90%) from the training video data set to construct training samples and acquire the remaining video data (such as 10%) in the training video data set to construct test samples; and acquiring a training sample and a test sample by intercepting a video segment and an audio segment corresponding to the same playing time period in the video data. The present application does not set any limit to the way and channel that the server obtains the training samples.

Step 202: for each training sample, determining a first prediction feature corresponding to the training sample according to a first segment in the training sample through a first coding network; the first encoding network is any one of a video encoding network and an audio encoding network.

After the server obtains a plurality of training samples, the server may perform feature coding processing on a first segment in each training sample through a first coding network, so as to obtain a first prediction feature corresponding to the training sample. It should be understood that the feature encoding process here is to convert the first segment in the training sample into machine-recognizable numerical information, and the converted numerical information can reflect the feature of the first segment itself to some extent.

The first coding network is any one of a video coding network and an audio coding network to be trained, the first segment is a segment suitable for being processed by the first coding network in a video segment and an audio segment included in the training sample, and the first prediction feature is a prediction coding feature corresponding to the first segment in the training sample. For example, when the first coding network is a video coding network, the first segment is a video segment in the training sample, and the first prediction feature is a prediction video coding feature corresponding to the video segment in the training sample; when the first coding network is an audio coding network, the first segment is an audio segment in the training sample, and the first prediction feature is a prediction audio coding feature corresponding to the audio segment in the training sample.

In a possible implementation manner, when the first coding network is a video coding network, before the server performs feature coding processing on the video segments in the training samples through the video coding network, the server may perform preprocessing on the video segments in the training samples. For example, the server may randomly sample a certain number of frames (e.g., 16 frames) of video pictures from the video clips in the training sample; and scaling the sampled video frame, for example, making the shorter side of the video frame 256 pixels without changing the aspect ratio of the video frame; then, the server may cut out an area of a certain size (e.g., 224 pixels × 224 pixels) from the video frame obtained by the scaling processing; further, an image tensor having a size of 16 × 3 × 224 × 224 may be constructed based on the extracted region in each frame of video picture, where 16 denotes the frame number of the video picture extracted from the video clip included in the training sample, 3 denotes a Red Green Blue (RGB) channel value, and 224 × 224 denotes the size of the region extracted from each frame of video picture; the image tensor thus obtained can be used as input data of a video coding network.

Furthermore, the server may input the image tensor obtained through the preprocessing into a video coding network, and the video coding network may output the corresponding predicted video coding feature, that is, the first predicted feature corresponding to the training sample, by analyzing and processing the input image tensor.

It should be noted that the video coding network may use an R (2+1) D network with a preset number of layers (e.g., 18 layers), and the network combines two-dimensional convolution and three-dimensional convolution, and may extract spatial information by using the two-dimensional convolution and synthesize spatiotemporal information by using the three-dimensional network, thereby being more beneficial to learning the characteristics of the video segments. In addition, the Video coding network may also be other network structures, such as a SlowFast network, an expanded convolutional network (unflexed 3D ConvNet, I3D), a behavior recognition network (C3D), a Video swing transform, and the like, and the structure of the Video coding network is not limited in any way in this application.

In another possible implementation manner, when the first coding network is an audio coding network, the server may perform preprocessing on the audio segments in the training samples before performing feature coding processing on the audio segments in the training samples through the audio coding network. For example, the server may perform short-time fourier transform on the audio clip, and perform logarithm processing on the result obtained by the short-time fourier transform to obtain a logarithmic spectrogram (for example, a tensor with a size of 40 × 100) with time as a horizontal axis, frequency as a vertical axis, and intensity as a value, which is used as input data of the audio coding network.

Furthermore, the server may input the logarithmic spectrogram obtained through the preprocessing into the audio coding network, and the audio coding network may output the corresponding predicted audio coding feature, that is, the first predicted feature corresponding to the training sample, by analyzing and processing the input logarithmic spectrogram.

It should be noted that the audio coding Network may use an 18-layer ResNet based on two-dimensional convolution, or may also use a time-series convolution Network (stn) or a Recurrent Neural Network (RNN) structure, and the structure of the audio coding Network is not limited in this application.

Step 203: performing clustering processing based on first prediction features corresponding to the training samples respectively, and determining a category to which a first segment in each training sample belongs; configuring corresponding pseudo labels for second segments in the training samples according to the category to which the first segments in the training samples belong for each training sample; the second segment is different from the first segment.

The server completes the encoding processing of the first segment in each training sample through the first encoding network, and after the first prediction features corresponding to each training sample are obtained, the server can further perform clustering processing on the first prediction features corresponding to each training sample to determine the category to which the first segment in each training sample belongs. Furthermore, for each training sample, the server may configure a corresponding pseudo tag for the second segment in the training sample according to the category to which the first segment in the training sample belongs.

It should be noted that the second segment is another segment in the training sample except for the first segment, for example, when the first segment is a video segment in the training sample, the second segment is an audio segment in the training sample, and when the first segment is an audio segment in the training sample, the second segment is a video segment in the training sample. The second segment is a processing object of a second coding network, and the second coding network is another coding network except the first coding network in the video coding network and the audio coding network to be trained, for example, when the first coding network is a video coding network, the second coding network is an audio coding network, and when the first coding network is an audio coding network, the second coding network is a video coding network.

It should be understood that the pseudo label configured for the second segment is equivalent to the category labeled for the second segment, and since the performance of the first coding network may not be reliable and stable enough in the model training stage, the category labeled for the second segment by performing the clustering process on the first prediction features generated by the first coding network and according to the result of the clustering process may not be reliable and stable enough, it is called a pseudo label; because the video segments and the audio segments in the training samples have corresponding relations, the class to which the first segment in the training samples belongs can also represent the class to which the second segment in the training samples belongs, and the pseudo tag can be used as a supervised signal used when the second coding network is trained based on the second segment in the training samples.

For example, the server may perform clustering processing on the first prediction features corresponding to the training samples by using a K-Means clustering algorithm to obtain a preset number of clustering clusters; each cluster corresponds to a category, and the category corresponding to the cluster to which the first prediction feature corresponding to the training sample belongs is the category to which the first segment in the training sample belongs. Furthermore, for each training sample, the server may use a class to which a first segment in the training sample belongs as a pseudo label corresponding to a second segment in the training sample, for example, may use an identifier of the class to which the first segment in the training sample belongs as a pseudo label corresponding to the second segment in the training sample.

For convenience of understanding, the pseudo tag configuration process is described below by taking the first coding network as a video coding network, the first segment as a video segment in a training sample, and the first prediction characteristic as a prediction video coding characteristic of the video segment in the training sample as an example. The video coding network is supposed to encode the video segment vi in the ith (i is an integer greater than or equal to 1) training sample to obtain corresponding predicted video coding characteristics F (vi); the server adopts a K-Means algorithm to perform clustering processing on the prediction video coding characteristics corresponding to each training sample, and divides the prediction video coding characteristics corresponding to each training sample into 256 classes; the category to which the predictive video coding feature corresponding to the video segment vi in the ith training sample belongs is y _vi That is, the category to which the video segment vi in the ith training sample belongs is y _vi (ii) a Based on this, for the audio segment ai in the ith training sample, the server may configure the corresponding pseudo label y for the audio segment ai _vi The pseudo label y _vi Can be used as a supervision signal when training an audio coding network.

It should be understood that, in practical application, when the server performs clustering processing on the first prediction features corresponding to the training samples, other clustering algorithms besides the K-Means clustering algorithm may also be used, and the clustering algorithm used in this application is not limited at all.

Step 204: for each training sample, determining a second prediction feature corresponding to the training sample according to a second segment in the training sample through a second coding network; determining a category prediction result corresponding to a second segment in the training sample according to a second prediction characteristic corresponding to the training sample; the second encoding network is any one of the video encoding network and the audio encoding network and is different from the first encoding network.

After the server obtains a plurality of training samples, a second prediction feature corresponding to each training sample can be determined according to a second segment in the training sample through a second coding network, and then the second segment in the training sample is classified according to the second prediction feature corresponding to the training sample, so that a class prediction result corresponding to the second segment in the training sample is obtained.

It should be noted that the second prediction feature is a prediction coding feature corresponding to a second segment in the training sample; for example, when the second segment is a video segment in the training sample and the second coding network is a video coding network, the second prediction feature is a prediction video coding feature corresponding to the video segment in the training sample, and when the second segment is an audio segment in the training sample and the second coding network is an audio coding network, the second prediction feature is a prediction audio coding feature corresponding to the audio segment in the training sample.

As described in step 202 above, before performing the feature coding process on the first segment in the training sample through the first coding network, the first segment in the training sample needs to be preprocessed to obtain the input data suitable for the first coding network to process, and the preprocessing manner of the video segment and the audio segment in step 202 is described in detail above. Similarly, before the server performs feature coding processing on the second segment in the training sample through the second coding network, the server also needs to perform preprocessing on the second segment in the training sample; when the second segment is a video segment in the training sample, the preprocessing method for the video segment described in step 202 may be adopted to perform preprocessing, and when the second segment is an audio segment in the training sample, the preprocessing method for the audio segment described in step 202 may be adopted to perform preprocessing, which may refer to the related descriptions above in detail, and will not be described herein again.

For example, the server may perform preprocessing on the second segment in each training sample by using a corresponding preprocessing manner, so as to obtain input data suitable for processing by the second coding network. Furthermore, for each training sample, the server may input data obtained by preprocessing a second segment in the training sample into a second coding network, and the second coding network may output a second prediction feature corresponding to the training sample by analyzing and processing the input data; then, the server can further predict a class prediction result corresponding to a second segment in the training sample according to a second prediction characteristic corresponding to the training sample through the classifier; the class prediction result is a class to which a second segment in the training sample belongs, which is predicted based on a second prediction feature corresponding to the training sample.

For convenience of understanding, the above process of determining the class prediction result is described below by taking the first coding network as a video coding network, the second coding network as an audio coding network, the first segment as a video segment in the training sample, and the second segment as an audio segment in the training sample as examples. Suppose that the category to which the video segment vi in the ith training sample belongs is determined to be y through the processing of step 203 _vi And configuring a corresponding pseudo label y for the audio segment ai in the ith training sample according to the pseudo label y _vi (ii) a For the ith training sample, the server may perform feature coding processing on the audio segment ai in the ith training sample through an audio coding network to obtain corresponding predicted audio coding features g (ai), and determine a class prediction result corresponding to the audio segment ai according to the predicted audio coding features g (ai) through a classifier

It should be understood that, in practical applications, the server may perform step 202 and step 203 first and then perform step 204, may also perform step 204 first and then perform step 202 and step 203, and may also perform step 204 and step 202 and step 203 simultaneously, and the present application does not limit the execution sequence between step 202 and step 203, and step 204 in any way. It should be noted that, since the above-mentioned step 202 and step 203 have a time-series relationship, the step 202 needs to be executed first, and then the step 203 needs to be executed based on the execution result of the step 202, the step 202 and the step 203 can be regarded as a whole, and the whole is parallel to the step 204.

Step 205: and training the second coding network based on the class prediction result and the pseudo label corresponding to the second segment in the plurality of training samples.

After the server obtains the pseudo label corresponding to the second segment in each training sample through step 203 and obtains the class prediction result corresponding to the second segment in each training sample through step 204, a loss function may be constructed based on the class prediction result and the pseudo label corresponding to each second segment in each training sample, and a model parameter in the second coding network may be adjusted based on the loss function, so as to achieve the purpose of training the second coding network.

For example, the server may construct a cross-entropy loss function according to the class prediction result and the pseudo label corresponding to each second segment in each training sample, and adjust the model parameter in the second coding network with the goal of reducing the cross-entropy loss function. It should be understood that in practical applications, the server may also construct other types of loss functions, and the present application does not limit the types of loss functions constructed when training the second coding network.

It should be understood that when the second coding network is a video coding network, the server can implement training on the video coding network in the above manner; when the second coding network is an audio coding network, the server can realize the training of the audio coding network through the mode.

In order to implement the collaborative training for the video coding network and the audio coding network, in the embodiment of the present application, a similar manner to that for training the second coding network may also be adopted to train the first coding network, that is, the collaborative training for the first coding network and the second coding network is implemented, so as to simultaneously train and obtain the video coding network and the audio coding network which can be put into practical application.

That is, the server may perform clustering processing based on the second prediction features corresponding to the training samples, and determine a category to which the second segment in each training sample belongs; and for each training sample, configuring a corresponding pseudo label for the first segment in the training sample according to the class to which the second segment in the training sample belongs. Then, for each training sample, the server may determine a class prediction result corresponding to the first segment in the training sample according to the first prediction feature corresponding to the training sample. The first coding network is trained based on class prediction results and pseudo labels corresponding to the first segments in the training samples.

For convenience of understanding, the following description will exemplarily describe the implementation of the collaborative training video coding network and the audio coding network by taking the first coding network as a video coding network, the second coding network as an audio coding network, the first segment as a video segment in a training sample, and the second segment as an audio segment in the training sample.

Fig. 3 is a schematic diagram illustrating an implementation principle of a video coding network and an audio coding network for collaborative training according to an embodiment of the present application. As shown in the left training process of fig. 3, the server may first fix the video coding network and train the audio coding network. When the audio coding network is trained, the server may perform feature coding processing on the video segment vi (i ═ 1,2,3, … …) in each training sample by using a fixed video coding network, to obtain predicted video coding features f (vi) (i.e., first predicted coding features) corresponding to each training sample; then, clustering the predictive video coding characteristics F (vi) corresponding to each training sample by using a K-Means clustering algorithm, and coding the predictive video corresponding to each training sampleDividing the code characteristics into 256 classes, and determining the class y of the video clips in each training sample _vi And configuring corresponding pseudo label y for the audio piece ai in each training sample _vi . Then, the server may perform feature coding processing on the audio segments ai in each training sample by using the currently trained audio coding network to obtain the predicted audio coding features g (ai) (i.e., second predicted coding features) corresponding to each training sample, and determine the class prediction results corresponding to the audio segments in each training sample according to the predicted audio coding features g (ai) through the classifier

Furthermore, the server can predict the result according to the category corresponding to the audio segment in each training sample

And a pseudo label y _vi And constructing a loss function, and training the audio coding network based on the loss function.

As shown in the training process on the right side of fig. 3, the server may fix the audio coding network and train the video coding network. When training the video coding network, the server may perform feature coding processing on the audio segments ai (i ═ 1,2,3, … …) in each training sample by using a fixed audio coding network, to obtain predicted audio coding features g (ai) (i.e., second predicted coding features) corresponding to each training sample; then, clustering the predicted audio coding features G (ai) corresponding to each training sample by using a K-Means clustering algorithm, dividing the predicted audio coding features corresponding to each training sample into 256 classes, and determining the class y to which the audio segments in each training sample belong _ai And configuring corresponding pseudo label y for video segment vi in each training sample _ai . Then, the server can utilize the currently trained video coding network to perform feature coding processing on the video segment vi in each training sample to obtain the predictive video coding feature f (vi) (i.e. the first predictive coding feature) corresponding to each training sample, and according to the predictive video coding feature f (vi) through the classifier,determining the corresponding category prediction result of the video segment in each training sample

Furthermore, the server can predict the result according to the category corresponding to the video clip in each training sample

And a pseudo label y _ai And constructing a loss function, and training the video coding network based on the loss function.

Therefore, the server can alternately train the video coding network and the audio coding network through the mode, and the clustering result of the prediction characteristics obtained by coding based on one coding network is used as a supervision signal used when the other coding network is trained, so that the video coding network and the audio coding network can be cooperatively and efficiently trained.

Optionally, considering that clustering processing is performed based on the respective corresponding prediction features (including the first prediction feature and the second prediction feature) of all the training samples each time, more calculation resources need to be consumed, and longer calculation time needs to be consumed; therefore, in order to save the computing resources and the computing time, in the embodiment of the present application, before performing the clustering process based on the prediction features corresponding to the respective training samples, the coding network of the prediction features used in the clustering process may be tested, and when the test determines that the performance of the coding network meets the preset requirement, the clustering process is performed based on the prediction features generated by the coding network.

In a possible implementation manner, the server may obtain a plurality of test samples before performing clustering processing based on the first prediction features corresponding to the training samples, where the test samples include video segments and audio segments corresponding to the video segments. Then, for each test sample, determining a first prediction characteristic corresponding to the test sample according to a first segment in the test sample through a first coding network; and determining a category prediction result corresponding to the first segment in the test sample according to the first prediction characteristic corresponding to the test sample. Then, constructing a first reference loss function based on the class prediction result and the pseudo label corresponding to the first segment in each test sample; here, the pseudo label corresponding to the first segment in the test sample is determined by clustering the second prediction features corresponding to the respective test samples, and the second prediction features are determined from the second segment in the test sample by the second coding network. Further, judging whether the first reference loss function meets a first preset loss condition; if yes, performing clustering processing based on the first prediction features corresponding to the training samples, and determining the category of the first segment in each training sample; if not, continuing to train the first coding network based on the plurality of training samples.

Specifically, before the server performs clustering processing based on the first prediction features corresponding to the training samples, the server may first test the first coding network generating the first prediction features by using the test samples. It should be noted that the test sample herein is a sample for testing the video coding network and the audio coding network in the process of training the two networks, and the test sample also includes a video segment and an audio segment having a corresponding relationship; the test sample may be obtained in a manner similar to the manner of obtaining the training sample described above, and for example, the server may use 90% of video data in an open-source training video data set (e.g., AudioSet data) as the training sample and 10% of video data in the training video data set as the test sample.

The server determines the first prediction feature corresponding to the test sample through the first coding network, and determines the implementation manner of the class prediction result corresponding to the first segment in the test sample according to the first prediction feature corresponding to the test sample, which is the same as the implementation manner of determining the class prediction result corresponding to the first segment in the training sample through the first coding network described above and determining the first prediction feature corresponding to the training sample according to the first prediction feature corresponding to the training sample, and is not described herein again.

After the server obtains the category prediction result corresponding to the first segment in each test sample, a first reference loss function, for example, a cross entropy loss function, may be constructed according to the category prediction result corresponding to the first segment in each test sample and the pseudo label. It should be noted that, here, the determination manner of the pseudo label corresponding to the first segment in the test sample is the same as the determination manner of the pseudo label corresponding to the first segment in the training sample described above; specifically, the second prediction features corresponding to the respective test samples may be determined according to the second segments in the respective test samples by using the second coding network obtained through the previous training round, and then the clustering process may be performed based on the second prediction features corresponding to the respective test samples, so as to determine the category to which the second segments in the respective test samples belong, and for each test sample, the category to which the second segments in the test sample belong may be used as the pseudo label corresponding to the first segment in the test sample.

Furthermore, the server can judge whether the first reference loss function meets a first preset loss condition; the first preset loss condition here is a condition for balancing whether the current performance of the first coding network meets the requirement of the current training turn for the first coding network; for example, the first preset loss condition may be that a decrease amplitude corresponding to the first reference loss function (which is a decrease amplitude of the first reference loss function determined this time relative to the first reference loss function determined last time) is smaller than a preset decrease amplitude, and the first preset loss condition may also be whether a loss value corresponding to the first reference loss function is smaller than a preset loss value, and the like. If the first reference loss function meets the first preset loss condition, the current first coding network can be considered to meet the requirement of the current training round on the first coding network, and the first prediction features currently determined by the first coding network are reliable, so that the clustering processing can be performed based on the first prediction features respectively corresponding to the training samples determined by the first coding network, accordingly, the clustering result obtained through the clustering processing is also reliable, and the reliability of the pseudo label configured for the second segment in the training samples can be ensured. On the contrary, if the first reference loss function does not satisfy the first preset loss condition, it may be considered that the current first coding network does not satisfy the requirement of the current training round for the first coding network, and the first prediction feature currently determined by the first coding network is not reliable, and accordingly, the first coding network needs to be trained continuously based on each training sample.

In another possible implementation manner, the server may obtain a plurality of test samples before performing clustering processing based on the second prediction features corresponding to the training samples, where the test samples include video segments and audio segments corresponding to the video segments. Then, for each test sample, determining a second prediction characteristic corresponding to the test sample according to a second segment in the test sample through a second coding network; and determining a class prediction result corresponding to the second segment in the test sample according to the second prediction characteristic corresponding to the test sample. Then, constructing a second reference loss function based on the class prediction result and the pseudo label corresponding to the second fragment in each test sample; here, the pseudo label corresponding to the second segment in the test sample is determined by clustering the first prediction features corresponding to the respective test samples, and the first prediction features are determined according to the first segment in the test sample through the first coding network. Further, judging whether the second reference loss function meets a second preset loss condition; if yes, performing clustering processing based on second prediction features corresponding to the training samples respectively, and determining the category of a second segment in each training sample; if not, continuing to train the second coding network based on the plurality of training samples.

Similarly, before performing clustering processing based on the second prediction features corresponding to the training samples, the server may also test the second coding network generating the second prediction features by using the test samples. The implementation manner of specifically testing the second coding network is similar to the implementation manner of testing the first coding network, and may refer to the related description above in detail, which is not described herein again.

It should be understood that the second prediction loss condition according to when testing the second coding network is a condition for balancing whether the current performance of the second coding network meets the requirement of the current training turn on the second coding network; for example, the second preset loss condition may be that a decrease amplitude corresponding to the second reference loss function (which is a decrease amplitude of the second reference loss function determined this time relative to the second reference loss function determined last time) is smaller than a preset decrease amplitude, and the second preset loss condition may also be that whether a loss value corresponding to the second reference loss function is smaller than a preset loss value, and the like.

In this way, before the server performs the clustering processing based on the prediction features (including the first prediction feature and the second prediction feature) corresponding to each training sample, the test samples are used for testing the coding network generating the prediction features, so that the reliability of the prediction features used in the clustering processing can be ensured, the times of clustering processing required to be executed in the process of cooperatively training the video coding network and the audio coding network can be reduced, the model training efficiency can be improved, and the waste of computing resources can be reduced.

In the embodiment of the present application, in order to ensure that the trained video coding network and audio coding network both have better performance, the server may iteratively train the video coding network and the audio coding network for a plurality of training rounds. Specifically, when the trained first coding network meets a first training end condition in the current training round and the trained second coding network meets a second training end condition in the current training round, it may be determined that the model training of the current training round is completed; then, detecting whether the number of times of the currently completed training rounds reaches a preset training number; if so, determining that the training of the first coding network and the second coding network is finished; if not, continuing to execute the model training of the next training round.

Specifically, when the server trains the first coding network and the second coding network in turn, it may be detected whether the currently-trained coding network meets the training end condition corresponding to the coding network in the current training turn. Taking as an example that in the alternate training process, the first coding network is fixed, the second coding network is trained, and then the second coding network is fixed, and the first coding network is trained; when the server trains the second coding network, whether the trained second coding network meets a second training end condition in the current training round can be detected, if so, the training of the current training round on the second coding network can be determined to be completed, the training on the first coding network is started, and if not, the training on the second coding network based on the training sample is required to be continued; the second training end condition is a condition for measuring whether to stop training for the second coding network in the current training round, and the second training end condition may be, for example, that the second reference loss function mentioned above satisfies the second preset loss condition, and the second training end condition may be, for example, that the number of times of training for the second coding network in the current training round reaches the preset number of times, and so on, and the second training end condition is not limited in this application.

After the server finishes training the second coding network in the current training round, the server may train the first coding network, and in the process of training the first coding network, the server may detect whether the trained first coding network meets a first training end condition in the current training round, and if so, may determine to finish training the first coding network in the current training round, that is, may determine to finish model training (including training the first coding network and the second coding network) in the current round, and if not, it needs to continue training the first coding network based on the training sample; the first training end condition is a condition for measuring whether to stop training for the first coding network in the current training round, and the first training end condition may be, for example, that the first reference loss function mentioned above satisfies a first preset loss condition, and the first training end condition may be, for example, that the number of times of training for the first coding network in the current training round reaches a preset number of times, and so on, and the first training end condition is not limited in this application.

After the server determines that model training of the current training round is completed, whether the number of times of the current completed training round reaches a preset training number (such as 10 times) can be detected; if so, determining that the training of the first coding network and the second coding network is completed, that is, determining that the training of the video coding network and the audio coding network is completed; if not, the model training of the next training round can be started.

Therefore, through the mode, on the basis of training the first coding network and the second coding network in turn, the first coding network and the second coding network are subjected to multi-round training, and the first coding network and the second coding network are ensured to meet the corresponding training end conditions in each round of training, so that the trained first coding network and the trained second coding network can be ensured to have better model performance.

Optionally, in the embodiment of the present application, in order to further improve the model training efficiency and ensure that the trained video coding network and audio coding network have better model performance, the embodiment of the present application may further introduce a knowledge distillation idea in the process of constructing the loss function.

The knowledge distillation is to utilize the characteristics learned by one coding network to influence the training of the other coding network, and to realize information transfer between the trained video coding network and audio coding network. In the embodiment of the present application, a class to which one segment in a training sample belongs is utilized, and a corresponding pseudo tag is configured for another segment in the training sample, which can essentially play a role in information transfer, but the information transfer mode is inefficient. The reason is that the configured pseudo tag is essentially a "hard tag" which can only represent that the data belongs to a certain category, and cannot further represent the relationship between the data and other data, nor represent the relationship between each category. For example, if a certain video segment belongs to the category of "pop piano", the predicted video coding feature corresponding to the video segment should be closer to the predicted video coding feature corresponding to the video segment belonging to the category of "pop guitar", "pop accordion", etc. because the relationship between "pop piano" and "pop guitar", etc. is closer, and conversely, the predicted video coding feature corresponding to the video segment should be further away from the predicted video coding feature corresponding to the video segment belonging to "pop basketball", because the relationship between "pop piano" and "pop basketball", etc. is farther; the pseudo label cannot reflect the distance of the relation between the data and the data, namely, in the process of model training, the relation between the video segment configured with the pseudo label 'playing piano' and the video segment configured with the 'playing guitar' is equal to the relation between the video segment configured with the pseudo label 'playing piano' and the video segment configured with the pseudo label 'playing basketball'; and the method is not beneficial to improving the expression capability of the audio and video coding network to improve the learned audio and video characteristics.

Based on this, in the embodiment of the present application, when the server trains the second coding network, the second coding network may be trained based on the class prediction result corresponding to each of the first segments in the plurality of training samples, the class prediction result corresponding to each of the second segments in the plurality of training samples, and the pseudo label. Similarly, when the server trains the first coding network, the server may train the first coding network based on the class prediction result corresponding to each of the second segments in the plurality of training samples, the class prediction result corresponding to each of the first segments in the plurality of training samples, and the pseudo label.

Specifically, when the server trains the second coding network, the server may determine a difference between a distribution of the class prediction results corresponding to the first segments in each training sample and a distribution of the class prediction results corresponding to the second segments in each training sample, and determine a difference between the class prediction results corresponding to the second segments in each training sample and the pseudo labels, construct a loss function, and train the second coding network based on the loss function. Similarly, when the server trains the first coding network, the server may determine a difference between a distribution of the class prediction results corresponding to the first segment in each training sample and a distribution of the class prediction results corresponding to the second segment in each training sample, determine a difference between the class prediction results corresponding to the first segment in each training sample and the pseudo label, construct a loss function, and train the first coding network based on the loss function.

In one possible implementation, the server may train the second coding network by: aiming at each training sample, constructing a basic loss function corresponding to the training sample according to a class prediction result and a pseudo label corresponding to a second segment in the training sample; and constructing a distillation loss function corresponding to the training sample according to the respective corresponding class prediction results of the first fragment and the second fragment in the training sample; further, the server may train the second coding network based on a base loss function and a distillation loss function corresponding to each training sample.

Specifically, the server may construct, for each training sample, a cross entropy loss function according to a difference between a class prediction result corresponding to the second segment in the training sample and the pseudo label, where the cross entropy loss function is used as a basic loss function corresponding to the training sample; in addition, for each training sample, the server may further construct a distillation loss function corresponding to the training sample according to a difference between a class prediction result corresponding to a first segment in the training sample and a class prediction result corresponding to a second segment in the training sample. Furthermore, the server can construct a comprehensive loss function according to the basic loss function and the distillation loss function corresponding to each training sample; for example, the server may construct an overall basic loss function according to the basic loss function corresponding to each training sample, construct an overall distillation loss function according to the distillation loss function corresponding to each training sample, and perform weighting processing on the overall basic loss function and the overall distillation loss function to obtain a comprehensive loss function. Finally, the server may adjust the model parameters of the second coding network based on the composite loss function.

It should be understood that the training mode of the first coding network is similar to the training mode of the second coding network, and the description thereof is omitted here.

As an example, when the server constructs the distillation loss function corresponding to the training sample, at least one of a response-based distillation loss function and a relationship-based distillation loss function may be constructed. Specifically, the server may construct a first distillation loss function (i.e., a response-based distillation loss function) according to a difference between the class prediction results corresponding to each of the first segment and the second segment in the training sample; the server may construct a second distillation loss function (i.e., a relationship-based distillation loss function) based on the difference between the class prediction result corresponding to the first segment in the training sample and the class prediction results corresponding to the first segment in the other training samples and the difference between the class prediction result corresponding to the second segment in the training sample and the class prediction results corresponding to the second segment in the other training samples.

For example, the class prediction result may include probabilities that the fragments belong to the classes, and besides a positive label (i.e., the class to which the fragment belongs and the probability that the fragment belongs to the class is highest), a negative label (i.e., the class to which the fragment does not belong and the probability that the fragment belongs to the class is not highest) also contains a large amount of model-induced reasoning knowledge and relationships between classes. For example, for a fully trained classifier, because of the high similarity between the three categories of "playing piano", "playing guitar" and "playing accordion", the classifier should have a high response to the prediction of a segment on the category of "playing piano" (i.e. the probability of predicting the segment as belonging to the category of "playing piano"), and to the prediction of the segment on the categories of "playing guitar" and "playing accordion".

The embodiments of the present application build a response-based distillation loss function, i.e., a first distillation loss function, for transferring knowledge from the encoding network of one modality to the encoding network of another modality based on the above knowledge. Specifically, for each training sample, the server may construct a first distillation loss function corresponding to the training sample according to a difference between a class prediction result corresponding to a first segment in the training sample and a class prediction result corresponding to a second segment in the training sample. When training the second coding network, the server needs to construct an overall first distillation loss function according to the respective corresponding basic first distillation loss functions of the training samples, and the server may specifically be configured toConstructing the integral first distillation loss function L by the following formula (1) _Response ：

Wherein N is the total number of training samples,

predicting the result for the category corresponding to the video segment vi in the ith training sample,

and predicting the result for the class corresponding to the audio segment ai in the ith training sample.

Furthermore, in addition to the relationships between classes, the relationships between training samples are also important knowledge that needs to be learned. The relation between the training samples learned by one coding network to be trained is transferred to the other coding network, the distance between the training samples of the same type is shortened, the distance between the training samples of different types is further widened, and the training speed of the two coding networks can be improved.

Based on the knowledge, the embodiment of the application constructs a relationship-based distillation loss function for transferring the learned relationship between the training samples from the coding network of one modality to the coding network of another modality, namely a second distillation loss function. Specifically, for each training sample, the server may construct a second distillation loss function corresponding to the training sample according to a difference between the class prediction result corresponding to the first segment in the training sample and the class prediction results corresponding to the first segments in the other training samples, and a difference between the class prediction result corresponding to the second segment in the training sample and the class prediction results corresponding to the second segments in the other training samples. When training the second coding network, the server needs to construct an overall second distillation loss function according to the respective corresponding second distillation loss functions of the training samples, and the server can specifically construct the overall second distillation loss function according to the following formula (2)Distillation loss function L _Relation ：

Wherein N is the total number of training samples,

for the class prediction result corresponding to the video segment vj in the jth training sample,

for the class prediction result corresponding to the audio segment ai in the ith training sample,

and predicting the result for the class corresponding to the audio segment aj in the jth training sample.

In this way, by introducing at least one of the first distillation loss function and the second distillation loss function in the above manner, it is possible to improve the training efficiency for the video coding network and the audio coding network and improve the feature encoding capability of the trained video coding network and audio coding network based on the introduced distillation loss function.

The video coding network and the audio coding network obtained by training through the model training method provided by the embodiment of the application can be further applied to downstream tasks, so that the realization of the downstream tasks is assisted by using the image characteristics obtained by video coding network coding and the audio characteristics obtained by audio coding network coding. In the embodiment of the present application, the downstream task may include any task implemented based on image features obtained through video coding network coding and audio features obtained through audio coding network coding, such as a video classification task, an action recognition task, a video background audio generation task, and the like, and the present application does not limit the downstream task in any way.

In a possible implementation manner, the video coding network and the audio coding network trained by the model training method provided in the embodiment of the present application may be applied to a target classification task, where the target classification task is a task of classifying videos based on video frames in the videos and audio of the videos, and the target classification task may be, for example, a motion recognition task (i.e., a task of recognizing a category to which a motion existing in a video belongs), and may also be, for example, a universal video classification task (i.e., a category to which a video belongs, such as a game video, a gourmet video, a pet video, a cosmetic video, and the like).

When the trained video coding network and audio coding network are applied to a target classification task, a small amount of first labeled samples labeled in advance can be utilized to train a classification model comprising the video coding network and the audio coding network. That is, the server may obtain a plurality of first labeled samples corresponding to the target classification task, where the first labeled samples include video segments and audio segments having a corresponding relationship, and classification tags; then, the server can determine the image characteristics corresponding to the first labeling sample according to the video segment in the first labeling sample through a video coding network in the classification model to be trained; determining the audio characteristics corresponding to the first labeled sample according to the audio segment in the first labeled sample through an audio coding network in the classification model; further, determining a category prediction result corresponding to the first labeled sample according to the image feature and the audio feature corresponding to the first labeled sample through a classifier in the classification model; finally, the classification model is trained based on the class prediction result corresponding to the first labeled sample and the classification label in the first labeled sample.

It should be noted that, the first labeled sample is a supervised training sample for training a classification model for executing a target classification task; the first labeled sample includes video segments and audio segments having corresponding relationships, where the video segments and audio segments having corresponding relationships are obtained in a manner similar to the manner of obtaining the video segments and audio segments having corresponding relationships included in the training sample introduced in step 201 above; the first labeling sample further includes a classification label, where the classification label is used to characterize a category to which the video segment and the audio segment in the first labeling sample belong, and the category characterized by the classification label is a certain classification category in the target classification task, for example, when the target classification task is an action recognition task, the category characterized by the classification label is a category to which an action existing in the video belongs.

In addition, the classification model to be trained comprises the video coding network and the audio coding network which are trained by the model training method provided by the embodiment of the application and have better characteristic coding capability after being trained, so that when the classification model is trained, only a small amount of first labeling samples under a target classification task are used for fine tuning the classification model, namely only a small amount of first labeling samples are obtained, and resources consumed for labeling the first labeling samples can be reduced.

Fig. 4 is a schematic diagram illustrating an implementation principle of applying a video coding network and an audio coding network to a target classification task according to an embodiment of the present application. As shown in fig. 4, after the server obtains the first labeled sample, the server may first pre-process the video segment and the audio segment in the first labeled sample to obtain input data that can be processed by the video coding network and the audio coding network, where the pre-processing manner is described in detail in step 202 and is not described herein again. After input data corresponding to a video segment in a first labeled sample is obtained through preprocessing by a server, the input data is input into a video coding network of a classification model to be trained, and the video coding network can correspondingly output image characteristics corresponding to the first labeled sample through analyzing and processing the input data; similarly, after the server obtains the input data corresponding to the audio segment in the first labeled sample through preprocessing, the input data is input into the audio coding network of the classification model to be trained, and the audio coding network can correspondingly output the audio features corresponding to the first labeled sample through analyzing and processing the input data. Furthermore, the image feature and the audio feature corresponding to the first labeled sample can be spliced to obtain a fusion feature, and the classification prediction result corresponding to the first labeled sample is determined according to the fusion feature through a classifier in the classification model. And finally, constructing a loss function according to the class prediction result corresponding to the first labeled sample and the classification label in the first labeled sample, and training the classification model based on the loss function so as to adjust the parameters of the video coding network, the audio coding network and the classifier in the classification model.

Therefore, the video coding network and the audio coding network obtained by training through the method provided by the embodiment of the application are applied to the classification model for realizing the target classification task, so that the labeled samples required in the process of training the classification model can be reduced, the classification model can be subjected to supervised fine adjustment based on a small amount of labeled samples under the target classification task, a better model training effect can be achieved through short-time training, and the classification model obtained by training can be ensured to have higher accuracy.

As an example, the classification model trained in the above manner may be applied to a target classification task such as motion recognition, video classification, and the like, and at this time, when the target classification task is executed based on the classification model, the server may determine, according to a first to-be-processed video segment, an image feature corresponding to the first to-be-processed video segment through a video coding network in the classification model; determining the audio characteristics corresponding to the first to-be-processed audio segment according to the first to-be-processed audio segment corresponding to the first to-be-processed video segment through an audio coding network in the classification model; and then, determining a classification result of the first to-be-processed video clip in the target classification task according to the image feature corresponding to the first to-be-processed video clip and the audio feature corresponding to the first to-be-processed audio clip by using the classifier in the classification model.

It should be noted that the first to-be-processed video clip and the first to-be-processed audio clip are processing objects when performing the target classification task, that is, the first to-be-processed video clip is a video clip to be classified, and the first to-be-processed audio clip is an audio clip corresponding to the first to-be-processed video clip, such as an audio of the first to-be-processed video clip or a background audio corresponding to the first to-be-processed video clip.

When the server executes the target classification task, the first to-be-processed video segment and the first to-be-processed audio can be preprocessed respectively to obtain input data which can be processed by a video coding network and an audio coding network. Then, determining image characteristics corresponding to the first to-be-processed video clip according to input data corresponding to the first to-be-processed video clip through a video coding network in a trained classification model; and determining the audio characteristics corresponding to the first audio clip to be processed according to the input data corresponding to the first audio clip to be processed through the audio coding network in the trained classification model. Further, through a classifier in the trained classification model, a classification result of the first to-be-processed video clip in the target classification task is obtained according to the image feature corresponding to the first to-be-processed video clip and the splicing feature of the audio feature corresponding to the first to-be-processed audio clip; for example, when the target classification task is an action recognition task, the classification result should be a category to which an action existing in the first to-be-processed video segment belongs, and for example, when the target classification task is a universal video classification task, the classification result should be a category to which the first to-be-processed video segment belongs, such as a game video, a food video, a pet video, a makeup video, and the like.

As an example, the classification model trained in the above manner may be applied to an action timing positioning task, where the action timing positioning task is to detect, for a video segment with a long duration, a category to which an action existing therein belongs, and determine occurrence timing of each detected action. When the action timing positioning task is executed based on the classification model, the server may determine, for a second video segment to be processed, sub-candidate video segments in which a preset action exists in the second video segment to be processed, and determine an arrangement order of each sub-candidate video segment. Then, the server can determine the image characteristics corresponding to each sub-candidate video clip according to the sub-candidate video clip through the video coding network in the classification model aiming at each sub-candidate video clip; and determining the audio characteristics corresponding to the sub-candidate audio segments according to the sub-candidate audio segments corresponding to the sub-candidate video segments through the audio coding network in the classification model. And then, determining the action recognition result corresponding to the sub-candidate video clip according to the image feature corresponding to the sub-candidate video clip and the audio feature corresponding to the sub-candidate audio clip by the classifier in the classification model. And finally, determining an action time sequence positioning result according to the action identification result corresponding to each sub-candidate video clip and the arrangement sequence of each sub-candidate video clip.

It should be noted that the second to-be-processed video segment is a processing object when performing the action timing positioning task, and the duration of the second to-be-processed video segment is usually long, wherein multiple actions may be involved, and wherein a sub-segment without any action occurring may be included.

When the server executes the action timing positioning task for the second to-be-processed video clip, the regressor may first detect sub-candidate video clips, in which actions may exist, in the second to-be-processed video clip, and determine starting times and ending times corresponding to the sub-candidate video clips, so as to determine a time arrangement order of the sub-candidate video clips. For each sub-candidate video clip, the server may pre-process the sub-candidate video clip and its corresponding sub-candidate audio clip to obtain input data for the video coding network and the audio coding network to process; then, the server can process the input data corresponding to the sub-candidate video clips through the video coding network in the trained classification model to obtain the image characteristics corresponding to the sub-candidate video clips; processing the input data corresponding to the sub-candidate audio segments through the audio coding network in the classification model to obtain the audio features corresponding to the sub-candidate audio segments; furthermore, by means of the classifier in the classification model, according to the image features corresponding to the sub-candidate video clips and the splicing features of the audio features corresponding to the sub-candidate audio clips, the action identification result corresponding to the sub-candidate video clips is determined, that is, the action types to which the actions existing in the sub-candidate video clips belong are determined. Finally, the server may arrange the action recognition results corresponding to the sub-candidate video clips according to the time arrangement sequence of the sub-candidate video clips, so as to obtain an action time sequence positioning result corresponding to the second to-be-processed video clip.

In another possible implementation manner, the video coding network and the audio coding network trained by the model training method provided in the embodiment of the present application may be applied to a background audio generation task, where the background audio generation task is a task for generating a corresponding background audio for a video segment.

When the video coding network and the audio coding network obtained by training are applied to a background audio generation task, a small amount of second labeled samples labeled in advance can be utilized to respectively train a feature conversion network and a background audio generation model; the feature conversion network is a neural network for converting image features corresponding to the video segments into corresponding audio features, and the background audio generation model is a neural network model for generating corresponding background audio according to the audio features.

That is, the server may obtain a plurality of second annotation samples corresponding to the background audio generation task, where the second annotation samples include video segments and annotation background audio segments having a corresponding relationship. Then, the server can determine the image characteristics corresponding to the video segment according to the video segment in the second labeled sample through a video coding network obtained through pre-training; and determining the audio characteristics corresponding to the labeled background audio segment according to the labeled background audio segment in the second labeled sample through the audio coding network obtained by pre-training. Furthermore, the server can perform feature conversion processing on the image features corresponding to the video clips in the second labeling sample through a feature conversion network to be trained to obtain reference audio conversion features; and training the feature conversion network based on the audio feature corresponding to the labeled background audio segment in the second labeled sample and the reference audio conversion feature. In addition, the server can also generate a prediction background audio segment according to the audio characteristics corresponding to the marked background audio segment through a background audio generation model to be trained; and training a background audio generation model based on the labeled background audio segment and the predicted background audio segment.

It should be noted that, the second labeled sample is a supervised training sample for training the feature transformation network and the background audio generation model for executing the background audio generation task; the second annotation sample comprises a video segment and an annotation background audio segment which have a corresponding relationship, wherein the annotation background audio segment is background audio annotated by combining the content of the video segment.

Fig. 5 is a schematic diagram illustrating an implementation principle of applying a video coding network and an audio coding network to a background audio generation task according to an embodiment of the present application. As shown in fig. 5, for a certain second labeled sample, the server may first pre-process the video segment and the labeled background audio segment in the second labeled sample, respectively, to obtain input data corresponding to the video segment and the labeled background audio segment. Then, the server can process the input data corresponding to the video segment in the second annotation sample through a pre-trained video coding network to obtain the image characteristic corresponding to the video segment; and processing the input data corresponding to the labeled background audio segment in the second labeled sample through a pre-trained audio coding network to obtain the audio features corresponding to the labeled background audio segment.

Furthermore, the server may train the feature conversion network and the background audio generation model respectively based on the image feature corresponding to the video segment in the second labeled sample and the audio feature corresponding to the labeled background audio segment in the second labeled sample.

When the feature transformation network is trained specifically, the server may first perform feature transformation processing on the image features corresponding to the video segments in the second labeled sample by using the feature transformation network to be trained, so as to obtain corresponding reference audio transformation features. Furthermore, the server may construct a loss function according to a difference between the reference audio conversion feature and the audio feature corresponding to the labeled background audio piece in the second labeled sample, and train the feature conversion network based on the loss function.

When the background audio generation model is specifically trained, the server may first generate the prediction background audio segment by using the background audio generation model to be trained, according to the audio feature corresponding to the labeled background audio segment in the second labeled sample. Furthermore, the server may construct a loss function according to a difference between the labeled background audio segment and the predicted background audio segment, and train the background audio generation model based on the loss function.

Therefore, the video coding network and the audio coding network obtained by training through the method provided by the embodiment of the application are used for training the feature conversion network and the background audio generation model for realizing the background audio generation task, and the background audio generated based on the feature conversion network and the background audio generation model obtained by training can be ensured to be more matched with the video.

As an example, the feature transformation network and the background audio generation model trained in the above manner may implement the background audio generation task by: determining image characteristics corresponding to a target video clip according to the target video clip of the background audio to be generated through a video coding network; then, performing feature conversion processing on the image features corresponding to the target video clip through a feature conversion network to obtain audio conversion features corresponding to the target video clip; and then, generating a background audio clip corresponding to the target video clip according to the audio conversion characteristics corresponding to the target video clip by using the background audio generation model.

Specifically, for a target video segment of the background audio to be generated, the server may first pre-process the target video segment to obtain input data corresponding to the target video segment, and then the server may input the input data corresponding to the target video segment into a pre-trained video coding network, where the video coding network may correspondingly output image features corresponding to the target video segment by analyzing and processing the input data. Furthermore, the server may input the image feature corresponding to the target video segment into a pre-trained feature conversion network, and the feature conversion network correspondingly obtains the audio conversion feature corresponding to the target video segment by performing feature conversion processing on the image feature. Finally, the server can process the audio conversion characteristics corresponding to the target video clip through a pre-trained background audio generation model to obtain the background audio clip corresponding to the target video clip.

It should be understood that the video coding network and the audio coding network trained by the method provided in the embodiments of the present application may be applied to other downstream tasks besides the downstream tasks described above, and the present application does not limit the downstream tasks to which the video coding network and the audio coding network may be applied.

When the model training method is used for training a video coding network and an audio coding network, a clustering result of coding features generated by one coding network is utilized to determine a supervision signal available for training the other coding network. On one hand, marking of the training samples is avoided, processing resources consumed by marking of the training samples are saved, and meanwhile the problem that the performance of a trained coding network is poor due to the fact that the constructed training samples have defects can be avoided. On the other hand, because the video segments and the audio segments in the training samples have the corresponding relationship, the pseudo labels corresponding to one segment in the training samples are configured for the other segment in the training samples based on the feature clustering result corresponding to the one segment in the training samples, so that the reliability of the configured pseudo labels can be ensured to a certain extent, correspondingly, the pseudo labels are used as supervision signals to train another coding network, the reliable training of the coding network can be ensured, namely, the trained video coding network or audio coding network can have better feature coding capability, and can be better applied to downstream tasks.

In order to further understand the model training method provided in the embodiments of the present application, the following describes an exemplary implementation principle diagram shown in fig. 6.

In an embodiment of the present application, a video coding network and audio coding may be trained based on an AudioSet of data. The AudioSet data set comprises more than two million videos, and in the embodiment of the application, 90% of the videos can be selected as training samples, and the remaining 10% of the videos can be used as test samples. The duration of each video in the data set is about 10 seconds, and the frame rate is 30 frames/second; in the embodiment of the application, a video segment and an audio segment which have a corresponding relation and are 2 seconds long can be intercepted from each video, and a training sample and a testing sample are constructed. For the video clips in the training sample and the test sample, the server may randomly sample 16 video frames from the 2-second video clip, and for each video frame, may perform scaling without changing an aspect ratio thereof, so that a shorter side is 256 pixels, and then randomly intercept an area with a size of 224 × 224 pixels from the scaled video frame, so that, through the above preprocessing, a corresponding video tensor vi with a size of 16 × 3 × 224 × 224 may be obtained for each video clip, as an input of the video coding network. For the audio clips in the training sample and the test sample, the server may perform short-time fourier transform on the 2-second audio clip, and perform logarithm calculation on the short-time fourier transform result to obtain a logarithmic spectrogram ai (which is an audio tensor with a size of 40 × 100) with time as a horizontal axis, frequency as a vertical axis, and intensity as a value, as an input of the audio coding network.

When training the video coding network and the audio coding network, the server can train the video coding network and the audio coding network in turn. Specifically, the video coding network may be fixed first, and the audio coding network may be trained. That is, the video tensors vi (i is 1,2,3, and … …) corresponding to the video clips in each training sample are sequentially processed by the video coding network, so as to obtain the predictive video coding features corresponding to each training sample. Then, clustering the prediction video coding features corresponding to the training samples by adopting a K-Means clustering algorithm, dividing the prediction video coding features into 256 classes, and determining the class y of the video segments in each training sample _vi And taking the category of the video clip in each training sample as the category corresponding to the audio clip in the training sampleA pseudo tag. Furthermore, the audio tensors ai corresponding to the audio segments in the training samples can be sequentially processed through the audio coding network to obtain the predicted audio coding features corresponding to the training samples, and the class prediction results corresponding to the audio segments in the training samples are determined according to the predicted audio coding features corresponding to the training samples

Finally, the pseudo labels corresponding to the audio segments in the training samples can be used as supervision signals, and a cross entropy loss function is constructed based on the class prediction results and the pseudo labels corresponding to the audio segments in the training samples, so that the audio coding network is trained.

After the training of the audio coding network in the current training round is completed, the server can fix the audio coding network and train the video coding network. That is, the server may perform clustering processing on the predicted audio coding features (determined by the audio coding network obtained by training in the current training round) corresponding to each training sample by using a K-Means clustering algorithm, divide the predicted audio coding features into 256 classes, and determine the class y to which the audio segment in each training sample belongs _ai And the class to which the audio clip in each training sample belongs is taken as a pseudo label corresponding to the video clip in the training sample. Then, the video tensors vi corresponding to the video segments in the training samples can be sequentially processed through a video coding network to obtain predicted video coding features corresponding to the training samples, and the category prediction results corresponding to the video segments in the training samples are determined according to the predicted video coding features corresponding to the training samples

Furthermore, the pseudo label corresponding to the video segment in each training sample can be used as a supervised signal, and a cross entropy loss function is constructed based on the class prediction result and the pseudo label corresponding to the video segment in each training sample, so as to train the video coding network.

The server may repeatedly perform the step of training the video coding network and the audio coding network in turn for 10 times, i.e. perform 10 rounds of training on the video coding network and the audio coding network, respectively.

Considering that it is time-consuming to perform clustering processing based on the predictive coding features (including the predictive video coding features and the predictive audio coding features) corresponding to all training samples, it is possible to first test the coding network process for generating the predictive coding features according to the clustering processing by using each test sample before performing the clustering processing each time, continue training the coding network before the loss function obtained based on the test sample stops descending, perform clustering processing based on the predictive coding features generated by the coding network after the coding network is sufficiently trained, and train another coding network based on the result of the clustering processing. By the method, the times of clustering can be reduced, and the model training efficiency is improved.

In addition, in the embodiment of the application, in the process of training the video coding network and the audio coding network, a knowledge distillation idea can be introduced to transfer information between the trained video coding network and audio coding network, so that the coding network of one modality can learn the knowledge of the other modality, thereby accelerating network training and improving performance.

Specifically, the class prediction result usually includes probabilities that the fragments belong to each class, and besides a positive label (i.e., the class to which the fragment belongs, and the probability that the fragment belongs to the class are the highest), a negative label (i.e., the class to which the fragment does not belong, and the probability that the fragment belongs to the class are not the highest) also includes a large amount of knowledge induced by model reasoning and relationships between classes. For example, for a fully trained classifier, because of the high similarity between the three categories of "playing piano", "playing guitar" and "playing accordion", the classifier should have a high response to the prediction of a segment on the category of "playing piano" (i.e. the probability of predicting the segment as belonging to the category of "playing piano"), and to the prediction of the segment on the categories of "playing guitar" and "playing accordion". Based on this, when training the video coding network and the audio coding network, the embodiments of the present application may construct a response-based distillation loss function as shown below:

wherein N is the total number of training samples,

In addition to the relationships between classes, the relationships between training samples are also important knowledge to learn. The relation between the training samples learned by one coding network to be trained is transferred to the other coding network, the distance between the training samples of the same type is shortened, the distance between the training samples of different types is further widened, and the training speed of the two coding networks can be improved. Based on this, when the embodiments of the present application train the video coding network and the audio coding network, a relationship-based distillation loss function as shown below can be constructed:

wherein N is the total number of training samples,

for the class prediction result corresponding to the audio piece ai in the ith training sample,

If the server trains the video coding network and the audio coding network, and simultaneously constructs a cross entropy loss function (constructed according to the difference between the class prediction result and the pseudo label) and the above-mentioned distillation loss function based on the response and the distillation loss function based on the relationship, the three functions can be weighted to obtain a comprehensive loss function, and the video coding network and the audio coding network are trained based on the comprehensive loss function.

The video coding network and the audio coding network trained in the above way can be applied to classification tasks (such as action recognition tasks, video classification tasks, etc.). In this case, the server may obtain an annotation training sample under the corresponding classification task, where such annotation training sample may include a video segment and an audio segment having a corresponding relationship, and a corresponding classification label. Then, the server can determine the image characteristics corresponding to the labeled training sample according to the video segments in the labeled training sample through a video coding network in the classification model to be trained; and determining the audio characteristics corresponding to the labeled training samples according to the audio segments in the labeled training samples through the audio coding network in the classification model. Furthermore, the server can determine a category prediction result corresponding to the labeled training sample according to the video feature and the audio feature corresponding to the labeled training sample through the classifier in the classification model; finally, the server can train the classification model according to the class prediction result corresponding to the labeled training sample and the classification label in the labeled training sample.

More specifically, the video coding network and the audio coding network trained in the above manner can be applied to classification tasks (such as game operation recognition tasks, game video classification tasks, and the like) in game scenes. For example, when applied to a game operation recognition task, the server may acquire video segments and audio segments having a corresponding relationship in a game video and corresponding classification labels (for characterizing game operations included in the game video) to construct a label training sample, and then train a game operation recognition model including a video coding network and an audio coding network based on the acquired label training sample. For another example, when applied to a game video classification task, the server may obtain video segments and audio segments having a corresponding relationship in a game video and corresponding classification labels (for characterizing a category to which the game video belongs) to construct a label training sample, and then train a game video classification model including a video coding network and an audio coding network based on the obtained label training sample.

The video coding network and the audio coding network trained in the above manner can be applied to a background audio generation task (such as a background music generation task). In this case, the server may obtain a label training sample under the corresponding background audio generation task, where such a label training sample may include a video segment and a label background audio segment corresponding thereto. Then, the server can determine the image characteristics corresponding to the labeled training sample according to the video segments in the labeled training sample through a pre-trained video coding network; and determining the audio characteristics corresponding to the labeled training sample according to the labeled background audio segments in the labeled training sample through a pre-trained audio coding network. Furthermore, the server may perform conversion processing on the image features corresponding to the labeled training sample through a feature conversion network to be trained to obtain audio conversion features corresponding to the labeled training sample, and train the feature conversion network according to a distance (such as a cosine distance) between the audio conversion features and the audio features corresponding to the labeled training sample. In addition, the server can also generate a prediction background audio segment according to the audio characteristics corresponding to the labeled training sample through a background audio generation model to be trained, and train the background audio generation model according to the prediction background audio segment and the labeled background audio segment in the labeled training sample.

More specifically, the video coding network and the audio coding network trained in the above manner can be applied to a background audio generation task in a game scene, that is, a background audio matched with an operation rhythm in a game video is generated. In this case, the server may obtain a labeling training sample including a game video clip and a corresponding labeling background audio clip; then, generating image characteristics and audio characteristics corresponding to the labeled training sample according to the game video segment and the labeled background audio segment in the labeled training sample through a video coding network and an audio coding network which are trained in advance respectively; training a feature conversion network for converting the image features into the audio features based on the image features and the audio features corresponding to the labeled training samples; and training a background audio generation model for predicting the background audio of the game video based on the audio features corresponding to the labeled training samples and the labeled background audio segments.

Aiming at the model training method described above, the present application also provides a corresponding model training device, so that the model training method can be applied and implemented in practice.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a model training apparatus 700 corresponding to the model training method shown in fig. 2. As shown in fig. 7, the model training apparatus 700 includes:

a training sample obtaining module 701, configured to obtain a plurality of training samples; the training sample comprises a video clip and an audio clip corresponding to the video clip;

a first feature prediction module 702, configured to determine, for each training sample, a first prediction feature corresponding to the training sample according to a first segment in the training sample through a first coding network; the first coding network is any one of a video coding network and an audio coding network;

a first feature clustering module 703, configured to perform clustering processing based on first prediction features corresponding to the multiple training samples, and determine a category to which a first segment in each of the training samples belongs; configuring corresponding pseudo labels for second segments in the training samples according to the category to which the first segments in the training samples belong for each training sample; the second segment is different from the first segment;

a second network prediction module 704, configured to determine, for each training sample, a second prediction feature corresponding to the training sample according to a second segment in the training sample through a second coding network; determining a category prediction result corresponding to a second segment in the training sample according to a second prediction feature corresponding to the training sample; the second encoding network is any one of the video encoding network and the audio encoding network and is different from the first encoding network;

a second network training module 705, configured to train the second coding network based on the class prediction result and the pseudo label that each correspond to a second segment in the multiple training samples.

Optionally, the apparatus further comprises:

the second feature clustering module is used for clustering based on second prediction features corresponding to the training samples respectively and determining the category of a second segment in each training sample; configuring a corresponding pseudo label for a first segment in the training sample according to the class to which a second segment in the training sample belongs for each training sample;

the first network prediction module is used for determining a category prediction result corresponding to a first segment in the training samples according to a first prediction characteristic corresponding to the training samples aiming at each training sample;

and the first network training module is used for training the first coding network based on the class prediction result and the pseudo label which respectively correspond to the first segment in the plurality of training samples.

Optionally, the apparatus further includes a first network test module; the first network test module is used for:

obtaining a plurality of test samples; the test sample comprises a video clip and an audio clip corresponding to the video clip;

for each test sample, determining a first prediction characteristic corresponding to the test sample according to a first segment in the test sample through the first coding network; determining a category prediction result corresponding to a first segment in the test sample according to a first prediction characteristic corresponding to the test sample;

constructing a first reference loss function based on the category prediction result and the pseudo label corresponding to the first segment in the plurality of test samples; the pseudo label corresponding to the first segment in the test sample is determined by clustering second prediction features corresponding to the plurality of test samples, and the second prediction features are determined according to the second segment in the test sample through the second coding network;

judging whether the first reference loss function meets a first preset loss condition or not; if yes, performing clustering processing based on the first prediction features corresponding to the training samples, and determining the category of the first segment in each training sample; if not, continuing to train the first coding network based on the plurality of training samples.

Optionally, the apparatus further includes a second network test module; the second network test module is used for:

for each test sample, determining a second prediction characteristic corresponding to the test sample according to a second segment in the test sample through the second coding network; determining a category prediction result corresponding to a second segment in the test sample according to a second prediction characteristic corresponding to the test sample;

constructing a second reference loss function based on the category prediction result and the pseudo label corresponding to each second segment in the plurality of test samples; the pseudo label corresponding to the second segment in the test sample is determined by clustering a first prediction characteristic corresponding to each of the plurality of test samples, and the first prediction characteristic is determined according to the first segment in the test sample through the first coding network;

judging whether the second reference loss function meets a second preset loss condition or not; if yes, performing clustering processing based on second prediction features corresponding to the training samples, and determining a category to which a second segment in each training sample belongs; if not, continuing to train the second coding network based on the plurality of training samples.

Optionally, the apparatus further includes a training round detection module; this training round detection module is used for:

determining to complete model training of the current training round when the trained first coding network meets a first training end condition in the current training round and the trained second coding network meets a second training end condition in the current training round;

detecting whether the number of times of the currently completed training round reaches a preset training number;

if so, determining that the training of the first coding network and the second coding network is finished; if not, continuing to execute the model training of the next training round.

Optionally, the second network training module 705 is specifically configured to:

and training the second coding network based on the class prediction result corresponding to each first segment in the plurality of training samples, the class prediction result corresponding to each second segment in the plurality of training samples and the pseudo label.

for each training sample, constructing a basic loss function corresponding to the training sample according to a class prediction result and a pseudo label corresponding to a second segment in the training sample; constructing a distillation loss function corresponding to the training sample according to the respective corresponding class prediction results of the first segment and the second segment in the training sample;

and training the second coding network based on the basic loss function and the distillation loss function corresponding to the plurality of training samples respectively.

Optionally, the second network training module 705 is specifically configured to construct a distillation loss function corresponding to the training sample by at least one of the following methods:

constructing a first distillation loss function according to the difference between the corresponding class prediction results of the first fragment and the second fragment in the training sample;

and constructing a second distillation loss function according to the difference between the class prediction result corresponding to the first segment in the training sample and the class prediction results corresponding to the first segments in other training samples and the difference between the class prediction result corresponding to the second segment in the training sample and the class prediction results corresponding to the second segments in other training samples.

Optionally, the apparatus further comprises a classification model training module; the classification model training module is used for:

obtaining a plurality of first labeling samples corresponding to the target classification task; the first labeling sample comprises a video segment, an audio segment and a classification label, wherein the video segment and the audio segment have corresponding relations;

determining image characteristics corresponding to the first labeling sample according to the video segment in the first labeling sample through the video coding network in the classification model to be trained; determining audio features corresponding to the first labeled sample according to the audio segments in the first labeled sample through the audio coding network in the classification model;

determining a category prediction result corresponding to the first labeled sample according to the image characteristic and the audio characteristic corresponding to the first labeled sample through a classifier in the classification model;

and training the classification model based on the class prediction result corresponding to the first labeled sample and the classification label in the first labeled sample.

Optionally, the apparatus further comprises a first classification model application module; the first classification model application module is used for:

after the training of the classification model is completed, determining image characteristics corresponding to a first video clip to be processed according to the first video clip to be processed through the video coding network in the classification model; determining, by the audio coding network in the classification model, an audio feature corresponding to the first to-be-processed audio segment according to the first to-be-processed audio segment corresponding to the first to-be-processed video segment;

and determining a classification result of the first to-be-processed video clip in the target classification task according to the image feature corresponding to the first to-be-processed video clip and the audio feature corresponding to the first to-be-processed audio clip by the classifier in the classification model.

Optionally, the apparatus further comprises a second classification model application module; the second classification model application module is used for:

when the target classification task is an action timing sequence positioning task, after training of the classification model is completed, determining sub-candidate video segments with preset actions in a second video segment to be processed according to the second video segment to be processed, and determining the arrangement sequence of the sub-candidate video segments;

for each sub-candidate video segment, determining image features corresponding to the sub-candidate video segment according to the sub-candidate video segment through the video coding network in the classification model; determining audio features corresponding to the sub-candidate audio clips according to the sub-candidate audio clips corresponding to the sub-candidate video clips through the audio coding network in the classification model; determining, by the classifier in the classification model, an action identification result corresponding to the sub-candidate video clip according to the image feature corresponding to the sub-candidate video clip and the audio feature corresponding to the sub-candidate audio clip;

and determining an action time sequence positioning result according to the action identification result corresponding to each sub-candidate video clip and the arrangement sequence of each sub-candidate video clip.

Optionally, the apparatus further comprises a background audio generation model training module; the background audio generation model training module is used for:

obtaining a plurality of second labeling samples corresponding to the background audio generation task; the second labeling sample comprises a video segment and a labeling background audio segment which have a corresponding relation;

determining image characteristics corresponding to the video segments according to the video segments in the second labeled sample through the video coding network; determining, by the audio coding network, an audio feature corresponding to the labeled background audio segment according to the labeled background audio segment in the second labeled sample;

performing feature conversion processing on image features corresponding to the video clips in the second labeling sample through a feature conversion network to be trained to obtain reference audio conversion features; training the feature conversion network based on the audio features corresponding to the labeled background audio segments in the second labeled sample and the reference audio conversion features;

generating a predictive background audio segment according to the audio characteristics corresponding to the labeled background audio segment by using a background audio generation model to be trained; training the background audio generation model based on the labeled background audio segment and the predicted background audio segment.

Optionally, the apparatus further comprises a background audio generation model application module; the background audio generation model application module is configured to:

after the training of the feature conversion network and the background audio generation model is completed, determining image features corresponding to a target video segment of the background audio to be generated through the video coding network according to the target video segment;

performing feature conversion processing on the image features corresponding to the target video clip through the feature conversion network to obtain audio conversion features corresponding to the target video clip;

and generating a background audio clip corresponding to the target video clip according to the audio conversion characteristics corresponding to the target video clip by the background audio generation model.

When the model training device trains the video coding network and the audio coding network, the clustering result of the coding features generated by one coding network is used for determining the supervision signals available when the other coding network is trained. On one hand, marking of the training samples is avoided, processing resources consumed by marking of the training samples are saved, and meanwhile the problem that the performance of a trained coding network is poor due to the fact that the constructed training samples have defects can be avoided. On the other hand, because the video segments and the audio segments in the training samples have corresponding relations, a corresponding pseudo label is configured for another segment in the training samples based on the feature clustering result corresponding to one segment in the training samples, so that the reliability of the configured pseudo label can be ensured to a certain extent, correspondingly, the pseudo label is used as a supervision signal to train another coding network, the reliable training of the coding network can be ensured, that is, the trained video coding network or audio coding network can have better feature coding capability, and can be better applied to downstream tasks.

The embodiment of the present application further provides a computer device for training a model, where the computer device may specifically be a terminal device or a server, and the terminal device and the server provided in the embodiment of the present application will be described in terms of hardware implementation.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 8, for convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal as a computer as an example:

fig. 8 is a block diagram showing a partial structure of a computer related to a terminal provided in an embodiment of the present application. Referring to fig. 8, the computer includes: radio Frequency (RF) circuitry 810, memory 820, input unit 830 (including touch panel 831 and other input devices 832), display unit 840 (including display panel 841), sensor 850, audio circuitry 860 (which may connect speaker 861 and microphone 862), wireless fidelity (WiFi) module 870, processor 880, and power supply 890. Those skilled in the art will appreciate that the computer architecture shown in FIG. 8 is not intended to be limiting of computers, and may include more or fewer components than shown, or a combination of certain components, or a different arrangement of components.

The memory 820 may be used to store software programs and modules, and the processor 880 executes various functional applications of the computer and data processing by operating the software programs and modules stored in the memory 820. The memory 820 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, application programs (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the computer, etc. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 880 is a control center of the computer, connects various parts of the entire computer using various interfaces and lines, performs various functions of the computer and processes data by operating or executing software programs and/or modules stored in the memory 820 and calling data stored in the memory 820. Optionally, processor 880 may include one or more processing units; preferably, the processor 880 may integrate an application processor, which mainly handles operating systems, user interfaces, applications, etc., and a modem processor, which mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into processor 880.

In this embodiment of the present application, the processor 880 included in the terminal is further configured to execute the steps of any implementation manner of the model training method provided in this embodiment of the present application.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a server 900 according to an embodiment of the present disclosure. The server 900 may vary widely in configuration or performance and may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) storing applications 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 922 may be provided in communication with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The Server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, and/or one or more operating systems, such as a Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM And so on.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

The CPU 922 is configured to execute steps of any implementation manner of the model training method provided in the embodiment of the present application.

The embodiments of the present application further provide a computer-readable storage medium, configured to store a computer program, where the computer program is configured to execute any one implementation manner of the model training method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes any one of the implementation manners of the model training method described in the foregoing embodiments.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, further comprising:

performing clustering processing based on second prediction features corresponding to the training samples respectively, and determining a category to which a second segment in each training sample belongs; configuring a corresponding pseudo label for a first segment in the training sample according to the class to which a second segment in the training sample belongs for each training sample;

for each training sample, determining a category prediction result corresponding to a first segment in the training sample according to a first prediction feature corresponding to the training sample;

and training the first coding network based on the class prediction result and the pseudo label corresponding to the first segment in the plurality of training samples.

3. The method of claim 1, further comprising:

for each test sample, determining a first prediction characteristic corresponding to the test sample according to a first segment in the test sample through the first coding network; determining a category prediction result corresponding to a first segment in the test sample according to a first prediction feature corresponding to the test sample;

4. The method of claim 2, further comprising:

constructing a second reference loss function based on the category prediction result and the pseudo label corresponding to each second segment in the plurality of test samples; the pseudo label corresponding to the second segment in the test sample is determined by clustering first prediction features corresponding to the plurality of test samples, wherein the first prediction features are determined according to the first segment in the test sample through the first coding network;

5. The method according to any one of claims 2 to 4, further comprising:

6. The method of claim 1, further comprising:

the training the second coding network based on the class prediction result and the pseudo label corresponding to each second segment in the plurality of training samples comprises:

7. The method of claim 6, wherein training the second coding network based on the class prediction results corresponding to the first segments of the plurality of training samples, the class prediction results corresponding to the second segments of the plurality of training samples, and the pseudo label comprises:

for each training sample, constructing a basic loss function corresponding to the training sample according to a class prediction result and a pseudo label corresponding to a second segment in the training sample; and constructing a distillation loss function corresponding to the training sample according to the respective corresponding class prediction results of the first fragment and the second fragment in the training sample;

training the second coding network based on the basis loss function and the distillation loss function corresponding to each of the plurality of training samples.

8. The method according to claim 7, wherein the constructing the distillation loss function corresponding to the training sample according to the class prediction result corresponding to each of the first segment and the second segment in the training sample comprises at least one of:

constructing a first distillation loss function according to the difference between the class prediction results corresponding to the first segment and the second segment in the training sample;

9. The method according to claim 1 or 2, characterized in that the method further comprises:

10. The method of claim 9, wherein after completing the training of the classification model, the method further comprises:

determining image characteristics corresponding to a first video clip to be processed according to the first video clip to be processed through the video coding network in the classification model; determining, by the audio coding network in the classification model, an audio feature corresponding to the first to-be-processed audio segment according to the first to-be-processed audio segment corresponding to the first to-be-processed video segment;

11. The method of claim 9, wherein when the target classification task is an action timing positioning task, after completing the training of the classification model, the method further comprises:

aiming at a second video clip to be processed, determining sub-candidate video clips with preset actions in the second video clip to be processed, and determining the arrangement sequence of each sub-candidate video clip;

12. The method according to claim 1 or 2, characterized in that the method further comprises:

acquiring a plurality of second labeling samples corresponding to the background audio generation task; the second labeling sample comprises a video segment and a labeling background audio segment which have a corresponding relation;

generating a prediction background audio segment according to the audio characteristics corresponding to the marked background audio segment by using a background audio generation model to be trained; training the background audio generation model based on the labeled background audio segment and the predicted background audio segment.

13. The method of claim 12, wherein after completing the training of the feature transformation network and the background audio generation model, the method further comprises:

determining image characteristics corresponding to a target video clip according to the target video clip of the background audio to be generated through the video coding network;

14. A model training apparatus, the apparatus comprising:

the second network prediction module is used for determining a second prediction characteristic corresponding to each training sample according to a second segment in the training sample through a second coding network; determining a category prediction result corresponding to a second segment in the training sample according to a second prediction characteristic corresponding to the training sample; the second encoding network is any one of the video encoding network and the audio encoding network and is different from the first encoding network;

15. A computer device, the device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to perform the model training method of any one of claims 1 to 13 in accordance with the computer program.

16. A computer-readable storage medium for storing a computer program for performing the model training method of any one of claims 1 to 13.

17. A computer program product comprising a computer program or instructions, characterized in that the computer program or the instructions, when executed by a processor, implement the model training method of any one of claims 1 to 13.