CN116958851A

CN116958851A - Training method, device, equipment and storage medium for video aging model

Info

Publication number: CN116958851A
Application number: CN202211528249.0A
Authority: CN
Inventors: 杨善明; 李和瀚; 司建锋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-10-27

Abstract

The application discloses a training method, device and equipment for a video aging model and a storage medium, and relates to the field of artificial intelligence. The method comprises the following steps: respectively inputting at least two modal data of the sample video into at least two feature extraction layers to obtain at least two modal features corresponding to the at least two modal data; inputting at least two modal features into a feature fusion layer to obtain fusion features; outputting at least two modal prediction results according to at least two modal characteristics; outputting a fusion prediction result according to the fusion characteristics; training at least two feature extraction layers according to at least two modal prediction results and the loss of the aging label of the sample video; and training at least two feature extraction layers and a feature fusion layer according to the fusion prediction result and the loss of the aging label of the sample video. The method can train the video aging model, and the video aging model is adopted to determine the video aging, so that the identification accuracy of the video aging can be improved.

Description

Training method, device, equipment and storage medium for video aging model

Technical Field

The application relates to the field of artificial intelligence, in particular to a training method, device and equipment for a video aging model and a storage medium.

Background

With the rapid development of internet communication technology, networks have become an important way for people to acquire information and share information. The server can push various information to the client, and the user can acquire various information through the client. For example, a server of a video application may push recommended videos to a client through which a user clicks to view the videos.

Videos are time efficient, e.g., videos of a certain news event may lose news value over a period of time. For another example, event forenotice video is not time-efficient after the event begins. Therefore, each video needs to be annotated with its video age, and after the video age expires, the video is not pushed to the client frequently.

In the related art, the video aging of the video to be pushed is determined in a manual annotation mode. However, manual work often has subjective age determination criteria, which can lead to unstable labeling quality.

Disclosure of Invention

The embodiment of the application provides a training method, a device, equipment and a storage medium for a video aging model, which can train the video aging model, determine video aging by using the video aging model, and respectively perform constraint training on a multi-mode feature extraction layer and a feature fusion layer of the video aging model according to respective prediction results, thereby improving the capability of the model for extracting features and improving the recognition accuracy of video aging. The technical scheme is as follows.

According to an aspect of the present application, there is provided a training method of a video aging model, the method comprising:

respectively inputting at least two modal data of a sample video into at least two feature extraction layers to obtain at least two modal features corresponding to the at least two modal data;

inputting the at least two modal features into a feature fusion layer to obtain fusion features;

outputting at least two modal prediction results according to the at least two modal characteristics; outputting a fusion prediction result according to the fusion characteristics;

training the at least two feature extraction layers according to the at least two modal prediction results and the loss of the aging label of the sample video; and training the at least two feature extraction layers and the feature fusion layer according to the fusion prediction result and the loss of the aging label of the sample video.

According to another aspect of the present application, there is provided a training apparatus for a video aging model, the apparatus comprising:

the feature extraction module is used for respectively inputting at least two modal data of the sample video into at least two feature extraction layers to obtain at least two modal features corresponding to the at least two modal data;

The feature fusion module is used for inputting the at least two modal features into a feature fusion layer to obtain fusion features;

the prediction module is used for outputting at least two modal prediction results according to the at least two modal characteristics; outputting a fusion prediction result according to the fusion characteristics;

the training module is used for training the at least two feature extraction layers according to the at least two modal prediction results and the loss of the aging label of the sample video; and training the at least two feature extraction layers and the feature fusion layer according to the fusion prediction result and the loss of the aging label of the sample video.

According to another aspect of the present application, there is provided a computer apparatus including: the system comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the training method of the video aging model.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions, the at least one instruction, the at least one program, the set of codes or the set of instructions being loaded and executed by a processor to implement the method of training a video aging model as described in the above aspect.

According to another aspect of embodiments of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method of training the video aging model provided in the alternative implementations described above.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the video aging model is trained, the video aging model is used for extracting the characteristics of the multi-mode data of the video, the video aging of the video is predicted based on the fusion characteristics of the multi-mode characteristics, and the labeling quality of the video aging is ensured. In addition, as the convergence rates of the feature extraction layers of different modal data are inconsistent, in order to ensure that the feature extraction layers of each modal can be converged to an optimal state, the feature prediction output by each feature extraction layer is used to obtain a prediction result in the training process, and the corresponding feature extraction layer is trained according to the prediction result corresponding to each feature extraction layer and the loss of a real label, so that other feature extraction layers are trained with sufficient constraint force after the convergence of part of the feature extraction layers, the accuracy of feature extraction layers in feature extraction is improved, and the accuracy of video aging marking by a video aging model is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a computer device provided by an exemplary embodiment of the present application;

FIG. 2 is a method flow diagram of a training method for a video aging model provided in another exemplary embodiment of the application;

FIG. 3 is a schematic diagram of a training method for a video aging model provided by another exemplary embodiment of the present application;

FIG. 4 is a method flow diagram of a training method for a video aging model provided in another exemplary embodiment of the application;

FIG. 5 is a schematic diagram of a training method for a video aging model provided by another exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a training method for a video aging model provided by another exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a training method for a video aging model provided by another exemplary embodiment of the present application;

FIG. 8 is a method flow diagram of a training method for a video aging model provided in another exemplary embodiment of the application;

FIG. 9 is a block diagram of a training apparatus for a video aging model provided in another exemplary embodiment of the application;

fig. 10 is a schematic diagram of a server according to another exemplary embodiment of the present application;

fig. 11 is a block diagram of a terminal provided in another exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include techniques such as training of video aging models, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, as well as common biometric techniques such as training of video aging models, fingerprint recognition, and the like.

Key technologies for Speech technology (Speech Technology) are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

Fig. 1 shows a schematic diagram of a computer device 101 provided by an exemplary embodiment of the application, which computer device 101 may be a terminal or a server.

The terminal may include at least one of a smart phone, a notebook computer, a desktop computer, a tablet computer, and a smart robot. In an optional implementation manner, the video aging model obtained by training by the training method of the video aging model provided by the application can be applied to an application program with a video pushing function, and the application program can be: video applications, applet applications, information applications, social applications, shopping applications, live applications, forum applications, life class applications, office applications, and the like. Optionally, a client of the application program is installed on the terminal.

The terminal stores a video aging model 102, and when the client needs to use the video pushing function, the client can call the video aging model to perform video aging annotation on the video, so as to determine to push the video according to video aging.

Illustratively, the server stores a video aging model 102, and the server can call the video aging model to perform video aging annotation on the video, so as to determine to push the video according to the video aging, and push the recommended video to the client. The training process of the video aging model can be completed by a server or the server.

The terminal and the server are connected with each other through a wired or wireless network.

The terminal includes a first memory and a first processor. The first memory stores a training algorithm of a video aging model; the training algorithm of the video aging model is called and executed by the first processor to realize the training method of the video aging model. The first memory may include, but is not limited to, the following: random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM).

The first processor may be one or more integrated circuit chips. Alternatively, the first processor may be a general purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a network processor (Network Processor, NP). Optionally, the first processor may implement the training method of the video aging model provided by the present application by running a program or code.

The server includes a second memory and a second processor. The second memory stores a training algorithm of the video aging model; the training algorithm of the video aging model is called by the second processor to realize the training method of the video aging model. Alternatively, the second memory may include, but is not limited to, the following: RAM, ROM, PROM, EPROM, EEPROM. Alternatively, the second processor may be a general purpose processor, such as a CPU or NP.

As shown in fig. 1, the video aging model 102 stored in the computer device 101 includes at least two feature extraction layers, a feature fusion layer and a full connection layer (Fully Connected layers, FC), different feature extraction layers being used to extract features of different modality data of the video.

When the video aging model is used, the multi-modal data (at least two modal data) of the video are respectively input into the corresponding feature extraction layers. For example, the multi-modal data of the video includes first modal data, second modal data, third modal data … …, and nth modal data, where n is a positive integer; the first mode data is input into the first feature extraction layer, the second mode data is input into the second feature extraction layer, the third mode data is input into the third feature extraction layer … …, and the nth mode data is input into the nth feature extraction layer, so that at least two features are obtained.

And inputting at least two features into a feature fusion layer, and carrying out feature fusion to obtain fusion features. Alternatively, the feature fusion layer may employ a cross attention network to fuse the multimodal features of the video.

The fusion features are input into the full connection layer to output a prediction result (predicted video aging). Optionally, a softmax layer may be further connected after the fully connected layer, and the softmax layer outputs the prediction result.

Optionally, the video aging model 102 is trained by using the training method of the video aging model provided by the embodiment of the present application. In the training process, each feature extraction layer is connected with a full connection layer, so that a prediction result is output according to the features output by the feature extraction layer, and the feature extraction layer is trained according to the prediction result and the loss of the real tag. Meanwhile, according to the prediction result output by the fusion features and the loss of the real label, all the feature extraction layers and the feature fusion layers are trained, so that the multi-mode feature extraction capability of the model can be trained simultaneously, the feature extraction layers of all modes can be guaranteed to be converged to the optimal state, and the feature extraction capability of the model is improved.

FIG. 2 illustrates a flowchart of a method for training a video aging model provided by an exemplary embodiment of the present application. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. The method comprises the following steps.

Step 210, at least two modal data of the sample video are respectively input into at least two feature extraction layers, so as to obtain at least two modal features corresponding to the at least two modal data.

Optionally, the video aging model is trained using a training sample set. The training sample set comprises at least one sample video and a corresponding aging label for each sample video.

The sample video may include a classification from video durations: small video, medium video, long video. And/or the sample video may include a variety of shows, tv shows, movies, cartoons, live broadcasts, information. And/or the sample video may include news video, food video, life sharing video, topic related video, and the like.

The aging label can be a label marked manually, and can also be a fuzzy label automatically generated according to the related information of the video.

Alternatively, the aged labels may be classified according to two different sizes, coarse and fine. When the video aging is smaller than a first time threshold, the aging label of the video aging adopts a fine-granularity label, and when the video aging is larger than the first time threshold, the aging label of the video aging adopts a coarse-granularity label. The time interval of two adjacent aged tags in the fine-grained tag is a smaller time interval (e.g., one hour) and the time interval of two adjacent aged tags in the coarse-grained tag is a larger time interval (e.g., one day). For example, for an aged label with a video aging of less than three days, one hour may be used as the granularity, with one aged label per hour, i.e., aged labels comprising: 0 hours, 1 hour, 2 hours … … hours, a total of 72 fine-grained labels. For ageing labels with video ageing greater than three days, one day may be used as granularity, with one ageing label per day, e.g. ageing labels comprising: three days, four days, five days, six days … … three hundred sixty-five days (with a maximum of 365 days distance from video aging, although the longest video aging may be smaller or larger), for a total of 363 coarse-grained labels.

For the format of the aged label, the aged label may have a one-digit value therein for indicating that the label is a fine-grain label or a coarse-grain label, for example, when the first-digit value in the aged label is 0, the aged label is a fine-grain label, and when the first-digit value is 1, the aged label is a coarse-grain label. For example, if a certain aging label is 1123, the aging label is a coarse-grained label, and the video aging indicated by the aging label is 123 days; for another example, if an aging label is 024, the aging label is a fine-grained label, which indicates that the video aging is 24 hours.

For example, the aging label may be applied to video aging and refunding, and after the uploading duration of the video reaches the aging label, the server may not actively push the video to the client, or the server may not push the video in a recommended page of the client, or the server may reduce the probability of pushing the video to a part of users. For example, if the age label of a video is 1 day, and the video is uploaded to the server by 12:00 on 1 month and 1 day, then the video is deleted from the list of videos to be recommended by 12:00 on 1 month and 2 days, that is, the server does not actively push the video to the client.

The data extraction of the sample video can be performed to extract multi-modal data (at least two types of modal data) of the sample video, wherein the multi-modal data refers to data of multiple categories, and the at least two types of modal data are at least two types of data.

For example, the at least two modality data of the sample video may include: at least one of visual data (image data/image stream data composed of time-sequentially spliced multi-frame images), text data, and audio data.

The visual data may be data obtained by stitching at least two frames of images of the sample video. And/or, the visual data is data obtained by splicing at least two frames of images obtained by frame extraction after frame extraction is carried out on the sample video. And/or the visual data includes at least one frame of video picture of the sample video and a cover image of the sample video. And/or the visual data includes at least one frame of video picture of the sample video, a cover image, and a merchandise image of a recommended merchandise associated with the sample video.

The text data includes: at least one of a first text obtained by Automatic Speech Recognition (ASR) of the sample video, a second text obtained by optical character recognition (Optical Character Recognition, OCR) of the sample video, a title of the sample video, a category of the sample video, a tag of the sample video, an author of the sample video, an upload date of the sample video. And sequentially splicing at least one data to obtain text data.

The audio data includes at least one piece of audio of the sample video. And/or, the audio data comprises: and splitting the audio of the sample video into at least one of human voice audio, background audio and noise audio. And/or the audio data includes audio data of at least one music associated with the sample video.

In addition to the data of the several modalities, at least two modality data of the sample video may further include: action data. And carrying out human body recognition on at least two frames of video pictures of the sample video by adopting a human body recognition model to obtain a human body region in the video pictures. And intercepting the human body areas in the video picture, and splicing at least two human body areas in sequence according to the time corresponding to the video picture to obtain at least two frames of action pictures. Human body key point recognition is carried out on at least two frames of action pictures by using a human body key point recognition model (for example, the joint points of a human body/at least two joint points which can be used for representing the human body gesture/key points of the human body), so that at least two frames of human body gestures (the human body gestures comprise the human body key points and connecting lines among the human body key points) are obtained. And the at least two frames of human body gestures are spliced according to the time sequence to obtain the action data of the sample video.

Alternatively, in addition to taking multimodal data of the sample video as a model input, user related data may be used as input for training. The video aging model thus trained can determine the video aging of the video for each user account according to the preference of different user accounts, and the same video has different video aging for different user accounts, for example, if a user is more interested in a certain type of video, the video aging of the type of video for the user can be longer. By adopting the method, the server can maintain a video list to be recommended for each user account, and the server recommends videos to the user account according to the video list to be recommended of each user account.

The user-related data may include user authorization: nickname, gender, age, interest preferences, favorite video types, video authors of interest.

In the training phase, the video aging model includes a network structure as shown in fig. 3: the device comprises at least two feature extraction layers, a feature fusion layer connected with the output of each feature extraction layer, a full connection layer (FC layer) connected with the output of the feature fusion layer, and at least two FC layers respectively connected with the at least two feature extraction layers. The video aging model may also include a softmax layer connected to the output of the fully connected layer FC. Wherein, one mode data corresponds to one feature extraction layer, and the number of the feature extraction layers is the same as the number of the mode data.

For example, the at least two modality data include text data and visual data. The at least two feature extraction layers include a text feature extraction layer for extracting text features of the text data, a visual feature extraction layer for extracting visual features of the visual data. For another example, if the at least two modal data further includes audio data, the at least two feature extraction layers may further include an audio feature extraction layer for extracting audio features of the audio data (for example, the audio feature extraction layer may be implemented by using a VGGish network to perform audio feature extraction). For another example, if the at least two modality data further includes motion data, the at least two feature extraction layers may further include a motion feature extraction layer for extracting motion features of the motion data.

The feature extraction layer is used for extracting features of input data to obtain features. The feature extraction layer may employ any network structure for extracting data features. For example, one feature extraction layer may be implemented with at least one convolution layer, or one feature extraction layer may be implemented with at least one convolution layer, pooling layer, activation layer, or other network structure for extracting data features.

In an alternative embodiment, since visual data is one more time dimension than image data, visual feature extraction may use 3D convolution or 3D Transformer networks to process the spatial and time dimensions; 2D+1D convolution may be employed, with 1D convolution processing timing features, and 2D convolution processing spatial features. The visual feature extraction layer may use a 3D Swin Transformer network.

In an alternative embodiment, the text feature extraction layer may use a BERT (Bidirectional Encoder Representations from Transformer, transform-based bi-directional encoder representation) network. The BERT network is stacked by the encoder of the transducer.

In the application phase, the FC layer connected to each feature extraction layer in the training phase needs to be removed, i.e., the video aging model of the application phase includes the network structure shown in fig. 1: at least two feature extraction layers, a feature fusion layer connected to the output of each feature extraction layer, and a full connection layer (FC layer) connected to the output of the feature fusion layer. The video aging model may also include a softmax layer connected to the output of the fully connected layer FC.

Optionally, when the input data of the model further includes user related data, the video aging model may further include an account feature extraction layer, where the account feature extraction layer is also connected to the feature fusion layer. In the training stage, the account feature extraction layer is also connected with an FC layer. The account feature extraction layer is used for extracting account features from the related data of the user, and in the training stage, the FC layer connected with the account feature extraction layer outputs a user prediction result according to the account features, and trains the account feature extraction layer according to the user prediction result and the loss of the aging label. In the application stage, the account features output by the account feature extraction layer and at least two features output by other at least two feature extraction layers are input into a feature fusion layer together for feature fusion to obtain fusion features, and then a prediction result (a predicted aging label) is output.

And 220, inputting at least two modal features into a feature fusion layer to obtain fusion features.

And (2) inputting at least two modal features output by the at least two feature extraction layers obtained in the step (210) into a feature fusion layer for feature fusion to obtain fused features after fusion.

After the modal characteristics of the plurality of modalities are obtained, the modal characteristics of the plurality of modalities are fused. By way of example, the fusion approach may employ feature stitching, cross-attention based fusion.

For example, as shown in fig. 3, the features output from the first, second, and third feature extraction layers … … and the nth feature extraction layer are input to the feature fusion layer 301, and fusion features are output.

And 230, outputting a fusion prediction result according to the fusion characteristics.

Illustratively, the FC layer is invoked to output a fusion prediction result according to the fusion characteristics. Or, calling the FC layer and the softmax layer to output a fusion prediction result according to the fusion characteristics. Or, calling the decoder to output a fusion prediction result according to the fusion characteristics.

And the fusion prediction result is an aging label of the sample video obtained by prediction according to the fusion characteristics obtained after the multi-mode characteristics are fused.

Step 240, outputting at least two modal prediction results according to at least two modal characteristics.

Illustratively, each feature extraction layer is followed by an FC layer. For example, as shown in fig. 3, the first feature extraction layer is followed by the FC1 layer, the second feature extraction layer is followed by the FC2 layer, the third feature extraction layer is followed by the FC3 layer … …, and the n-th feature extraction layer is followed by the FCn layer.

Optionally, a softmax layer may also be attached after each FC layer.

And predicting the aging label of the sample video according to each modal characteristic to obtain a modal prediction result. For example, outputting a first modality prediction result according to the first modality features output by the first feature extraction layer; outputting a second modal prediction result according to the second modal characteristics output by the second characteristic extraction layer; and outputting a third modal prediction result … … according to the third modal feature output by the third feature extraction layer, and outputting an nth modal prediction result according to the nth modal feature output by the nth feature extraction layer.

One modality prediction result corresponds to one feature extraction layer. The modal prediction results are used to calculate losses with real tags (aged tags) to train the respective feature extraction layers.

For example, a decoder may be connected after each feature extraction layer, and the decoder may output a mode prediction result according to the mode features output by the feature extraction layer.

Step 250, training at least two feature extraction layers and a feature fusion layer according to the loss of the aging label of the fusion prediction result and the sample video.

The aging label is a real label of the sample video, and can be specifically referred to the description of the aging label in step 210. And training a video aging model by using the fusion prediction result obtained by prediction and the loss before the tag.

And training model parameters of at least two feature extraction layers and the feature fusion layer according to fusion prediction results corresponding to the feature fusion layer and loss of a real tag (aging tag of a sample video). Optionally, training at least two feature extraction layers, a feature fusion layer and an FC layer connected with the feature fusion layer according to the fusion prediction result corresponding to the feature fusion layer and the loss of the real label.

Illustratively, the loss function may employ a CE loss (Cross Entropy Loss, cross entropy loss function).

For example, as shown in fig. 3, the loss of the fusion prediction result and the aging label is used to train the feature fusion layer 301, FC connected to the feature fusion layer, the first feature extraction layer, the second feature extraction layer, and the third feature extraction layer … … nth feature extraction layer.

Step 260, training at least two feature extraction layers according to the at least two modal prediction results and the loss of aging tags of the sample video.

Based on the loss function, each mode prediction result and the aging label of the sample video calculate a loss, and at least two mode prediction results can obtain at least two losses. And extracting a layer according to the characteristics corresponding to the loss training mode prediction result. Optionally, the penalty is also used to train the FC layer or decoder connected after its corresponding feature extraction layer.

For example, if the first modal prediction result is obtained based on the first feature output by the first feature extraction layer, a first loss of the first modal prediction result and the aging tag is calculated, and the first loss is used for training the first feature extraction layer.

For example, as shown in fig. 3, a first loss of the first predictor and the aging tag is used to train the first feature extraction layer and FC1, a second loss of the second predictor and the aging tag is used to train the second feature extraction layer and FC2, a third loss of the third predictor and the aging tag is used to train the third feature extraction layer and FC3 … … nth predictor and the nth loss of the aging tag is used to train the nth feature extraction layer and FCn.

For example, after the video aging model training is completed, the video aging model may be used to predict aging tags of the video. And respectively inputting the multi-mode data of the video to be predicted into the corresponding feature extraction layers, carrying out feature extraction to obtain a plurality of features, carrying out feature fusion on the plurality of feature input feature fusion layers to obtain fusion features, inputting the fusion features into the FC layer and the softmax layer, and outputting a prediction result, wherein the prediction result is an aging label of the video to be predicted.

In summary, according to the method provided by the embodiment, the video aging model is trained, the video aging model is used to perform feature extraction on the multi-modal data of the video, the video aging of the video is predicted based on the fusion features of the multi-modal features, and the labeling quality of the video aging is ensured. In addition, as the convergence rates of the feature extraction layers of different modal data are inconsistent, in order to ensure that the feature extraction layers of each modal can be converged to an optimal state, the feature prediction output by each feature extraction layer is used to obtain a prediction result in the training process, and the corresponding feature extraction layer is trained according to the prediction result corresponding to each feature extraction layer and the loss of a real label, so that other feature extraction layers are trained with sufficient constraint force after the convergence of part of the feature extraction layers, the accuracy of feature extraction layers in feature extraction is improved, and the accuracy of video aging marking by a video aging model is further improved.

By taking the example that the multi-modal data of the video data comprises visual data and text data as an example, a training method of a video aging model is given.

FIG. 4 is a flowchart illustrating a method for training a video aging model according to an exemplary embodiment of the present application. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. Based on the embodiment shown in fig. 2, step 210 is preceded by steps 201 and 202; step 210 includes step 211 and step 212; step 220 includes step 221; step 240 includes step 241 and step 242; step 260 includes step 261 and step 262. The method comprises the following steps.

Step 201, performing frame extraction processing on the sample video to obtain visual data composed of at least one frame of picture of the sample video.

Illustratively, the visual data is video vision information, which is a sequence of video frames having timing characteristics. The visual data is image stream data obtained by connecting at least one frame of video frame of the sample video in time sequence. Alternatively, the visual data may be data obtained by temporally concatenating all video frames of the sample video. Alternatively, the visual data may be data obtained by splicing partial video frames of the sample video in time sequence.

Because more redundant information exists between frames in the original video frame sequence, the model identification is not greatly assisted, but the efficiency of model inference is reduced, so that the frame extraction processing is carried out on the sample video to reduce the data volume of visual data. For example, the decimating may be decimating F frames per second, and F may vary depending on the size of the video aging model and the task. For example, 1 frame per second is extracted.

And performing frame extraction processing on the sample video according to a preset frame extraction frequency to obtain multi-frame pictures of the sample video, and splicing/stacking the multi-frame pictures according to a time sequence to obtain visual data.

Step 202, generating text data based on at least one of the first text, the second text, the title of the sample video, the classification of the sample video, the tag of the sample video, the author of the sample video, and the upload date of the sample video.

The first text is text obtained by performing ASR on the sample video, and the second text is text obtained by performing OCR on the sample video. Optionally, the first text is text obtained by ASR of audio data of the sample video, and the second text is text obtained by OCR of visual data of the sample video.

The title of the sample video is a title which is formulated when the sample video author uploads the sample video, and the title can generally highly summarize the central idea of the sample video. The classification of the sample video may be an author selected classification or a server/background staff classification of the sample video. The tags of the sample video may be author-selected tags or server/background staff tags labeling the sample video. The author of the sample video includes author-related information such as author nickname, author locale, author gender, author age, author account creation time, etc.

Text data of the sample video is generated based on text content associated with the sample video. Illustratively, the at least one data is spliced to obtain text data.

Step 211, inputting the visual data of the sample video into a visual feature extraction layer to obtain visual features.

Illustratively, the at least two feature extraction layers include: a visual feature extraction layer that outputs visual features based on visual data of the sample video, and a text feature extraction layer that outputs text features based on text data of the sample video.

For example, as shown in fig. 5, the video aging model includes a visual feature extraction layer 401, a text feature extraction layer 402, a feature fusion layer 403, and an FC layer. The visual feature extraction layer 401 and the text feature extraction layer 402 are respectively connected with the input end of the feature fusion layer 403, and the output end of the feature fusion layer 403 is connected with the FC layer.

The visual feature extraction layer is used for extracting visual features in the visual data.

Alternatively, the network structure of the visual feature extraction layer may be arbitrarily set. For example, the visual feature extraction layer may employ a 3D swin transformer network.

And 212, inputting the text data of the sample video into a text feature extraction layer to obtain text features.

The text feature extraction layer is used for extracting text features in the text data.

Alternatively, the network structure of the text feature extraction layer may be arbitrarily set. For example, the text feature extraction layer may employ a BERT network. The input data of the text feature extraction layer is obtained by splicing different types of data (such as title and first text) of the text features through the identifier.

Step 221, inputting the visual feature and the text feature into a cross-attention feature fusion layer to obtain a fusion feature.

By way of example, the feature fusion layer may employ a cross-attention network. The cross-section network allows the input sequence to be a sequence of different modes, and can perform better alignment and fusion on the input multi-mode data (the characteristics of the multi-mode data).

Step 241, outputting a visual prediction result according to the visual characteristics, wherein the visual prediction result comprises a prediction age of the sample video output according to the visual data.

Illustratively, the at least two modal predictions include: visual prediction results output based on visual features, and text prediction results output based on text features.

Illustratively, during the training phase, the FC1 layer and the softmax layer are connected after the visual feature extraction layer, the FC1 layer and the softmax layer being used to predict the aging signature (visual prediction result) of the sample video based on the visual features. For example, as shown in fig. 5, the visual features output by the visual feature extraction layer 401 are input to the FC1 layer and the softmax layer (not shown) to obtain visual prediction results.

Step 242, outputting a text prediction result according to the text characteristics, wherein the text prediction result comprises a prediction aging of the sample video output according to the text data.

Illustratively, in the training phase, the FC2 layer and the softmax layer are connected after the text feature extraction layer, the FC2 layer and the softmax layer being used to predict aging tags (text predictions) of the sample video based on the text features. For example, as shown in fig. 5, the text feature output by the text feature extraction layer 402 is input to the FC2 layer and the softmax layer (not shown) to obtain a text prediction result.

Illustratively, as shown in fig. 5, model parameters of a feature fusion layer 403, an FC layer connected to the feature fusion layer, a visual feature extraction layer 401, and a text feature extraction layer 402 are trained according to the loss of the real label (aging label) of the fusion prediction result and the sample video.

The loss function may employ CE loss (cross entropy loss).

Step 261, training the visual feature extraction layer according to the visual prediction result and the first loss of the aging label of the sample video.

Illustratively, as shown in FIG. 5, a first loss of the visual prediction result and the real label (aged label) of the sample video is calculated, and the model parameters of the visual feature extraction layer 401 and the FC1 layer are trained according to the loss.

The loss function can be supervised by adopting CE loss (cross entropy loss), so that after the feature extraction layers of the single mode are converged, the feature extraction layers of the other modes have enough supervision power.

Because the feature extraction layers of different modes have different convergence rates, in order to enable the feature extraction layers to converge simultaneously as much as possible, the embodiment of the application also provides a method for dynamically adjusting the training gradient of the feature extraction layers of the modes, and the training gradient of the feature extraction layers of the modes is adjusted based on the difference of the prediction results of the modes.

For example, a first weighting coefficient may be calculated based on the difference between the visual prediction result and the text prediction result, and the training gradient of the visual feature extraction layer may be weighted with the first weighting coefficient; and calculating a second weighting coefficient according to the difference between the text prediction result and the visual prediction result, and weighting the training gradient of the text feature extraction layer by using the second weighting coefficient.

Namely, calculating a first gradient according to a first loss, wherein the first loss is the loss of the visual prediction result and the aging label; weighting the first gradient according to a first difference ratio to obtain a second gradient, wherein the first difference ratio is calculated according to the difference between the visual prediction result and the text prediction result; and carrying out gradient update on the model parameters of the visual characteristic extraction layer according to the second gradient.

Calculating a first ratio of the visual prediction result to the text prediction result; under the condition that the first ratio is greater than one, normalization modulation is carried out on the first ratio to obtain a first difference ratio; in the case where the first ratio is not greater than one, 1 is determined as the first difference ratio.

Normalization modulation includes: and normalizing the first ratio to obtain a first normalization result, and subtracting the first normalization result from 1 to obtain a first difference ratio.

The first difference ratio is calculated by the following formula:

where v denotes the visual modality, a denotes the text modality,for a first ratio corresponding to the visual prediction result, softmax () is a softmax function, +.>Model parameters of the FC1 layer connected for the visual features extraction layer,model parameters of FC2 layer connected for text feature extraction layer, +.>Visual features output for the visual feature extraction layer, < +. >Text feature output for text feature extraction layer, +.> Namely, visual prediction result, is->And the text prediction result is obtained. That is, the first ratio is equal to the sum of the visual predictors and the sum of the text predictors.

For a first difference ratio corresponding to the visual prediction result, tanh () is a tanh function.

That is, when the sum of the visual predictors is greater than the sum of the text predictors, the first ratio is a positive number less than 1, thereby reducing the first gradient and slowing down the convergence speed of the visual feature extraction layer. When the sum of the visual predictors is greater than the sum of the text predictors, the first ratio is equal to 1, so that the visual feature extraction layer is trained according to the first gradient, and the visual feature extraction layer converges according to the normal speed.

The gradient update formula of the model parameters of the visual feature extraction layer is:

/>

wherein, the liquid crystal display device comprises a liquid crystal display device,for the first difference ratio corresponding to the visual mode, eta is the learning rate and +.>For parameter->Corresponding gradient values (calculated from visual prediction results, age labels, loss functions), -a method for determining the gradient value of the target object>For one model parameter (model parameter before update) in the visual feature extraction layer, is->For model parameters->Updated values. That is, the model parameter after the update of the visual feature extraction layer=the model parameter before the update-learning rate.

Alternatively, the first ratio may be equal to a ratio of a sum of the accuracy rates of the visual predictors to a sum of the accuracy rates of the text predictors.

Step 262, training the text feature extraction layer according to the text prediction result and the second loss of the aging label of the sample video.

Illustratively, as shown in FIG. 5, a second penalty of text predictions and real tags (aged tags) of the sample video is calculated, and the model parameters of the text feature extraction layer 402 and the FC2 layer are trained based on the penalty.

The loss function can adopt CE loss (cross entropy loss), and after the feature extraction layers of the single mode are converged, the feature extraction layers of the other modes have enough supervision power.

The embodiment of the application also provides a method for dynamically adjusting the training gradient of the feature extraction layer of each mode, which adjusts the training gradient of the feature extraction layer of each mode based on the difference of the prediction results of each mode.

Namely, calculating a third gradient according to a second loss, wherein the second loss is the loss of the text prediction result and the aging label; weighting the second gradient according to a second difference ratio to obtain a fourth gradient, wherein the second difference ratio is calculated according to the difference between the text prediction result and the visual prediction result; and carrying out gradient update on the model parameters of the text feature extraction layer according to the fourth gradient.

Calculating a second ratio of the text prediction result to the visual prediction result; under the condition that the second ratio is larger than 1, normalization modulation is carried out on the second ratio to obtain a second difference ratio; in the case where the second ratio is not greater than 1, 1 is determined as the second difference ratio.

Normalization modulation includes: and normalizing the second ratio to obtain a second normalization result, and subtracting the second normalization result from 1 to obtain a second difference ratio.

Illustratively, the second difference ratio is calculated as:

where v denotes the visual modality, a denotes the text modality,for the second ratio corresponding to the text prediction result, softmax ()As a softmax function,/->Model parameters of the FC1 layer connected for the visual features extraction layer,model parameters of FC2 layer connected for text feature extraction layer, +.>Visual features output for the visual feature extraction layer, < +. >Text feature output for text feature extraction layer, +.> Namely, visual prediction result, is->And the text prediction result is obtained. That is, the second ratio is equal to the sum of the text predictions over the sum of the visual predictions.

For the second difference ratio corresponding to the text prediction result, tanh () is a tanh function.

That is, when the sum of the text prediction results is greater than the sum of the visual prediction results, the second ratio is a positive number smaller than 1, thereby reducing the third gradient and slowing down the convergence speed of the text feature extraction layer. When the sum of the text predictions is greater than the sum of the visual predictions, the second ratio is equal to 1, thereby training the text feature extraction layer according to the third gradient, such that the text feature extraction layer converges at a normal speed.

The gradient update formula of the model parameters of the text feature extraction layer is:

wherein, the liquid crystal display device comprises a liquid crystal display device,for the second difference ratio corresponding to the text mode, eta is the learning rate, and +.>For parameter->Corresponding gradient values (calculated from text prediction, aging label, loss function), -and (ii)>For one model parameter (model parameter before update) in the text feature extraction layer, is->For model parameters->Updated values. That is, the text feature extraction layer updates the model parameter=the model parameter before updating-the learning rate, the second difference ratio, and the third gradient.

Alternatively, the second ratio may be equal to a ratio of a sum of the correctness of the text prediction result to a sum of the correctness of the visual prediction result.

Optionally, after the video aging model is obtained by training by using the model training method, the video aging model can be applied to predict the aging label of the video. As shown in fig. 6, visual data of a video to be predicted is input into the visual feature extraction layer 401: extracting features in a Swin transform network to obtain visual features; text data of the video to be predicted is input into the text feature extraction layer 402: extracting features in the BERT network to obtain text features; the visual features and text features are input into feature fusion layer 403: feature fusion is carried out in a cross attribute network, so that fusion features are obtained; and inputting the fusion characteristics into the FC layer and the softmax layer to obtain a final prediction result, wherein the prediction result is an aging label of the video to be predicted.

Optionally, the training method provided by the embodiment of the application can be used for training a video aging model for predicting the aging label, and can also be used for training other video models for carrying out other labels on the video. For example, the training method is used for training video understanding projects such as a video classification model, a video content identification model, a video low-quality identification model and the like, and the model performance can be improved by utilizing multi-mode fusion characteristics by adopting the training method.

In summary, according to the method provided by the embodiment, video aging of the video is predicted based on the fusion characteristics of the visual data and the text data of the video, so that the labeling quality of the video aging is ensured. In addition, because the convergence rates of the feature extraction layers respectively corresponding to the visual data and the text data are inconsistent, in order to ensure that each feature extraction layer can converge to an optimal state, visual prediction results are obtained by using visual feature prediction output by the visual feature extraction layer in the training process, and the visual feature extraction layer is trained by using the visual prediction results and the loss of a real label; and predicting the text features output by the text feature extraction layer to obtain a text prediction result, and training the text feature extraction layer by using the text prediction result and the loss of the real labels. After the convergence of part of the feature extraction layers, other feature extraction layers also have sufficient constraint force for training, so that the accuracy of feature extraction of the feature extraction layers is improved, and the accuracy of video aging model on video aging labeling is further improved.

According to the method provided by the embodiment, the training gradient of the two feature extraction layers is dynamically adjusted according to the prediction result difference corresponding to the two modal data by calculating the difference between the visual prediction result and the text prediction result, and when the prediction result corresponding to the visual feature extraction layer is more accurate, the gradient of the visual feature extraction layer is reduced, and the convergence speed of the visual feature extraction layer is slowed down. When the prediction result corresponding to the text feature extraction layer is more accurate, the gradient of the text feature extraction layer is reduced, and the convergence speed of the text feature extraction layer is slowed down. According to the difference ratio corresponding to the two modal data in the training process, different gradient coefficients are dynamically given to the characteristic extraction layers of different modalities, so that the modal optimization process is dynamically adjusted in the whole training process.

The method provided by the embodiment provides a method for quantifying the difference of the prediction results corresponding to the two modal data, and further determines the gradient coefficient of each modal feature extraction layer according to the quantified difference of the two types of prediction results, so as to dynamically adjust the training gradient of each feature extraction layer, realize the control of the convergence rate of each modal feature extraction layer, and enable each feature extraction layer to converge to the optimal state.

According to the method provided by the embodiment, based on the multi-mode video aging classification framework, the proposed algorithm is not only applied to aging classification projects, but also applied to video understanding projects such as video classification, video high-quality content identification, video low-quality identification and the like, and the performance of the model can be improved by utilizing multi-mode fusion characteristics. The effect is remarkable in the actual experimental process, and the accuracy of the video aging model obtained through training can reach 90%, and the recall rate is 67.1%.

The model training method is characterized in that after the model training method is used for training again by using the field data and then performing refinement adjustment by using the specific task data after the model training method is used for training again by using the field data based on the large-scale data. The method solves the problem of domain shift (domain shift) which occurs when the model is directly migrated from large-scale training to a specific task.

As shown in fig. 7, the method mainly comprises three stages: large scale pre-training 501; domain pre-training 502; downstream task 503.

In the large-scale pre-training stage, the model is trained by using large-scale data, and the large-scale data comprises video and labels of the video, wherein the labels can be labels of any dimension of the video, for example, video related information such as classification of the video, labels of the video, scoring of the video and the like. In the large-scale pre-training result, large-scale training is used to enable a feature extraction layer of the model to learn and extract video features preliminarily. The feature extraction structure of the model trained in the large-scale pre-training stage is the same as that of a video aging model, namely, the model is a plurality of feature extraction layers corresponding to a plurality of modal data respectively, and a feature fusion layer connected with the plurality of feature extraction layers. But the prediction of the model trained in the large-scale pre-training stage is the above-described label (which may not be a time-lapse label).

In the domain pre-training phase, domain data (video and time-lapse labels of video) are used. That is, the model is further trained in the field of video aging prediction. Firstly, initializing parameters of a feature extraction structure of a video aging model by adopting a feature extraction structure obtained by training in a large-scale pre-training stage, and then training the video aging model to predict aging labels of videos by using field data. The time-effect label of the field data adopted in the stage can be an aging label (the label accuracy is lower) obtained by automatically marking according to the video information.

And finally, in a downstream task stage, adopting manually marked data with accurate aging labels to accurately adjust the video aging model in the field pre-training stage, so that the video aging model with higher accuracy can be obtained by training under the condition of less accurate sample label data.

FIG. 8 is a flowchart illustrating a method for training a video aging model according to an exemplary embodiment of the present application. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. The method comprises the following steps.

Step 310, invoking a first training sample set to train a first classification model, wherein samples in the first training sample set comprise videos and labels corresponding to the videos, the labels comprise classification labels of the videos in any classification dimension, and the first classification model comprises a first visual feature extraction layer, a first text feature extraction layer and a first feature fusion layer.

The first classification model is used for predicting a label corresponding to the video according to the input video, and the label may not be an aging label of the video. The tag may be other more readily available video tags, such as a video category (small video, medium video, long video, movies, drama, variety, comedy, action, documentary, etc. various video categories).

The feature extraction structure (including the feature extraction layer and the feature fusion layer) of the first classification model is the same as the video aging model, i.e., both include a visual feature extraction layer for extracting visual features from visual modality data, a text feature extraction layer for extracting text features from text modality data, and a feature fusion layer for fusing visual features and text features. And the visual feature extraction layer structure of the first classification model is the same as that of the video aging model, the text feature extraction layer structure of the first classification model is the same as that of the video aging model, and the feature fusion layer structure of the first classification model is the same as that of the video aging model.

And a first classification model obtained by training the first training sample set has preliminary feature extraction capability on the multi-mode data of the video.

Step 320, based on the first visual feature extraction layer, the second text feature extraction layer and the first feature fusion layer, invoking a second training sample set to train a second classification model, wherein the samples in the second training sample set comprise videos and automatic aging labels of the videos, and the automatic aging labels are automatically marked according to aging related data of the videos; the second classification model comprises a second visual feature extraction layer obtained by training the first visual feature extraction layer, a second text feature extraction layer obtained by training the first text feature extraction layer and a second feature fusion layer obtained by training the first feature fusion layer.

And then, initializing the model parameters of the feature extraction structure in the second classification model by adopting the model parameters of the feature extraction structure in the first classification model. The feature extraction structure of the second classification model is also the same as that of the video aging model. The second classification model is used to predict age tags for the video based on the input video.

The second classification model adopts a second training sample set for training, and sample labels in the second training sample set are aging labels, and the aging labels can be rough labeling aging labels (the accuracy can be low). And training the second classification model by using a second training sample set, so that the feature extraction layer and the feature fusion layer pay attention to the features of the video in the aging field, and the feature extraction layer and the feature fusion layer are guided to extract the features of the video related to aging.

Illustratively, the aging tags in the second training sample set may be aging tags of the video that are automatically determined from some historical data of the video using an algorithm. For example, a time period from when a video is released to when the video click rate is less than a threshold is determined as an aging tag for the video based on historical click rate changes for the video.

The first classification model and the second classification model can also be used for model training by using the model training method provided by the embodiment of the application.

Step 330, initializing parameters of the visual feature extraction layer by using the model parameters of the second visual feature extraction layer; initializing parameters of the text feature extraction layer by using model parameters of the second text feature extraction layer; and initializing parameters of the feature fusion layer by using model parameters of the second feature fusion layer.

After training to obtain a second classification model, initializing parameters of the feature extraction structure of the video aging model by using model parameters of the feature extraction structure in the second classification model. Therefore, the feature extraction layer and the feature fusion layer of the video aging model have certain feature extraction capability.

Step 340, invoking the artificial annotation training sample set to train the video aging model.

Then, the method of the embodiment shown in fig. 2 or fig. 4 is adopted to train the initialized video aging model by using the manually marked accurate training sample set, so as to obtain the video aging model. That is, in the embodiments shown in fig. 2 and 4, the aging label of the sample video is a manually labeled accurate aging label.

In summary, according to the method provided by the embodiment, the video content is further identified by using the multi-stage pre-training scheme and the multi-mode fusion technology by utilizing the video frame and the text characteristics of the video medium, so that the aging of the video can be more accurately judged, and the corresponding life cycle length is accurately given to the expired video. The method increases the consumption experience and freshness of users, and has great effect on improving more users.

The following is an embodiment of the device according to the present application, and details of the embodiment of the device that are not described in detail may be combined with corresponding descriptions in the embodiment of the method described above, which are not described herein again.

Fig. 9 is a schematic structural diagram of a training device for a video aging model according to an exemplary embodiment of the present application. The apparatus may be implemented as all or part of a computer device by software, hardware, or a combination of both, the apparatus comprising:

the feature extraction module 602 is configured to input at least two types of modal data of a sample video into at least two feature extraction layers respectively, so as to obtain at least two modal features corresponding to the at least two types of modal data;

The feature fusion module 603 is configured to input the at least two modal features into a feature fusion layer to obtain a fusion feature;

a prediction module 604, configured to output at least two modal prediction results according to the at least two modal characteristics; outputting a fusion prediction result according to the fusion characteristics;

a training module 605, configured to train the at least two feature extraction layers according to the at least two modal prediction results and the loss of the aging label of the sample video; and training the at least two feature extraction layers and the feature fusion layer according to the fusion prediction result and the loss of the aging label of the sample video.

In an alternative embodiment, the at least two modal predictions include: visual prediction results output based on visual features, and text prediction results output based on text features;

the at least two feature extraction layers include: a visual feature extraction layer outputting visual features based on visual data of the sample video, and a text feature extraction layer outputting text features based on text data of the sample video;

the training module 605 is configured to train the visual feature extraction layer according to the visual prediction result and the first loss of the aging label of the sample video; and training the text feature extraction layer according to the text prediction result and the second loss of the aging label of the sample video.

In an alternative embodiment, the training module 605 is configured to calculate a first gradient according to the first loss, where the first loss is a loss of the visual prediction result and the aging label; weighting the first gradient according to a first difference ratio to obtain a second gradient, wherein the first difference ratio is calculated according to the difference between the visual prediction result and the text prediction result; and carrying out gradient update on the model parameters of the visual feature extraction layer according to the second gradient.

In an alternative embodiment, the training module 605 is configured to calculate a first ratio of the visual prediction result to the text prediction result; performing normalization modulation on the first ratio to obtain the first difference ratio under the condition that the first ratio is larger than one; in the case where the first ratio is not greater than one, 1 is determined as the first difference ratio.

In an alternative embodiment, the training module 605 is configured to calculate a third gradient according to the second loss, where the second loss is a loss of the text prediction result and the aging label; weighting the second gradient according to a second difference ratio to obtain a fourth gradient, wherein the second difference ratio is calculated according to the difference between the text prediction result and the visual prediction result; and carrying out gradient update on the model parameters of the text feature extraction layer according to the fourth gradient.

In an alternative embodiment, the training module 605 is configured to calculate a second ratio of the text prediction result to the visual prediction result; when the second ratio is greater than 1, performing normalization modulation on the second ratio to obtain the second difference ratio; in the case where the second ratio is not greater than 1, 1 is determined as the second difference ratio.

In an alternative embodiment, the feature extraction module 602 is configured to input visual data of the sample video into a visual feature extraction layer to obtain a visual feature;

the feature extraction module 602 is configured to input text data of the sample video into a text feature extraction layer to obtain text features;

the feature fusion module 603 is configured to input the visual feature and the text feature into a cross-attention feature fusion layer to obtain the fusion feature;

the prediction module 604 is configured to output a visual prediction result according to the visual feature, where the visual prediction result includes a prediction age of the sample video output according to the visual data;

the prediction module 604 is configured to output a text prediction result according to the text feature, where the text prediction result includes a prediction age of the sample video output according to the text data.

In an alternative embodiment, the at least two modality data includes visual data and text data; the apparatus further comprises:

the data processing module 601 is configured to perform frame extraction processing on the sample video to obtain the visual data composed of at least one frame of picture of the sample video;

the data processing module 601 is configured to generate the text data based on at least one of a first text, a second text, a title of the sample video, a classification of the sample video, a tag of the sample video, an author of the sample video, and an upload date of the sample video;

the first text is text obtained by performing Automatic Speech Recognition (ASR) on the sample video, and the second text is text obtained by performing Optical Character Recognition (OCR) on the sample video.

In an alternative embodiment, the aging label of the sample video is a manually labeled aging label;

the at least two feature extraction layers include a visual feature extraction layer and a text feature extraction layer, the apparatus further comprising:

the pre-training module 606 is configured to invoke a first training sample set to train a first classification model, where samples in the first training sample set include videos and labels corresponding to the videos, the labels include classification labels of the videos in any classification dimension, and the first classification model includes a first visual feature extraction layer, a first text feature extraction layer, and a first feature fusion layer;

The pre-training module 606 is configured to invoke a second training sample set to train a second classification model based on the first visual feature extraction layer, the second text feature extraction layer, and the first feature fusion layer, where samples in the second training sample set include video and an automatic aging tag of the video, and the automatic aging tag is automatically labeled according to aging related data of the video; the second classification model comprises a second visual feature extraction layer obtained by training the first visual feature extraction layer, a second text feature extraction layer obtained by training the first text feature extraction layer and a second feature fusion layer obtained by training the first feature fusion layer;

the pre-training module 606 is configured to perform parameter initialization on the visual feature extraction layer by using model parameters of the second visual feature extraction layer;

the pre-training module 606 is configured to perform parameter initialization on the text feature extraction layer by using model parameters of the second text feature extraction layer;

the pre-training module 606 is configured to initialize parameters of the feature fusion layer by using model parameters of the second feature fusion layer.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application. Specifically, the present application relates to a method for manufacturing a semiconductor device. The server 800 includes a central processing unit (english: central Processing Unit, abbreviated as CPU) 801, a system Memory 804 including a random access Memory (english: random Access Memory, abbreviated as RAM) 802 and a Read-Only Memory (english: ROM) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806 for facilitating the transfer of information between various devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user account input information. Wherein both the display 808 and the input device 809 are connected to the central processing unit 801 via an input/output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input/output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only Memory (EPROM for short, english: erasable Programmable Read-Only Memory), electrically erasable programmable read-Only Memory (EEPROM for short, electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (DVD for short, digital Versatile Disc), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.

According to various embodiments of the application, server 800 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., server 800 may be connected to a network 812 through a network interface unit 811 connected to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The application also provides a terminal which comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the training method of the video aging model provided by each method embodiment. It should be noted that the terminal may be a terminal as provided in fig. 11 below.

Fig. 11 shows a block diagram of a terminal 900 according to an exemplary embodiment of the present application. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 900 may also be referred to as a user account device, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 901 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 901 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 901 may incorporate a GPU (Graphics Processing Unit, a trainer of a video aging model) for taking care of rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 901 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement a method of training a video aging model provided by an embodiment of the method of the present application.

In some embodiments, the terminal 900 may further optionally include: a peripheral interface 903, and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 903 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 904, a display 905, a camera assembly 906, audio circuitry 907, and a power source 909.

The peripheral interface 903 may be used to connect at least one peripheral device associated with an I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 901, the memory 902, and the peripheral interface 903 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 904 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 904 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Illustratively, the radio frequency circuit 904 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, user account identity module cards, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication ) related circuits, which the present application is not limited to.

The display 905 is used to display a UI (User Interface, user account Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 905 is a touch display, the display 905 also has the ability to capture touch signals at or above the surface of the display 905. The touch signal may be input as a control signal to the processor 901 for processing. At this time, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing a front panel of the terminal 900; in other embodiments, the display 905 may be at least two, respectively disposed on different surfaces of the terminal 900 or in a folded design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display 905 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 905 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 906 is used to capture images or video. Illustratively, the camera assembly 906 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user account and the environment, converting the sound waves into electric signals and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be plural and disposed at different portions of the terminal 900. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 907 may also include a headphone jack.

The power supply 909 is used to supply power to the various components in the terminal 900. The power supply 909 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can further include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 901 may control the display 905 to display the user account interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for the acquisition of motion data for games or user accounts.

The gyro sensor 912 may detect the body direction and the rotation angle of the terminal 900, and the gyro sensor 912 may collect the 3D motion of the user account to the terminal 900 in cooperation with the acceleration sensor 911. The processor 901 may implement the following functions according to the data collected by the gyro sensor 912: motion sensing (e.g., changing UI according to tilting operation of user account), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 913 may be provided at a side frame of the terminal 900 and/or at a lower layer of the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the holding signal of the user account to the terminal 900 can be detected, and the processor 901 performs the left-right hand recognition or the shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at the lower layer of the display 905, the processor 901 performs pressure operation on the display 905 according to the user account, thereby realizing control of the operability control on the UI interface. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 915 is used to collect the intensity of ambient light. In one embodiment, the processor 901 may control the display brightness of the display panel 905 based on the intensity of ambient light collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display luminance of the display screen 905 is turned up; when the ambient light intensity is low, the display luminance of the display panel 905 is turned down. In another embodiment, the processor 901 may also dynamically adjust the shooting parameters of the camera assembly 906 based on the ambient light intensity collected by the optical sensor 915.

A proximity sensor 916, also referred to as a distance sensor, is typically provided on the front panel of the terminal 900. Proximity sensor 916 is used to collect the distance between the user account and the front of terminal 900. In one embodiment, when proximity sensor 916 detects a gradual decrease in the distance between the user account and the front face of terminal 900, processor 901 controls display 905 to switch from the bright screen state to the off screen state; when the proximity sensor 916 detects that the distance between the user account and the front surface of the terminal 900 gradually increases, the processor 901 controls the display 905 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 11 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

The memory also includes one or more programs stored in the memory, the one or more programs including training methods for performing the video aging model provided by the embodiments of the present application.

The present application also provides a computer device comprising: the storage medium stores at least one instruction, at least one section of program, a code set or an instruction set, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to implement the training method of the video aging model and the training method of the video aging model provided by the above method embodiments.

The application also provides a computer readable storage medium, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the storage medium, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by a processor to realize the training method of the video aging model and the training method of the video aging model provided by the method embodiments.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the training method of the video aging model and the training method of the video aging model provided in the above-described alternative implementations.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc. The foregoing is illustrative of the present application and is not to be construed as limiting thereof, but rather as various modifications, equivalent arrangements, improvements, etc., which fall within the spirit and principles of the present application.

It should be noted that, before and during the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window or output voice prompt information, where the prompt interface, popup window or voice prompt information is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected by the present application is collected with the consent and authorization of the user, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards of the relevant country and region.

Claims

1. A method for training a video aging model, the method comprising:

2. The method of claim 1, wherein the at least two modal predictions include: visual prediction results output based on visual features, and text prediction results output based on text features;

training the at least two feature extraction layers according to the at least two modal predictions and the loss of aging tags of the sample video, comprising:

Training the visual feature extraction layer according to the visual prediction result and the first loss of the aging label of the sample video;

and training the text feature extraction layer according to the text prediction result and the second loss of the aging label of the sample video.

3. The method of claim 2, wherein training the visual feature extraction layer based on the visual prediction result and the first loss of the aging label of the sample video comprises:

calculating a first gradient according to the first loss, wherein the first loss is the loss of the visual prediction result and the aging label;

weighting the first gradient according to a first difference ratio to obtain a second gradient, wherein the first difference ratio is calculated according to the difference between the visual prediction result and the text prediction result;

and carrying out gradient update on the model parameters of the visual feature extraction layer according to the second gradient.

4. A method according to claim 3, characterized in that the method further comprises:

calculating a first ratio of the visual predictive result to the text predictive result;

performing normalization modulation on the first ratio to obtain the first difference ratio under the condition that the first ratio is larger than one;

In the case where the first ratio is not greater than one, 1 is determined as the first difference ratio.

5. The method of claim 2, wherein training the text feature extraction layer based on the text prediction result and a second loss of the aging tags of the sample video comprises:

calculating a third gradient according to the second loss, wherein the second loss is the loss of the text prediction result and the aging label;

weighting the second gradient according to a second difference ratio to obtain a fourth gradient, wherein the second difference ratio is calculated according to the difference between the text prediction result and the visual prediction result;

and carrying out gradient update on the model parameters of the text feature extraction layer according to the fourth gradient.

6. The method of claim 5, wherein the method further comprises:

calculating a second ratio of the text prediction result to the visual prediction result;

when the second ratio is greater than 1, performing normalization modulation on the second ratio to obtain the second difference ratio;

in the case where the second ratio is not greater than 1, 1 is determined as the second difference ratio.

7. The method according to any one of claims 1 to 6, wherein at least two modal data of the sample video are respectively input into at least two feature extraction layers to obtain at least two modal features corresponding to the at least two modal data; inputting the at least two modal features into a feature fusion layer to obtain fusion features; outputting at least two modal prediction results according to the at least two modal characteristics, including:

inputting the visual data of the sample video into a visual feature extraction layer to obtain visual features;

inputting the text data of the sample video into a text feature extraction layer to obtain text features;

inputting the visual features and the text features into a cross-attention feature fusion layer to obtain the fusion features;

outputting a visual prediction result according to the visual characteristics, wherein the visual prediction result comprises a prediction age of the sample video output according to the visual data;

and outputting a text prediction result according to the text characteristics, wherein the text prediction result comprises the prediction aging of the sample video output according to the text data.

8. The method of any one of claims 1 to 6, wherein the at least two modality data includes visual data and text data; the method further comprises the steps of:

Performing frame extraction processing on the sample video to obtain the visual data consisting of at least one frame of picture of the sample video;

generating the text data based on at least one of a first text, a second text, a title of the sample video, a category of the sample video, a tag of the sample video, an author of the sample video, and an upload date of the sample video;

9. The method of any one of claims 1 to 6, wherein the aging label of the sample video is a manually labeled aging label;

the at least two feature extraction layers include a visual feature extraction layer and a text feature extraction layer, the method further comprising:

invoking a first training sample set to train a first classification model, wherein samples in the first training sample set comprise videos and labels corresponding to the videos, the labels comprise classification labels of the videos in any classification dimension, and the first classification model comprises a first visual feature extraction layer, a first text feature extraction layer and a first feature fusion layer;

Invoking a second training sample set to train a second classification model based on the first visual feature extraction layer, the second text feature extraction layer and the first feature fusion layer, wherein the samples in the second training sample set comprise videos and automatic aging labels of the videos, and the automatic aging labels are automatically marked according to aging related data of the videos; the second classification model comprises a second visual feature extraction layer obtained by training the first visual feature extraction layer, a second text feature extraction layer obtained by training the first text feature extraction layer and a second feature fusion layer obtained by training the first feature fusion layer;

initializing parameters of the visual feature extraction layer by using model parameters of the second visual feature extraction layer;

initializing parameters of the text feature extraction layer by using model parameters of the second text feature extraction layer;

and initializing parameters of the feature fusion layer by using the model parameters of the second feature fusion layer.

10. A training device for a video aging model, the device comprising:

11. A computer device, the computer device comprising: a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the code set or instruction set being loaded and executed by the processor to implement the method of training a video aging model according to any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of training the video aging model of any of claims 1 to 9.