CN114332690A

CN114332690A - Model generation method, video processing method and equipment

Info

Publication number: CN114332690A
Application number: CN202111531880.1A
Authority: CN
Inventors: 陈思宇; 康力; 邓俊祺; 王立波
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-04-12

Abstract

The embodiment of the application provides a model generation method, a video processing method and equipment. Extracting a plurality of image frames from a video to be processed; inputting the plurality of image frames into a video processing model; wherein the video processing model comprises a feature extractor and at least one computation module; the feature extractor comprises a plurality of feature extraction modules; the plurality of feature extraction modules are respectively extracted from a plurality of image processing models; obtaining first video features corresponding to the video to be processed based on frame features respectively extracted from the plurality of image frames by the plurality of feature extraction modules; and processing the first video characteristics by utilizing the at least one calculation module respectively to obtain at least one processing result. The technical scheme provided by the embodiment of the application improves the accuracy of the video processing result.

Description

Model generation method, video processing method and equipment

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a model generation method, a video processing method and equipment.

Background

In the field of data processing, processing of video data is often involved, such as video classification, feature calculation, and the like, so that corresponding operations can be performed based on the processing result of the video data. At present, a machine learning model, such as a neural network model, is usually used to process video data, and the accuracy of a video processing result is affected by a model structure, a training mode, and the like. How to provide a video processing model with more accurate video processing results becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the application provides a model generation method, a video processing method and video processing equipment, which are used for improving the accuracy of a video processing result.

In a first aspect, an embodiment of the present application provides a video processing method, including:

extracting a plurality of image frames from a video to be processed;

inputting the plurality of image frames into a video processing model; wherein the video processing model comprises a feature extractor and at least one computation module; the feature extractor comprises a plurality of feature extraction modules; the plurality of feature extraction modules are respectively extracted from a plurality of image processing models;

obtaining first video features corresponding to the video to be processed based on frame features respectively extracted from the plurality of image frames by the plurality of feature extraction modules;

and processing the first video characteristics by utilizing the at least one calculation module respectively to obtain at least one processing result. In a second aspect, an embodiment of the present application provides a model generation method, including:

determining a plurality of image processing models;

respectively extracting feature extraction modules from the plurality of image processing models to obtain a plurality of feature extraction modules;

connecting the plurality of feature extraction modules in parallel to construct a feature extractor;

connecting the output of the feature extractor with a computing module of at least one video task to construct a video processing model;

keeping the model parameters of the feature extractor unchanged, and training the video processing model by using the training sample of the at least one video task; the video processing model is used for processing the video to be processed to obtain processing results respectively corresponding to the at least one video task.

In a third aspect, an embodiment of the present application provides a model generation method, including:

determining a plurality of image processing models;

respectively connecting the outputs of the plurality of feature extraction modules with a feature fusion module to construct a feature extractor;

the feature extractor is used for extracting video features of a video to be processed; the characteristic extraction modules respectively extract the frame characteristics of the image frames based on the image frames corresponding to the video to be processed; the feature fusion module is used for fusing the frame features respectively extracted by the feature extraction modules to obtain the video features.

In a fourth aspect, an embodiment of the present application provides a computing device, which includes a storage component and a processing component, where the storage component stores one or more computer instructions for being called and executed by the processing component to implement the video processing method according to the first aspect, or implement the model generation method according to the second aspect, or implement the model generation method according to the third aspect.

In the embodiment of the application, the feature extraction module is extracted from a plurality of image processing models, the feature extraction module in the plurality of image processing models forms a feature extractor, the feature extractor is connected with the calculation module of at least one video task to construct a video processing model, when model training is carried out, keeping the model parameters of the feature extractor unchanged, training to obtain the model parameters of the calculation module, thereby facilitating the video processing model to process the model to be processed and obtaining the processing result corresponding to at least one video task, the feature extractor is constructed by the feature extraction modules in a plurality of image processing models, used for extracting video characteristics, the characteristics extracted by the characteristic extraction modules of a plurality of image processing models form the video characteristics, the accuracy of video feature extraction can be remarkably improved, and therefore the accuracy of a video processing result can be improved.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a block diagram illustrating an embodiment of a data processing system provided herein;

FIG. 2 illustrates a flow diagram of one embodiment of a model generation method provided herein;

FIG. 3a is a schematic diagram illustrating a structure of a feature extractor in a practical application according to an embodiment of the present application;

FIG. 3b is a schematic structural diagram of a video processing model in a practical application according to the embodiment of the present application;

FIG. 3c is a schematic diagram showing a video processing model in yet another practical application of the embodiment of the present application;

FIG. 3d is a schematic diagram showing a video processing model in yet another practical application of the embodiment of the present application;

FIG. 3e is a schematic structural diagram of a merging model in a practical application according to the embodiment of the present application;

FIG. 3f is a schematic diagram of an integrated process model in a practical application according to an embodiment of the present application;

FIG. 4 illustrates a flow chart of yet another embodiment of a model generation method provided herein;

FIG. 5 is a flow diagram illustrating one embodiment of a video processing method provided herein;

FIG. 6a is a flow chart illustrating a further embodiment of a video processing method provided by the present application;

FIG. 6b is a schematic diagram illustrating scene interaction in a practical application according to the embodiment of the present application;

FIG. 7 is a schematic diagram illustrating an embodiment of a model generation apparatus provided herein;

FIG. 8 illustrates a schematic structural diagram of one embodiment of a computing device provided herein;

FIG. 9 is a schematic diagram illustrating an embodiment of a video processing apparatus provided in the present application;

FIG. 10 is a block diagram illustrating one embodiment of a computing device provided herein.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some of the flows described in the specification and claims of this application and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, the number of operations, e.g., 101, 102, etc., merely being used to distinguish between various operations, and the number itself does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical scheme of the embodiment of the application can be applied to video data processing scenes, and video data processing is more and more along with the development of artificial intelligence technology, internet technology and the like.

As described in the background, machine learning models, such as neural network models, are currently used for video processing. The inventor researches and discovers that when a video processing model carries out video processing, video features need to be extracted firstly, and then corresponding processing is carried out based on the video features, so that the video features are a key factor influencing the accuracy of a video processing result, and the inventor thinks that if the accuracy of the video feature extraction can be improved, the accuracy of the video processing result can also be improved. The inventor further researches and discovers that if the accuracy of video feature extraction is to be improved, a feature extraction module in a video processing model can be improved, taking a neural network model as an example, feature extraction is usually performed by multiple network layers in the neural network model, and the accuracy of video feature extraction can be ensured by increasing the number of network layers and the like.

In view of the above findings, the inventors have conducted a series of studies and have creatively proposed the technical solution of the present application, in the embodiment of the present application, a feature extractor is constructed and obtained by feature extraction modules in a plurality of image processing models, so as to extract video features, and the features extracted by the feature extraction modules of the plurality of image processing models constitute video features, so that the accuracy of video feature extraction can be significantly improved, and thus the accuracy of video processing results can be improved. Under the condition of involving a plurality of video tasks, the plurality of video tasks can share the feature extractor, and compared with a mode of carrying out model training on a single video task, the complexity of model training can be reduced; and because the feature extractor can guarantee the accuracy of the video processing result, and the calculation module adopts a lightweight structure to still guarantee the accuracy of the video processing result, the calculation amount can be reduced, the calculation speed is increased, and the influence on the performance of the equipment of the operation model can be reduced on the premise of guaranteeing the accuracy of the video processing result. In a scene of practical application in which a processing result of the video processing model is used for performing corresponding operation, for example, in a scene in which audio data is recommended to a video to be processed, the operation accuracy can be ensured, the audio data recommendation accuracy can be improved, and the like.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solution of the embodiment of the present application may be applied to the data processing system as described in fig. 1, and the data processing system may include, for example, a first server 101, a second server 102, a client 103, and the like. The first server 101 may perform construction, training, and the like of the video processing model, and the trained video processing model may be deployed on the second server 102 or in the client 103.

When the video processing model is deployed in the second server 102, the second server 102 may process the video to be processed based on the video processing model according to the video processing request sent by the client 103.

In the case where the video processing model is deployed in the client 103, the client 103 may directly process the video to be processed based on the video processing model, and the like.

In addition, in practical applications, the first server 101 and the second server 102 may be the same server. Namely, the same server can complete the training of the video processing model, the processing of the video to be processed and the like.

In addition, the video processing using the video processing model may be generally executed by the second server 102, or may be executed by the client 103, which is not limited in the present application.

The server mentioned above may refer to hardware or software, and when the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or implemented as a single server, or implemented as a cloud server, or implemented as an intelligent cloud computing server or an intelligent cloud host with an artificial intelligence technology. When the server is software, the server can be implemented as a plurality of software modules, or as a single software module.

The client 103 may be generally understood as an application program deployed in an electronic device, for example, one or more of a smart phone, a tablet computer, and a portable computer, or certainly, a desktop computer, etc., and for ease of understanding, the client is mainly represented in fig. 1 in a device image. Various other types of applications may also be configured in the electronic device, such as a search type, an instant messaging type, and so on. Of course, the client 103 may also refer to a browser, a web application such as H5(HyperText Markup Language5, 5 th edition) application, or a light application (also referred to as an applet, a kind of light application), and the like. This is not specifically limited in this application.

The details of implementation of the technical solution of the embodiments of the present application are set forth in the following.

Fig. 2 is a flowchart of an embodiment of a model generation method provided in an embodiment of the present application, where a technical solution of this embodiment may be executed by a first server or a second server in the system shown in fig. 1, for example, the method may include the following steps:

201: a plurality of image processing models is determined.

Wherein the plurality of image processing models may correspond to a plurality of image tasks.

The plurality of image tasks may include one or more of an image classification task, an object detection task, and an image segmentation task. The image classification tasks may include, for example, background classification tasks, such as identifying the background of mountains, rivers, cities, etc. in the image; or a target object classification task, such as identifying cats, dogs, people, cars, etc. in the image; the target detection task may include, for example, a face detection task, such as outputting an area image where a face position is located; the image segmentation task may be, for example, predicting a category of an object to which each pixel belongs, or obtaining an image of a specific region by segmentation.

Alternatively, the image tasks corresponding to the plurality of image processing models may be different, and may be used to implement different image processing, respectively.

The plurality of image processing models may be obtained by using an existing model or training. Thus, optionally, determining a plurality of image processing models comprises:

training a plurality of image processing models by utilizing training samples corresponding to a plurality of image tasks respectively; the plurality of image tasks include one or more of an image classification task, an object detection task, and an image segmentation task.

That is, the training sample corresponding to each image task is used to train and obtain an image processing model corresponding to the image task, and a plurality of image processing models can be trained and obtained for a plurality of image tasks.

The training sample corresponding to each image task may be composed of a sample image and label data corresponding to the sample image and belonging to the image task, for example, for an image classification task, the label data may refer to a specific category corresponding to the sample image.

202: and extracting the feature extraction modules from the plurality of image processing models respectively to obtain a plurality of feature extraction modules.

203: a plurality of feature extraction modules are connected in parallel to construct a feature extractor.

204: the output of the feature extractor is connected to a computation module of at least one video task to build a video processing model.

When the image processing model performs image processing, image features are extracted from an input image, and then the image processing is performed based on the image features, wherein the image features are usually in a vector form capable of representing image contents. When the model is used for video processing, frame features need to be extracted from image frames of video data, and video processing is performed based on the frame features, wherein the frame features are also image features. Therefore, the inventor thinks that, in the embodiment of the present application, the feature extraction module in each image processing model can be extracted, and the feature extraction modules in a plurality of image processing models can be connected in parallel to construct a feature extractor. And connecting the feature extractor with at least one computing module of a video task, so as to construct and obtain a video processing model. Therefore, each feature extraction module can be used for extracting the frame features of the image frames in the video data, and the feature extractor can fuse the frame features extracted by the plurality of feature extraction modules, so that the corresponding video features representing the video content can be obtained.

205: and keeping the model parameters of the feature extractor unchanged, and training a video processing model by using the training sample of at least one video task.

The video processing model can be used for processing the video to be processed to obtain processing results respectively corresponding to at least one video task. The feature extractor may be configured to extract a first video feature of the video to be processed. The feature extraction model may be configured to extract frame features corresponding to a plurality of image frames in the video to be processed, where the first video feature may be obtained by fusing a plurality of frame features extracted by each feature extraction module, respectively.

In the video processing model provided by the embodiment of the application, the feature extractor can be connected with the computing modules corresponding to at least one video task respectively, and can include the condition of connecting the computing modules of one video task, so that the video processing operation corresponding to the video task can be realized for the video to be processed; the method can also include the condition of connecting the computing modules corresponding to the plurality of video tasks respectively, so that the video processing operation corresponding to the plurality of video tasks can be realized. Under the condition that the video processing system comprises the calculation modules corresponding to the video tasks respectively, the video tasks share one feature extractor, the calculation amount can be reduced on the premise of ensuring the video processing accuracy, and the model response speed and the video processing efficiency are improved.

The feature extractor is constructed and obtained based on the plurality of feature extraction modules, the plurality of feature extraction modules are extracted and obtained from the plurality of image processing models, and the image tasks of the image processing models can be different, so that the features extracted by the feature extractor are equivalent to the features of different types, and the obtained features are more accurate. In order to reduce training complexity and training cost, when a video processing model is trained, model parameters of the feature extractor can be guaranteed to be unchanged, and only model parameters of a calculation module of each video task are trained.

When the image processing model is a neural network model, the image processing model is generally composed of an input layer, at least one intermediate layer and an output layer, each intermediate layer is used for generating corresponding features corresponding to the image, and the accuracy of the generated features on the image content expression is lower for the intermediate layer closer to the output layer and is closer to the image processing result. Therefore, the feature extraction modules are respectively extracted from the plurality of image processing models, and the plurality of feature extraction modules can be obtained by removing the corresponding number of network layers from the end of each image processing model, that is, removing the last network layers from each image processing model from the output layer, and forming the feature extraction module by the remaining network layers.

Optionally, the extracting the feature extraction modules from the plurality of image processing models respectively, and obtaining the plurality of feature extraction modules may include: and respectively removing the network layers with the corresponding layers from the tail ends of the image processing models according to the task types respectively corresponding to the image processing models to obtain a plurality of feature extraction modules.

That is, the network layer to be removed or the network layer to be reserved can be selected according to different task types. For example, for an image classification task, only the output or the like may be removed, and for an image segmentation task or the like, since the output result of the image processing model is an image, the output feature purity of the intermediate layer closer to the output layer is lower and the data amount is larger, and in order to avoid data redundancy, the network layers from the output layer to the intermediate layer located at the intermediate position may be removed from the output layer, and the input layer and several network layers may be left.

Further, to facilitate computation, in some embodiments, connecting multiple feature extraction modules in parallel to construct a feature extractor may comprise:

respectively connecting the outputs of the plurality of feature extraction modules with a feature fusion module to obtain a plurality of sub-extractors; the feature fusion module is used for fusing a plurality of frame features output by the feature extraction module connected with the feature fusion module to obtain fusion features;

connecting the outputs of the plurality of sub-extractors to a feature connecting module to obtain feature extractors; the feature connection model is used for fusing a plurality of fusion features output by the plurality of sub-extractors to obtain a first video feature.

When video data is processed, each feature extraction module is respectively used for extracting the frame features of a plurality of image frames corresponding to the video data, so that one feature extraction module can output a plurality of frame features. In order to reduce the computational complexity, a feature fusion module can perform fusion processing on a plurality of frame features output by the feature extraction module to obtain a fusion feature.

The fusion processing mode may be implemented by, for example, calculating a time domain mean value, and since the feature is usually a vector, the time domain mean value may be specifically calculated for a plurality of feature vectors corresponding to each dimension, so that a multidimensional fusion feature may be obtained. For ease of understanding, it is assumed that f image frames are input to the feature extraction module, so that f frame features, denoted as f

And each frame feature may be a d-dimensional vector, and each frame feature may be represented as

The specific calculation formula of the fusion feature corresponding to each dimension can be as follows:

of course, besides calculating the time domain mean, a mode may be adopted for fusion, such as weighted mean, sum, and the like, which is not specifically limited in this application.

Due to the fact that the plurality of sub-extractors exist, each sub-extractor can obtain one fusion feature, in order to further reduce the computational complexity, the plurality of sub-extractors can be connected to one feature connection module together, and the feature connection module can further fuse the plurality of fusion features corresponding to the plurality of sub-extractors, so that the first video feature capable of representing video content is obtained finally.

The feature connection module fuses the plurality of fusion features, which may be stitching the plurality of fusion features, for example, each fusion feature is a D-dimensional vector, and if there are 3 fusion features, a 3 × D-dimensional first video feature may be obtained by stitching.

For further understanding, in the schematic structural diagram of the feature extractor shown in fig. 3a, the feature extractor may include a feature extraction module 301, a feature fusion module 302, and a feature connection module 303. A plurality of image frames may be input into a feature extraction module 301, so as to obtain a plurality of frame features, the plurality of frame features are processed by a feature fusion module 302 to obtain a fusion feature, and the plurality of fusion features are processed by a feature connection module 303 to obtain a first video feature.

In an actual scenario, at least one video task may include a plurality of video tasks, and therefore step 204 may be to connect the output of the feature extractor to the computing modules of the plurality of video tasks respectively to construct a video processing model.

Therefore, the video processing model can be specifically used for processing the video to be processed to obtain the processing results respectively corresponding to the plurality of video tasks.

The training of the video processing model may be to keep the model parameters of the feature extractor unchanged, and train the video processing model using a training sample of at least one video task.

In order to further improve the accuracy of the video processing result, as an optional implementation manner, the computing module may include a feature adapter and a video processor; in addition, the computing module for certain specific video tasks may also include at least one auxiliary processor.

Thus, training the video processing model using the training samples of the at least one video task while keeping the model parameters of the feature extractor unchanged may comprise:

keeping the model parameters of the feature extractor unchanged, and inputting training samples of the video tasks into the feature extractor to obtain first video features aiming at each video task;

inputting the first video characteristics into a characteristic adaptation module corresponding to the video task to obtain second video characteristics;

if the computing module corresponding to the video task comprises at least one auxiliary processor, respectively inputting the second video characteristics into the video processor and the at least one auxiliary processor corresponding to the video task to obtain a video processing result and at least one auxiliary processing result;

based on label data respectively corresponding to the video processor and the at least one auxiliary processor in the training sample, and the video processing result and the at least one auxiliary processing result, adjusting model parameters of the video processor, the at least one auxiliary processor and the feature adapter;

if the computing module corresponding to the video task does not comprise at least one auxiliary processor, inputting the second video characteristics into the video processor corresponding to the video task to obtain a video processing result;

adjusting model parameters of a video processor feature adapter based on label data corresponding to the video processor in the training sample and the video processing result;

and after the training is finished, cutting the at least one auxiliary processor from the video processing model to obtain the trained video processing model.

The at least one auxiliary processor may be of a different processing type than the video processor. For example, a video processor is used to compute video specific features, at least one auxiliary processor may be used to implement video classification, and the like.

The description will be given by taking an example that the plurality of video tasks include a video category identification task and a video signature feature extraction task. In the video processing model as shown in fig. 3b, the computing module including the feature extractor 300, the video category identification task connected to the output of the feature extractor 300 may include a classification feature adaptation module 31 and a video classifier 32; the computation module of the video signature feature extraction task connected to the output of the feature extractor 300 may comprise a signature feature adaptation module 33, a signature vector calculator 34 and at least one auxiliary classifier 35, which at least one auxiliary classifier 35 may for example comprise a multi-classifier identifying a specific video class or a bi-classifier identifying whether a specific object is present in the video, etc.

By adding the auxiliary processor, the video processor can be combined with the auxiliary processor, so that parameters of the feature adapter are reasonably adjusted, and the situation that the model cannot be processed or the processing accuracy is low due to poor training is avoided. After the training is finished, the auxiliary processor can be cut off again, so that the trained video processing model shown in fig. 3c can be obtained.

In a practical application, the video type and the video signature feature obtained by identifying the video processing model can be applied to an audio data recommendation scene, the similarity between the video signature feature and the audio signature feature can be calculated, for example, the similarity is represented by the Euclidean distance, and the audio data matched with the video to be processed is determined according to the similarity. When the video processing model is trained, the model input in the training sample of the video signature characteristic extraction task can be sample video data, and the model output means that the label data corresponding to the signature vector calculator can be the distance range between the sample video data and the signature characteristic of the audio data matched with the sample video data, in the training process, the signature vector calculator calculates the obtained video processing result, can calculate the Euclidean distance with the audio signature characteristic of the audio data, and adjusts the model parameters according to the distance range in the label data as a constraint condition, so that the Euclidean distance between the video processing result of the signature vector calculator and the audio signature characteristic of the audio data is in the distance range, namely the training requirement is considered to be met, and the training is finished.

In yet another practical scenario, the at least one video task may comprise a video task, and step 204 may be connecting the output of the feature extractor to a computation module of the video task to construct the video processing model.

The video processing model may be a processing model that processes a video to be processed to obtain processing results respectively corresponding to the one video task.

The video processing model may be obtained based on training samples of the video task.

In this implementation scenario, the video processing model may be used to process one video task, and since there may be a plurality of processing requirements of the video task, in order to improve processing efficiency, reduce computation amount, and the like, as another embodiment, after training the video processing model by using a training sample corresponding to at least one video task, the method may further include:

determining a plurality of video processing models respectively constructed by a feature extractor and a computing module of a video task; the multiple video processing models correspond to different video tasks;

integrating the video processing models to obtain a comprehensive processing model consisting of computing modules corresponding to a plurality of video tasks respectively connected with the output of the feature extractor; the comprehensive processing model is used for processing the video to be processed to obtain calculation results respectively corresponding to the plurality of video tasks.

Because the plurality of video processors are all constructed and obtained based on the feature extractor, the plurality of video processing models constructed based on the feature extractor can be cut, merged and processed, and the like, so that a common feature extractor is obtained, and a comprehensive processing model for processing a plurality of video tasks can be realized.

The difference between the comprehensive processing model and the video processing model shown in fig. 3b or fig. 3c is that the video processing model shown in fig. 3b or fig. 3c is obtained by integrating the calculation modules of a plurality of video tasks in a manner of sharing a feature extractor, and then is obtained by training with the training samples of the plurality of video tasks. The comprehensive processing model can be obtained by constructing a video processing model based on a feature extractor and finishing training for a certain video task, and then cutting and combining a plurality of video processing models, and the comprehensive processing model can be used without training.

The computing module of the video task may include a feature adaptation module and a video processor, and for a specific video task, the computing module may further include at least one auxiliary processor, and the specific video task may refer to any one of the video tasks, and may also be determined according to a type of the video task.

Then in some embodiments, keeping the model parameters of the feature extractor unchanged, training the video processing model using the training samples corresponding to the at least one video task may include:

keeping the model parameters of the feature extractor unchanged, and inputting the training sample into the feature extractor to obtain a first video feature;

inputting the first video characteristic into a characteristic adaptation module to obtain a second video characteristic;

if the calculation module comprises at least one auxiliary processor, respectively inputting the second video characteristics into the video processor and the at least one auxiliary processor to obtain a video processing result and at least one auxiliary processing result;

based on label data respectively corresponding to the video processor and the at least one auxiliary processor in the training sample, the video processing result and the at least one auxiliary processing result, adjusting model parameters of the video processor, the at least one auxiliary processor and the feature adapter until the training requirement is met;

cutting the at least one auxiliary processor from the video processing model after the training is finished;

if the calculation module does not comprise at least one auxiliary processor, inputting the second video characteristics into the video processor to obtain a video processing result;

based on the label data in the training sample and the video processing result, adjusting model parameters of a video processor and a feature adapter until the training requirement is met;

integrating a plurality of video processing models to obtain a comprehensive processing model which is composed of calculation modules corresponding to a plurality of video tasks of a feature extractor and a connection feature extractor respectively, wherein the comprehensive processing model comprises the following steps:

and combining the video processing models to obtain a comprehensive processing model consisting of the feature extractor and the computing modules corresponding to the video tasks respectively connected with the output of the feature extractor.

That is, the plurality of video processing modules are cut and combined, and only the comprehensive processing model formed by the feature adaptation module and the video processing module respectively corresponding to the plurality of video tasks connected by the output of the feature extractor and the feature extractor is obtained.

In order to clearly understand the generation process of the integrated processing model, the following description still takes a plurality of video tasks including a video category identification task and a video signature feature extraction task as an example. FIG. 3d is a video processing model corresponding to the video category identification task, which may include a feature extractor 300, a feature adaptation module 36 connected to the feature extractor 300, and a video classifier 37 connected to the feature adaptation module; fig. 3e is a video processing model corresponding to the video signature feature extraction task, and it can be seen that the video processing model includes a feature extractor 300, a feature adaptation module 38 connected to the feature extractor 300, and a signature vector calculator 39 and at least one auxiliary classifier 40 respectively connected to the feature adaptation module. FIG. 3f is a video processing model obtained by cropping out the auxiliary classifiers after the training of the video processing model of FIG. 3e is completed. The video processing model of fig. 3f is combined with the video processing model of fig. 3d, so that an integrated processing model can be obtained, which has the same structure as the video processing model shown in fig. 3 c.

In an actual application, the video type and the video signature feature obtained by identifying the video processing model can be applied to an audio data recommendation scene, the similarity between the video signature feature and the audio signature feature can be calculated, for example, the similarity is represented by the Euclidean distance, and the audio data matched with the video to be processed is determined according to the similarity. When the video processing model is trained, the model input in the training sample of the video signature characteristic extraction task can be sample video data, and the model output means that the label data corresponding to the signature vector calculator can be the distance range between the sample video data and the signature characteristic of the audio data matched with the sample video data, in the training process, the signature vector calculator calculates the obtained video processing result, can calculate the Euclidean distance with the audio signature characteristic of the audio data, and adjusts the model parameters according to the distance range in the label data as a constraint condition, so that the Euclidean distance between the video processing result of the signature vector calculator and the audio signature characteristic of the audio data is in the distance range, namely the training requirement is considered to be met, and the training is finished.

In addition, an embodiment of the present application further provides a model generation method, as shown in fig. 4, the method may include the following steps:

401: a plurality of image processing models is determined.

402: and respectively extracting the feature extraction modules from the plurality of image processing models to obtain a plurality of feature extraction modules.

403: a plurality of feature extraction modules are connected in parallel to construct a feature extractor.

Wherein the feature extractor may be configured to extract video features of the video to be processed.

The operations of step 401 to step 403 may be detailed in the embodiments of step 201 to step 203 shown in fig. 2, and are not repeated herein. In this embodiment, the feature extractor constructed by connecting the feature extraction modules in the plurality of image processing models may be used as an independent feature extraction model to extract video features of a video to be processed. The feature extractor is constructed and obtained based on the plurality of feature extraction modules, the plurality of feature extraction modules are extracted and obtained from the plurality of image processing models, and the image tasks of the image processing models can be different, so that the features extracted by the feature extractor are equivalent to the features of different types, and the obtained features are more accurate.

Based on the video processing model obtained by training in the foregoing corresponding embodiment, video processing may be performed, and referring to fig. 5, the method is a flowchart of an embodiment of a video processing method provided in this embodiment of the present application, and the method may include the following steps:

501: a plurality of image frames are extracted from a video to be processed.

Wherein, an extraction manner of the plurality of image frames can extract one image frame from every predetermined number of image frames in the image frames of the video to be processed to obtain the plurality of image frames. Another way of extraction may be from key frames extracted from the processed video. Of course, it may also refer to all video frames that make up the video to be processed in theory.

502: a plurality of image frames are input to a video processing model.

Wherein the video processing model comprises a feature extractor and at least one computing module; the feature extractor comprises a plurality of feature extraction modules; the plurality of feature extraction modules are respectively extracted from the plurality of image processing models.

The video processing model is obtained by training samples corresponding to at least one video task respectively, and model parameters of the feature extractor are kept unchanged in the training process, so that the accuracy of the video processing model is guaranteed.

The specific construction and training modes of the video processing model may be described in the corresponding embodiments, and are not described herein again.

503: and obtaining a first video characteristic corresponding to the video to be processed based on the frame characteristics respectively extracted from the plurality of image frames by the plurality of characteristic extraction modules.

The image frames are input into the video processing model and respectively input into each feature extraction module, each feature extraction module respectively extracts frame features from the image frames, and each feature extraction module can extract and obtain a plurality of frame features corresponding to the image frames.

Based on the frame features respectively output by the feature extraction modules, the first video feature of the video to be processed can be obtained.

Optionally, in a case that the feature extractor specifically includes a plurality of sub-extractors each including an output of the plurality of feature extraction modules connected to the feature fusion module, and a feature connection module connected to outputs of the plurality of sub-extractors, the obtaining the first video feature corresponding to the video to be processed based on the frame features extracted from the plurality of image frames by the plurality of feature extraction modules includes:

respectively extracting the frame characteristics of the image frames by utilizing the plurality of characteristic extraction modules, and fusing the frame characteristics by utilizing the corresponding characteristic fusion modules to obtain fusion characteristics;

and fusing the plurality of fusion features output by the plurality of sub-extractors by using the feature connection module to obtain a first video feature.

504: and processing the first video characteristics by utilizing at least one calculation module respectively to obtain at least one processing result.

Each computing module may include a feature adaptation module and a video processor, the first video feature may output the at least one computing module, the feature adaptation module in each computing module processes the first video feature to obtain a second video feature, and the video processor processes the second video feature to obtain a corresponding processing result.

The specific implementation and usage of the feature extraction module, the feature fusion module, the feature connection module, the feature video module, and the video processor are described in detail in the foregoing embodiments of the model generation method, and will not be repeated here.

As can be known from the foregoing examples, the video processing model provided in the embodiment of the present application may be used in an audio data recommendation scene for video data, for example, in a video sharing system or a video processing system, a user may upload video data taken by the user, and the system may recommend audio data matched with the video data uploaded by the user, so as to improve the attention of the video data by combining audio and video. To improve the accuracy of audio data matching with video data, the inventors thought that matching could be done in conjunction with a video genre, which could refer to, for example, video genre types such as classical, jazz, etc., and a video signature feature, which is a specific type of video feature that could refer to a multi-dimensional real number vector, to uniquely characterize the video content. By adopting the technical scheme of the embodiment of the application, the video processing model with more accurate processing result is obtained by construction and training, so that the accuracy of the video type and the video signature characteristic can be improved, and the accurate recommendation of the audio data can be realized. Taking audio data recommendation as an example, a technical solution of the embodiment of the present application is described below, referring to fig. 6a, which is a flowchart of another embodiment of a video processing method provided in the embodiment of the present application, where the technical solution of the embodiment may be executed by a second server deploying a video processing model, and certainly may also be executed by a client deploying the video processing model, and the method may include:

601: and receiving the video to be processed uploaded by the user.

When the second server executes the to-be-processed video, the to-be-processed video may be uploaded to the second server by the user through the client, for example, the client may send a video processing request to the second server based on the to-be-processed video uploaded by the user, where the video processing request includes the to-be-processed video, and thus the second server may determine the to-be-processed video from the video processing request. And the video to be processed can be obtained by shooting by the client based on the user control operation, or obtained by reading from the local system where the client is located based on the user instruction operation, and the like.

602: a plurality of image frames are extracted from a video to be processed.

603: and inputting a plurality of image frames into a video processing model to obtain the video type and the video signature characteristics.

The video processing model may implement the determination of the video type and the video signature feature, that is, the video processing model may be the video processing model shown in fig. 3 c.

Of course, as another embodiment, a plurality of image frames may be input into the integrated processing model to obtain the video signature characteristics of the video category.

Of course, as another embodiment, a plurality of image frames may be input into the video processing model shown in fig. 3d to obtain the video category, a plurality of image frames may be input into the video processing model shown in fig. 3f to obtain the video signature, and so on.

Wherein, the video processing model is composed of a feature extractor and at least one computing module; the feature extractor may include a plurality of feature extraction modules; the plurality of feature extraction modules are respectively extracted from the plurality of image processing models.

After the plurality of image frames are input into the video processing model, a first video feature corresponding to a video to be processed can be obtained based on frame features respectively extracted from the plurality of image frames by the plurality of feature extraction modules; and respectively processing the first video characteristics by using at least one computing module to obtain the video types and the video signature characteristics.

In addition, in the case that the feature extractor specifically includes a plurality of sub-extractors each of which is configured by connecting an output of each of the plurality of feature extraction modules to a feature fusion module, and a feature connection module connected to an output of each of the plurality of sub-extractors, alternatively, specifically, the plurality of feature extraction modules may respectively extract frame features of the plurality of image frames, and fuse the plurality of frame features via the respective corresponding feature fusion modules to obtain fusion features; and fusing the plurality of fusion features output by the plurality of sub-extractors by using the feature connection module to obtain a first video feature.

The computing module for executing the video category identification task may include a feature adaptation module and a video classifier, and the computing module for executing the video signature feature extraction task may include a feature adaptation model and a signature vector calculator, so that the feature adaptation module in each computing module may obtain the second video feature based on the first video feature, then the video classifier is used to classify the first video feature to obtain the corresponding video category, and the signature vector calculator is used to perform signature vector calculation on the first video feature to obtain the video signature feature.

604: and screening at least one target audio data matched with the video to be processed from the audio database based on the video type and the video signature characteristics.

605: at least one target audio data is recommended to the user.

When the technical solution of this embodiment is executed by the second server, step 505 may specifically be to send the recommendation prompting information of the at least one target audio data to the client, and the client outputs the recommendation prompting information, so as to achieve the purpose of recommending to the user.

The embodiment can combine the video category and the video signature characteristics to screen at least one target audio data matched with the video to be processed from the audio database. Since the accuracy of the video type and the video signature feature can be ensured, the accuracy of audio data recommendation can be improved.

As an alternative, the screening of the at least one target audio data matching the video to be processed from the audio database based on the video category and the video signature feature may include:

screening at least one audio data matched with the video category from an audio database;

respectively calculating the similarity of the audio signature characteristic and the video signature characteristic of at least one piece of audio data;

determining at least one target audio data with similarity satisfying the similarity requirement;

the at least one target audio data is recommended to the user.

The similarity may refer to a feature distance between the audio signature feature and the video signature feature, such as a euclidean distance, a cosine distance, or a mahalanobis distance, for example.

The at least one target audio data with the similarity satisfying the similarity requirement may refer to, for example, at least one target audio data with the similarity greater than a predetermined value or target audio data corresponding to a predetermined number of similarities in a ranking result of the similarities from large to small.

As another alternative, the screening of the at least one target audio data matching the video to be processed from the audio database based on the video category and the video signature feature may include:

screening at least one first audio data matched with the video category from an audio database;

respectively calculating the similarity between the video signature characteristics and the audio signature characteristics of the audio data in the audio database, and screening at least one second audio data with the similarity meeting the similarity requirement;

determining at least one third audio data present in both the at least one audio data and the at least one second audio data;

selecting at least one target audio data from at least one third audio data according to the sequence of the similarity from big to small;

the at least one target audio data is recommended to the user.

The at least one target audio data may be, for example, the audio data with the largest similarity in the at least one third audio data, or the first predetermined number of audio data, or the at least one third audio data.

At least one third audio data present in both the at least one audio data and the at least one second audio data, that is, the at least one audio data and the at least one second audio data are intersected, so as to obtain the at least one third audio data.

Fig. 6b shows a schematic view of a possible scene interaction in an audio data recommendation scene according to the technical solution of the embodiment of the present application, where a client 61 receives a to-be-processed video uploaded by a user, and sends the to-be-processed video to a second server 62.

The second server 62 deploys a video processing model, which may be obtained by training in the first server 63, or may be obtained by training the second server 62.

The second server 62 may identify a video type and a video signature characteristic of the video to be processed, and determine at least one target audio data matching the video to be processed based on the video type and the video signature characteristic.

The second server 62 sends recommendation prompt information of at least one target audio data to the client 61, and the client 61 outputs the recommendation prompt information.

Fig. 7 is a schematic structural diagram of an embodiment of a model generation apparatus provided in an embodiment of the present application, where the apparatus may include:

a first determining module 701 for determining a plurality of image processing models;

a first extraction module 702, configured to extract feature extraction modules from the multiple image processing models, respectively, to obtain multiple feature extraction modules;

a first constructing module 703, configured to connect outputs of the plurality of feature extracting modules to the feature fusing module, respectively, so as to construct a feature extractor;

the feature extractor may be configured to extract video features of a video to be processed.

In some embodiments, the apparatus may further comprise:

a second construction module 704 for connecting the output of the feature extractor to a computation module of at least one video task to construct a video processing model;

a first training module 705, configured to keep the model parameters of the feature extractor unchanged, train a video processing model using a training sample of at least one video task; the video processing model is used for processing the video to be processed to obtain processing results respectively corresponding to at least one video task.

In some embodiments, the first building module may be specifically configured to connect outputs of the plurality of feature extraction modules to a feature fusion module, respectively, to obtain a plurality of sub-extractors; the feature fusion module is used for fusing the frame features output by the feature extraction module to obtain fusion features; connecting the outputs of the plurality of sub-extractors to a feature connecting module to obtain feature extractors; the feature connection model is used for fusing a plurality of fusion features output by the plurality of sub-extractors to obtain a first video feature.

In some embodiments, the second construction module may be specifically configured to connect the output of the feature extractor to a computation module of a video task, and construct and obtain a video processing model;

the apparatus may further include:

the second determining module is used for determining a plurality of video processing models respectively constructed by the feature extractor and the computing module of one video task; the multiple video processing models correspond to different video tasks;

the model integration module is used for integrating the video processing models to obtain a comprehensive processing model formed by the feature extractor and the calculation modules corresponding to the video tasks respectively and connected with the output of the feature extractor; the comprehensive processing model is used for processing the video to be processed to obtain calculation results respectively corresponding to the plurality of video tasks.

In some embodiments, the computing module comprises a feature adaptation module and a video processor or the computing module comprises a feature adaptation module, a video processor, and at least one auxiliary processor;

the first training module is specifically used for keeping the model parameters of the feature extractor unchanged and inputting training samples into the feature extractor to obtain first video features; inputting the first video characteristic into a characteristic adaptation module to obtain a second video characteristic; if the calculation module comprises at least one auxiliary processor, respectively inputting the second video characteristics into the video processor and the at least one auxiliary processor to obtain a video processing result and at least one auxiliary processing result; based on label data respectively corresponding to the video processor and the at least one auxiliary processor in the training sample, and the video processing result and the at least one auxiliary processing result, adjusting model parameters of the video processor, the at least one auxiliary processor and the feature adapter; if the calculation module does not comprise at least one auxiliary processor, inputting the second video characteristics into the video processor to obtain a video processing result; based on the label data in the training sample and the video processing result, adjusting model parameters of the video processor and the feature adapter;

the model integration module may be specifically configured to merge a plurality of video processing models to obtain a merged model formed by the feature extractor and calculation modules corresponding to the plurality of video tasks connected to the output of the feature extractor, respectively; and cutting out the auxiliary processor from the combined model to obtain a comprehensive processing model.

In some embodiments, the computing module comprises a feature adapter and a video processor or the computing module comprises a feature adapter, a video processor, and at least one auxiliary processor;

the first training module may be specifically configured to keep the model parameters of the feature extractor unchanged, and for each video task, input a training sample of the video task to the feature extractor to obtain a first video feature; inputting the first video characteristics into a characteristic adaptation module corresponding to the video task to obtain second video characteristics; if the computing module corresponding to the video task comprises at least one auxiliary processor, respectively inputting the second video characteristics into the video processor and the at least one auxiliary processor corresponding to the video task to obtain a video processing result and at least one auxiliary processing result; based on label data respectively corresponding to the video processor and the at least one auxiliary processor in the training sample, and the video processing result and the at least one auxiliary processing result, adjusting model parameters of the video processor, the at least one auxiliary processor and the feature adapter; if the computing module corresponding to the video task does not comprise at least one auxiliary processor, inputting the second video characteristics into the video processor corresponding to the video task to obtain a video processing result; adjusting model parameters of a video processor feature adapter based on label data corresponding to the video processor in the training sample and the video processing result; and after the training is finished, cutting out the auxiliary processor from the video processing model to obtain the trained video processing model.

In some embodiments, the first determining module may be specifically configured to train a plurality of image processing models by using training samples corresponding to a plurality of image tasks, respectively; the plurality of image tasks includes one or more of an image classification task, an object detection task, and an image segmentation task.

In some embodiments, the first extraction module may be specifically configured to remove, according to task types respectively corresponding to the multiple image processing models, corresponding number of network layers from the ends of the multiple image processing models, respectively, to obtain the multiple feature extraction modules.

The model generating apparatus shown in fig. 7 may execute the model generating method shown in the embodiment shown in fig. 2, and the implementation principle and the technical effect are not repeated. The specific manner in which each module and unit of the model generation apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be elaborated herein.

In one possible design, the model generation apparatus of the embodiment shown in fig. 7 may be implemented as a computing device, which may be, for example, a first server or a second server in the embodiment shown in fig. 1, as shown in fig. 8, and may include a storage component 801 and a processing component 802;

the storage component 801 stores one or more computer instructions for execution by the processing component 802 to implement the model generation method as shown in fig. 2 or as shown in fig. 4.

Of course, a computing device may also necessarily include other components, such as input/output interfaces, communication components, and so forth.

The input/output interface provides an interface between the processing components and peripheral interface modules, which may be output devices, input devices, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.

The computing device may be a physical device or an elastic computing host provided by a cloud computing platform, and the computing device may be a cloud server, and the processing component, the storage component, and the like may be a basic server resource rented or purchased from the cloud computing platform.

The embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the computer program can implement the model generation method in the embodiment shown in fig. 2 or fig. 4.

Fig. 9 is a schematic structural diagram of an embodiment of a video processing apparatus according to an embodiment of the present application, where the apparatus may include:

a data extracting module 901, configured to extract a plurality of image frames from a video to be processed;

a data processing module 902, configured to input a plurality of image frames into a video processing model, and obtain first video features corresponding to the video to be processed based on frame features extracted from the plurality of image frames by the plurality of feature extraction modules, respectively; processing the first video characteristics by using the at least one computing module respectively to obtain at least one processing result; the video processing model comprises a feature extractor and at least one computing module; the feature extractor comprises a plurality of feature extraction modules; the plurality of feature extraction modules are respectively extracted from the plurality of image processing models.

In some embodiments, the feature extractor specifically includes a plurality of sub-extractors formed by connecting the outputs of the plurality of feature extraction modules with a feature fusion module respectively, and a feature connection module connected with the outputs of the plurality of sub-extractors;

the data processing module, based on the frame features respectively extracted from the image frames by the feature extraction modules, may specifically obtain the first video feature corresponding to the video to be processed, where the obtaining of the first video feature may specifically include respectively extracting the frame features of the image frames by the feature extraction modules, and fusing the frame features by the feature fusion modules respectively corresponding to the frame features to obtain a fusion feature; and fusing the plurality of fusion features output by the plurality of sub-extractors by using the feature connection module to obtain a first video feature.

In some embodiments, the at least one processing result includes a video category and a video signature feature;

the apparatus may further include:

and the receiving module is used for receiving the video to be processed uploaded by the user.

As an alternative, the data processing module may be specifically configured to filter at least one audio data matching the video category from an audio database; respectively calculating the similarity of the audio signature characteristic and the video signature characteristic of at least one piece of audio data; determining at least one target audio data with similarity satisfying the similarity requirement; at least one target audio data is recommended to the user.

As another alternative, the data processing module specifically filters at least one first audio data matched with the video category from an audio database; respectively calculating the similarity between the video signature characteristics and the audio signature characteristics of the audio data in the audio database, and screening at least one second audio data with the similarity meeting the similarity requirement; determining at least one third audio data present in both the at least one audio data and the at least one second audio data; selecting at least one target audio data from at least one third audio data according to the sequence of similarity from big to small; at least one target audio data is recommended to the user.

In one possible design, the video processing apparatus of the embodiment shown in fig. 9 may be implemented as a computing device, which may be, for example, a second server in the embodiment shown in fig. 1, as shown in fig. 10, and may include a storage component 1001 and a processing component 1002;

the storage component 1001 stores one or more computer instructions for execution by the processing component 1002 to implement the video processing method as shown in fig. 5.

An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the video processing method of the embodiment shown in fig. 5 can be implemented.

The processing components involved in the previous embodiments may include one or more processors executing computer instructions to perform all or part of the steps of the methods described above. Of course, the processing elements may also be implemented as one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components configured to perform the above-described methods.

The storage component is configured to store various types of data to support operations at the terminal. The memory components may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A video processing method, comprising:

extracting a plurality of image frames from a video to be processed;

and processing the first video characteristics by utilizing the at least one calculation module respectively to obtain at least one processing result.

2. The method according to claim 1, wherein the feature extractor specifically comprises a plurality of sub-extractors each of which is composed of an output of the plurality of feature extraction modules connected to a feature fusion module, and a feature connection module connected to an output of each of the plurality of sub-extractors;

the obtaining of the first video feature corresponding to the video to be processed based on the frame features respectively extracted from the plurality of image frames by the plurality of feature extraction modules comprises:

3. The method of claim 1, wherein the at least one processing result comprises a video category and a video signature feature;

before the extracting a plurality of image frames from the video to be processed, the method further comprises:

receiving a video to be processed uploaded by a user;

after the inputting the plurality of image frames into a video processing model and obtaining at least one processing result, the method further comprises:

respectively calculating the similarity of the audio signature characteristic and the video signature characteristic of the at least one piece of audio data;

recommending the at least one target audio data to the user.

4. The method of claim 1, wherein the at least one processing result comprises a video category and a video signature feature;

receiving a video to be processed uploaded by a user;

respectively calculating the similarity between the video signature characteristics and the audio signature characteristics of the audio data in an audio database, and screening at least one second audio data with the similarity meeting the similarity requirement;

selecting at least one target audio data from the at least one third audio data in the order of similarity from large to small;

recommending the at least one target audio data to the user.

5. A method of model generation, comprising:

determining a plurality of image processing models;

6. The method of claim 5, wherein said connecting the plurality of feature extraction modules in parallel to construct a feature extractor comprises:

respectively connecting the outputs of the plurality of feature extraction modules with a feature fusion module to obtain a plurality of sub-extractors; the feature fusion module is used for fusing the frame features output by the feature extraction module to obtain fusion features;

7. The method of claim 5, wherein connecting the output of the feature extractor to a computation module of at least one video task to construct a video processing model comprises:

connecting the output of the feature extractor with a computing module of a video task to construct and obtain a video processing model;

after the model parameters of the feature extractor are kept unchanged and the training sample corresponding to the at least one video task is utilized to train the video processing model,

the method further comprises the following steps:

determining a plurality of video processing models respectively constructed by the feature extractor and a computing module of a video task; the multiple video processing models correspond to different video tasks;

integrating the video processing models to obtain a comprehensive processing model formed by computing modules corresponding to the feature extractor and a plurality of video tasks connected with the output of the feature extractor; the comprehensive processing model is used for processing the video to be processed to obtain calculation results respectively corresponding to the plurality of video tasks.

8. The method of claim 7, wherein the computing module comprises a feature adaptation module and a video processor or the computing module comprises a feature adaptation module, a video processor and at least one auxiliary processor;

the keeping the model parameters of the feature extractor unchanged, and training the video processing model by using the training sample corresponding to the at least one video task includes:

keeping the model parameters of the feature extractor unchanged, and inputting a training sample into the feature extractor to obtain a first video feature;

if the computing module comprises at least one auxiliary processor, the second video characteristics are respectively input into the video processor and the at least one auxiliary processor to obtain a video processing result and at least one auxiliary processing result;

adjusting model parameters of the video processor, the at least one auxiliary processor and the feature adapter based on label data respectively corresponding to the video processor and the at least one auxiliary processor in the training sample, and the video processing result and the at least one auxiliary processing result;

if the computing module does not comprise at least one auxiliary processor, inputting the second video characteristics into the video processor to obtain a video processing result;

adjusting model parameters of the video processor and the feature adapter based on the label data in the training sample and the video processing result;

the integrating the plurality of video processing models to obtain a comprehensive processing model composed of the feature extractor and the computing modules corresponding to the plurality of video tasks connected with the feature extractor respectively comprises:

and combining the video processing models to obtain a combined model formed by the feature extractor and calculation modules corresponding to the video tasks respectively connected with the output of the feature extractor.

9. The method of claim 5, wherein the computing module comprises a feature adapter and a video processor or the computing module comprises a feature adapter, a video processor, and at least one auxiliary processor;

the keeping the model parameters of the feature extractor unchanged, training the video processing model using the training samples of the at least one video task, comprising:

keeping the model parameters of the feature extractor unchanged, and inputting the training sample of the video task into the feature extractor to obtain a first video feature aiming at each video task;

if the computing module corresponding to the video task comprises at least one auxiliary processor, the second video characteristics are respectively input into the video processor corresponding to the video task and the at least one auxiliary processor, and a video processing result and at least one auxiliary processing result are obtained;

adjusting model parameters of the feature adapter of the video processor based on the label data corresponding to the video processor in the training sample and the video processing result;

and after the training is finished, cutting out the auxiliary processor from the video processing model to obtain the trained video processing model.

10. The method of claim 5, wherein the determining a plurality of image processing models comprises:

training a plurality of image processing models by utilizing training samples corresponding to a plurality of image tasks respectively; the plurality of image tasks includes one or more of an image classification task, an object detection task, and an image segmentation task.

11. A method of model generation, comprising:

determining a plurality of image processing models;

12. A computing device comprising a storage component and a processing component;

the storage component stores one or more computer instructions; the one or more computer instructions are for invocation and execution by the processing component to implement the model generation method of claim 5 or 11 or to implement the video processing method of claim 1.