CN115115986A - Video quality evaluation model production method, device, equipment and medium - Google Patents

Video quality evaluation model production method, device, equipment and medium Download PDF

Info

Publication number
CN115115986A
CN115115986A CN202210749997.5A CN202210749997A CN115115986A CN 115115986 A CN115115986 A CN 115115986A CN 202210749997 A CN202210749997 A CN 202210749997A CN 115115986 A CN115115986 A CN 115115986A
Authority
CN
China
Prior art keywords
model
network
sub
network model
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210749997.5A
Other languages
Chinese (zh)
Inventor
冯进亨
戴长军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huanju Shidai Information Technology Co Ltd
Original Assignee
Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huanju Shidai Information Technology Co Ltd filed Critical Guangzhou Huanju Shidai Information Technology Co Ltd
Priority to CN202210749997.5A priority Critical patent/CN115115986A/en
Publication of CN115115986A publication Critical patent/CN115115986A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The application relates to a video quality evaluation model production method, a device, equipment and a medium thereof in the technical field of image recognition, wherein the method comprises the following steps: acquiring a sampling space corresponding to a network structure of a pre-trained to converged hyper-network model and a search space of a sub-network structure corresponding to the sampling space; sampling from the corresponding sampling space according to the network structure of the hyper-network model to determine a plurality of sub-network structures, and obtaining a sub-network model corresponding to each sub-network structure; performing joint training on each sub-network model corresponding to the plurality of sub-network structures and the super-network model to obtain a sub-network model trained to be convergent and serve as a candidate model; and verifying the performance of each candidate model, and screening out at least one candidate model as a video quality evaluation model. The method and the device can produce the high-efficiency video quality evaluation model.

Description

Video quality evaluation model production method, device, equipment and medium
Technical Field
The present application relates to the field of image recognition technologies, and in particular, to a method for producing a video quality assessment model, and a corresponding apparatus, computer device, and computer-readable storage medium.
Background
Video quality assessment is a method that has been studied in academia and industry since the advent of video, and is mainly divided into reference assessment and non-reference assessment. There is a reference evaluation that compares the effect difference frame by frame based on a given reference video, so that a relatively accurate evaluation score can be obtained more easily. The video non-reference evaluation is a method for giving user subjective experience scores to current video quality, and quality indexes such as code rate, frame rate, resolution and the like of network transmission can be dynamically adjusted according to the scores of the non-reference evaluation so as to achieve the purpose of controlling cost. Through adjustment, the subjective video impression experience of a user basically has no influence, so that the user requirements are met, network transmission resources can be saved in a phase-changing manner, and considerable economic benefits can be brought.
In order to achieve accuracy of a video quality evaluation model realized by the traditional technology, the general model is complex in structure and low in operation efficiency. In addition, in order to improve the operation efficiency, the traditional model compression technical scheme often leads to a too simple model structure due to the fact that the compressed model structure cannot be grasped, is difficult to ensure the accuracy rate, and cannot meet the requirements of actual services.
In view of the shortcomings of the conventional technology, the present application makes a corresponding search.
Disclosure of Invention
A primary object of the present application is to solve at least one of the above problems and provide a method for producing a video quality assessment model, and a corresponding apparatus, computer device, and computer-readable storage medium.
In order to meet various purposes of the application, the following technical scheme is adopted in the application:
the method for producing the video quality evaluation model, which is suitable for one of the purposes of the application, comprises the following steps:
acquiring a sampling space corresponding to a network structure of a pre-trained to converged hyper-network model and a search space of a sub-network structure corresponding to the sampling space;
sampling from the corresponding sampling space according to the network structure of the hyper-network model to determine a plurality of sub-network structures, and obtaining a sub-network model corresponding to each sub-network structure;
performing joint training on each sub-network model corresponding to the plurality of sub-network structures and the super-network model to obtain a sub-network model trained to be convergent and serve as a candidate model;
and verifying the performance of each candidate model, and screening out at least one candidate model as a video quality evaluation model.
In a further embodiment, the training process of the hyper-network model includes the following steps:
acquiring a plurality of image frames of a video in a single training sample, performing data enhancement processing on the plurality of image frames to obtain sample data, and inputting the sample data into a super network model;
after the image characteristic information of the sample data is extracted by the hyper-network model, a quality score corresponding to the sample data is output and predicted by a prediction module;
and calculating a loss value of the hyper-network model, updating the weight of the model when the loss value of the model does not reach a preset threshold value, and continuously calling other training samples to carry out iterative training until the model converges.
In a preferred embodiment, the loss value calculation includes the following steps:
calculating a regression loss value of a predicted quality score obtained by predicting sample data by a hyper-network model according to a supervision score corresponding to the sample data, wherein the supervision score is a subjective quality score;
calculating a corresponding cross entropy loss value according to the regression loss value;
respectively carrying out the same disorder processing on the predicted quality score obtained by predicting sample data by the hyper-network model and the supervision score corresponding to the sample data, taking the difference value of the predicted quality score and the disorder predicted quality score after the disorder processing as the input of a preset function, calculating the sum value of the absolute value of the difference value of the supervised quality score and the disorder supervised score after the disorder processing and the function output result, and taking the maximum value of the sum value and 0 as a disorder loss value;
calculating a sum of the regression loss value, the cross entropy loss value, and the out-of-order loss value as the loss value.
In a further embodiment, the process of joint training includes the following steps:
obtaining a plurality of image frames of a video in a single training sample, performing data enhancement on the plurality of image frames to obtain sample data, and synchronously inputting the sample data into a super network model and each sub network model;
after the image characteristic information of the sample data is respectively extracted by the hyper-network model and each sub-network model, predicting the quality score corresponding to the sample data by a prediction module;
and calculating corresponding joint loss values according to the loss values corresponding to the hyper-network model and the sub-network models, updating corresponding implementation weights of the sub-network models when the joint loss values do not reach a preset threshold value, and continuously calling other training samples to implement iterative training.
In a further embodiment, the step of verifying the performance of each candidate model and screening out at least one candidate model as the video quality assessment model includes the following steps:
performing performance measurement and calculation on each candidate model by adopting an agent verification set to obtain the comprehensive performance corresponding to each candidate model, wherein the comprehensive performance comprises the running time and/or accuracy index;
and selecting the candidate model which meets the preset conditions and corresponds to the comprehensive performance to output as a video quality evaluation model.
In a further embodiment, the step of performing performance measurement on each candidate model by using the proxy verification set to obtain the comprehensive performance corresponding to each candidate model includes the following steps:
performing channel search from the sub-network structures corresponding to the candidate models, and determining a plurality of compression candidate models corresponding to the sub-network structure of the target compression channel;
corresponding input of training samples in the agent verification set is adopted to the plurality of compression candidate models, and corresponding running time and accuracy indexes are calculated;
and determining the comprehensive performance according to the running time and the accuracy index.
In a further embodiment, after the step of determining the comprehensive performance according to the running time and the accuracy index, the method further comprises the following steps:
multiplexing weights corresponding to the common network structure in the super network model for the network structure shared by each compression candidate model and the super network model;
and performing joint training on each compression candidate model and the hyper-network model, selecting the trained compression candidate model meeting the comprehensive performance of the preset condition, and outputting the compression candidate model as a video quality evaluation model.
On the other hand, a video quality assessment model production apparatus adapted to one of the objectives of the present application includes a sampling acquisition module, a structure sampling module, a joint training module, and a performance verification module, wherein: the sampling acquisition module is used for acquiring a sampling space corresponding to a network structure of a pre-trained converged hyper-network model and a search space of a sub-network structure corresponding to the sampling space; the structure sampling module is used for sampling and determining a plurality of sub-network structures from the corresponding sampling space according to the network structure of the super-network model to obtain a sub-network model corresponding to each sub-network structure; the combined training module is used for carrying out combined training on each sub-network model and the super-network model corresponding to the plurality of sub-network structures to obtain a sub-network model which is trained to be convergent and is used as a candidate model; and the performance verification module is used for verifying the performance of each candidate model and screening out at least one candidate model as a video quality evaluation model.
In a further embodiment, the sampling module includes: the first sample input submodule is used for acquiring a plurality of image frames of a video in a single training sample, performing data enhancement processing on the image frames to acquire sample data and inputting the sample data into the super-network model; the first quality prediction submodule is used for outputting and predicting a quality score corresponding to the sample data through a prediction module after the image characteristic information of the sample data is extracted by the super network model; and the first weight updating submodule is used for calculating the loss value of the hyper-network model, updating the weight of the model when the loss value of the model does not reach a preset threshold value, and continuing to call other training samples to carry out iterative training until the model converges.
In a preferred embodiment, the weight updating submodule includes: the regression loss submodule is used for calculating the regression loss value of the prediction quality score obtained by predicting the sample data by the hyper-network model according to the supervision score corresponding to the sample data, and the supervision score is the subjective quality score; the cross entropy loss submodule is used for calculating a corresponding cross entropy loss value according to the regression loss value; the disorder loss submodule is used for respectively carrying out the same disorder processing on the prediction quality score obtained by predicting the sample data by the ultra-network model and the supervision score corresponding to the sample data, taking the difference value between the prediction quality score and the disorder prediction quality score after the disorder processing as the input of a preset function, calculating the sum value of the absolute value of the difference value between the supervision score and the disorder supervision score after the disorder processing and the function output result, and taking the maximum value of the sum value and 0 as a disorder loss value; and the loss value submodule is used for calculating the sum of the regression loss value, the cross entropy loss value and the disorder loss value as the loss value.
In a further embodiment, the joint training module includes: the second sample input submodule is used for acquiring a plurality of image frames of a video in a single training sample, performing data enhancement on the image frames to obtain sample data, and synchronously inputting the sample data into the super-network model and each sub-network model; the second quality prediction submodule is used for predicting the quality score corresponding to the sample data through the prediction module after the image characteristic information of the sample data is extracted from the hyper-network model and each sub-network model respectively; and the second weight updating submodule is used for calculating corresponding combined loss values according to the loss values corresponding to the hyper-network model and the sub-network models, updating the corresponding implementation weights of the sub-network models when the combined loss values do not reach a preset threshold value, and continuously calling other training samples to implement iterative training.
In a further embodiment, the performance verification module includes: the performance calculation submodule is used for performing performance measurement and calculation on each candidate model by adopting an agent verification set to obtain the comprehensive performance corresponding to each candidate model, and the comprehensive performance comprises the running time and/or the accuracy index; and the model output submodule is used for selecting the candidate model which meets the preset conditions and corresponds to the comprehensive performance to output as the video quality evaluation model.
In a further embodiment, the performance calculation submodule includes: the channel searching unit is used for searching channels from the sub-network structures corresponding to the candidate models and determining a plurality of compression candidate models corresponding to the sub-network structures of the target compression channel; the model performance estimation unit is used for adopting corresponding input of training samples in the proxy verification set to the plurality of compressed candidate models and calculating corresponding running time and accuracy indexes; and the performance determining unit is used for determining the comprehensive performance according to the running time and the accuracy index.
In a further embodiment, after the performance determining unit, the method further includes: a weight multiplexing unit, configured to multiplex, for a network structure shared by each of the compression candidate models and the super-network model, a weight corresponding to the shared network structure in the super-network model; and the model output unit is used for carrying out joint training on each compression candidate model and the hyper-network model, selecting the compression candidate model which meets the comprehensive performance of the preset condition after training, and outputting the compression candidate model as the video quality evaluation model.
In yet another aspect, a computer device adapted for one of the purposes of the present application includes a central processing unit and a memory, the central processing unit being configured to invoke execution of a computer program stored in the memory to perform the steps of the video quality assessment model production method described herein.
In a further aspect, a computer-readable storage medium is provided, which stores a computer program according to the video quality assessment model production method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.
The technical solution of the present application has various advantages, including but not limited to the following aspects:
on one hand, the network structure of the super-network model is sampled through a sampling space, a corresponding sub-network model is determined, the sub-network model and the super-network model are subjected to combined training, the sub-network model which is trained to be convergent is obtained and serves as a candidate model, then the performance of the candidate model is verified, and the candidate model with better performance is screened out and serves as a video quality evaluation model. It can be understood that the sub-network model determined by sampling the sampling space has a corresponding network structure which is a simplified version of the network structure of the super-network model, that is, a lightweight sub-network structure is realized, the operation efficiency of the finally-output video quality evaluation model can be effectively improved, and the lightweight model is easier to train to converge and can realize efficient training. In addition, due to the adoption of the joint training, the accuracy and the generalization capability of the obtained candidate model can be guaranteed.
On the other hand, the video quality evaluation model produced by the method can be used for performing non-reference video quality evaluation on the video stream, is intelligent and efficient, and does not need a large amount of labor cost.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a video quality assessment model production method of the present application;
FIG. 2 is a diagram illustrating an exemplary network architecture of a hyper-network model in an embodiment of the present application;
FIG. 3 is a diagram illustrating an exemplary example of backbone modules of a network architecture of a hyper-network model according to an embodiment of the present application;
FIG. 4 is a diagram illustrating an exemplary example of a subnet model in an embodiment of the present application;
FIG. 5 is a schematic diagram of a hyper-network model training process in an embodiment of the present application;
FIG. 6 is a schematic flow chart illustrating calculation of loss values according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating joint training in an embodiment of the present application;
FIG. 8 is a schematic flow chart of a production video quality assessment model according to an embodiment of the present application;
FIG. 9 is a schematic flow chart illustrating the determination of the overall performance in an embodiment of the present application;
FIG. 10 is a schematic flow chart illustrating training of a video quality assessment model according to an embodiment of the present application;
fig. 11 is a schematic block diagram of a video quality estimation model production apparatus of the present application;
fig. 12 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having a single line display or a multi-line display or cellular or other communication devices without a multi-line display; PCS (personal communications System), which may combine voice, data processing, facsimile and/or data communications capabilities; a PDA (personal digital assistant), which may include a radio frequency receiver, a pager, internet/intranet access, web browser, notepad, calendar, and/or GPS (global positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal device" used herein may also be a communication terminal, a web terminal, and a music/video playing terminal, and may be, for example, a PDA, a MI D (mobile internet device), and/or a mobile phone with a music/video playing function, and may also be a smart tv, a set-top box, and other devices.
The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.
It should be noted that the concept of "server" in the present application can be extended to the case of server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.
Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.
Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.
The video quality assessment model production method can be programmed into a computer program product and is deployed in a client or a server to run, for example, in an exemplary application scenario of the application, the video quality assessment model production method can be deployed in a server of an e-commerce platform, so that the method can be executed by accessing an open interface after the computer program product runs and performing human-computer interaction with a process of the computer program product through a graphical user interface.
Referring to fig. 1, the method for producing a video quality assessment model of the present application, in an exemplary embodiment, includes the following steps:
step S1100, acquiring a sampling space corresponding to a network structure of a pre-trained to converged hyper-network model and a search space of a sub-network structure corresponding to the sampling space;
an exemplary network structure of the super network model is shown in fig. 2, which includes a backbone module 200, a global average pooling layer 201, and a prediction module 202, wherein the backbone module is connected to the global average pooling layer, and the global average pooling layer is connected to the prediction module. The trunk module 200 includes convolution modules of multiple stages, the convolution module of the previous stage is connected with the corresponding convolution module of the next stage as shown in fig. 3, the convolution modules include convolution kernel and RELU layer, and the prediction module 202 includes FC layer (fully connected layer). For an exemplary example, the backbone module of the super network model includes convolution modules with 5 stages, and the number of channels corresponding to each convolution module is [64, 64,128,256, 512], which is close to the VGG model, and helps to ensure the robustness and generalization capability of the super network model, although those skilled in the art can flexibly set the convolution modules and their corresponding number of channels according to the disclosure herein.
The corresponding sampling space may be preset according to the network structure of the super network model to facilitate subsequent sampling therefrom to determine the corresponding sub-network structure. In an embodiment, each convolution module of the backbone module of the model may be subjected to a stepwise simplification of the number of channels corresponding to the convolution module, where the stepwise simplification may be an exponential decrement with a base number of 2, and a channel number interval corresponding to a plurality of channel numbers of the stepwise simplification is determined, and the channel number interval corresponding to each convolution module is used as a sampling space, for example, the channel data corresponding to the convolution module in the first stage is 64, and the corresponding channel number interval may be [16, 32, 48, 64], and further, for example, the channel data corresponding to the convolution scale in the fourth stage is 256, and the corresponding channel number space may be [16, 32,64,128,256 ]. It can be seen that the difference between the adjacent channel numbers in the channel number interval is an exponent with the base number of 2, and those skilled in the art can flexibly set the channel number space corresponding to each convolution module according to the disclosure herein. Those skilled in the art should know that, by reasonably planning the implemented sampling space, under the condition of ensuring a certain compression of the network structure, fewer selectable items are determined for sampling the subsequent sub-network structure, and the execution efficiency of sampling is greatly improved.
The search space of the corresponding sub-network structure may be further pre-set according to the sampling space, which in one embodiment is shown in the following list:
Stage1,[16,32,48,64]
Stage2,[16,32,48,64]
Stage3,[16,32,64,96,128]
Stage4,[16,32,64,128,256]
Stage5,[128,256,384,512]
wherein Stage1, Stage2, Stage3, Stage4 and Stage5 are convolution modules, and "[ ]" is a channel number space corresponding to the convolution modules. Further, a subset of the middle portion of each channel number space may be selected as a channel number space corresponding to each convolution module in the search space, so as to obtain the search space, where an exemplary example is shown in the following list:
Stage1,[32,48]
Stage2,[32,48]
Stage3,[32,64,96]
Stage4,[16,32,128]
Stage5,[256,384]
those skilled in the art should know that the search space implemented as above can ensure that the number of channels in the network structure of the model obtained by performing channel search in the search space is sufficient, thereby ensuring the accuracy and generalization capability of the model.
And carrying out supervision training on the hyper-network model by adopting a video stream carrying manual marking scores until the model is trained to be converged, and specifically realizing further disclosure by the following partial embodiments.
Step S1200, sampling and determining a plurality of sub-network structures from the corresponding sampling space according to the network structure of the super-network model, and obtaining a sub-network model corresponding to each sub-network structure;
in an embodiment, according to each convolution module in the backbone module of the network structure of the super network model, sampling is performed from a sampling space to determine a channel number corresponding to each convolution module, and then a corresponding partial structure of each convolution module is obtained as the backbone module of the sub-network structure, where a sampling rule may be set to constrain the sampling, where the sampling rule is that when each convolution module is sampled, the sampling rule may only start from a first channel, for example, the number of channels of the convolution module in the last stage is 512, and the number of corresponding channels is 1 to 512, and if the number of channels sampled is 384, the convolution module in the last stage of sampling is the partial structure of the channel numbers of channels 1 to 384, but not the partial structure of the channel numbers of channels 11 to 394. The sampling rule is adopted for sampling, the number of channels with the same sampling amount can be greatly reduced, the number of partial structures corresponding to different starting channel serial numbers is reduced, and the sampling operation efficiency is improved. Further, it can be understood that, since the fully-connected layer in the prediction module is connected to the global average pooling and then connected to the convolution module in the last stage in the backbone module, the fully-connected layer needs to be adapted to the change of the convolution module in the last stage, for example, the number of channels of the convolution module in the last stage is 512, the number of parameters of the fully-connected layer is 1024 and the input dimension is 512, the corresponding number of parameters is 1024 × 512 and is 524, 288, when the number of channels of the convolution module in the last stage is 256, the number of parameters of the fully-connected layer is 1024 and the input dimension is 256, when the corresponding number of parameters is 1024 × 256 and is 262, 144. It can be seen that the parameters of the fully-connected layer are changed along with the change of the number of channels of the convolution module at the last stage, and the compression ratio is consistent. Accordingly, dynamic adaptation to the fully-connected layer can be achieved in accordance with the compression ratio consistency. Based on the above, the backbone module and the prediction module in the sub-network structure are obtained and the global average pooling layer is kept unchanged, so that the sampling of the backbone module of the super network is realized, that is, the network structure of the super network is complete, and the sub-network structure is obtained by adopting parts of each convolution module in the backbone module of the super network. Thus, a sub-network model corresponding to each sub-network structure is obtained, and an exemplary sub-network model is schematically shown in fig. 4, wherein the number of channels corresponding to sub-network 1 is [32,32,64,256,512], the number of channels corresponding to sub-network 2 is [32,32,128,256,384], and the number of channels corresponding to sub-network 3 is [32,32,64,128,256 ].
Step S1300, performing combined training on each sub-network model and the hyper-network model corresponding to a plurality of sub-network structures to obtain a sub-network model trained to convergence as a candidate model;
and performing joint training on each sub-network model and the super-network model corresponding to a plurality of sub-network structures, wherein the weights of the convolution module structures shared by the sub-network models and the super-network model are shared, so that the weights of each sub-network model can be optimized in the training process. In one embodiment, the loss function used in the joint training is given by the following exemplary formula:
Figure BDA0003717963390000111
wherein: loss ofa Is a loss value of combined training of a hyper network model and a sub network model, n is the number of sub networks which can be sampled by single weight update, i is the sub network sampled in a sampling space,
Figure BDA0003717963390000112
loss value, Loss, for a sub-network model supernet Is the loss value of the hyper network model.
The loss values of the sub-network model and the super-network model may be any loss value or a mixture of loss values such as a regression loss value and a cross-entropy loss value known to those skilled in the art.
The preset training set is adopted to carry out combined training on each sub-network model and each super-network model, and understandably, the load can be passed ofa Judging the error of the prediction result of each sub-network model when Loss ofa And when the preset threshold value is met, determining that the subnetwork model is trained to be converged, and obtaining the subnetwork model trained to be converged as a candidate model. The training set may be constructed by collecting a plurality of video streams of different video quality, and in addition, the respective video streams are labeled with corresponding real labels, which may be obtained by manually scoring the video streams.
And S1400, verifying the performance of each candidate model, and screening out at least one candidate model as a video quality evaluation model.
In one embodiment, the network structure search is performed on each candidate model in the search space of the sub-network structure, the performance of each candidate model is verified by adopting a preset verification set, and the number of channels, loss values and running time of each corresponding convolution module are recorded for a backbone module in the network structure of each candidate model. Further, the loss value and the running time are used as performance evaluation indexes of the candidate models, and at least one candidate model with the loss value and the running time reaching a preset threshold value is screened out to be used as a video quality model, wherein the threshold value can be flexibly set by a person skilled in the art according to service needs.
The video quality assessment model is used for no-reference scoring of video stream prediction, and accordingly, in one embodiment, the video quality assessment model can be deployed in a media server corresponding to live webcasting in an e-commerce platform to provide corresponding services, and the services can be no-reference scoring of video stream prediction generated by live webcasting in the e-commerce platform. The video stream is composed of a plurality of temporally successive image frames, and data of a plurality of image frames corresponding to the video stream can be obtained by decoding the video stream as the sample data and input into the model to extract features for training the model.
As can be appreciated from the exemplary embodiments of the present application, the technical solution of the present application has various advantages, including but not limited to the following aspects:
on one hand, the network structure of the super-network model is sampled through a sampling space, a corresponding sub-network model is determined, the sub-network model and the super-network model are subjected to combined training, the sub-network model which is trained to be converged is obtained and serves as a candidate model, then the performance of the candidate model is verified, and the candidate model with better performance is screened out and serves as a video quality evaluation model. It can be understood that the sub-network model determined by sampling the sampling space has a corresponding network structure which is a simplified version of the network structure of the super-network model, that is, a lightweight sub-network structure is realized, the operation efficiency of the finally-output video quality evaluation model can be effectively improved, and the lightweight model is easier to be trained to be convergent and can be efficiently trained. In addition, due to the adoption of the joint training, the accuracy and the generalization capability of the obtained candidate model can be guaranteed.
On the other hand, the video quality evaluation model produced by the method can perform non-reference video quality evaluation on the video stream, is intelligent and efficient, and does not need a large amount of labor cost.
Referring to fig. 5, in a further embodiment, the training process of the hyper-network model in step S1100 includes the following steps:
step S1110, obtaining a plurality of image frames of a video in a single training sample, performing data enhancement processing on the plurality of image frames to obtain sample data, and inputting the sample data into a super-network model;
in one embodiment, according to hard indexes of video quality such as code rate, frame rate, resolution, and the like, in a manual acquisition manner, sufficient video streams are sorted out to serve as training sets, each video stream serves as a single training sample, each single training sample corresponds to a corresponding subjective quality score labeled for the video stream, and the subjective quality scores serve as supervision scores.
The method comprises the steps of obtaining a video stream of a single training sample from a training set, decoding the video stream to obtain a plurality of corresponding image frames, and performing data enhancement processing on the plurality of image frames to obtain sample data, wherein the data enhancement processing can be mirror image, rotation, scaling, brightness adjustment, contrast, Gaussian noise, Mosa ic, Mi xup, Cutout, CutMix and the like.
Step S1120, after the image characteristic information of the sample data is extracted by the hyper-network model, outputting and predicting a quality score corresponding to the sample data through a prediction module;
and extracting visual features of the sample data by a backbone module in the hyper-network model, outputting the visual features as the image feature information, then carrying out space average on the image feature information by a global average pooling layer of the model, inputting the image feature information to a prediction module of the model, expanding the image feature information into a one-dimensional vector according to a corresponding channel direction by a full connection layer in the image feature information, and mapping the vector to a corresponding classification space to obtain a quality score corresponding to the predicted sample data.
And S1130, calculating a loss value of the hyper-network model, updating the weight of the model when the loss value of the model does not reach a preset threshold value, and continuing to call other training samples to carry out iterative training until the model converges.
And calculating the loss value of the hyper-network model by adopting a preset loss value calculation function, wherein the loss value calculation function comprises the calculation of the regression loss value, the disorder loss value and the cross entropy loss value of the model, and the concrete implementation is further disclosed by the following partial embodiments.
And judging whether the loss value of the hyper-network model reaches a preset threshold value, updating the weight of the model when the loss value of the hyper-network model does not reach the preset threshold value, updating the corresponding weight of a trunk module of the hyper-network model, continuously calling other training samples in the training set to carry out iterative training, and representing that the hyper-network model is trained to be convergent when the loss value of the hyper-network model reaches the preset threshold value.
In the embodiment, the training process of the super network model is disclosed, and it can be seen that after the super network model is trained to be converged, no-reference evaluation can be performed on the video stream, and a corresponding quality score is predicted. Therefore, in an actual application scene, the method can replace manpower to artificially define standards, intelligently and quickly evaluate the quality of the video stream, and is very efficient.
Referring to fig. 6, in the preferred embodiment, the step S1130 of calculating the loss value includes the following steps:
step S1131, calculating a regression loss value of a prediction quality score obtained by predicting sample data by the hyper-network model according to a supervision score corresponding to the sample data, wherein the supervision score is a subjective quality score;
in one embodiment, the supervised score, i.e., the subjective quality score, may be obtained by subjectively scoring the video quality of each image frame in the video stream corresponding to the training sample by persons corresponding to multiple persons, and calculating an average of the subjective scores.
Calculating a supervision score of each image frame in the video stream corresponding to the sample data and a prediction quality score of each image frame in the video stream corresponding to the super-network model prediction sample data, wherein an absolute value of a difference value between the supervision score and the prediction quality score is used as a regression loss value, and an exemplary formula is as follows:
Figure BDA0003717963390000141
wherein: loss mos For the regression loss value, n is the number of image frames in the video stream, equal to batch _ size, y out Predicting quality scores, y, corresponding to each image frame in the video stream corresponding to the sample data for the hyper-network model gt And labeling the supervision scores corresponding to the image frames in the video stream corresponding to the sample data in advance.
Step S1132, calculating a corresponding cross entropy loss value according to the regression loss value;
based on the regression Loss value, 1-Loss mos As the positive sample output probability, the corresponding cross entropy loss value is calculated, and an exemplary formula is as follows:
Figure BDA0003717963390000142
wherein: loss ce For the cross entropy loss value, n is the number of image frames in the video stream, equal to batch _ size, y out Predicting quality scores, y, corresponding to each image frame in the video stream corresponding to the sample data for the super network model gt And labeling the supervision scores corresponding to the image frames in the video stream corresponding to the sample data in advance.
Step S1133, respectively performing same disorder processing on the prediction quality score obtained by predicting the sample data by the hyper-network model and the supervision score corresponding to the sample data, taking the difference value between the prediction quality score and the disorder prediction quality score after the disorder processing as the input of a preset function, calculating the sum value of the absolute value of the difference value between the supervision score and the disorder supervision score after the disorder processing and the function output result, and taking the maximum value of the sum value and 0 as a disorder loss value;
the accuracy of model prediction is improved by improving the difference between prediction quality scores obtained by predicting sample data by the hyper-network model. Accordingly, the prediction quality scores obtained by the image frames in the video stream corresponding to the super network model prediction sample data and the supervision scores corresponding to the image frames in the video stream corresponding to the sample data are respectively subjected to the same disorder processing, and the disordered prediction quality scores and supervision scores after the disorder processing are correspondingly obtained, for example, the prediction quality scores are [1,2,3,4,5], the supervision scores are [2,3,3,4,3], the disordered prediction quality scores after the disorder processing are [5,4,1,3,2], and the disordered supervision scores after the disorder processing are [3,4,2,3,3 ]. Further, the difference value between the prediction score and the disordered prediction quality score is used as the input of a preset function, the preset function is sgn, and the sgn is a sign function and can indicate the positive and negative of the input content. Then, the sum of the absolute value of the difference between the supervision score and the out-of-order supervision score after out-of-order processing and the function output result is calculated, and the maximum value between the sum and 0 is taken as the out-of-order loss value, wherein an exemplary formula is as follows:
Figure BDA0003717963390000151
wherein: loss rank For the out-of-order loss value, n is the number of image frames in the video stream, equal to batch _ size, y gt Marking the supervision scores r corresponding to each image frame in the video stream corresponding to the sample data in advance gt For supervised scoring of disorder after disorder processing, y out Predicting respective graphs in a video stream corresponding to sample data for a hyper-network modelPrediction quality score, r, for a frame out And scoring the prediction quality of the disorder after the disorder processing.
Those skilled in the art will appreciate that y in the formula of the example gt -r gt | provides the adaptive magnitude of change of the difference of the prediction quality scores after misordering, sgn (y) out -r out ) Provides the direction of change of the misordered differences in the prediction quality scores.
Step S1134, calculating a sum of the regression loss value, the cross entropy loss value and the disorder loss value as the loss value.
Summing to calculate a sum of the regression loss value, the cross-entropy loss value, and the out-of-order loss value as the loss value, an exemplary example formula is:
Loss=Loss mos +Loss ce +Loss rank
wherein: loss is a Loss value corresponding to prediction sample data of the hyper network model, and Loss is mos For the regression Loss value, Loss ce For the cross entropy Loss value, Loss rank Is the out-of-order loss value.
In this embodiment, the calculation of the loss value corresponding to the prediction sample data of the hyper-network model and the calculation of the three loss values included therein are disclosed. It can be understood that, on one hand, the loss value calculated by the embodiment can accurately reflect the accuracy of model prediction, and on the other hand, when the hyper-network model is trained, the loss value calculated by the embodiment can accelerate model convergence and improve training efficiency.
Referring to fig. 7, in a further embodiment, the step S1300, the process of the joint training, includes the following steps:
step S1310, obtaining a plurality of image frames of a video in a single training sample, performing data enhancement on the plurality of image frames to obtain sample data, and synchronously inputting the sample data into a super network model and each sub network model;
in one embodiment, according to hard indexes of video quality such as code rate, frame rate, resolution, and the like, in a manual acquisition manner, sufficient video streams are sorted out to serve as training sets, each video stream serves as a single training sample, each single training sample corresponds to a corresponding subjective quality score labeled for the video stream, and the subjective quality scores serve as supervision scores.
The method comprises the steps of obtaining a video stream of a single training sample from a training set, obtaining a plurality of corresponding image frames by decoding the video stream, and obtaining sample data after performing data enhancement processing on the image frames, wherein the data enhancement processing can be mirror image, rotation, scaling, brightness adjustment, contrast, Gaussian noise, Mosa ic, Mi xup, Cutout, CutMix and the like.
Step S1320, after the image characteristic information of the sample data is extracted by the hyper-network model and each sub-network model respectively, predicting the quality score corresponding to the sample data through a prediction module;
and extracting visual features of the sample data by the corresponding trunk modules in the hyper-network model and each sub-network model, outputting the visual features as the image feature information, carrying out space average on the image feature information by the global average pooling layers corresponding to the hyper-network model and each sub-network model, inputting the image feature information to the prediction modules corresponding to the hyper-network model and each sub-network model, expanding the image feature information into one-dimensional vectors by the corresponding full-connection layers in the hyper-network model and each sub-network model according to the corresponding channel directions, mapping the one-dimensional vectors to the corresponding classification spaces, and obtaining quality scores corresponding to the predicted sample data.
Step S1330, calculating a corresponding joint loss value according to the loss values corresponding to the super network model and the sub network models, updating the corresponding implementation weights of the sub network models when the joint loss value does not reach a preset threshold, and continuing to invoke other training samples to implement iterative training.
And calculating the loss value of the hyper network model according to the steps S1131-1134. Those skilled in the art will appreciate that the respective sub-network models can also calculate the corresponding loss values according to the disclosure of steps S1131-1134. Further, according to the loss values corresponding to the super network model and the respective sub network models, a corresponding joint loss value is calculated, and an exemplary formula is as follows:
Figure BDA0003717963390000161
wherein: loss ofa Is a loss value of combined training of a hyper network model and a sub network model, n is the number of sub networks which can be sampled by single weight update, i is the sub network sampled in a sampling space,
Figure BDA0003717963390000162
loss value, Loss, for a sub-network model supernet Is the loss value of the hyper network model.
Judging whether the combined loss value of the super-network model and each sub-network model reaches a preset threshold value, updating the weight of the super-network model and each sub-network model when the combined loss value of the super-network model and each sub-network model does not reach the preset threshold value, wherein the sub-network model and the super-network model share the structure weight of a convolution module shared by the super-network model, updating the weight of the super-network model and the corresponding main module of each sub-network model, continuously calling other training samples in the training set to perform iterative training, and representing that the super-network model and each sub-network model are trained to be converged when the combined loss value of the super-network model and each sub-network model reaches the preset threshold value.
In this embodiment, a joint training process of each sub-network model and the super-network model is disclosed, so that the weights of each sub-network model can be optimized in the training process. In addition, it can be understood that after each sub-network model is trained to converge, the video stream can be evaluated without reference, and the corresponding quality score can be predicted. Therefore, in an actual application scene, the method can replace manpower to artificially define standards, intelligently and quickly evaluate the quality of the video stream, and is very efficient.
Referring to fig. 8, in a further embodiment, the step S1400 of verifying the performance of each candidate model and screening out at least one candidate model as the video quality assessment model includes the following steps:
step S1410, performing performance measurement and calculation on each candidate model by adopting an agent verification set to obtain comprehensive performance corresponding to each candidate model, wherein the comprehensive performance comprises running time and/or accuracy index;
and taking part of training samples in the training of the hyper-network model as the proxy validation set, and optionally, extracting 10% of training samples from the training set in a random sampling mode, thereby obtaining the part of training samples.
Inputting the proxy verification set into each candidate model for performance measurement, and obtaining the running time and loss value required by the prediction corresponding to each candidate model, as will be understood by those skilled in the art, each candidate model can also calculate the corresponding loss value according to the disclosure of steps S1131-1134. Further, the result of subtracting the loss value multiplied by 100% by 1 is calculated as the accuracy index, and then the accuracy index and/or the running time is used as the corresponding comprehensive performance of each candidate model.
And S1420, selecting the candidate model corresponding to the comprehensive performance meeting the preset conditions to output as a video quality evaluation model.
Corresponding overall performance indexes can be preset according to the overall performance, for example, the accuracy index is 85% and/or the running time is 0.3S. Of course, those skilled in the art can flexibly set the overall performance index according to the actual service requirement. And selecting at least one candidate model meeting the comprehensive performance index to output as a video quality evaluation model.
In the embodiment, the performance of the candidate model is measured and calculated by adopting the proxy verification set, so that the candidate model with the performance reaching the standard is selected and output as the video quality evaluation model. Therefore, the video quality evaluation model is reasonably and preferably selected, so that considerable effect can be obtained when the video quality evaluation model is put into use in an actual service scene.
Referring to fig. 9, in a further embodiment, in step S1410, performing performance measurement on each candidate model by using the proxy verification set, and obtaining the comprehensive performance corresponding to each candidate model, the method includes the following steps:
step S1411, channel searching is carried out from the sub-network structures corresponding to the candidate models, and a plurality of compression candidate models corresponding to the sub-network structures of the target compression channels are determined;
a search space corresponding to the sub-network structure may be established based on the sampling space, and specific implementation may refer to corresponding disclosure in step S1100. Accordingly, a channel search is performed from the sub-network structures corresponding to the candidate models according to the search space, and a plurality of compression candidate models corresponding to the sub-network structures of the target compression channels corresponding to the convolution modules in the search space are determined, for example, the number of channels corresponding to the convolution module at the last stage of the trunk module of the candidate model is 256, and the target compression channel corresponding to the convolution module at the last stage in the search space is [256,384], at this time, it is determined that the convolution module at the last stage is in the target compression channel, and further, when the convolution modules at other stages of the trunk module of the candidate model are in the target compression channels, it is determined that the candidate model is the compression candidate model.
Step S1412, inputting the training samples in the proxy verification set to the plurality of compressed candidate models correspondingly, and calculating corresponding running time and accuracy indexes;
the part of the training samples in training the hyper-network model may be used as the proxy validation set, and optionally, 10% of the training samples are randomly sampled from the training set, so as to obtain the part of the training samples.
Inputting the proxy verification set into each compression candidate model for performance measurement, and obtaining the running time required for prediction corresponding to each compression candidate model, and the loss value, as will be understood by those skilled in the art, the each compression candidate model can also calculate the corresponding loss value according to the disclosures of steps S1131-1134. Further, the result of subtracting the loss value by 1 multiplied by 100% is calculated as the accuracy index.
And step S1413, determining comprehensive performance according to the running time and the accuracy index.
And further, taking the accuracy index and the running time as the corresponding comprehensive performance of each compression candidate model.
In this embodiment, the compressed candidate model is searched from the candidate models through the search space, so that the candidate models can be compressed.
Referring to fig. 10, in a further embodiment, after the step of determining the comprehensive performance according to the running time and the accuracy index in step S1413, the method further includes the following steps:
step S1414, multiplexing weights corresponding to the shared network structure in the super network model for the network structure shared by each compression candidate model and the super network model;
and multiplexing the weight of the super network model corresponding to the shared structure for the structure shared by the compression candidate model and each convolution module in the main network corresponding to the super network model.
And step S1415, performing combined training on each compression candidate model and the hyper-network model, selecting the compression candidate model which meets the comprehensive performance of the preset condition after training, and outputting the compression candidate model as a video quality evaluation model.
The compressed candidate model and the hyper-network model are jointly trained, and the specific implementation can refer to the disclosure implementation of steps S1310-1330, which is not described in detail herein. Further, corresponding comprehensive performance indexes, such as an accuracy index of 85% and a running time of 0.3S, may be preset according to the comprehensive performance corresponding to the compression candidate model. Of course, those skilled in the art can flexibly set the overall performance index according to the actual service requirement. And selecting at least one compression candidate model meeting the comprehensive performance index to generate a video quality evaluation model.
In the embodiment, the accuracy and the generalization capability of the compressed candidate model are further improved by performing joint training on the compressed candidate model, and considerable effect can be obtained when the compressed candidate model is put into use in an actual business scene.
Referring to fig. 11, an apparatus for producing a video quality assessment model adapted to one of the objectives of the present application is a functional implementation of the method for producing a video quality assessment model of the present application, and the apparatus includes a sampling acquisition module 1100, a structure sampling module 1200, a joint training module 1300, and a performance verification module 1400, wherein: a sampling acquisition module 1100, configured to acquire a sampling space corresponding to a network structure of a pre-trained to converged super network model and a search space of a sub-network structure corresponding to the sampling space; a structure sampling module 1200, configured to determine, according to the network structure of the super network model, a plurality of sub-network structures from the sampling space corresponding to the network structure, and obtain a sub-network model corresponding to each sub-network structure; a joint training module 1300, configured to perform joint training on each subnetwork model and the super-network model corresponding to the multiple subnetwork structures, and obtain a subnetwork model trained to convergence as a candidate model; and the performance verification module 1400 is configured to verify performance of each candidate model, and screen out at least one candidate model as a video quality evaluation model.
In a further embodiment, the sample acquiring module 1100 includes: the first sample input submodule is used for acquiring a plurality of image frames of a video in a single training sample, performing data enhancement processing on the image frames to acquire sample data and inputting the sample data into the super-network model; the first quality prediction submodule is used for outputting and predicting a quality score corresponding to the sample data through a prediction module after the image characteristic information of the sample data is extracted by the super network model; and the first weight updating submodule is used for calculating the loss value of the hyper-network model, updating the weight of the model when the loss value of the model does not reach a preset threshold value, and continuing to call other training samples to carry out iterative training until the model converges.
In a preferred embodiment, the weight updating submodule includes: the regression loss submodule is used for calculating the regression loss value of the prediction quality score obtained by predicting the sample data by the hyper-network model according to the supervision score corresponding to the sample data, and the supervision score is the subjective quality score; the cross entropy loss submodule is used for calculating a corresponding cross entropy loss value according to the regression loss value; the disorder loss submodule is used for respectively carrying out the same disorder processing on the prediction quality score obtained by predicting the sample data by the ultra-network model and the supervision score corresponding to the sample data, taking the difference value between the prediction quality score and the disorder prediction quality score after the disorder processing as the input of a preset function, calculating the sum value of the absolute value of the difference value between the supervision score and the disorder supervision score after the disorder processing and the function output result, and taking the maximum value of the sum value and 0 as a disorder loss value; and the loss value submodule is used for calculating the sum of the regression loss value, the cross entropy loss value and the disorder loss value as the loss value.
In a further embodiment, the joint training module 1300 includes: the second sample input submodule is used for acquiring a plurality of image frames of a video in a single training sample, performing data enhancement on the image frames to acquire sample data and synchronously inputting the sample data into the super network model and each sub network model; the second quality prediction submodule is used for predicting the quality score corresponding to the sample data through the prediction module after the image characteristic information of the sample data is extracted from the hyper-network model and each sub-network model respectively; and the second weight updating submodule is used for calculating corresponding combined loss values according to the loss values corresponding to the hyper-network model and the sub-network models, updating the corresponding implementation weights of the sub-network models when the combined loss values do not reach a preset threshold value, and continuously calling other training samples to implement iterative training.
In a further embodiment, the performance verification module 1400 includes: the performance calculation submodule is used for performing performance measurement and calculation on each candidate model by adopting an agent verification set to obtain the comprehensive performance corresponding to each candidate model, and the comprehensive performance comprises the running time and/or the accuracy index; and the model output submodule is used for selecting the candidate model which meets the preset conditions and corresponds to the comprehensive performance to output as the video quality evaluation model.
In a further embodiment, the performance calculation submodule includes: the channel searching unit is used for searching channels from the sub-network structures corresponding to the candidate models and determining a plurality of compression candidate models corresponding to the sub-network structures of the target compression channel; the model performance estimation unit is used for adopting corresponding input of training samples in the proxy verification set to the plurality of compressed candidate models and calculating corresponding running time and accuracy indexes; and the performance determining unit is used for determining the comprehensive performance according to the running time and the accuracy index.
In a further embodiment, after the performance determining unit, the method further includes: a weight multiplexing unit, configured to multiplex, for a network structure shared by each of the compression candidate models and the super-network model, a weight corresponding to the shared network structure in the super-network model; and the model output unit is used for carrying out joint training on each compression candidate model and the hyper-network model, selecting the compression candidate model which meets the comprehensive performance of the preset condition after training, and outputting the compression candidate model as the video quality evaluation model.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 12, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a video quality assessment model production method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the video quality assessment model production method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 11, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the video quality assessment model production apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the video quality assessment model production method of any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), or a Random Access Memory (RAM).
To sum up, the video quality evaluation model is produced by searching the compression hyper-network model through the network structure, the lightweight of the model is realized, and the high-efficiency operation can be realized in the actual application scene.
Those of skill in the art will understand that various operations, methods, steps in the flow, measures, schemes discussed in this application can be alternated, modified, combined, or deleted. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, the steps, measures, and schemes in the various operations, methods, and flows disclosed in the present application in the prior art can also be alternated, modified, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A method for producing a video quality assessment model is characterized by comprising the following steps:
acquiring a sampling space corresponding to a network structure of a pre-trained to converged hyper-network model and a search space of a sub-network structure corresponding to the sampling space;
sampling from the corresponding sampling space according to the network structure of the hyper-network model to determine a plurality of sub-network structures, and obtaining a sub-network model corresponding to each sub-network structure;
performing joint training on each sub-network model corresponding to the plurality of sub-network structures and the super-network model to obtain a sub-network model trained to be convergent and serve as a candidate model;
and verifying the performance of each candidate model, and screening out at least one candidate model as a video quality evaluation model.
2. The method for producing a video quality assessment model according to claim 1, wherein the training process of said hyper-network model comprises the steps of:
acquiring a plurality of image frames of a video in a single training sample, performing data enhancement processing on the plurality of image frames to obtain sample data, and inputting the sample data into a super network model;
after the image characteristic information of the sample data is extracted by the hyper-network model, a quality score corresponding to the sample data is output and predicted by a prediction module;
and calculating a loss value of the hyper-network model, updating the weight of the model when the loss value of the model does not reach a preset threshold value, and continuously calling other training samples to carry out iterative training until the model converges.
3. The method for producing a video quality assessment model according to claim 2, wherein said loss value calculation comprises the steps of:
calculating a regression loss value of a prediction quality score obtained by predicting sample data by the hyper-network model according to a supervision score corresponding to the sample data, wherein the supervision score is a subjective quality score;
calculating a corresponding cross entropy loss value according to the regression loss value;
respectively carrying out the same disorder processing on the prediction quality score obtained by predicting sample data by the hyper-network model and the supervision score corresponding to the sample data, taking the difference value between the prediction quality score and the disorder prediction quality score after the disorder processing as the input of a preset function, calculating the sum value of the absolute value of the difference value between the supervision score and the disorder supervision score after the disorder processing and the function output result, and taking the maximum value of the sum value and 0 as a disorder loss value;
calculating a sum of the regression loss value, the cross entropy loss value, and the out-of-order loss value as the loss value.
4. The method for producing a video quality assessment model according to claim 1, wherein said process of joint training comprises the steps of:
obtaining a plurality of image frames of a video in a single training sample, performing data enhancement on the plurality of image frames to obtain sample data, and synchronously inputting the sample data into a super network model and each sub network model;
after the image characteristic information of the sample data is respectively extracted by the hyper-network model and each sub-network model, predicting the quality score corresponding to the sample data by a prediction module;
and calculating corresponding joint loss values according to the loss values corresponding to the hyper-network model and the sub-network models, updating corresponding implementation weights of the sub-network models when the joint loss values do not reach a preset threshold value, and continuously calling other training samples to implement iterative training.
5. The method for producing a video quality assessment model according to claim 1, wherein the step of verifying the performance of each candidate model and screening out at least one candidate model as the video quality assessment model comprises the steps of:
performing performance measurement and calculation on each candidate model by adopting an agent verification set to obtain comprehensive performance corresponding to each candidate model, wherein the comprehensive performance comprises running time and/or accuracy index;
and selecting the candidate model corresponding to the comprehensive performance meeting the preset conditions to output as a video quality evaluation model.
6. The method for producing the video quality assessment model according to claim 5, wherein the step of performing the performance measurement on each candidate model by using the proxy verification set to obtain the comprehensive performance corresponding to each candidate model comprises the following steps:
performing channel search from the sub-network structures corresponding to the candidate models, and determining a plurality of compression candidate models corresponding to the sub-network structure of the target compression channel;
correspondingly inputting training samples in the agent verification set to the plurality of compression candidate models, and calculating corresponding running time and accuracy indexes;
and determining the comprehensive performance according to the running time and the accuracy index.
7. The method for producing a video quality assessment model according to claim 6, further comprising the steps of, after the step of determining the overall performance from the run time and accuracy indicators:
multiplexing weights corresponding to the common network structure in the super network model for the network structure shared by each compression candidate model and the super network model;
and performing joint training on each compression candidate model and the hyper-network model, selecting the trained compression candidate model meeting the comprehensive performance of the preset condition, and outputting the compression candidate model as a video quality evaluation model.
8. A video quality estimation model production apparatus, comprising:
the sampling acquisition module is used for acquiring a sampling space corresponding to a network structure of a pre-trained converged hyper-network model and a search space of a sub-network structure corresponding to the sampling space;
the structure sampling module is used for sampling and determining a plurality of sub-network structures from the corresponding sampling space according to the network structure of the super-network model to obtain a sub-network model corresponding to each sub-network structure;
the combined training module is used for carrying out combined training on each sub-network model and the super-network model corresponding to the plurality of sub-network structures to obtain a sub-network model which is trained to be convergent and is used as a candidate model;
and the performance verification module is used for verifying the performance of each candidate model and screening out at least one candidate model as a video quality evaluation model.
9. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.
CN202210749997.5A 2022-06-28 2022-06-28 Video quality evaluation model production method, device, equipment and medium Pending CN115115986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210749997.5A CN115115986A (en) 2022-06-28 2022-06-28 Video quality evaluation model production method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210749997.5A CN115115986A (en) 2022-06-28 2022-06-28 Video quality evaluation model production method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115115986A true CN115115986A (en) 2022-09-27

Family

ID=83331041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210749997.5A Pending CN115115986A (en) 2022-06-28 2022-06-28 Video quality evaluation model production method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115115986A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631388A (en) * 2022-12-21 2023-01-20 第六镜科技(成都)有限公司 Image classification method and device, electronic equipment and storage medium
CN117689611A (en) * 2023-08-02 2024-03-12 上海荣耀智慧科技开发有限公司 Quality prediction network model generation method, image processing method and electronic equipment
CN117689611B (en) * 2023-08-02 2024-07-05 上海荣耀智慧科技开发有限公司 Quality prediction network model generation method, image processing method and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631388A (en) * 2022-12-21 2023-01-20 第六镜科技(成都)有限公司 Image classification method and device, electronic equipment and storage medium
CN117689611A (en) * 2023-08-02 2024-03-12 上海荣耀智慧科技开发有限公司 Quality prediction network model generation method, image processing method and electronic equipment
CN117689611B (en) * 2023-08-02 2024-07-05 上海荣耀智慧科技开发有限公司 Quality prediction network model generation method, image processing method and electronic equipment

Similar Documents

Publication Publication Date Title
CN109740018B (en) Method and device for generating video label model
US20200184955A1 (en) Image-based approaches to identifying the source of audio data
CN110826567B (en) Optical character recognition method, device, equipment and storage medium
CN110046571B (en) Method and device for identifying age
CN110287788A (en) A kind of video classification methods and device
CN113449851A (en) Data processing method and device
CN112950640A (en) Video portrait segmentation method and device, electronic equipment and storage medium
CN115037926A (en) Video quality evaluation method, device, equipment and medium thereof
CN112289338A (en) Signal processing method and device, computer device and readable storage medium
CN113962965A (en) Image quality evaluation method, device, equipment and storage medium
CN115115986A (en) Video quality evaluation model production method, device, equipment and medium
CN110555120B (en) Picture compression control method, device, computer equipment and storage medium
CN114881971A (en) Video quality evaluation method, device, equipment and medium thereof
CN108446688B (en) Face image gender judgment method and device, computer equipment and storage medium
CN114420135A (en) Attention mechanism-based voiceprint recognition method and device
CN113923378A (en) Video processing method, device, equipment and storage medium
CN112000803B (en) Text classification method and device, electronic equipment and computer readable storage medium
CN111797266B (en) Image processing method and apparatus, storage medium, and electronic device
CN112487957A (en) Video behavior detection and response method and device, equipment and medium
CN112241761A (en) Model training method and device and electronic equipment
CN114973068A (en) Video quality evaluation model production method, device, equipment and medium
CN111128131A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN115910062A (en) Audio recognition method, device, equipment and storage medium
CN115147870A (en) Pedestrian re-identification method and device
CN114566160A (en) Voice processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination