CN112949433B - Method, device and equipment for generating video classification model and storage medium - Google Patents

Method, device and equipment for generating video classification model and storage medium Download PDF

Info

Publication number
CN112949433B
CN112949433B CN202110190232.8A CN202110190232A CN112949433B CN 112949433 B CN112949433 B CN 112949433B CN 202110190232 A CN202110190232 A CN 202110190232A CN 112949433 B CN112949433 B CN 112949433B
Authority
CN
China
Prior art keywords
model
knowledge
learning model
loss function
knowledge learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110190232.8A
Other languages
Chinese (zh)
Other versions
CN112949433A (en
Inventor
黄军
程军
胡晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110190232.8A priority Critical patent/CN112949433B/en
Publication of CN112949433A publication Critical patent/CN112949433A/en
Application granted granted Critical
Publication of CN112949433B publication Critical patent/CN112949433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for generating a video classification model, and relates to the technical field of computers, in particular to the technical field of artificial intelligence such as computer vision and deep learning. The generation method of the video classification model comprises the following steps: acquiring an image classification model, wherein the image classification model is generated by adopting a first knowledge distillation network according to an image classification data set; and training the image classification model by adopting a second knowledge distillation network according to the video classification data set so as to generate a video classification model. The video classification model can improve the classification effect of the video classification model.

Description

Method, device and equipment for generating video classification model and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as computer vision and deep learning, and in particular, to a method, an apparatus, a device, and a storage medium for generating a video classification model.
Background
Artificial Intelligence (AI) is a subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both hardware-level and software-level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.
With the popularization of mobile devices and the great improvement of hardware performance, the life of recording videos has gradually become the daily behavior habit of people. With the rapid growth of video data, video classification models can be employed to classify videos for storage, management, etc. based on the classification. In order to reduce the number of parameters and improve the practicability, a knowledge distillation mode can be adopted to generate a video classification model. The knowledge distillation network comprises a knowledge migration model and a knowledge learning model.
In the related technology, a knowledge distillation process based on a video classification data set is only performed once, a video classification model to be generated is used as a knowledge learning model, the output of a knowledge migration model is used as supervision information in a knowledge distillation mode, and the video classification model is generated after the knowledge learning model is trained.
Disclosure of Invention
The disclosure provides a method, a device, equipment and a storage medium for generating a video classification model.
According to an aspect of the present disclosure, there is provided a method for generating a video classification model, including: acquiring an image classification model, wherein the image classification model is generated by adopting a first knowledge distillation network according to an image classification data set; and training the image classification model by adopting a second knowledge distillation network according to the video classification data set so as to generate a video classification model.
According to another aspect of the present disclosure, there is provided an apparatus for generating a video classification model, including: the acquisition module is used for acquiring an image classification model, and the image classification model is generated by adopting a first knowledge distillation network according to an image classification data set; and the generating module is used for training the image classification model by adopting a second knowledge distillation network according to the video classification data set so as to generate the video classification model.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.
According to the technical scheme, the classification effect of the video classification model can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;
FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;
FIG. 5 is a schematic illustration according to a fifth embodiment of the present disclosure;
fig. 6 is a schematic diagram of an electronic device for implementing any one of the methods of generating a video classification model according to the embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Knowledge distillation (knowledge distillation) is a common method for model compression, and is different from pruning and quantification in model compression, and knowledge distillation is to train a small lightweight model by using supervision information of a large model with better performance so as to achieve better performance and precision. The large model is called a knowledge migration model or a teacher (teacher) model, and the small model is called a knowledge learning model or a student (student) model. The supervised information from the output of the teacher model is called knowledge (knowledge), while the process of the student model learning to migrate the supervised information from the teacher model is called distillation (distillation).
Some video knowledge distillation schemes exist in the related art, such as cross-modal distillation, optical flow distillation, etc., however, these knowledge distillation schemes are only one-stage knowledge distillation, that is, include a knowledge distillation network, the knowledge distillation network includes a knowledge migration network and a knowledge learning network, after inputting the video classification data set into the knowledge migration network and the knowledge learning network of the knowledge distillation network, the knowledge learning network is trained by using the output of the knowledge migration network as supervision information, and the trained knowledge learning network is used as a video classification model.
Because the data volume of the video classification data set is small, the classification effect of the video classification model is poor due to the fact that the video classification data set is only used for training the video classification model. The research in the image field is relatively mature, and in order to improve the classification effect of the video classification model, the knowledge in the image field can be introduced when the video classification model is trained. To this end, the present disclosure provides some embodiments as follows.
Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a method for generating a video classification model, which comprises the following steps:
101. and acquiring an image classification model, wherein the image classification model is generated by adopting a first knowledge distillation network according to the image classification data set.
102. And training the image classification model by adopting a second knowledge distillation network according to the video classification data set so as to generate a video classification model.
The execution main body of the embodiment may be a single device main body, such as a server.
In some embodiments, the image classification model may be acquired by the single device body from another device. Alternatively, in some embodiments, the image classification model may be obtained by the single device body from itself. Further, the image classification model may be generated online, for example, when the video classification model is to be generated, the image classification model is generated first; alternatively, the image classification model may be generated offline, for example, when a video classification model is to be generated currently, the image classification model that has been generated before may be directly obtained.
Since the research in the image field is relatively mature and there are many image classification models with excellent performance, the knowledge of the image classification model can be transferred to the video classification model, so that the excellent performance of the image classification model can be utilized.
When the video classification model is generated, for example, the image classification model is used as a pre-training model of the video classification model, and fine tuning is performed on the basis of the pre-training model, so that the video classification model is obtained.
In the embodiment, the image classification model and the video classification model are obtained in a knowledge distillation mode, so that the classification effect of the two models can be improved, and the parameter quantity of the models can be reduced. During video distillation, the video classification model is generated after the image classification model is trained, namely, the image classification model can be used as a pre-training model of the video classification model, and knowledge in the image field is transferred to the video classification model by adopting a pre-training mode, so that the knowledge in the image field can be transferred to the video classification model simply and conveniently, and the classification effect of the video classification model is improved. In addition, the video classification model is trained on the basis of a pre-training model, namely, the pre-training model is adopted to initialize the video classification model, and because the pre-training model is a trained image classification model, compared with a mode of randomly initializing the video classification model, the number of iteration rounds in training can be reduced, and the training time of the video classification model is reduced.
An image classification dataset, such as ImageNet, and a video classification dataset, such as Kinetics-400, are existing datasets.
Further, the first knowledge distillation network may include a first knowledge migration model and a first knowledge learning model, a parameter quantity of the first knowledge learning model is smaller than a parameter quantity of the first knowledge migration model, accuracy of the first knowledge migration model is higher than accuracy of the first knowledge learning model, the first knowledge learning model is trained based on output of the first knowledge migration model, and the trained first knowledge learning model is used as an image classification model, so that an image classification model with high practicability (i.e., less parameter quantity) and high accuracy may be obtained.
Further, the second knowledge distillation network may include a second knowledge migration model and a second knowledge learning model, a parameter quantity of the second knowledge learning model is smaller than a parameter quantity of the second knowledge migration model, accuracy of the second knowledge migration model is higher than accuracy of the second knowledge learning model, a pre-training model of the second knowledge learning model is the image classification model, the second knowledge learning model is trained based on output of the second knowledge migration model, and the trained second knowledge learning model is used as the video classification model, so that a video classification model introducing image domain knowledge, high in practicability and high in accuracy can be obtained.
Fig. 2 is a schematic diagram of a second embodiment of the present disclosure, where this embodiment provides a method for generating a video classification model, as shown in fig. 2 in combination with fig. 3, where the method includes:
201. an image classification dataset is constructed.
For example, a very large scale image classification dataset is constructed based on ImageNet.
202. A first knowledge migration model and a first knowledge learning model are selected.
The first knowledge migration model and the first knowledge learning model should be as similar in structure as possible, while the accuracy of the first knowledge migration model should be as higher as possible than the accuracy of the first knowledge learning model.
Based on the first knowledge migration model and the first knowledge learning model, an image classification model may be generated, and it is understood that, in the embodiment of the present disclosure, if not specifically described, the "image classification model" refers to an image classification model that is finally trained in the first stage and serves as a pre-training model in the second stage. Specifically, the following steps are carried out: the first knowledge learning model is trained based on the first knowledge migration model in the first stage, so that the first knowledge migration model and the first knowledge learning model are both image classification models, where the image classification model specifically means that the first knowledge migration model is an image classification model trained before the first stage and having higher precision, the first knowledge learning model is an image classification model trainable in the first stage, that is, an image classification model with adjustable parameters, and the "image classification model trained finally in the first stage" is the first knowledge learning model trained after the first stage.
The first knowledge migration model includes, but is not limited to, ResNeXt101_32x16d _ wsl, and the first knowledge learning model includes, but is not limited to, ResNet50_ vd.
203. And respectively taking the image classification data sets as the input of the first knowledge migration model and the first knowledge learning model, processing the image classification data sets by adopting the first knowledge migration model to obtain the output of the first knowledge migration model, and processing the image classification data sets by adopting the first knowledge learning model to obtain the output of the first knowledge learning model.
204. And constructing a first loss function according to the output of the first knowledge migration model and the output of the first knowledge learning model, and training the first knowledge learning model according to the first loss function to generate an image classification model.
Specifically, training the first knowledge learning model according to the first loss function to generate the image classification model may include: updating parameters of the first knowledge learning model according to the first loss function, and fixing the parameters of the first knowledge migration model until the first loss function is converged; and determining a first knowledge learning model when the first loss function is converged as the image classification model.
When constructing the first loss function, the corresponding loss function may be constructed according to a specific knowledge Distillation scheme, for example, when a Simple Semi-supervised Label knowledge Distillation (SSLD) scheme is adopted, constructing the first loss function may include: performing first activation processing on the first knowledge migration model output to obtain a first soft label (soft label), performing second activation processing on the first knowledge learning model output to obtain a first soft prediction (soft prediction), and calculating a first loss function according to the first soft label and the first soft prediction, wherein the first loss function is JS divergence (Jensen-Shannon divergence).
For example, taking SSLD as an example, referring to fig. 4, after an input passes through a knowledge migration model and a knowledge learning model, an output is generated respectively and is called as a knowledge migration model output and a knowledge learning model output, then, softmax is used for activation and is called as a soft label and soft prediction, then, a loss function is calculated based on the soft label and the soft prediction, parameters of the knowledge learning model are updated based on the loss function, and the parameters of the knowledge migration model are fixed until the loss function converges, so that training of the knowledge learning model is completed. M and N are the number of layers of the knowledge migration model and the knowledge learning model, both M and N are positive integers, and generally M is larger than N.
It is understood that, in the training of the image classification model, the input is the image classification data set, the knowledge transfer model and the knowledge learning model are respectively referred to as a first knowledge transfer model and a first knowledge learning model, and the output is respectively referred to as: the method comprises the steps of first knowledge migration model output, first knowledge learning model output, a first soft label and first soft prediction, wherein a loss function is called a first loss function. When the video classification model is trained, the input is a video classification data set, the knowledge migration model and the knowledge learning model are respectively called a second knowledge migration model and a second knowledge learning model, and the output is respectively called: the second knowledge migration model output, the second knowledge learning model output, the second soft label and the second soft prediction, wherein the loss function is called a second loss function.
The activation function for activating the knowledge migration model output and the knowledge learning model output comprises the first activation function, the second activation function, a third activation function and a fourth activation function, and is a softmax function containing temperature T, wherein T is a preset hyper-parameter.
Through the first knowledge migration model and the first knowledge learning model, an image classification model with higher practicability (namely less parameter quantity) and higher precision can be obtained by adopting a knowledge distillation mode.
By updating the parameters of the first knowledge learning model by using the first loss function and fixing the parameters of the first knowledge migration model, the knowledge of the first knowledge migration model can be introduced into the first knowledge learning model, so that the finally obtained image classification model has the excellent performance of the first knowledge migration model.
205. A video classification dataset is constructed.
The video classification data set is, for example, Kinetics-400, which contains about 24 ten thousand video samples in total, and the number of corresponding categories is about 400.
206. Selecting a second knowledge migration model and a second knowledge learning model; the second knowledge learning model includes the image classification model.
Based on the second knowledge migration model and the second knowledge learning model, a video classification model may be generated, and it is understood that, in this embodiment of the disclosure, if not specifically described, the "video classification model" refers to a video classification model finally trained at the second stage, and the video classification model may be used at an application stage, that is, at the application stage, a video to be classified is input into the video classification model, and after being processed by the video classification model, category information of the video to be classified is output. Specifically, the following description is provided: in the second stage, the second knowledge learning model is trained based on the second knowledge migration model, so that the second knowledge migration model and the second knowledge learning model are both video classification models, where the video classification model specifically means that the second knowledge migration model is a video classification model trained before the second stage and having higher precision, the second knowledge learning model is a video classification model trainable in the second stage, that is, a parameter-adjustable video classification model, and the "video classification model trained finally in the second stage" is the second knowledge learning model trained after the second stage.
To increase the computation speed, the second knowledge learning model may be selected as a less parametric utility model, such as a 2D convolution model. Including but not limited to PP-TSM models. Further, when the second knowledge learning model is a 2D convolution model, it can be understood that the second knowledge learning model includes a spatial feature extraction model and a Temporal feature extraction model, an initial value of the spatial feature extraction model is an image classification model obtained in the first stage, and the Temporal feature extraction model may adopt various relevant models for extracting Temporal features, for example, a Long Short-Term Memory (LSTM) network or a Temporal Shift Module (TSM). In particular, the second knowledge learning model includes, but is not limited to, a PP-TSM model.
In order to improve the operation accuracy, the second knowledge migration model may be selected as a high-accuracy model, and further, any one of the following models, or a combination of at least two of the following models may be included:
more accurate 3D convolution models including, but not limited to, SlowFast;
2D convolution models with deeper layers and higher precision, including but not limited to TSM with ResNet-101 as a backbone;
the resulting high accuracy model, including but not limited to VideoTag, is trained using a very large video classification dataset.
207. And respectively taking the video classification data sets as the input of a second knowledge migration model and a second knowledge learning model, processing the image classification data sets by adopting the second knowledge migration model to obtain the output of the second knowledge migration model, and processing the image classification data sets by adopting the second knowledge learning model to obtain the output of the second knowledge learning model.
208. And constructing a second loss function according to the output of the second knowledge migration model and the output of the second knowledge learning model, and training the second knowledge learning model according to the second loss function to generate a video classification model.
Specifically, training the second knowledge learning model according to the second loss function to generate the video classification model may include: updating parameters of the second knowledge learning model according to the second loss function, and fixing the parameters of the second knowledge migration model until the second loss function is converged; and determining a second knowledge learning model when the second loss function is converged as the video classification model.
Constructing the second loss function may include: performing third activation processing on the output of the second knowledge migration model to obtain a second soft label (soft label), performing fourth activation processing on the output of the second knowledge learning model to obtain a second soft prediction (soft prediction), and calculating a second loss function according to the second soft label and the second soft prediction, wherein the second loss function is as follows: cross entropy, KL divergence (Kullback-Leibler divergence), or JS divergence (Jensen-Shannon divergence).
Specifically, the cross entropy calculation formula is:
Figure BDA0002943791290000091
the calculation formula of the KL divergence is as follows:
Figure BDA0002943791290000092
the formula for calculating the JS divergence is as follows:
Figure BDA0002943791290000093
wherein p (xi) is the real output corresponding to the ith input xi, which is a soft tag in this embodiment; q (xi) is the prediction output corresponding to the ith input xi, which is soft prediction in this embodiment. It is understood that the soft labels and soft predictions used in calculating the loss function may be a first soft label and a first soft prediction, or a second soft label and a second soft prediction, based on different knowledge distillation stages.
Through the second knowledge migration model and the second knowledge learning model, and the second knowledge learning module takes the image classification model as a pre-training model, a knowledge distillation and pre-training mode can be adopted, and the video classification model with introduced image field knowledge, higher practicability and higher precision is obtained.
By adopting the second loss function to update the parameters of the second knowledge learning model and fixing the parameters of the second knowledge transfer model, the knowledge of the second knowledge transfer model can be introduced into the second knowledge learning model, so that the finally obtained video classification model has the excellent performance of the second knowledge transfer model.
It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.
In this embodiment, a two-stage knowledge distillation method is adopted to generate a video classification model. And in the second stage, the image classification model generated in the first stage is used as a pre-training model of the video classification model, and training is performed on the basis of the pre-training model to generate the video classification model. In the first stage, the parameter quantity of the first knowledge learning model is less than that of the first knowledge migration model, the precision of the first knowledge migration model is higher than that of the first knowledge learning model, knowledge of the first knowledge migration model can be migrated into the first knowledge learning model through knowledge distillation, the trained first knowledge learning model is used as an image classification model, and the image classification model with less parameter quantity and higher precision can be obtained. In the second stage, the image classification model obtained in the first stage is used as a pre-training model of the video classification model, the good performance of the image classification model can be introduced into the video classification model, a high-precision second knowledge transfer model is adopted, knowledge of the second knowledge transfer model can be introduced into the second knowledge learning model through knowledge distillation, the trained second knowledge learning model is used as the video classification model, the precision of the video classification model can be improved, and therefore the video classification model with good performance, high practicability (less parameter quantity) and high precision in the field of introduced images can be obtained.
In this embodiment, the image classification model is used as the pre-training model of the video classification model, so that the knowledge in the image field can be migrated to the video field in a pre-training mode. And a pre-training mode is adopted, so that the iteration number of knowledge distillation can be reduced, and the training time of the video classification model is reduced to a great extent. The 2D convolution model is adopted through the second knowledge learning model, the model with higher precision is adopted by the second knowledge migration model, and the second knowledge learning model acquires knowledge of the second knowledge migration model in a knowledge distillation mode, so that the second knowledge learning model can keep the practicability of the 2D convolution model and improve the precision, the trained second knowledge learning model is used as a video classification model, and the video classification model with higher practicability and higher precision is obtained.
Fig. 5 is a schematic diagram of a fifth embodiment of the present disclosure, and this embodiment provides an apparatus for generating a video classification model, where the apparatus 500 includes: an acquisition module 501 and a generation module 502.
The obtaining module 501 is configured to obtain an image classification model, where the image classification model is generated by using a first knowledge distillation network according to an image classification dataset; the generating module 502 is configured to train the image classification model according to the video classification dataset by using a second knowledge distillation network to generate a video classification model.
In some embodiments, the first knowledge distillation network comprises a first knowledge transfer model and a first knowledge learning model, and the obtaining module 501 is specifically configured to: respectively taking an image classification data set as the input of the first knowledge migration model and the first knowledge learning model, processing the image classification data set by adopting the first knowledge migration model to obtain a first knowledge migration model output, and processing the image classification data set by adopting the first knowledge learning model to obtain a first knowledge learning model output; and constructing a first loss function according to the first knowledge migration model output and the first knowledge learning model output, and training the first knowledge learning model according to the first loss function to generate an image classification model.
In some embodiments, the obtaining module 501 is further specifically configured to: updating parameters of the first knowledge learning model according to the first loss function, and fixing the parameters of the first knowledge migration model until the first loss function converges; and determining a first knowledge learning model when the first loss function is converged as the image classification model.
In some embodiments, the second knowledge distillation network comprises a second knowledge transfer model and a second knowledge learning model, and the generating module 502 is specifically configured to: respectively taking a video classification data set as the input of the second knowledge transfer model and the second knowledge learning model, processing the video classification data set by adopting the second knowledge transfer model to obtain the output of the second knowledge transfer model, and processing the video classification data set by adopting the second knowledge learning model to obtain the output of the second knowledge learning model; the second knowledge learning model comprises the image classification model; and constructing a second loss function according to the output of the second knowledge migration model and the output of the second knowledge learning model, and training the second knowledge learning model according to the second loss function to generate a video classification model.
In some embodiments, the generating module 502 is further specifically configured to: updating parameters of the second knowledge learning model according to the second loss function, and fixing the parameters of the second knowledge migration model until the second loss function is converged; and determining a second knowledge learning model when the second loss function is converged as the video classification model.
In some embodiments, the second knowledge learning model is a 2D convolution model; and/or the accuracy of the second knowledge migration model is higher than the accuracy of the second knowledge learning model.
In this embodiment, the video classification model is generated after the image classification model is trained, that is, the image classification model may be used as a pre-training model of the video classification model, and the knowledge in the image field is migrated into the video classification model in a pre-training manner, so that the migration of the knowledge in the image field into the video classification model may be easily implemented, and the classification effect of the video classification model may be improved. In addition, the video classification model is trained on the basis of a pre-training model, namely, the pre-training model is adopted to initialize the video classification model, and because the pre-training model is a trained image classification model, compared with a mode of randomly initializing the video classification model, the number of iteration rounds in training can be reduced, and the training time of the video classification model is reduced. In addition, the image classification model and the video classification model are obtained through a knowledge distillation mode, and the effects of the two models can be further improved.
It is to be understood that in the disclosed embodiments, the same or similar contents in different embodiments may be mutually referred to.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the electronic device 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the generation method of the video classification model. For example, in some embodiments, the method of generating a video classification model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method of generating a video classification model described above may be performed. Alternatively, in other embodiments, the calculation unit 601 may be configured by any other suitable means (e.g. by means of firmware) to perform the generation method of the video classification model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (10)

1. A method for generating a video classification model comprises the following steps:
respectively taking an image classification data set as input of a first knowledge migration model and a first knowledge learning model, processing the image classification data set by adopting the first knowledge migration model to obtain first knowledge migration model output, and processing the image classification data set by adopting the first knowledge learning model to obtain first knowledge learning model output;
constructing a first loss function according to the first knowledge migration model output and the first knowledge learning model output, and training the first knowledge learning model according to the first loss function to generate an image classification model; and using the image classification model as a pre-training model of a second knowledge learning model;
respectively taking the video classification data sets as the input of a second knowledge migration model and a second knowledge learning model, processing the video classification data sets by adopting the second knowledge migration model to obtain the output of the second knowledge migration model, and processing the video classification data sets by adopting the second knowledge learning model to obtain the output of the second knowledge learning model;
and constructing a second loss function according to the output of the second knowledge migration model and the output of the second knowledge learning model, and training the second knowledge learning model according to the second loss function to generate a video classification model.
2. The method of claim 1, wherein the training the first knowledge learning model according to the first loss function to generate an image classification model comprises:
updating parameters of the first knowledge learning model according to the first loss function, and fixing the parameters of the first knowledge migration model until the first loss function converges; and determining a first knowledge learning model when the first loss function is converged as the image classification model.
3. The method of claim 1, wherein said training the second knowledge learning model according to the second loss function to generate a video classification model comprises:
updating parameters of the second knowledge learning model according to the second loss function, and fixing the parameters of the second knowledge migration model until the second loss function is converged; and determining a second knowledge learning model when the second loss function is converged as the video classification model.
4. The method of claim 1, wherein,
the second knowledge learning model is a 2D convolution model; and/or the presence of a gas in the atmosphere,
the accuracy of the second knowledge migration model is higher than the accuracy of the second knowledge learning model.
5. An apparatus for generating a video classification model, comprising:
the acquisition module is used for respectively taking the image classification data sets as the input of a first knowledge migration model and a first knowledge learning model, processing the image classification data sets by adopting the first knowledge migration model to obtain the output of the first knowledge migration model, and processing the image classification data sets by adopting the first knowledge learning model to obtain the output of the first knowledge learning model; constructing a first loss function according to the first knowledge migration model output and the first knowledge learning model output, and training the first knowledge learning model according to the first loss function to generate an image classification model; and using the image classification model as a pre-training model of a second knowledge learning model;
the generation module is used for respectively taking the video classification data sets as the input of a second knowledge migration model and the input of a second knowledge learning model, processing the video classification data sets by adopting the second knowledge migration model to obtain the output of the second knowledge migration model, and processing the video classification data sets by adopting the second knowledge learning model to obtain the output of the second knowledge learning model; and constructing a second loss function according to the output of the second knowledge migration model and the output of the second knowledge learning model, and training the second knowledge learning model according to the second loss function to generate a video classification model.
6. The apparatus of claim 5, wherein the obtaining module is further specifically configured to:
updating parameters of the first knowledge learning model according to the first loss function, and fixing the parameters of the first knowledge migration model until the first loss function is converged; and determining a first knowledge learning model when the first loss function is converged as the image classification model.
7. The apparatus of claim 5, wherein the generation module is further specifically configured to:
updating parameters of the second knowledge learning model according to the second loss function, and fixing the parameters of the second knowledge migration model until the second loss function is converged; and determining a second knowledge learning model when the second loss function is converged as the video classification model.
8. The apparatus of claim 5, wherein,
the second knowledge learning model is a 2D convolution model; and/or the presence of a gas in the gas,
the accuracy of the second knowledge migration model is higher than the accuracy of the second knowledge learning model.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
CN202110190232.8A 2021-02-18 2021-02-18 Method, device and equipment for generating video classification model and storage medium Active CN112949433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110190232.8A CN112949433B (en) 2021-02-18 2021-02-18 Method, device and equipment for generating video classification model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110190232.8A CN112949433B (en) 2021-02-18 2021-02-18 Method, device and equipment for generating video classification model and storage medium

Publications (2)

Publication Number Publication Date
CN112949433A CN112949433A (en) 2021-06-11
CN112949433B true CN112949433B (en) 2022-07-22

Family

ID=76244510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110190232.8A Active CN112949433B (en) 2021-02-18 2021-02-18 Method, device and equipment for generating video classification model and storage medium

Country Status (1)

Country Link
CN (1) CN112949433B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642605A (en) * 2021-07-09 2021-11-12 北京百度网讯科技有限公司 Model distillation method, device, electronic equipment and storage medium
CN113642532B (en) * 2021-10-13 2022-02-08 广州虎牙信息科技有限公司 Video classification model processing method and device and data processing equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018169639A1 (en) * 2017-03-17 2018-09-20 Nec Laboratories America, Inc Recognition in unlabeled videos with domain adversarial learning and knowledge distillation
CN111950411A (en) * 2020-07-31 2020-11-17 上海商汤智能科技有限公司 Model determination method and related device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220616B (en) * 2017-05-25 2021-01-19 北京大学 Adaptive weight-based double-path collaborative learning video classification method
CN110188239B (en) * 2018-12-26 2021-06-22 北京大学 Double-current video classification method and device based on cross-mode attention mechanism
CN110866512B (en) * 2019-11-21 2023-06-06 南京大学 Monitoring camera shielding detection method based on video classification
CN111639710B (en) * 2020-05-29 2023-08-08 北京百度网讯科技有限公司 Image recognition model training method, device, equipment and storage medium
CN111738436B (en) * 2020-06-28 2023-07-18 电子科技大学中山学院 Model distillation method and device, electronic equipment and storage medium
CN112232397A (en) * 2020-09-30 2021-01-15 上海眼控科技股份有限公司 Knowledge distillation method and device of image classification model and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018169639A1 (en) * 2017-03-17 2018-09-20 Nec Laboratories America, Inc Recognition in unlabeled videos with domain adversarial learning and knowledge distillation
CN111950411A (en) * 2020-07-31 2020-11-17 上海商汤智能科技有限公司 Model determination method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
VideoSSL:Semi-Supervised Learing for Video Classification;Longlong Jing et al.;《arxiv.org》;20200229;全文 *
面向语音辅助唇语识别的知识蒸馏;许睿;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200815;全文 *

Also Published As

Publication number Publication date
CN112949433A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
JP2022058915A (en) Method and device for training image recognition model, method and device for recognizing image, electronic device, storage medium, and computer program
JP7331975B2 (en) Cross-modal search model training methods, apparatus, equipment, and storage media
CN113836333A (en) Training method of image-text matching model, method and device for realizing image-text retrieval
CN112541122A (en) Recommendation model training method and device, electronic equipment and storage medium
CN114202076B (en) Training method of deep learning model, natural language processing method and device
CN112580733B (en) Classification model training method, device, equipment and storage medium
CN113344089B (en) Model training method and device and electronic equipment
CN113326852A (en) Model training method, device, equipment, storage medium and program product
CN114494784A (en) Deep learning model training method, image processing method and object recognition method
CN112949433B (en) Method, device and equipment for generating video classification model and storage medium
US20220374678A1 (en) Method for determining pre-training model, electronic device and storage medium
CN116152833B (en) Training method of form restoration model based on image and form restoration method
CN113033801A (en) Pre-training method and device of neural network model, electronic equipment and medium
CN112966744A (en) Model training method, image processing method, device and electronic equipment
CN112580666A (en) Image feature extraction method, training method, device, electronic equipment and medium
CN114462598A (en) Deep learning model training method, and method and device for determining data category
CN114186681A (en) Method, apparatus and computer program product for generating model clusters
CN114490965B (en) Question processing method and device, electronic equipment and storage medium
CN113361621B (en) Method and device for training model
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114841172A (en) Knowledge distillation method, apparatus and program product for text matching double tower model
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN113989899A (en) Method, device and storage medium for determining feature extraction layer in face recognition model
CN114973279B (en) Training method and device for handwritten text image generation model and storage medium
CN114066278B (en) Method, apparatus, medium, and program product for evaluating article recall

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant