CN109740018B

CN109740018B - Method and device for generating video label model

Info

Publication number: CN109740018B
Application number: CN201910084734.5A
Authority: CN
Inventors: 李伟健; 王长虎
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing Volcano Engine Technology Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2021-03-02
Anticipated expiration: 2039-01-29
Also published as: CN109740018A

Abstract

The embodiment of the disclosure discloses a method and a device for generating a video label model. One embodiment of the method comprises: acquiring at least two sample video sets; selecting a sample video set from at least two sample video sets, and using the selected sample video set, performing the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model. The embodiment improves the flexibility of model training and is beneficial to improving the accuracy of video classification by using the video label model.

Description

Method and device for generating video label model

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for generating a video label model.

Background

The multi-label classification is to classify certain information into a plurality of categories, that is, one information may have a plurality of labels. The existing method for multi-label classification of video usually adopts a multi-label classification model. The model includes a plurality of sigmoid activation functions, each activation function corresponding to a tag. When the model is trained, a single sample video corresponds to a plurality of labeled labels, and the trained model can output a plurality of labels, wherein each label corresponds to a video category.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for generating a video label model and a method and a device for generating a category label set of a video.

In a first aspect, an embodiment of the present disclosure provides a method for generating a video tag model, the method including: acquiring at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets comprise positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information; selecting a sample video set from at least two sample video sets, and using the selected sample video set, performing the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model.

In some embodiments, the method further comprises: in response to determining that the at least two sample video sets include an unselected sample video set, reselecting the sample video set from the unselected sample video set, and continuing the training step using the reselected sample video set and the initial model after the last training.

In some embodiments, the positive category information and the negative category information are vectors respectively including a preset number of elements, a target element in a vector corresponding to the positive sample video is used to characterize that the positive sample video belongs to a corresponding video category, a target element in a vector corresponding to the negative sample video is used to characterize that the negative sample video does not belong to a corresponding video category, and the target element is an element located at an element position in the vector, where a correspondence relationship is established in advance with a video category corresponding to a sample video set to which the sample video corresponding to the vector belongs.

In some embodiments, the initial model is a convolutional neural network, and includes a feature extraction layer and a classification layer, where the classification layer includes a preset number of weight data, and the weight data corresponds to a preset video category, and is used to determine a probability that the input video belongs to the video category to which the weight data corresponds.

In some embodiments, training the initial model comprises: and fixing other weight data except the weight data corresponding to the sample video set in the preset number of weight data, and adjusting the weight data corresponding to the sample video set to train the initial model.

In some embodiments, the initial model further comprises a video frame extraction layer; and training an initial model by taking the positive sample video included in the sample video set as an input, taking the positive category information corresponding to the input positive sample video as an expected output, taking the negative sample video in the sample video set as an input, and taking the negative category information corresponding to the input negative sample video as an expected output, wherein the training comprises the following steps: inputting a positive sample video included in the sample video set into a video frame extraction layer to obtain a positive sample video frame set; taking the obtained positive sample video frame set as the input of a feature extraction layer, and taking positive category information corresponding to the input positive sample video as the expected output of an initial model; inputting negative sample videos included in the sample video set into a video frame extraction layer to obtain a negative sample video frame set; and taking the obtained negative sample video frame set as the input of the feature extraction layer, taking the negative category information corresponding to the input negative sample video as the expected output of the initial model, and training the initial model.

In a second aspect, an embodiment of the present disclosure provides a method for generating a category label set for a video, the method including: acquiring a video to be classified; inputting a video to be classified into a pre-trained video tag model, and generating a category tag set, wherein the category tag corresponds to a preset video category and is used for representing that the video to be classified belongs to the video category corresponding to the category tag, and the video tag model is generated according to the method described in any embodiment of the first aspect.

In a third aspect, an embodiment of the present disclosure provides an apparatus for generating a video tag model, the apparatus including: an obtaining unit configured to obtain at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets include positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information; a training unit configured to select a sample video set from at least two sample video sets, and with the selected sample video set, perform the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model.

In some embodiments, the apparatus further comprises: a selection unit configured to, in response to determining that the at least two sample video sets include an unselected sample video set, reselect the sample video set from the unselected sample video set, and continue to perform the training step using the reselected sample video set and the initial model after the last training.

In some embodiments, the training unit is further configured to: and fixing other weight data except the weight data corresponding to the sample video set in the preset number of weight data, and adjusting the weight data corresponding to the sample video set to train the initial model.

In some embodiments, the initial model further comprises a video frame extraction layer; and the training unit is further configured to: inputting a positive sample video included in the sample video set into a video frame extraction layer to obtain a positive sample video frame set; taking the obtained positive sample video frame set as the input of a feature extraction layer, and taking positive category information corresponding to the input positive sample video as the expected output of an initial model; inputting negative sample videos included in the sample video set into a video frame extraction layer to obtain a negative sample video frame set; and taking the obtained negative sample video frame set as the input of the feature extraction layer, taking the negative category information corresponding to the input negative sample video as the expected output of the initial model, and training the initial model.

In a fourth aspect, an embodiment of the present disclosure provides an apparatus for generating a category label set for a video, the apparatus including: an acquisition unit configured to acquire a video to be classified; the generating unit is configured to input a video to be classified into a pre-trained video tag model, and generate a category tag set, where the category tag corresponds to a preset video category and is used for representing that the video to be classified belongs to a video category corresponding to the category tag, and the video tag model is generated according to the method described in any embodiment of the first aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when executed by one or more processors, cause the one or more processors to implement a method as described in any of the implementations of the first or second aspects.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.

According to the method and the device for generating the video label model, at least two sample video sets are obtained, wherein the sample video sets correspond to preset video categories and comprise positive sample videos and negative sample videos, the positive sample videos correspond to positive category information, and the negative sample videos correspond to negative category information; then, the positive sample video is used as input, the positive category information is used as expected output, the negative sample video is used as input, the negative category information is used as expected output, the initial model is trained, and finally the video label model is obtained through training.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for generating a video tag model, according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating a video tag model according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of one embodiment of a method for generating a category label set for a video, according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of one embodiment of an apparatus for generating a video tag model according to an embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating an embodiment of an apparatus for generating a category label set for a video according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant disclosure and are not limiting of the disclosure. It should be noted that, for the convenience of description, only the parts relevant to the related disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 for a method of generating a video tag model or an apparatus for generating a video tag model, and for a method of generating a category tag set of a video or an apparatus for generating a category tag set of a video to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a video playing application, a video processing application, a web browser application, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal apparatuses

101, 102, 103 are hardware, various electronic apparatuses are possible. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background model server performing model training using a sample video set uploaded by the

terminal devices

101, 102, 103. The background model server can perform model training by using the obtained at least two sample video sets to generate a video tag model, and can also send the video tag model to the terminal equipment, or process the video to be classified by using the video tag model to obtain the tag of the video to be classified.

It should be noted that the method for generating the video tag model provided by the embodiment of the present disclosure may be executed by the server 105, and may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the apparatus for generating the video tag model may be disposed in the server 105, and may also be disposed in the

terminal devices

101, 102, and 103. Furthermore, the method for generating the category label set of the video provided by the embodiment of the present disclosure may be executed by the server 105, and may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the apparatus for generating the category label set of the video may be disposed in the server 105, and may also be disposed in the

terminal devices

101, 102, and 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case that the sample video set required for training the model is not required to be acquired from a remote place, or the video to be classified is not required to be acquired from a remote place, the system architecture may not include a network, and only a server or a terminal device is required.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a video tag model according to the present disclosure is shown. The method for generating the video label model comprises the following steps:

at step 201, at least two sample video sets are obtained.

In this embodiment, an executing subject (for example, a server or a terminal device shown in fig. 1) of the method for generating a video tag model may remotely obtain at least two sample video sets through a wired connection or a wireless connection, or locally obtain at least two sample video sets. The sample video set corresponds to a preset video category, the sample video set comprises positive sample videos belonging to the corresponding video category and negative sample videos not belonging to the corresponding video category, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information.

Specifically, the positive category information and the negative category information may include information in at least one of the following forms: letters, numbers, symbols, etc. For example, the positive category information may be "seaside" and the negative category information may be "non-seaside".

It should be noted that the positive sample video and the negative sample video used in the present embodiment include an image sequence including at least two images.

In some optional implementation manners of this embodiment, the positive category information and the negative category information are vectors including a preset number of elements, respectively, a target element in a vector corresponding to the positive sample video is used to represent that the positive sample video belongs to a corresponding video category, a target element in a vector corresponding to the negative sample video is used to represent that the negative sample video does not belong to a corresponding video category, and the target element is an element located at an element position in the vector, where a correspondence relationship with the video category corresponding to the vector is established in advance. The video category corresponding to the vector is the video category corresponding to the sample video set to which the sample video corresponding to the vector belongs.

As an example, assuming that the preset number is 100, for a sample video set, the video category corresponding to the sample video set is a seaside category, the positive category information corresponding to the positive sample video in the sample video set may be a vector (1,0,0,0, …, 0), where the vector includes 100 elements, where the first element corresponds to the seaside category. Here, the number 1 indicates that the video belongs to the seaside class, and the other element 0 indicates that the video does not belong to the video class corresponding to the element position where 0 is located. Accordingly, the negative category information may be a vector (0,0,0,0, …, 0). The other elements may have other values and are not limited to 0. The class information labeled in a vector form is generally used for training a multi-label classification model, and since one vector is used for representing whether one video belongs to one video class, the class information of the implementation mode can be regarded as a single label. When a certain sample video set is used for training, a training method of a single-label model can be adopted, and the training steps are simplified.

By using the vector to represent the category information, the video category identified by the video tag model can be flexibly expanded. For example, assume that the preset number is 100, i.e., the model can identify 100 categories of videos at the maximum. In practical application, only 10 video categories need to be identified, and the 1 st to 10 th elements in the vector respectively correspond to preset video categories. When the video tag model can identify videos of more categories, only the video categories corresponding to other elements need to be set, and therefore the identification capability of the video tag model can be flexibly expanded.

Step 202, selecting a sample video set from at least two sample video sets, and executing the following training steps by using the selected sample video set: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model.

In this embodiment, the executing body may execute the following sub-steps:

step 2021, select a sample video set from at least two sample video sets.

Specifically, the execution subject may select the sample video sets in various manners, such as random selection, selection in a number order of the respective sample video sets set in advance, and the like.

Next, using the selected sample video set, the following training steps (including steps 2022-2024) are performed.

Step 2022, using a machine learning method, training an initial model by taking the positive sample video included in the sample video set as an input, taking the positive category information corresponding to the input positive sample video as an expected output, taking the negative sample video in the sample video set as an input, and taking the negative category information corresponding to the input negative sample video as an expected output.

Specifically, the initial model may be various types of models, such as a recurrent neural network model, a convolutional neural network model, and the like. In the process of training the initial model, the actual output can be obtained for the positive sample video or the negative sample video input for each training. And the actual output is data actually output by the initial model and used for representing the class information. Then, the executing agent may adopt a gradient descent method, adjust parameters of the initial model based on the actual output and the expected output, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training for one sample video set when a preset ending condition is met. It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the loss value calculated using a predetermined loss function (e.g., a cross entropy loss function) is less than a predetermined loss value threshold.

As one example, the initial model may include at least two bi-classification models, each corresponding to a sample video set. For a certain binary model, the binary model may be trained based on positive sample videos and negative sample videos included in the corresponding sample video set. And finally, the trained binary classification model can determine whether the input video belongs to the video category corresponding to the binary classification model, and if so, generate a label for representing the video category corresponding to the binary classification model. Therefore, when the trained video label model is finally used for video classification, at least one label used for representing the video category can be generated, and the effect of multi-label classification is achieved.

In some optional implementations of this embodiment, the initial model may be a convolutional neural network, and includes a feature extraction layer and a classification layer. The classification layer comprises a preset number of weight data, the weight data correspond to preset video categories and are used for determining the probability that the input video belongs to the video categories corresponding to the weight data. In general, the feature extraction layer may include a convolutional layer, a pooling layer, and the like, for generating feature data of the video, which may be used to characterize features such as color, shape, and the like of images in the video. The classification layer comprises a full-concatenation layer, and the full concatenation is used for generating a feature vector (for example, 2048-dimensional vector) according to feature data output by the feature extraction layer. The weight data comprises weight coefficients which can be multiplied by the characteristic data, the weight data can also comprise bias values, and the weight coefficients and the bias values can be used for obtaining probability values corresponding to the weight data, and the probability values are used for representing the probability that the input video belongs to the video category corresponding to the weight data.

In some optional implementations of this embodiment, the executing entity may train the initial model according to the following steps:

and fixing other weight data except the weight data corresponding to the sample video set in the preset number of weight data, and adjusting the weight data corresponding to the sample video set to train the initial model.

Specifically, for a sample video set, fixing other weight data except the weight data corresponding to the sample video set, and adjusting the weight data corresponding to the sample video set by adopting a training method of a binary classification model. Thereby optimizing the weight data corresponding to the sample video set. It should be noted that the method for training the two-class model is a well-known technique widely studied and applied at present, and is not described herein again. By the implementation mode, the weight data included by the video label model can be independent from each other, and when a sample video set is used for training, other weight data are not influenced, so that the finally obtained video label model can classify videos more accurately. Due to the fact that the multiple weight data are adopted, the finally obtained video label model can divide multiple video categories of the video input into the model, and the effect of multi-label classification is achieved.

In some optional implementations of this embodiment, the initial model further includes a video frame extraction layer. The executive body may train the initial model as follows:

and inputting the positive sample video included in the sample video set into a video frame extraction layer to obtain a positive sample video frame set. And taking the obtained positive sample video frame set as the input of the feature extraction layer, taking the positive category information corresponding to the input positive sample video as the expected output of the initial model, and training the initial model. And inputting the negative sample video included in the sample video set into a video frame extraction layer to obtain a negative sample video frame set. And taking the obtained negative sample video frame set as the input of the feature extraction layer, taking the negative category information corresponding to the input negative sample video as the expected output of the initial model, and training the initial model.

Specifically, the video frame extraction layer may extract the video frame in a video frame extraction manner of various preset settings. For example, the key frames of the input sample video may be extracted as sample video frames according to an existing method of extracting key frames of a video. Or extracting the video frame as a sample video frame according to a preset playing time interval. Through the implementation mode, a certain number of video frames can be extracted from the sample video for classifying the sample video, so that the calculation amount of the model can be reduced, and the efficiency of model training is improved.

Step 2023, determine whether at least two sample video sets include an unselected sample video set.

Step 2024, in response to determining not to include, determining the initial model after the last training to be the video tag model.

In some optional implementations of this embodiment, the executing entity may, in response to determining that the at least two sample video sets include the unselected sample video set, reselect the sample video set from the unselected sample video set, and continue to execute the training step (i.e., step 2022-step 2024) using the reselected sample video set and the initial model after the last training. The manner of reselecting the sample video set from the unselected sample video set may be random selection or selection according to the numbering sequence of the sample video sets, which is not limited herein.

The video label model obtained by training according to the above steps can be used to determine probability values of the input video belonging to each preset video category, and if a certain probability value is greater than or equal to a preset probability threshold, a category label used for representing that the input video belongs to the video category corresponding to the probability value is generated. In practical applications, the video tag model may output a set of category tags, where each category tag corresponds to a preset video category to which the video used for representing the input video tag model belongs. Therefore, the trained video label model is a multi-label classification model.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating a video tag model according to the present embodiment. In the application scenario of fig. 3, the electronic device 301 first acquires at least two sample video sets 302. Each sample video set corresponds to a preset video category, each sample video set comprises a positive sample video belonging to the corresponding video category and a negative sample video not belonging to the corresponding video category, the positive sample video corresponds to pre-labeled positive category information, and the negative sample video corresponds to pre-labeled negative category information. For example, sample video set 3021 corresponds to the video category "seaside" and sample video set 3022 corresponds to the video category "hotel". The positive category information corresponding to each positive sample video included in the sample video set 3021 is a vector (1,0,0, …), and the negative category information corresponding to each negative sample video included is a vector (0,0,0, …). The positive category information corresponding to each positive sample video included in the sample video set 3022 is a vector (0,1,0, …), and the negative category information corresponding to each negative sample video included is a vector (0,0,0, …). Where each element position in the vector corresponds to a video category.

Then, the electronic device 301 sequentially selects sample video sets from the at least two sample video sets 302 according to a preset numbering sequence of the sample video sets, and performs the following training steps using the selected sample video sets: using a machine learning method, the initial model 303 is trained using, as input, the positive sample videos included in the sample video set, using the positive category information corresponding to the input positive sample videos as expected output, using the negative sample videos in the sample video set as input, and using the negative category information corresponding to the input negative sample videos as expected output. The initial model 303 is shown trained using a sample video set 3021. After each training using the sample video, the initial model 303 retains the adjusted parameters and continues to be trained using other sample videos. After each training with the sample video set is finished, the electronic device 301 determines whether at least two sample video sets 302 include the unselected sample video set, and if not, that is, all sample video sets are used for training, determines that the initial model after the last training is the video tag model 304.

In the method provided by the above embodiment of the present disclosure, at least two sample video sets are obtained, where the sample video sets correspond to preset video categories, and each sample video set includes a positive sample video and a negative sample video, where the positive sample video corresponds to positive category information and the negative sample video corresponds to negative category information; then, the positive sample video is used as input, the positive category information is used as expected output, the negative sample video is used as input, the negative category information is used as expected output, the initial model is trained, and finally the video label model is obtained through training.

With further reference to fig. 4, a flow 400 of one embodiment of a method for generating a category label set for a video in accordance with the present disclosure is shown. The method for generating the category label set of the video comprises the following steps:

step 401, obtaining a video to be classified.

In this embodiment, the execution subject (such as the server or the terminal device shown in fig. 1) of the method for generating the category label set of the video may acquire the video to be classified from the local or from the remote. The video to be classified is the video to be classified. It should be noted that the video to be classified adopted in the present embodiment includes an image sequence including at least two images.

And 402, inputting the video to be classified into a pre-trained video label model to generate a class label set.

In this embodiment, the executing entity may input the video to be classified into a pre-trained video tag model, and generate a category tag set. The category label corresponds to a preset video category and is used for representing that the video to be classified belongs to the video category corresponding to the category label. The category labels may be various forms of labels including, but not limited to, at least one of: letters, numbers, symbols, etc.

In this embodiment, the video tag model is generated according to the method described in the embodiment corresponding to fig. 2, which may specifically refer to each step described in the embodiment corresponding to fig. 2, and is not described herein again.

In general, the generated set of category labels may be stored in association with the video to be classified. For example, the category label set may be stored as attribute information of the video to be classified into the attribute information set of the video to be classified. Thereby increasing the comprehensiveness of the attributes characterizing the video to be classified. The attribute information set may include, but is not limited to, at least one of the following attribute information: the name, size, generation time, etc. of the video to be classified.

Alternatively, the generated category label set may be output in various manners, for example, the category label set is displayed on a display screen included in the execution main body. Or sending the category label set to other electronic equipment in communication connection with the execution main body.

According to the method provided by the embodiment of the disclosure, the video to be classified is classified by using the video label model generated by the embodiment corresponding to fig. 2, so as to generate the category label set of the video to be classified, and then the video label model for multi-label classification is generated by using the training of the single label sample, so that the accuracy and efficiency of video classification are improved.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present disclosure provides an embodiment of an apparatus for generating a video tag model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for generating a video tag model of the present embodiment includes: an obtaining unit 501 configured to obtain at least two sample video sets, wherein a sample video set corresponds to a preset video category, and the sample video set includes a positive sample video belonging to the corresponding video category and a negative sample video not belonging to the corresponding video category, the positive sample video corresponds to pre-labeled positive category information, and the negative sample video corresponds to pre-labeled negative category information; a training unit 502 configured to select a sample video set from at least two sample video sets, and with the selected sample video set, perform the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model.

In this embodiment, the obtaining unit 501 may obtain at least two sample video sets remotely through a wired connection or a wireless connection, or obtain at least two sample video sets locally. The sample video set corresponds to a preset video category, the sample video set comprises positive sample videos belonging to the corresponding video category and negative sample videos not belonging to the corresponding video category, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information.

In this embodiment, the training unit 502 may perform the following sub-steps:

step 5021, a sample video set is selected from at least two sample video sets.

Specifically, the training unit 502 may select the sample video sets in various manners, such as random selection, selection according to the number order of each sample video set, and the like.

Next, using the selected sample video set, the following training steps (including steps 5022-5024) are performed.

Step 5022, by using a machine learning method, taking the positive sample video included in the sample video set as input, taking the positive category information corresponding to the input positive sample video as expected output, taking the negative sample video in the sample video set as input, taking the negative category information corresponding to the input negative sample video as expected output, and training an initial model.

Specifically, the initial model may be various types of models, such as a recurrent neural network model, a convolutional neural network model, and the like. In the process of training the initial model, the actual output can be obtained for the positive sample video or the negative sample video input for each training. And the actual output is data actually output by the initial model and used for representing the class information. Then, the training unit 502 may adjust parameters of the initial model based on the actual output and the expected output by using a gradient descent method, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training for one sample video set when a preset end condition is satisfied. It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the loss value calculated using a predetermined loss function (e.g., a cross entropy loss function) is less than a predetermined loss value threshold.

Step 5023, determining whether at least two sample video sets comprise unselected sample video sets.

Step 5024, in response to the fact that the initial model after the last training is not included, the initial model after the last training is determined to be a video label model.

In some optional implementations of this embodiment, the apparatus 500 may further include: a selection unit (not shown in the figures) configured to, in response to determining that the at least two sample video sets comprise an unselected sample video set, reselect the sample video set from the unselected sample video set, and continue to perform the training step using the reselected sample video set and the initial model after the last training.

In some optional implementation manners of this embodiment, the positive category information and the negative category information are respectively vectors including a preset number of elements, a target element in a vector corresponding to the positive sample video is used to represent that the positive sample video belongs to a corresponding video category, a target element in a vector corresponding to the negative sample video is used to represent that the negative sample video does not belong to a corresponding video category, the target element is an element located at an element position in the vector, where a correspondence relationship is established in advance with the video category corresponding to the vector, and the video category corresponding to the vector is a video category corresponding to a sample video set to which the sample video corresponding to the vector belongs.

In some optional implementations of this embodiment, the initial model is a convolutional neural network, and includes a feature extraction layer and a classification layer, where the classification layer includes a preset number of weight data, and the weight data corresponds to a preset video category and is used to determine a probability that an input video belongs to the video category corresponding to the weight data.

In some optional implementations of this embodiment, the training unit 502 may be further configured to: and fixing other weight data except the weight data corresponding to the sample video set in the preset number of weight data, and adjusting the weight data corresponding to the sample video set to train the initial model.

In some optional implementations of this embodiment, the initial model further includes a video frame extraction layer; and training unit 502 may be further configured to: inputting a positive sample video included in the sample video set into a video frame extraction layer to obtain a positive sample video frame set; taking the obtained positive sample video frame set as the input of a feature extraction layer, and taking positive category information corresponding to the input positive sample video as the expected output of an initial model; inputting negative sample videos included in the sample video set into a video frame extraction layer to obtain a negative sample video frame set; and taking the obtained negative sample video frame set as the input of the feature extraction layer, taking the negative category information corresponding to the input negative sample video as the expected output of the initial model, and training the initial model.

The apparatus 500 provided in the foregoing embodiment of the present disclosure obtains at least two sample video sets, where the sample video sets correspond to preset video categories, and each sample video set includes a positive sample video and a negative sample video, where the positive sample video corresponds to positive category information, and the negative sample video corresponds to negative category information; then, the positive sample video is used as input, the positive category information is used as expected output, the negative sample video is used as input, the negative category information is used as expected output, the initial model is trained, and the video label model is obtained through final training

With further reference to fig. 6, as an implementation of the method shown in fig. 4 described above, the present disclosure provides an embodiment of an apparatus for generating a category label set of a video, where the apparatus embodiment corresponds to the method embodiment shown in fig. 4, and the apparatus may be applied to various electronic devices in particular.

As shown in fig. 6, the apparatus 600 for generating a category label set of a video according to this embodiment includes: an acquisition unit 601 configured to acquire a video to be classified; a generating unit 602, configured to input a video to be classified into a pre-trained video tag model, and generate a category tag set, where a category tag corresponds to a preset video category and is used for representing that the video to be classified belongs to a video category corresponding to the category tag, and the video tag model is generated according to the method described in any embodiment of the first aspect.

In the present embodiment, the acquisition unit 601 may acquire the video to be classified from a local place or from a remote place. The video to be classified is the video to be classified. It should be noted that the video to be classified adopted in the present embodiment includes an image sequence including at least two images.

In this embodiment, the generating unit 602 may input the video to be classified into a pre-trained video tag model, and generate a category tag set. The category label corresponds to a preset video category and is used for representing that the video to be classified belongs to the video category corresponding to the category label. The category labels may be various forms of labels including, but not limited to, at least one of: letters, numbers, symbols, etc.

The apparatus 600 provided by the foregoing embodiment of the present disclosure classifies videos to be classified by using the video label model generated in fig. 2 in accordance with the embodiment, generates a category label set of the videos to be classified, and thus generates a video label model for multi-label classification by using single-label sample training, thereby improving accuracy and efficiency of video classification.

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets comprise positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information; selecting a sample video set from at least two sample video sets, and using the selected sample video set, performing the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model.

Further, the one or more programs, when executed by the electronic device, may further cause the electronic device to: acquiring a video to be classified; and inputting the video to be classified into a pre-trained video label model to generate a category label set.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a training unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit acquiring at least two sample video sets".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for generating a video tag model, comprising:

acquiring at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets comprise positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information;

selecting a sample video set from the at least two sample video sets, and using the selected sample video set, performing the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether the at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model;

the initial model is a convolutional neural network and comprises a feature extraction layer and a classification layer, wherein the classification layer comprises a preset number of weight data, and the weight data correspond to preset video categories and are used for determining the probability that the input video belongs to the video categories corresponding to the weight data.

2. The method of claim 1, wherein the method further comprises:

in response to determining that the at least two sample video sets include an unselected sample video set, reselecting the sample video set from the unselected sample video set, and continuing the training step using the reselected sample video set and the initial model after the last training.

3. The method according to claim 1, wherein the positive category information and the negative category information are vectors respectively including a preset number of elements, a target element in a vector corresponding to the positive sample video is used to characterize that the positive sample video belongs to a corresponding video category, a target element in a vector corresponding to the negative sample video is used to characterize that the negative sample video does not belong to a corresponding video category, the target element is an element located at an element position that previously establishes a correspondence relationship with the video category corresponding to the vector, and the video category corresponding to the vector is a video category corresponding to a sample video set to which the sample video corresponding to the vector belongs.

4. The method of claim 1, wherein the training an initial model comprises:

5. The method of claim 1 or 4, wherein the initial model further comprises a video frame extraction layer; and

the training of the initial model by taking the positive sample video included in the sample video set as input, taking the positive category information corresponding to the input positive sample video as expected output, taking the negative sample video in the sample video set as input, and taking the negative category information corresponding to the input negative sample video as expected output includes:

inputting a positive sample video included in the sample video set into a video frame extraction layer to obtain a positive sample video frame set; taking the obtained positive sample video frame set as the input of a feature extraction layer, and taking positive category information corresponding to the input positive sample video as the expected output of an initial model; inputting negative sample videos included in the sample video set into the video frame extraction layer to obtain a negative sample video frame set; and taking the obtained negative sample video frame set as the input of the feature extraction layer, taking the negative category information corresponding to the input negative sample video as the expected output of the initial model, and training the initial model.

6. A method for generating a set of category labels for a video, comprising:

acquiring a video to be classified;

inputting the video to be classified into a pre-trained video label model to generate a category label set, wherein category labels correspond to preset video categories and are used for representing that the video to be classified belongs to video categories corresponding to the category labels, and the video label model is generated according to the method of one of claims 1 to 5.

7. An apparatus for generating a video tag model, comprising:

an obtaining unit configured to obtain at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets include positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information;

a training unit configured to select a sample video set from the at least two sample video sets, and with the selected sample video set, perform the following training steps: by utilizing a machine learning method, taking a positive sample video included in a sample video set as an input, taking positive category information corresponding to the input positive sample video as an expected output, taking a negative sample video in the sample video set as an input, taking negative category information corresponding to the input negative sample video as an expected output, and training an initial model; determining whether the at least two sample video sets include an unselected sample video set; in response to determining not to include, determining the initial model after the last training to be a video tag model;

8. The apparatus of claim 7, wherein the apparatus further comprises:

a selection unit configured to, in response to determining that the at least two sample video sets include an unselected sample video set, reselect a sample video set from the unselected sample video set, and continue performing the training step using the reselected sample video set and the initial model after the last training.

9. The apparatus according to claim 7, wherein the positive category information and the negative category information are vectors each including a preset number of elements, a target element in a vector corresponding to the positive sample video is used to characterize that the positive sample video belongs to a corresponding video category, a target element in a vector corresponding to the negative sample video is used to characterize that the negative sample video does not belong to a corresponding video category, and the target element is an element located at an element position in the vector, where a correspondence relationship is established in advance with a video category corresponding to a sample video set to which the sample video corresponding to the vector belongs.

10. The apparatus of claim 7, wherein the training unit is further configured to:

11. The apparatus of claim 7 or 10, wherein the initial model further comprises a video frame extraction layer; and

the training unit is further configured to:

12. An apparatus for generating a set of category labels for a video, comprising:

an acquisition unit configured to acquire a video to be classified;

a generating unit, configured to input the video to be classified into a pre-trained video tag model, and generate a category tag set, where a category tag corresponds to a preset video category and is used for representing that the video to be classified belongs to a video category corresponding to the category tag, and the video tag model is generated according to the method of one of claims 1 to 5.

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.