CN109740018A

CN109740018A - Method and apparatus for generating video tab model

Info

Publication number: CN109740018A
Application number: CN201910084734.5A
Authority: CN
Inventors: 李伟健; 王长虎
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing Volcano Engine Technology Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2019-05-10
Anticipated expiration: 2039-01-29
Also published as: CN109740018B

Abstract

Embodiment of the disclosure discloses the method and apparatus for generating video tab model.One specific embodiment of this method includes: to obtain at least two Sample video collection；Selection Sample video collection is concentrated from least two Sample videos, utilize selected Sample video collection, it executes following training step: utilizing machine learning method, the positive sample video for including using Sample video collection is as input, using positive classification information corresponding with the positive sample video of input as desired output, the negative sample video that Sample video is concentrated is as input, using negative classification information corresponding with the negative sample video of input as desired output, training initial model；Whether determine that at least two Sample videos are concentrated includes non-selected Sample video collection；It does not include that the initial model after determining the last time training is video tab model in response to determination.This embodiment improves the flexibilities of model training, and help to improve the accuracy using video tab model to visual classification.

Description

Method and apparatus for generating video tab model

Technical field

Embodiment of the disclosure is related to field of computer technology, and in particular to for generate video tab model method and Device.

Background technique

Multi-tag classification refers to incorporates some information to multiple classifications into, it can so that an information has multiple marks Label.The existing method for carrying out multi-tag classification to video, generallys use multi-tag disaggregated model.Model includes multiple Sigmoid activation primitive, each activation primitive correspond to a label.When training pattern, single sample video corresponds to mark Multiple labels, trained model can export multiple labels, and each label corresponds to a video classification.

Summary of the invention

Embodiment of the disclosure proposes the method and apparatus for generating video tab model, and for generating video Class label set method and apparatus.

In a first aspect, embodiment of the disclosure provides a kind of method for generating video tab model, this method packet It includes: obtaining at least two Sample video collection, wherein Sample video collection corresponds to preset video classification, and Sample video collection includes Belong to the corresponding other positive sample video of video class and is not belonging to the corresponding other negative sample video of video class, positive sample video pair The negative classification information marked in advance should be corresponded in the positive classification information marked in advance, negative sample video；From at least two samples It selects Sample video collection to execute following training step using selected Sample video collection in video set: utilizing machine learning side Method, the positive sample video for including using Sample video collection is as input, by positive classification information corresponding with the positive sample video of input As desired output, the negative sample video that Sample video is concentrated, will be corresponding with the negative sample video of input negative as input Classification information is as desired output, training initial model；Whether determine that at least two Sample videos are concentrated includes non-selected sample This video set；It does not include that the initial model after determining the last time training is video tab model in response to determination.

In some embodiments, this method further include: in response to determining that at least two Sample videos are concentrated including non-selected Sample video collection, from non-selected Sample video concentration reselect Sample video collection, utilize reselect sample view Initial model after frequency collection and the last training, continues to execute training step.

In some embodiments, positive classification information and negative classification information are respectively the vector for including preset quantity element, Object element in the corresponding vector of positive sample video belongs to corresponding video classification, negative sample view for characterizing positive sample video Frequently the object element in corresponding vector is not belonging to corresponding video classification for characterizing negative sample video, and object element is vector In element position in, the corresponding video classification of the Sample video collection belonging to the Sample video corresponding with vector in advance is established Element at the element position of corresponding relationship.

In some embodiments, initial model is convolutional neural networks, including feature extraction layer and classification layer, layer packet of classifying Preset quantity weighted data is included, weighted data corresponds to preset video classification, for determining that the video of input belongs to weight The other probability of the corresponding video class of data.

In some embodiments, training initial model, comprising: in fixed preset quantity weighted data, except sample view Frequency collects other weighted datas except corresponding weighted data, and the corresponding weighted data of adjustment Sample video collection, to first Beginning model is trained.

In some embodiments, initial model further includes video frame extraction layer；And the positive sample for by Sample video collection including This video is as input, using positive classification information corresponding with the positive sample video of input as desired output, by Sample video collection In negative sample video as input, negative classification information corresponding with the negative sample video of input is used as to desired output, it is trained Initial model, comprising: the positive sample video input video frame extraction layer for including by Sample video collection obtains positive sample set of video It closes；It, will be corresponding with the positive sample video of input using obtained positive sample sets of video frames as the input of feature extraction layer Positive desired output of the classification information as initial model；The negative sample video input video frame extraction for including by Sample video collection Layer, obtains negative sample sets of video frames；Using obtained negative sample sets of video frames as the input of feature extraction layer, will with it is defeated Desired output of the corresponding negative classification information of the negative sample video entered as initial model, training initial model.

Second aspect, embodiment of the disclosure provide a kind of method for generating the class label set of video, should Method includes: to obtain video to be sorted；The video tab model that video input to be sorted is trained in advance generates class label collection It closes, wherein class label corresponds to preset video classification, belongs to the corresponding video of class label for characterizing video to be sorted Classification, video tab model are that the method described according to any embodiment in above-mentioned first aspect generates.

The third aspect, embodiment of the disclosure provide a kind of for generating the device of video tab model, the device packet Include: acquiring unit is configured to obtain at least two Sample video collection, wherein Sample video collection corresponds to preset video class Not, Sample video collection includes belonging to the corresponding other positive sample video of video class and being not belonging to the corresponding other negative sample of video class Video, positive sample video correspond to the positive classification information marked in advance, and negative sample video corresponds to the negative classification letter marked in advance Breath；Training unit is configured to concentrate selection Sample video collection from least two Sample videos, utilizes selected Sample video Collection, executes following training step: using machine learning method, the positive sample video for including using Sample video collection, will as input Positive classification information corresponding with the positive sample video of input as desired output, negative sample video that Sample video is concentrated as Input, using negative classification information corresponding with the negative sample video of input as desired output, training initial model；Determine at least two Whether it includes non-selected Sample video collection that a Sample video is concentrated；It does not include after determining the last training in response to determination Initial model be video tab model.

In some embodiments, device further include: selecting unit is configured in response to determine at least two samples view It includes non-selected Sample video collection that frequency, which is concentrated, reselects Sample video collection from non-selected Sample video concentration, utilizes Initial model after the Sample video collection reselected and the last training, continues to execute training step.

In some embodiments, training unit is further configured to: in fixed preset quantity weighted data, except sample Other weighted datas except the corresponding weighted data of this video set, and the corresponding weighted data of adjustment Sample video collection, with Initial model is trained.

In some embodiments, initial model further includes video frame extraction layer；And training unit is further configured to: The positive sample video input video frame extraction layer for including by Sample video collection, obtains positive sample sets of video frames；It will be obtained Input of the positive sample sets of video frames as feature extraction layer, using positive classification information corresponding with the positive sample video of input as The desired output of initial model；The negative sample video input video frame extraction layer for including by Sample video collection obtains negative sample view Frequency frame set；Using obtained negative sample sets of video frames as the input of feature extraction layer, by the negative sample video with input Desired output of the corresponding negative classification information as initial model, training initial model.

Fourth aspect, embodiment of the disclosure provide it is a kind of for generating the device of the class label set of video, should Device includes: acquiring unit, is configured to obtain video to be sorted；Generation unit is configured to video input to be sorted is pre- First trained video tab model generates class label set, wherein class label corresponds to preset video classification, is used for It characterizes video to be sorted and belongs to the corresponding video classification of class label, video tab model is according to any in above-mentioned first aspect What the method for embodiment description generated.

5th aspect, embodiment of the disclosure provide a kind of electronic equipment, which includes: one or more places Manage device；Storage device is stored thereon with one or more programs；When one or more programs are held by one or more processors Row, so that one or more processors realize the method as described in implementation any in first aspect or second aspect.

6th aspect, embodiment of the disclosure provide a kind of computer-readable medium, are stored thereon with computer program, The method as described in implementation any in first aspect or second aspect is realized when the computer program is executed by processor.

The method and apparatus for generating video tab model that embodiment of the disclosure provides, by obtaining at least two Sample video collection, wherein Sample video collection corresponds to preset video classification, and Sample video collection includes positive sample video and negative sample This video, positive sample video correspond to positive classification information, and negative sample video corresponds to negative classification information；Then by positive sample video It is using negative sample video as input, negative classification information is defeated as it is expected using positive classification information as desired output as input Out, initial model is trained, final training obtains video tab model, since positive classification information and negative classification information are needles To the information of single classification, therefore, embodiment of the disclosure can not use the classification information for being labeled as multi-tag, and need to only make It is trained with the classification information for being labeled as single label, the video tab model that can carry out multi-tag classification is obtained, thus sharp With single labeling simplicity, targeted feature, the flexibility of model training is improved, and help to improve and utilize video Accuracy of the label model to visual classification.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the disclosure is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is that one embodiment of the disclosure can be applied to exemplary system architecture figure therein；

Fig. 2 is according to an embodiment of the present disclosure for generating the process of one embodiment of the method for video tab model Figure；

Fig. 3 is according to an embodiment of the present disclosure for generating showing for an application scenarios of the method for video tab model It is intended to；

Fig. 4 is according to an embodiment of the present disclosure for generating one embodiment of the method for the class label set of video Flow chart；

Fig. 5 is according to an embodiment of the present disclosure for generating the structure of one embodiment of the device of video tab model Schematic diagram；

Fig. 6 is according to an embodiment of the present disclosure for generating one embodiment of the device of the class label set of video Structural schematic diagram；

Fig. 7 is adapted for the structural schematic diagram for realizing the electronic equipment of embodiment of the disclosure.

Specific embodiment

The disclosure is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining that correlation is open, rather than the restriction to the disclosure.It also should be noted that in order to Convenient for description, is illustrated only in attached drawing and disclose relevant part to related.

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the disclosure can phase Mutually combination.The disclosure is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 is shown can be using embodiment of the disclosure for generating the method for video tab model or for generating The device of video tab model, and the class label set for generating video method or classification mark for generating video Sign the exemplary system architecture 100 of the device of set.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out Send message etc..Various telecommunication customer end applications can be installed on terminal device 101,102,103, such as video playing application, Video processing applications, web browser applications, social platform software etc..

Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard When part, it can be various electronic equipments.When terminal device 101,102,103 is software, above-mentioned electronic equipment may be mounted at In.Multiple softwares or software module (such as providing the software of Distributed Services or software module) may be implemented into it, Single software or software module may be implemented into.It is not specifically limited herein.

Server 105 can be to provide the server of various services, such as uploaded using terminal device 101,102,103 The background model server of Sample video collection progress model training.Background model server can use at least two samples of acquisition This video set carries out model training, generates video tab model, can also send video tab model on terminal device, or Video to be sorted is handled using video tab model, obtains the label of video to be sorted.

It should be noted that can be by taking for generating the method for video tab model provided by embodiment of the disclosure Business device 105 executes, and can also be executed by terminal device 101,102,103, correspondingly, for generating the device of video tab model It can be set in server 105, also can be set in terminal device 101,102,103.In addition, embodiment of the disclosure institute The method of the class label set for generating video provided can be executed by server 105, can also be by terminal device 101, it 102,103 executes, correspondingly, can be set in server 105 for generating the device of class label set of video, Also it can be set in terminal device 101,102,103.

It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software To be implemented as multiple softwares or software module (such as providing the software of Distributed Services or software module), also may be implemented At single software or software module.It is not specifically limited herein.

It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.Sample video collection needed for training pattern is not required to from remote Journey obtains or video to be sorted is not required in the case where long-range obtain, and above system framework can not include network, and only need Server or terminal device.

With continued reference to Fig. 2, one embodiment of the method for generating video tab model according to the disclosure is shown Process 200.The method for being used to generate video tab model, comprising the following steps:

Step 201, at least two Sample video collection are obtained.

In the present embodiment, for generating executing subject (such as the server shown in FIG. 1 of the method for video tab model Or terminal device) at least two Sample video collection can be obtained from long-range by wired connection mode or radio connection, Or at least two Sample video collection are obtained from local.Wherein, Sample video collection corresponds to preset video classification, Sample video collection Including belonging to the corresponding other positive sample video of video class and being not belonging to the corresponding other negative sample video of video class, positive sample view Frequency corresponds to the positive classification information marked in advance, and negative sample video corresponds to the negative classification information marked in advance.

Specifically, positive classification information and negative classification information may include the information of following at least one form: text, number Word, symbol etc..For example, positive classification information can be " seashore ", negative classification information can be " non-seashore ".

It should be noted that positive sample video and negative sample video used by the present embodiment include containing at least two figures The image sequence of picture.

In some optional implementations of the present embodiment, positive classification information and negative classification information include respectively presetting The vector of quantity element, object element in the corresponding vector of positive sample video for characterize positive sample video belong to it is corresponding Video classification, the object element in the corresponding vector of negative sample video are not belonging to corresponding video class for characterizing negative sample video Not, object element is to establish the member of corresponding relationship positioned at video classification corresponding with vector in advance in the element position in vector Element at plain position.The corresponding video classification of vector is the corresponding view of Sample video collection belonging to the corresponding Sample video of vector Frequency classification.

As an example it is supposed that preset quantity is 100, for a Sample video collection, the corresponding video of Sample video collection Classification be seashore class, then the Sample video concentrate the corresponding positive classification information of positive sample video can be vector (1,0,0, 0 ..., 0), which includes 100 elements, wherein first element corresponds to seashore class.Here, number 1 indicates that video belongs to Seashore class, other elements 0 indicate the corresponding video classification of element position that video is not belonging to where 0.Correspondingly, negative classification letter Breath can be vector (0,0,0,0 ..., 0).It should be noted that other elements are also possible to other numerical value, it is not limited to 0.It uses The classification information of vector form mark is commonly used in training multi-tag disaggregated model, since a vector here is for characterizing one Whether a video belongs to a video classification, and therefore, the classification information of this implementation can regard single label as.Utilizing some When Sample video collection is trained, it can be conducive to simplify training step using the training method of single label model.

Classification information is characterized by using vector, neatly the video classification of video tab model identification can also be carried out Extension.For example, it is assumed that preset quantity is 100, i.e., model is at best able to identify the video of 100 classifications.In practical application, only It need to identify 10 video classifications, the 1st to the 10th element in vector corresponds respectively to preset video classification.When needs make to regard When frequency label model can identify the video of larger class, only the corresponding video classification of other elements need to be set, so as to spirit Ground living is extended the recognition capability of video tab model.

Step 202, selection Sample video collection is concentrated from least two Sample videos, using selected Sample video collection, Execute following training step: using machine learning method, the positive sample video for including using Sample video collection as inputting, will with it is defeated The corresponding positive classification information of the positive sample video entered is as desired output, and the negative sample video that Sample video is concentrated is as defeated Enter, using negative classification information corresponding with the negative sample video of input as desired output, training initial model；Determine at least two Whether it includes non-selected Sample video collection that Sample video is concentrated；It does not include after determining the last training in response to determination Initial model is video tab model.

In the present embodiment, above-mentioned executing subject can execute following sub-step:

Step 2021, selection Sample video collection is concentrated from least two Sample videos.

Specifically, above-mentioned executing subject can select Sample video collection in various manners, such as randomly choose, according to pre- Number order selection for each Sample video collection being first arranged etc..

Then, using selected Sample video collection, following training step (including step 2022- step 2024) is executed.

Step 2022, using machine learning method, the positive sample video for including using Sample video collection as input, will with it is defeated The corresponding positive classification information of the positive sample video entered is as desired output, and the negative sample video that Sample video is concentrated is as defeated Enter, using negative classification information corresponding with the negative sample video of input as desired output, training initial model.

Specifically, initial model can be various types of models, such as Recognition with Recurrent Neural Network model, convolutional neural networks Model etc..During training initial model, for the positive sample video or negative sample video of each training input, it can obtain To reality output.Wherein, reality output is the data of initial model reality output, for characterizing classification information.Then, above-mentioned to hold Row main body can use gradient descent method, be based on reality output and desired output, adjust the parameter of initial model, will adjust every time Initial model of the model obtained after parameter as training next time, and in the case where meeting preset termination condition, terminate needle Training to a Sample video collection.It should be noted that the training termination condition here preset at can include but is not limited to At least one of lower: the training time is more than preset duration；Frequency of training is more than preset times；Using preset loss function (such as Cross entropy loss function) resulting penalty values are calculated less than default penalty values threshold value.

As an example, initial model may include at least two 2 disaggregated models, and each two disaggregated model corresponds to One Sample video collection.For some two disaggregated model, which can include based on corresponding Sample video collection Positive sample video and the training of negative sample video obtain.Two disaggregated model that final training is completed, can determine the video of input Whether two disaggregated model corresponding video classification is belonged to, if it is determined that is belonged to, is then generated for characterizing two disaggregated model pair The other label of the video class answered.To can be generated when the final video tab model completed using training carries out visual classification For characterizing at least one other label of video class, the effect of multi-tag classification is realized.

In some optional implementations of the present embodiment, above-mentioned initial model can be convolutional neural networks, including Feature extraction layer and classification layer.Wherein, classification layer includes preset quantity weighted data, and weighted data corresponds to preset video Classification, for determining that the video of input belongs to the other probability of the corresponding video class of weighted data.In general, feature extraction layer can wrap Convolutional layer, pond layer etc. are included, for generating the characteristic of video, characteristic can be used for characterizing the image in such as video The features such as color, shape.Layer of classifying includes full articulamentum, and characteristic of the full connection for being exported according to feature extraction layer is raw At a feature vector (such as vector of 2048 dimensions).Weighted data includes weight coefficient, and weight coefficient can be with characteristic It is multiplied, weighted data can also include bias, corresponding general using the available weighted data of weight coefficient and bias Rate value, the video which is used to characterize input belong to the other probability of the corresponding video class of the weighted data.

In some optional implementations of the present embodiment, above-mentioned executing subject can train initially in accordance with the following steps Model:

Other power in fixed above-mentioned preset quantity weighted data, in addition to the corresponding weighted data of Sample video collection Tuple evidence, and the corresponding weighted data of adjustment Sample video collection, to be trained to initial model.

Specifically, for a Sample video collection, other power except the corresponding weighted data of Sample video collection are fixed Tuple evidence can use the training method of two disaggregated models, adjust the corresponding weighted data of Sample video collection.To make the sample The corresponding weighted data of this video set is optimal.It should be noted that training two disaggregated models method be at present extensively research and The well-known technique of application, details are not described herein.By this implementation, the weighted data that video tab model can be made to include Independently of one another, when being trained using a Sample video collection, other weighted datas are not influenced, so that finally obtained Video tab model more accurately classifies to video.Due to using multiple weighted datas, finally obtained video mark Multiple other divisions of video class can be carried out to video therein is inputted by signing model, realize the effect of multi-tag classification.

In some optional implementations of the present embodiment, initial model further includes video frame extraction layer.Above-mentioned execution Main body can train as follows initial model:

The positive sample video input video frame extraction layer for including by Sample video collection, obtains positive sample sets of video frames.It will Input of the obtained positive sample sets of video frames as feature extraction layer, by positive classification corresponding with the positive sample video of input Desired output of the information as initial model, training initial model.The negative sample video input video for including by Sample video collection Frame extract layer obtains negative sample sets of video frames.Using obtained negative sample sets of video frames as the input of feature extraction layer, Using negative classification information corresponding with the negative sample video of input as the desired output of initial model, training initial model.

Specifically, video frame extraction layer can extract video frame according to the video frame extraction mode of various default settings.Example Such as, it can be regarded according to the method for the existing key frame for extracting video, the key frame for extracting the Sample video of input as sample Frequency frame.Or video frame is extracted as Sample video frame according to preset play time interval.Pass through this implementation, Ke Yicong A certain number of video frames are extracted in Sample video for classifying to Sample video, so as to reduce the calculating of model Amount, improves the efficiency of model training.

Step 2023, whether determine that at least two Sample videos are concentrated includes non-selected Sample video collection.

It step 2024, does not include that the initial model after determining the last time training is video tab model in response to determination.

In some optional implementations of the present embodiment, above-mentioned executing subject can be in response to determining at least two samples Include non-selected Sample video collection in this video set, reselects Sample video from non-selected Sample video concentration Collection continues to execute above-mentioned training step (i.e. using the initial model after the Sample video collection reselected and the last training Step 2022- step 2024).Wherein, the mode of Sample video collection of reselecting is concentrated from non-selected Sample video, it can be with It is random selection or is selected according to the number order of Sample video collection, here without limitation.

According to the video tab model that above steps training obtains, the video for being determined for inputting belongs to each pre- If the other probability value of video class, if some probability value be more than or equal to preset probability threshold value, generate for characterize input Video belong to the other class label of the corresponding video class of the probability value.In practical applications, video tab model can export Class label set, each class label corresponds to a preset video classification, for characterizing input video label model Video belongs to the video classification.The video tab model that training obtains as a result, is multi-tag disaggregated model.

With continued reference to the application scenarios that Fig. 3, Fig. 3 are according to the method for generating video tab model of the present embodiment One schematic diagram.In the application scenarios of Fig. 3, electronic equipment 301 obtains at least two Sample video collection 302 first.Wherein, often A Sample video collection corresponds to preset video classification, and Sample video collection includes belonging to the corresponding other positive sample video of video class Be not belonging to the corresponding other negative sample video of video class, positive sample video corresponds to the positive classification information marked in advance, negative sample This video corresponds to the negative classification information marked in advance.For example, Sample video collection 3021 corresponds to video classification " seashore ", sample Video set 3022 corresponds to video classification " hotel ".The corresponding positive class of each positive sample video for including in Sample video collection 3021 Other information be vector (1,0,0 ...), including the corresponding negative classification information of each negative sample video be vector (0,0,0 ...). Sample video integrates the corresponding positive classification information of each positive sample video for including in 3022 as vector (0,1,0 ...), including it is every The corresponding negative classification information of a negative sample video is vector (0,0,0 ...).Wherein, each element position in vector corresponds to One video classification.

Then, electronic equipment 301 is from above-mentioned at least two Sample videos collection 302, according to pre-set, Sample video The number order of collection successively selects Sample video collection, using selected Sample video collection, executes following training step: utilizing Machine learning method, the positive sample video for including using Sample video collection, will be corresponding with the positive sample video of input as input As desired output, the negative sample video that Sample video is concentrated will regard positive classification information as input with the negative sample of input Frequently corresponding negative classification information is as desired output, training initial model 303.As shown in the figure is that Sample video collection 3021 is used to instruct Practice initial model 303.Initial model 303 retains parameter adjusted, continues to use it after every time using Sample video training He trains at Sample video.Every time using after the training of Sample video collection, electronic equipment 301 determines at least two Sample video collection It whether include non-selected Sample video collection in 302, if not including, i.e., whole Sample video collection are used training, determine Initial model after the last time training is video tab model 304.

The method provided by the above embodiment of the disclosure, by obtaining at least two Sample video collection, wherein Sample video Collection corresponds to preset video classification, and Sample video collection includes positive sample video and negative sample video, and positive sample video corresponds to Positive classification information, negative sample video correspond to negative classification information；Then using positive sample video as input, positive classification information is made , using negative classification information as desired output, initial model is trained for desired output using negative sample video as input, Final training obtains video tab model, since positive classification information and negative classification information are to be directed to the information of single classification, Embodiment of the disclosure can need to only use the classification for being labeled as single label without the use of the classification information for being labeled as multi-tag Information is trained, and obtains the video tab model that can carry out multi-tag classification, to using single labeling simplicity, have needle The characteristics of to property, the flexibility of model training is improved, and helped to improve using video tab model to visual classification Accuracy.

With further reference to Fig. 4, the one of the method for the class label set for generating video according to the disclosure is shown The process 400 of a embodiment.This is used to generate the method for the class label set of video, comprising the following steps:

Step 401, video to be sorted is obtained.

In the present embodiment, for generating the executing subject (clothes as shown in Figure 1 of the method for the class label set of video Business device or terminal device) it can be from local or from remotely obtaining video to be sorted.Wherein, video to be sorted is to divide it The video of class.It should be noted that video to be sorted used by the present embodiment includes the image sequence containing at least two images Column.

Step 402, video tab model video input to be sorted trained in advance generates class label set.

In the present embodiment, the video tab model that above-mentioned executing subject can train video input to be sorted in advance, Generate class label set.Wherein, class label corresponds to preset video classification, belongs to classification for characterizing video to be sorted The corresponding video classification of label.Class label can be various forms of labels, including but not limited to following at least one: text Word, number, symbol etc..

In the present embodiment, video tab model is generated according to the method for above-mentioned Fig. 2 corresponding embodiment description, specifically It may refer to each step of Fig. 2 corresponding embodiment description, which is not described herein again.

In general, the class label set generated can be with video associated storage to be sorted.For example, can be by class label collection Cooperation is the attribute information of video to be sorted, is stored into the attribute information set of video to be sorted.So as to increase characterization The attribute of video to be sorted it is comprehensive.Attribute information set can include but is not limited to following at least one attribute information: to Title, size, generation time of classification video etc..

Optionally, the class label set of generation can export in various manners, for example, class label set is shown On the display screen that above-mentioned executing subject includes.It is communicated to connect alternatively, sending class label set to above-mentioned executing subject Other electronic equipments on.

The method provided by the above embodiment of the disclosure, by using Fig. 2 corresponding embodiment generate video tab model, Classify to video to be sorted, generates the class label set of video to be sorted, thus trained using by single exemplar, it is raw At the video tab model classified for multi-tag, the accuracy and efficiency to visual classification is improved.

With further reference to Fig. 5, as the realization to method shown in above-mentioned Fig. 2, present disclose provides one kind for generating view One embodiment of the device of frequency label model, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, device tool Body can be applied in various electronic equipments.

As shown in figure 5, the present embodiment includes: acquiring unit 501, quilt for generating the device 500 of video tab model It is configured to obtain at least two Sample video collection, wherein Sample video collection corresponds to preset video classification, Sample video Ji Bao It includes and belongs to the corresponding other positive sample video of video class and be not belonging to the corresponding other negative sample video of video class, positive sample video Corresponding to the positive classification information marked in advance, negative sample video corresponds to the negative classification information marked in advance；Training unit 502, It is configured to concentrate selection Sample video collection from least two Sample videos, using selected Sample video collection, execute as follows Training step: machine learning method is utilized, the positive sample video for including using Sample video collection is as input, by the positive sample with input The corresponding positive classification information of this video as desired output, negative sample video that Sample video is concentrated as inputting, will with it is defeated The corresponding negative classification information of the negative sample video entered is as desired output, training initial model；Determine at least two Sample videos Whether concentrate includes non-selected Sample video collection；It does not include the initial model determined after the last training in response to determination For video tab model.

In the present embodiment, acquiring unit 501 can be by wired connection mode or radio connection from remotely obtaining At least two Sample video collection are taken, or obtain at least two Sample video collection from local.Wherein, Sample video collection corresponds to default Video classification, Sample video collection includes belonging to the corresponding other positive sample video of video class and to be not belonging to corresponding video classification Negative sample video, positive sample video corresponds to the positive classification information that marks in advance, and negative sample video corresponds to be marked in advance Negative classification information.

In the present embodiment, training unit 502 can execute following sub-step:

Step 5021, selection Sample video collection is concentrated from least two Sample videos.

Specifically, above-mentioned training unit 502 can select Sample video collection in various manners, such as randomly choose, and press According to the number order selection etc. of each Sample video collection.

Then, using selected Sample video collection, following training step (including step 5022- step 5024) is executed.

Step 5022, using machine learning method, the positive sample video for including using Sample video collection as input, will with it is defeated The corresponding positive classification information of the positive sample video entered is as desired output, and the negative sample video that Sample video is concentrated is as defeated Enter, using negative classification information corresponding with the negative sample video of input as desired output, training initial model.

Specifically, initial model can be various types of models, such as Recognition with Recurrent Neural Network model, convolutional neural networks Model etc..During training initial model, for the positive sample video or negative sample video of each training input, it can obtain To reality output.Wherein, reality output is the data of initial model reality output, for characterizing classification information.Then, above-mentioned instruction Gradient descent method can be used by practicing unit 502, be based on reality output and desired output, adjusted the parameter of initial model, will be each Initial model of the model obtained after adjusting parameter as training next time, and in the case where meeting preset termination condition, knot Beam is directed to the training of a Sample video collection.It should be noted that the training termination condition here preset at may include but unlimited In at least one of following: the training time is more than preset duration；Frequency of training is more than preset times；Utilize preset loss function (such as cross entropy loss function) calculates resulting penalty values and is less than default penalty values threshold value.

Step 5023, whether determine that at least two Sample videos are concentrated includes non-selected Sample video collection.

It step 5024, does not include that the initial model after determining the last time training is video tab model in response to determination.

In some optional implementations of the present embodiment, the device 500 can also include: selecting unit (in figure not Show), it is configured in response to determine that at least two Sample videos are concentrated to include non-selected Sample video collection, from unselected Sample video concentration reselect Sample video collection, using after the Sample video collection that reselects and the last training just Beginning model, continues to execute training step.

In some optional implementations of the present embodiment, positive classification information and negative classification information include respectively presetting The vector of quantity element, object element in the corresponding vector of positive sample video for characterize positive sample video belong to it is corresponding Video classification, the object element in the corresponding vector of negative sample video are not belonging to corresponding video class for characterizing negative sample video Not, object element is to establish the member of corresponding relationship positioned at video classification corresponding with vector in advance in the element position in vector Element at plain position, the corresponding video classification of vector are the corresponding view of Sample video collection belonging to the corresponding Sample video of vector Frequency classification.

In some optional implementations of the present embodiment, initial model is convolutional neural networks, including feature extraction Layer and classification layer, classification layer include preset quantity weighted data, and weighted data corresponds to preset video classification, for determining The video of input belongs to the other probability of the corresponding video class of weighted data.

In some optional implementations of the present embodiment, training unit 502 can be further configured to: fixed pre- If other weighted datas in quantity weighted data, in addition to the corresponding weighted data of Sample video collection, and adjustment sample The corresponding weighted data of this video set, to be trained to initial model.

In some optional implementations of the present embodiment, initial model further includes video frame extraction layer；And training Unit 502 can be further configured to: the positive sample video input video frame extraction layer for including by Sample video collection obtains just Sample video frame set；Using obtained positive sample sets of video frames as the input of feature extraction layer, by the positive sample with input Desired output of the corresponding positive classification information of this video as initial model；The negative sample video input for including by Sample video collection Video frame extraction layer obtains negative sample sets of video frames；Using obtained negative sample sets of video frames as feature extraction layer Input, using negative classification information corresponding with the negative sample video of input as the desired output of initial model, training initial model.

The device provided by the above embodiment 500 of the disclosure, by obtaining at least two Sample video collection, wherein sample Video set corresponds to preset video classification, and Sample video collection includes positive sample video and negative sample video, positive sample video pair Ying Yuzheng classification information, negative sample video correspond to negative classification information；Then using positive sample video as input, positive classification is believed Breath is used as desired output, using negative sample video as input, using negative classification information as desired output, instructs to initial model To practice, final training obtains video tab model, since positive classification information and negative classification information are to be directed to the information of single classification, because This, embodiment of the disclosure can need to only use without the use of the classification information for being labeled as multi-tag and be labeled as single label Classification information is trained, and obtains the video tab model that can carry out multi-tag classification, thus easy using single labeling, Targeted feature, improves the flexibility of model training, and helps to improve using video tab model to video point The accuracy of class

With further reference to Fig. 6, as the realization to method shown in above-mentioned Fig. 4, present disclose provides one kind for generating view One embodiment of the device of the class label set of frequency, the Installation practice is corresponding with embodiment of the method shown in Fig. 4, should Device specifically can be applied in various electronic equipments.

As shown in fig. 6, the present embodiment includes: acquiring unit for generating the device 600 of the class label set of video 601, it is configured to obtain video to be sorted；Generation unit 602 is configured to the video for training video input to be sorted in advance Label model generates class label set, wherein class label corresponds to preset video classification, for characterizing view to be sorted Frequency belongs to the corresponding video classification of class label, and video tab model is described according to any embodiment in above-mentioned first aspect What method generated.

In the present embodiment, acquiring unit 601 can be from local or from remotely obtaining video to be sorted.Wherein, wherein to Classification video is the video to classify to it.It should be noted that video to be sorted used by the present embodiment includes containing There is the image sequence of at least two images.

In the present embodiment, the video tab model that generation unit 602 can train video input to be sorted in advance, it is raw At class label set.Wherein, class label corresponds to preset video classification, belongs to classification mark for characterizing video to be sorted Sign corresponding video classification.Class label can be various forms of labels, including but not limited to following at least one: text, Number, symbol etc..

The device provided by the above embodiment 600 of the disclosure, the video tab mould generated by using Fig. 2 corresponding embodiment Type classifies to video to be sorted, generates the class label set of video to be sorted, to instruct using by single exemplar Practice, generates the video tab model for multi-tag classification, improve the accuracy and efficiency to visual classification.

Below with reference to Fig. 7, it illustrates the electronic equipment that is suitable for being used to realize embodiment of the disclosure, (example is as shown in figure 1 Server or terminal device) 700 structural schematic diagram.Terminal device in embodiment of the disclosure can include but is not limited to all As mobile phone, laptop, digit broadcasting receiver, PDA (personal digital assistant), PAD (tablet computer), PMP are (portable Formula multimedia player), the mobile terminal and such as number TV, desk-top meter of car-mounted terminal (such as vehicle mounted guidance terminal) etc. The fixed terminal of calculation machine etc..Electronic equipment shown in Fig. 7 is only an example, should not be to the function of embodiment of the disclosure Any restrictions are brought with use scope.

As shown in fig. 7, electronic equipment 700 may include processing unit (such as central processing unit, graphics processor etc.) 701, random access can be loaded into according to the program being stored in read-only memory (ROM) 702 or from storage device 708 Program in memory (RAM) 703 and execute various movements appropriate and processing.In RAM 703, it is also stored with electronic equipment Various programs and data needed for 700 operations.Processing unit 701, ROM 702 and RAM 703 pass through the phase each other of bus 704 Even.Input/output (I/O) interface 705 is also connected to bus 704.

In general, following device can connect to I/O interface 705: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph As the input unit 706 of head, microphone, accelerometer, gyroscope etc.；Including such as liquid crystal display (LCD), loudspeaker, vibration The output device 707 of dynamic device etc.；Storage device 708 including such as tape, hard disk etc.；And communication device 709.Communication device 709, which can permit electronic equipment 700, is wirelessly or non-wirelessly communicated with other equipment to exchange data.Although Fig. 7 shows tool There is the electronic equipment 700 of various devices, it should be understood that being not required for implementing or having all devices shown.It can be with Alternatively implement or have more or fewer devices.Each box shown in Fig. 7 can represent a device, can also root According to needing to represent multiple devices.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communication device 709, or from storage device 708 It is mounted, or is mounted from ROM 702.When the computer program is executed by processing unit 701, the implementation of the disclosure is executed The above-mentioned function of being limited in the method for example.

It is situated between it should be noted that computer-readable medium described in embodiment of the disclosure can be computer-readable signal Matter or computer readable storage medium either the two any combination.Computer readable storage medium for example can be with System, device or the device of --- but being not limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or it is any more than Combination.The more specific example of computer readable storage medium can include but is not limited to: have one or more conducting wires Electrical connection, portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type are programmable Read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic are deposited Memory device or above-mentioned any appropriate combination.In embodiment of the disclosure, computer readable storage medium, which can be, appoints What include or the tangible medium of storage program that the program can be commanded execution system, device or device use or and its It is used in combination.And in embodiment of the disclosure, computer-readable signal media may include in a base band or as carrier wave The data-signal that a part is propagated, wherein carrying computer-readable program code.The data-signal of this propagation can be adopted With diversified forms, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal is situated between Matter can also be any computer-readable medium other than computer readable storage medium, which can be with It sends, propagate or transmits for by the use of instruction execution system, device or device or program in connection.Meter The program code for including on calculation machine readable medium can transmit with any suitable medium, including but not limited to: electric wire, optical cable, RF (radio frequency) etc. or above-mentioned any appropriate combination.

Above-mentioned computer-readable medium can be included in above-mentioned electronic equipment；It is also possible to individualism, and not It is fitted into the electronic equipment.Above-mentioned computer-readable medium carries one or more program, when said one or more When a program is executed by the electronic equipment, so that the electronic equipment: obtaining at least two Sample video collection, wherein Sample video Collection corresponds to preset video classification, and Sample video collection includes belonging to the corresponding other positive sample video of video class and being not belonging to pair The other negative sample video of the video class answered, positive sample video correspond to the positive classification information marked in advance, and negative sample video is corresponding In the negative classification information marked in advance；Selection Sample video collection is concentrated from least two Sample videos, utilizes selected sample Video set executes following training step: utilizing machine learning method, the positive sample video for including using Sample video collection is as defeated Enter, using positive classification information corresponding with the positive sample video of input as desired output, the negative sample that Sample video is concentrated is regarded Frequency is as input, using negative classification information corresponding with the negative sample video of input as desired output, training initial model；It determines Whether it includes non-selected Sample video collection that at least two Sample videos are concentrated；It does not include determining the last time in response to determination Initial model after training is video tab model.

In addition, when said one or multiple programs are executed by the electronic equipment, it is also possible that the electronic equipment: obtaining Take video to be sorted；The video tab model that video input to be sorted is trained in advance generates class label set.

The behaviour for executing embodiment of the disclosure can be write with one or more programming languages or combinations thereof The computer program code of work, described program design language include object oriented program language-such as Java, Smalltalk, C++ further include conventional procedural programming language-such as " C " language or similar program design language Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in embodiment of the disclosure can be realized by way of software, can also be passed through The mode of hardware is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor Including acquiring unit, training unit.Wherein, the title of these units does not constitute the limit to the unit itself under certain conditions It is fixed, for example, acquiring unit is also described as " obtaining the unit of at least two Sample video collection ".

Above description is only the preferred embodiment of the disclosure and the explanation to institute's application technology principle.Those skilled in the art Member it should be appreciated that embodiment of the disclosure involved in invention scope, however it is not limited to the specific combination of above-mentioned technical characteristic and At technical solution, while should also cover do not depart from foregoing invention design in the case where, by above-mentioned technical characteristic or its be equal Feature carries out any combination and other technical solutions for being formed.Such as disclosed in features described above and embodiment of the disclosure (but It is not limited to) technical characteristic with similar functions is replaced mutually and the technical solution that is formed.

Claims

1. a kind of method for generating video tab model, comprising:

Obtain at least two Sample video collection, wherein Sample video collection corresponds to preset video classification, and Sample video collection includes Belong to the corresponding other positive sample video of video class and is not belonging to the corresponding other negative sample video of video class, positive sample video pair The negative classification information marked in advance should be corresponded in the positive classification information marked in advance, negative sample video；

Selection Sample video collection is concentrated from least two Sample video, using selected Sample video collection, is executed as follows Training step: machine learning method is utilized, the positive sample video for including using Sample video collection is as input, by the positive sample with input The corresponding positive classification information of this video as desired output, negative sample video that Sample video is concentrated as inputting, will with it is defeated The corresponding negative classification information of the negative sample video entered is as desired output, training initial model；Determine at least two sample It whether include non-selected Sample video collection in video set；It does not include initial after determining the last training in response to determination Model is video tab model.

2. according to the method described in claim 1, wherein, the method also includes:

Concentrating in response to determination at least two Sample video includes non-selected Sample video collection, from non-selected sample Sample video collection is reselected in video set, utilizes the introductory die after the Sample video collection reselected and the last training Type continues to execute the training step.

3. according to the method described in claim 1, wherein, positive classification information and negative classification information respectively include preset quantity The vector of element, the object element in the corresponding vector of positive sample video belong to corresponding video class for characterizing positive sample video Not, the object element in the corresponding vector of negative sample video is not belonging to corresponding video classification, mesh for characterizing negative sample video Element is marked to establish the element position of corresponding relationship positioned at video classification corresponding with vector in advance in the element position in vector The element at place, the corresponding video classification of vector are the corresponding video class of Sample video collection belonging to the corresponding Sample video of vector Not.

4. according to the method described in claim 1, wherein, initial model is convolutional neural networks, including feature extraction layer and point Class layer, classification layer include preset quantity weighted data, and weighted data corresponds to preset video classification, for determining input Video belongs to the other probability of the corresponding video class of weighted data.

5. according to the method described in claim 4, wherein, the trained initial model, comprising:

Other weight numbers in the fixed preset quantity weighted data, in addition to the corresponding weighted data of Sample video collection According to, and the corresponding weighted data of adjustment Sample video collection, to be trained to initial model.

6. method according to claim 4 or 5, wherein the initial model further includes video frame extraction layer；And

The positive sample video for including using Sample video collection is as input, by positive classification corresponding with the positive sample video of input For information as desired output, the negative sample video that Sample video is concentrated, will be corresponding with the negative sample video of input as input Negative classification information as desired output, training initial model, comprising:

The positive sample video input video frame extraction layer for including by Sample video collection, obtains positive sample sets of video frames；By gained Input of the positive sample sets of video frames arrived as feature extraction layer, by positive classification information corresponding with the positive sample video of input Desired output as initial model；Video frame extraction layer described in the negative sample video input for including by Sample video collection, obtains Negative sample sets of video frames；It, will be negative with input using obtained negative sample sets of video frames as the input of feature extraction layer Desired output of the corresponding negative classification information of Sample video as initial model, training initial model.

7. a kind of method for generating the class label set of video, comprising:

Obtain video to be sorted；

The video tab model that the video input to be sorted is trained in advance generates class label set, wherein class label Corresponding to preset video classification, belong to the corresponding video classification of class label, the view for characterizing the video to be sorted Frequency label model is that method described in one of -6 generates according to claim 1.

8. a kind of for generating the device of video tab model, comprising:

Acquiring unit is configured to obtain at least two Sample video collection, wherein Sample video collection corresponds to preset video class Not, Sample video collection includes belonging to the corresponding other positive sample video of video class and being not belonging to the corresponding other negative sample of video class Video, positive sample video correspond to the positive classification information marked in advance, and negative sample video corresponds to the negative classification letter marked in advance Breath；

Training unit is configured to concentrate selection Sample video collection from least two Sample video, utilizes selected sample This video set executes following training step: utilizing machine learning method, the positive sample video for including using Sample video collection is as defeated Enter, using positive classification information corresponding with the positive sample video of input as desired output, the negative sample that Sample video is concentrated is regarded Frequency is as input, using negative classification information corresponding with the negative sample video of input as desired output, training initial model；It determines Whether it includes non-selected Sample video collection that at least two Sample video is concentrated；It does not include determining recently in response to determination Initial model after primary training is video tab model.

9. device according to claim 8, wherein described device further include:

Selecting unit is configured in response to determine that at least two Sample video is concentrated to include non-selected Sample video Collection reselects Sample video collection from non-selected Sample video concentration, using the Sample video collection reselected and recently Initial model after primary training, continues to execute the training step.

10. device according to claim 8, wherein just classification information and negative classification information includes respectively preset quantity The vector of a element, the object element in the corresponding vector of positive sample video belong to corresponding video for characterizing positive sample video Classification, the object element in the corresponding vector of negative sample video are not belonging to corresponding video classification for characterizing negative sample video, Object element is in the element position in vector, and the Sample video collection belonging to the Sample video corresponding with vector in advance is corresponding Video classification establish the element at the element position of corresponding relationship.

11. device according to claim 8, wherein initial model is convolutional neural networks, including feature extraction layer and point Class layer, classification layer include preset quantity weighted data, and weighted data corresponds to preset video classification, for determining input Video belongs to the other probability of the corresponding video class of weighted data.

12. device according to claim 11, wherein the training unit is further configured to:

13. device according to claim 11 or 12, wherein the initial model further includes video frame extraction layer；And

The training unit is further configured to:

14. a kind of for generating the device of the class label set of video, comprising:

Acquiring unit is configured to obtain video to be sorted；

Generation unit is configured to the video tab model for training the video input to be sorted in advance, generates class label Set, wherein class label corresponds to preset video classification, belongs to class label correspondence for characterizing the video to be sorted Video classification, the video tab model is that method described in one of -6 generates according to claim 1.

15. a kind of electronic equipment, comprising:

One or more processors；

Storage device is stored thereon with one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-7.

16. a kind of computer-readable medium, is stored thereon with computer program, wherein the realization when program is executed by processor Method as described in any in claim 1-7.