CN110232340A

CN110232340A - Establish the method, apparatus of video classification model and visual classification

Info

Publication number: CN110232340A
Application number: CN201910461500.8A
Authority: CN
Inventors: 牛国成; 何伯磊; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-09-13
Anticipated expiration: 2039-05-30
Also published as: CN110232340B

Abstract

The present invention provides a kind of method, apparatus for establishing video classification model and visual classification, and method includes: to obtain each video and its corresponding classification annotation results；The feature of preset kind is extracted from each video respectively, and is based on the corresponding assemblage characteristic of each video of each feature construction；Using each feature extracted from video assemblage characteristic corresponding with the video as the input of disaggregated model, disaggregated model is obtained for each feature of the video and the output result of assemblage characteristic；According to disaggregated model for the output result classification annotation results corresponding with the video of each feature and assemblage characteristic of video, each feature loss function corresponding with assemblage characteristic of the same video is obtained respectively；The loss function of disaggregated model is determined according to the corresponding loss function of each feature and assemblage characteristic of the same video, and using the parameter of the loss function of disaggregated model adjustment disaggregated model, obtains video classification model.The present invention is able to ascend the classification accuracy of established video classification model.

Description

Establish the method, apparatus of video classification model and visual classification

[technical field]

The present invention relates to natural language processing technique field more particularly to a kind of establish video classification model and video point Method, apparatus, equipment and the computer storage medium of class.

[background technique]

The prior art is carrying out when establishing of video classification model by multiple features of video, and it is more often to only use video Loss function corresponding to the assemblage characteristic of a feature carrys out the parameter of Optimum Classification model.But be used only video assemblage characteristic into When the calculating of row disaggregated model loss function, cannot be considered in terms of the contribution of all features in video, will lead to lose it is original single The problem of feature included information, so that the video classification model that final training obtains is lower to the accuracy of visual classification.

[summary of the invention]

In view of this, the present invention provides a kind of method, apparatus for establishing video classification model and visual classification, equipment And computer storage medium, for taking into account the contribution of all features in video in the training process of disaggregated model, to be promoted The classification accuracy for the video classification model established.

Used technical solution is to provide a kind of method for establishing video classification model to the present invention in order to solve the technical problem, The described method includes: obtaining training data, each video and the corresponding classification mark knot of each video are included in the training data Fruit；The feature of preset kind is extracted from each video respectively, and based on each video pair described in extracted each feature construction The assemblage characteristic answered；Will from video extracted each feature and the corresponding assemblage characteristic of the video respectively as disaggregated model Input, obtain the disaggregated model for the video each feature and assemblage characteristic output result；According to the classification Output result with the video corresponding classification annotation results of the model for each feature and assemblage characteristic of the video, difference Obtain the same video each feature and assemblage characteristic corresponding to loss function；According to each feature of the same video And loss function corresponding to assemblage characteristic, determine the loss function of the disaggregated model, and utilize the disaggregated model Loss function adjusts the parameter of the disaggregated model, obtains video classification model.

According to one preferred embodiment of the present invention, the feature of the preset kind is the audio of the visual signature of video, video Feature, the corresponding character features of video sound intermediate frequency, the feature of subtitle and directory title in video, the face characteristic in video, In the knowledge planting modes on sink characteristic of the feature of the affiliated label of the feature of video surface plot, video, the feature of video title and video extremely It is two kinds few.

According to one preferred embodiment of the present invention, described based on the corresponding combination of each video described in extracted each feature construction Feature includes: to splice each feature extracted from the video, obtains splicing result；Acquisition is extracted from the video Each feature between syntagmatic；According to feature each in extracted video to the attention of other features, to being extracted Each feature be adjusted, obtain each feature adjusted；By the syntagmatic and tune between the splicing result, each feature Each feature after whole is spliced, using splicing result as the corresponding assemblage characteristic of each video.

According to one preferred embodiment of the present invention, each feature for obtaining the same video respectively and assemblage characteristic institute are right The loss function answered includes: to determine that classification belonging to each video is appointed according to the corresponding classification annotation results of each video Business；The calculation formula of loss function is determined according to classification task belonging to each video；According to identified loss function Calculation formula, obtain respectively the same video each feature and assemblage characteristic corresponding to loss function.

According to one preferred embodiment of the present invention, described right according to each feature and the assemblage characteristic institute of the same video The loss function answered, determine disaggregated model loss function include: by the loss function of each feature of the same video with And the sum of the loss function of assemblage characteristic, it is determined as the loss function of disaggregated model.

It according to one preferred embodiment of the present invention, is to minimize the method also includes: the training objective of the disaggregated model The loss function of the disaggregated model.

The present invention in order to solve the technical problem used by technical solution be to provide the method for visual classification a kind of, the method It include: to obtain video to be sorted；The feature of preset kind is extracted from the video to be sorted, and is based on extracted feature structure Build the corresponding assemblage characteristic of the video to be sorted；Using the corresponding assemblage characteristic of the video to be sorted as video classification model Input, the generic of the video to be sorted is determined according to the output result of the video classification model.

According to one preferred embodiment of the present invention, described based on corresponding group of video to be sorted described in extracted feature construction Close feature include: by from the video to be sorted extracted feature splice, obtain splicing result；Obtain from it is described to Syntagmatic in classification video between extracted feature；According to each feature pair extracted from the video to be sorted The attention of other features is adjusted extracted feature, obtains each feature adjusted；By the splicing result, spy Syntagmatic and each feature adjusted between sign are spliced, and splicing result is corresponding as the video to be sorted Assemblage characteristic.

According to one preferred embodiment of the present invention, described determining described wait divide according to the output result of the video classification model The generic of class video comprises determining that the probability value for being greater than preset threshold in the output result of the video classification model；It will The corresponding classification of identified probability value is determined as the generic of the video to be sorted.

Used technical solution is to provide a kind of device for establishing video classification model to the present invention in order to solve the technical problem, Described device includes: first acquisition unit, includes each video and each video in the training data for obtaining training data Corresponding classification annotation results；First construction unit, for extracting the feature of preset kind, and base from each video respectively The corresponding assemblage characteristic of each video described in extracted each feature construction；First processing units, for that will be mentioned from video Each feature and the corresponding assemblage characteristic of the video taken obtains the disaggregated model and is directed to respectively as the input of disaggregated model Each feature of the video and the output result of assemblage characteristic；First training unit, for being directed to institute according to the disaggregated model The output result of each feature and assemblage characteristic of stating video classification annotation results corresponding with the video, obtain same respectively Loss function corresponding to each feature and assemblage characteristic of video；Second training unit, for according to the same video Each feature and assemblage characteristic corresponding to loss function, determine the loss function of the disaggregated model, and utilize described point The loss function of class model adjusts the parameter of the disaggregated model, obtains video classification model.

According to one preferred embodiment of the present invention, first construction unit is being based on respectively regarding described in extracted feature construction Frequently specific to execute: each feature extracted from the video being spliced, splicing result is obtained when corresponding assemblage characteristic； It obtains from the syntagmatic between each feature extracted in the video；According to feature each in extracted video to other spies The attention of sign is adjusted extracted each feature, obtains each feature adjusted；By the splicing result, each feature Between syntagmatic and each feature adjusted spliced, splicing result is special as the corresponding combination of each video Sign.

According to one preferred embodiment of the present invention, first training unit each feature for obtaining the same video respectively with It is specific to execute and when loss function corresponding to assemblage characteristic: according to the corresponding classification annotation results of each video, to determine institute State classification task belonging to each video；The calculation formula of loss function is determined according to classification task belonging to each video；Root According to the calculation formula of identified loss function, obtain respectively the same video each feature and assemblage characteristic corresponding to damage Lose function.

According to one preferred embodiment of the present invention, second training unit according to each feature of the same video with And loss function corresponding to assemblage characteristic, it is specific to execute: by the same video when determining the loss function of disaggregated model Each feature loss function and assemblage characteristic loss function sum, be determined as the loss function of disaggregated model.

According to one preferred embodiment of the present invention, second training unit also executes: for making the instruction of the disaggregated model Practicing target is the loss function for minimizing the disaggregated model.

Used technical solution is to provide a kind of device of visual classification, described device to the present invention in order to solve the technical problem It include: second acquisition unit, for obtaining video to be sorted；Second construction unit, for being extracted from the video to be sorted The feature of preset kind, and based on the corresponding assemblage characteristic of video to be sorted described in extracted feature construction；Second processing list Member, for using the corresponding assemblage characteristic of the video to be sorted as the input of video classification model, according to the visual classification The output result of model determines the generic of the video to be sorted.

According to one preferred embodiment of the present invention, second construction unit is being based on described in extracted feature construction wait divide It is specific to execute: extracted feature will to splice from the video to be sorted, and obtain when the corresponding assemblage characteristic of class video Splicing result；It obtains from the syntagmatic in the video to be sorted between extracted feature；According to from the view to be sorted Extracted each feature is adjusted the attention of other features to extracted feature in frequency, obtains adjusted each Feature；By between the splicing result, feature syntagmatic and each feature adjusted splice, splicing result is made For the corresponding assemblage characteristic of the video to be sorted.

According to one preferred embodiment of the present invention, described the second processing unit is in the output knot according to the video classification model It is specific to execute when fruit determines the generic of the video to be sorted: to determine big in the output result of the video classification model In the probability value of preset threshold；The corresponding classification of identified probability value is determined as to the generic of the video to be sorted.

As can be seen from the above technical solutions, the present invention by using video each feature corresponding to loss function and Loss function corresponding to the corresponding assemblage characteristic of video obtains the loss function of disaggregated model, and then according to obtained damage Function is lost to train to obtain video classification model, so that all features in video can be taken into account in the training process of disaggregated model Contribution avoids the problem that the included information of single feature in video is lost, to promote established video classification model Classification accuracy.

[Detailed description of the invention]

Fig. 1 is a kind of method flow diagram for establishing video classification model that one embodiment of the invention provides；

Fig. 2 is a kind of method flow diagram for visual classification that one embodiment of the invention provides；

Fig. 3 is a kind of structure drawing of device for establishing video classification model that one embodiment of the invention provides；

Fig. 4 is a kind of structure drawing of device for visual classification that one embodiment of the invention provides；

Fig. 5 is the block diagram for the computer system/server that one embodiment of the invention provides.

[specific embodiment]

To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.

The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determination " or " in response to detection ".Similarly, depend on context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when the detection (condition of statement Or event) when " or " in response to detection (condition or event of statement) ".

Fig. 1 is a kind of method flow diagram for establishing video classification model that one embodiment of the invention provides, as shown in figure 1 institute Show, which comprises

In 101, training data is obtained, includes each video and the corresponding classification mark of each video in the training data As a result.

In this step, classification annotation results corresponding to each video and each video are obtained as training data, are obtained The training data taken obtains video classification model for training.Wherein, this step can be obtained by terminal or server end Training data.

Specifically, the corresponding classification annotation results of each video that this step obtains belong to each in pre-set categories for each video The tag set of classification.Wherein, this step can by manually by the label for labelling of video generic be " 1 ", by other classifications Label for labelling be " 0 ".

For example, if having separately included classification 1, classification 2, classification 3 and classification 4 in pre-set categories, if the view obtained Classification belonging to frequency A be classification 1 and classification 3, then this step obtain the corresponding classification annotation results of video A can for [1, 0,1,0]。

In 102, the feature of preset kind is extracted from each video respectively, and be based on extracted feature construction institute State the corresponding assemblage characteristic of each video.

In this step, the feature of preset kind is extracted from each video acquired in step 101 respectively, and then is based on institute The corresponding assemblage characteristic of each video of the feature construction of extraction.Wherein, this step can be by server end according to acquired video Carry out the extraction and building of feature.

Specifically, the feature of the extracted preset kind of this step is the visual signature of video, the audio frequency characteristics of video, view The feature of subtitle and directory title, the face characteristic in video, video cover in the corresponding character features of frequency sound intermediate frequency, video At least two in knowledge planting modes on sink characteristic of the feature of the affiliated label of the feature of figure, video, the feature of video title and video etc..

It is understood that this step from video extract preset kind feature when, can be used the prior art into Row extracts.The present invention is to the mode of the feature of preset kind in extraction video without limiting.

For example, the visual signature of video can be extracted in the following ways: 10 frame images are sampled from video；According to Time sequencing combines each frame image to obtain image sequence；Using preparatory trained detection model to every frame in image sequence Image is predicted, such as (Representational State Transfer Net, declarative state turn using RestNet Move network) model, by every frame image the full articulamentum of the model the last layer expression of the output as the image；By image The expression of image included in sequence is combined, using combined result as the visual signature of the video.

For example, the audio frequency characteristics of video can be extracted in the following ways: audio is extracted from the audio of video Mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC)；Respectively to word each in audio Corresponding mel-frequency cepstrum coefficient carries out process of convolution, using processing result as the audio frequency characteristics of video.

For example, the corresponding character features of video sound intermediate frequency can be extracted in the following ways: by the audio in video It is identified as text；The text obtained to identification segments, and obtains the corresponding term vector sequence of audio according to word segmentation result；Make Obtained term vector sequence is handled with convolutional neural networks, using processing result as the corresponding text of video sound intermediate frequency Feature.

For example, the feature of subtitle and directory title in video can be extracted in the following ways: in video Subtitle carries out OCR identification (Optical Character Recognition, optical character identification), is generated according to recognition result Word sequence；Word sequence generated is segmented, and term vector sequence is obtained according to word segmentation result；Use convolutional Neural Network handles obtained term vector sequence, using processing result as the feature of subtitle in video；It extracts in video Directory title go forward side by side row vector mapping, using mapping result as the feature of directory title in video.

For example, the face characteristic in video can be extracted in the following ways: being identified using face identification facility The location information of the people information and personage's face that contain in video；Using BoW (Bag of Word, bag of words) model to knowledge The information not obtained carries out DUAL PROBLEMS OF VECTOR MAPPING, using mapping result as the face characteristic in video.

For example, the feature of video surface plot can be extracted in the following ways: extracting the surface plot of video；Using pre- First trained RestNet model predicts the surface plot, using the output of the full articulamentum of the model the last layer as The feature of video surface plot.

For example, the feature of the affiliated label of video can be extracted in the following ways: being obtained video and uploaded Shi Youzuo The label that person is stamped for the video；DUAL PROBLEMS OF VECTOR MAPPING is carried out to the label using BoW model, using mapping result as belonging to video The feature of label.

For example, the title feature of video can be extracted in the following ways: obtaining the corresponding title of video；It uses LSTM (long-short term memory, shot and long term memory) model handles acquired title, by processing result Title feature as video.

For example, the knowledge planting modes on sink characteristic of video can be extracted in the following ways: author's label, people are extracted from video Name claims and the contents such as directory title；Inquiry obtains knowledge base letter corresponding with extracted content from preset knowledge base Breath；Extracted knowledge base information is subjected to DUAL PROBLEMS OF VECTOR MAPPING, using mapping result as the knowledge planting modes on sink characteristic of video.

This step after extraction obtains the feature of preset kind in each video, just can be based on being mentioned from each video The feature taken is merged, to construct assemblage characteristic corresponding to each video.

Specifically, this step is when based on each video of extracted feature construction corresponding assemblage characteristic, can use with Under type: it extracted each feature will splice from video, and obtain splicing result；Obtain extracted each spy from video Syntagmatic between sign, such as obtained using factorization machine (Factorization Machine, FM) extracted each Syntagmatic between feature；According to feature each in extracted video to the attention of other features, to extracted each Feature is adjusted, and obtains each feature adjusted, such as using attention (ATTENTION) mechanism come to extracted each spy Sign is adjusted；By between obtained splicing result, each feature syntagmatic and each feature adjusted splice, Using splicing result as the corresponding assemblage characteristic of each video.

It is usually used in the prior art to be spliced multiple features directly to obtain the mode of assemblage characteristic, but which It may cause the problem of obtained assemblage characteristic obtains the included information of each feature with being unable to fully, so that between feature Combined effect it is not obvious enough.And use said combination mode provided by the present invention, can fully obtain each feature and The information for being included between each feature, to obtain the assemblage characteristic that combined effect between feature becomes apparent.

It is understood that this step can also be directly by the syntagmatic and tune between above-mentioned splicing result, feature At least one of feature after whole is as the corresponding assemblage characteristic of each video；This step can also be by above-mentioned splicing result, spy Any two kinds of splicing result in syntagmatic and feature adjusted between sign, it is special as the corresponding combination of each video Sign.

In 103, will from video extracted each feature and the corresponding assemblage characteristic of the video respectively as classification The input of model obtains the disaggregated model for each feature of the video and the output result of assemblage characteristic.

In this step, by step 102 from video extracted each feature and constructed corresponding group of each video Feature is closed respectively as the input of disaggregated model, to obtain each feature and assemblage characteristic institute of the disaggregated model for the video The output result of output.

For example, if extracting feature 1, feature 2 and feature 3 from video A, if according to extracted 3 features It constructs obtained assemblage characteristic and is characterized 4, then this step is respectively by feature 1, feature 2, feature 3 and feature 4 as classification mould The input of type, to obtain the disaggregated model output result exported for each input.

Wherein, the output result of disaggregated model acquired in this step belongs to often for video corresponding with the feature inputted The Making by Probability Sets of a pre-set categories.

For example, if input of this step by the feature 1 of video A as disaggregated model, the output result of model can Think [0.6,0.3,0.7,0.1], that is, the probability for showing that video A belongs to classification 1 is 0.6, and the probability for belonging to classification 2 is 0.3, is belonged to It is 0.7 in the probability of classification 3, the probability for belonging to classification 4 is 0.1；Similarly, this step can respectively obtain disaggregated model for video The output result of the feature 2,3,4 of A.

It is understood that the disaggregated model in this step can be support vector machines, regression model or deep learning Model, the present invention is to the type of disaggregated model without limiting.

In 104, according to the disaggregated model for the video each feature and assemblage characteristic output result with The corresponding classification annotation results of the video, obtain respectively the same video each feature and assemblage characteristic corresponding to loss letter Number.

In this step, according in the classification annotation results and step 103 of each video of correspondence acquired in step 101 By disaggregated model for each feature of each video and the output of assemblage characteristic as a result, obtaining each feature of the same video respectively And loss function corresponding to assemblage characteristic.

For example, feature 1, feature 2 and feature 3 are obtained if extracting from video A, and according to extracted 3 spies The assemblage characteristic obtained is characterized 4, then feature 1, feature 2, feature 3 and feature 4 is obtained classification mould as input respectively The output of type is as a result, and calculate separately feature 1, feature 2, loss function corresponding to feature 3 and feature 4.

Wherein, this step is needed when obtaining loss function corresponding to each feature and assemblage characteristic according to video pair Classification task belonging to the classification annotation results answered chooses corresponding calculation formula.

Specifically, if the corresponding classification annotation results of video are two classification tasks, the calculation formula of loss function are as follows:

In formula: i indicates i-th of pre-set categories；y_iIndicate that classification i is corresponding in classification annotation results corresponding to video Annotation results；s_iPresentation class model is for the output result for inputting exported classification i.

If the corresponding classification annotation results of video are multi-tag classification task, the calculation formula of loss function are as follows:

For example, if the classification annotation results of video A are [0,1,0,0], show that video A is two classification tasks, then make Each feature of video A and the loss function of assemblage characteristic are calculated with above-mentioned first calculation formula, if according to the feature of video A 1 obtained output result is [0.3,0.8,0.2,0.4], then the corresponding loss function of this feature 1 are as follows:-log (0.8).

If the classification annotation results of video A are [0,1,1,0], show that video A is multi-tag classification task, then using above-mentioned Second calculation formula calculates each feature of video A and the loss function of assemblage characteristic, if according to 1 gained of feature of video A The output result arrived is [0.3,0.8,0.2,0.4], then the corresponding loss function of this feature 1 are as follows: [- log (0.7)]+[- log (0.8)]+[-log(0.2)]+[-log(0.6)]。

In 105, according to loss function corresponding to each feature of the same video and assemblage characteristic, determines and divide The loss function of class model, and using the parameter of the loss function of disaggregated model adjustment disaggregated model, obtain visual classification Model.

In this step, according to the loss function and the view of each feature of the same video acquired in step 104 The loss function of the assemblage characteristic of frequency determines the loss function of disaggregated model, and then utilizes the loss of identified disaggregated model Function adjusts the parameter of disaggregated model, obtains video classification model.

That is, this step is other than it can obtain the loss that the corresponding assemblage characteristic of video is contributed, additionally it is possible to The loss that extracted each feature is contributed from video is obtained, obtained assemblage characteristic is avoided to lose the information of primitive character The problem of, so that disaggregated model can preferably take into account the contribution of all features in video.

Specifically, this step loss function corresponding to each feature and assemblage characteristic according to the same video determines It, can be in the following ways when the loss function of disaggregated model: by the loss function of each feature of the same video and combination The sum of the loss function of feature is determined as the loss function of disaggregated model.

For example, if loss function acquired in the feature 1 of video A is loss₁, loss function acquired in feature 2 For loss₂, loss function acquired in feature 3 be loss₃, loss function acquired in feature 4 be loss₄, then disaggregated model Loss function are as follows: loss₁+loss₂+loss₃+loss₄。

It is understood that this step, when being trained to disaggregated model, training objective is to minimize disaggregated model Loss function.It wherein, may include: the loss function phase obtained in preset times when minimizing the loss function of disaggregated model Deng, or difference between the loss function obtained in preset times is less than or equal to preset threshold, etc..

After by disaggregated model training, video classification model has just been obtained.Utilize the video classification model, Neng Gougen The Making by Probability Sets that the video belongs to each pre-set categories is obtained according to the assemblage characteristic of the video to be sorted inputted, and then according to institute Obtained Making by Probability Sets determines classification belonging to video to be sorted.

Fig. 2 is a kind of method flow diagram for visual classification that one embodiment of the invention provides, as shown in Figure 2, the side Method includes:

In 201, video to be sorted is obtained.

In this step, video to be sorted is obtained.It is understood that acquired video to be sorted can pass through for user The video of terminal captured in real-time, or user chooses the video for being stored in terminal local, can also pass through terminal for user The selected video from internet；It is subsequent to send it to server end progress after obtaining video to be sorted for this step Processing.

In 202, the feature of preset kind is extracted from the video to be sorted, and is based on extracted feature construction institute State the corresponding assemblage characteristic of video to be sorted.

In this step, the feature of preset kind is extracted in video to be sorted acquired from step 201, and is based on institute The corresponding assemblage characteristic of the feature construction of the extraction video to be sorted.Wherein, this step can be as server end according to acquired in Video to be sorted carry out feature extraction and building.

Wherein, the feature of this step extracted preset kind from video to be sorted is visual signature, the video of video Audio frequency characteristics, the corresponding character features of video sound intermediate frequency, the feature of subtitle and directory title in video, the face in video Feature, the feature of video surface plot, the feature of the affiliated label of video, the feature of video title and knowledge planting modes on sink characteristic of video etc. At least two in feature.The extracting mode of every kind of feature is had been described above, herein without repeating.

Specifically, this step is in assemblage characteristic corresponding based on the extracted feature construction video to be sorted, can be with In the following ways: extracted each feature will splice from video to be sorted, and obtain splicing result；It obtains to be sorted Syntagmatic in video between extracted each feature；According to each feature extracted from video to be sorted to other spies The attention of sign is adjusted extracted each feature, obtains each feature adjusted；By obtained splicing result, respectively Syntagmatic and each feature adjusted between feature are spliced, using splicing result as corresponding group of video to be sorted Close feature.

In 203, using the corresponding assemblage characteristic of the video to be sorted as the input of video classification model, according to described The output result of video classification model determines the generic of video to be sorted.

In this step, the corresponding assemblage characteristic input of video to be sorted constructed in step 202 is trained in advance The video classification model arrived determines the generic of video to be sorted according to the output result of video classification model.Wherein, video The output result of disaggregated model is the Making by Probability Sets that video to be sorted belongs to each pre-set categories.

Specifically, this step is when determining the generic of video to be sorted according to the output result of video classification model, It can be in the following ways: determining the probability value for being greater than preset threshold in the output result of video classification model；It will be identified The corresponding classification of probability value is determined as the generic of video to be sorted.

For example, if video to be sorted is video B, the constructed corresponding assemblage characteristic of video B is characterized 5, will be special The 5 input video classification model that training obtains in advance of sign, if obtained output result is [0.3,0.7,0.8,0.2], if default Threshold value is 0.5, then probability value 0.7 and 0.8 corresponding classification are determined as to the generic of video B.

Fig. 3 is a kind of structure drawing of device for establishing video classification model that one embodiment of the invention provides, such as institute in Fig. 3 Show, described device include: first acquisition unit 31, the first construction unit 32, first processing units 33, the first training unit 34 with And and training unit 35.

First acquisition unit 31 includes each video and each video pair in the training data for obtaining training data The classification annotation results answered.

First acquisition unit 31 obtains classification annotation results corresponding to each video and each video as training data, institute The training data of acquisition obtains video classification model for training.Wherein, first acquisition unit 31 can pass through terminal or service Device end obtains training data.

Specifically, the corresponding classification annotation results of each video that first acquisition unit 31 obtains are that each video belongs to default class The tag set of each classification in not.Wherein, first acquisition unit 31 can be by manually by the label for labelling of video generic It is " 0 " by the label for labelling of other classifications for " 1 ".

First construction unit 32 for extracting the feature of preset kind from each video respectively, and is based on being extracted Feature construction described in the corresponding assemblage characteristic of each video.

First construction unit 32 extracts the feature of preset kind from each video acquired in first acquisition unit 31 respectively, And then it is based on the corresponding assemblage characteristic of each video of extracted feature construction.Wherein, the first construction unit 32 can be by server End carries out the extraction and building of feature according to acquired video.

Specifically, the feature of the extracted preset kind of the first construction unit 32 is the sound of the visual signature of video, video Frequency feature, the corresponding character features of video sound intermediate frequency, the feature of subtitle and directory title in video, the face in video are special Sign, the feature of video surface plot, the feature of the affiliated label of video, the feature of video title and video knowledge planting modes on sink characteristic etc. in At least two.

It is understood that the first construction unit 32 when extracting the feature of preset kind from video, can be used existing There is technology to extract.The present invention is to the mode of the feature of preset kind in extraction video without limiting.

First construction unit 32 just can be based on from each view after extraction obtains the feature of preset kind in each video Extracted feature is merged in frequency, to construct assemblage characteristic corresponding to each video.

Specifically, the first construction unit 32, can in assemblage characteristic corresponding based on each video of extracted feature construction With in the following ways: extracted each feature will splice from video, and obtain splicing result；Acquisition is mentioned from video The syntagmatic between each feature taken；According to feature each in extracted video to the attention of other features, to being mentioned Each feature taken is adjusted, and obtains each feature adjusted；By the syntagmatic between obtained splicing result, each feature And each feature adjusted is spliced, using splicing result as the corresponding assemblage characteristic of each video.

It is usually used in the prior art to be spliced multiple features directly to obtain the mode of assemblage characteristic, but which It may cause the problem of obtained assemblage characteristic obtains the included information of each feature with being unable to fully, so that the group between feature It is not obvious enough to close effect.And said combination mode provided by the present invention is used, it can fully obtain each feature and each spy The information for being included between sign, to obtain the assemblage characteristic that combined effect between feature becomes apparent.

It is understood that the first construction unit 32 can also be directly by the combination between above-mentioned splicing result, each feature At least one of relationship and each feature adjusted are as the corresponding assemblage characteristic of each video；First construction unit 32 may be used also With by any two kinds of splicing knot in the syntagmatic and each feature adjusted between above-mentioned splicing result, each feature Fruit, as the corresponding assemblage characteristic of each video.

First processing units 33, for that extracted each feature and the corresponding assemblage characteristic of the video will divide from video Not as the input of disaggregated model, the disaggregated model is obtained for each feature of the video and the output knot of assemblage characteristic Fruit.

First processing units 33 extracted each feature and constructed each view from video by the first construction unit 32 Frequently corresponding assemblage characteristic respectively as disaggregated model input, thus obtain disaggregated model for the video each feature and The output result that assemblage characteristic is exported.

Wherein, the output result of disaggregated model acquired in first processing units 33 is view corresponding with the feature inputted Frequency belongs to the Making by Probability Sets of each pre-set categories.

It is understood that disaggregated model in first processing units 33 can for support vector machines, regression model or Deep learning model, the present invention is to the type of disaggregated model without limiting.

First training unit 34, for each feature and assemblage characteristic according to the disaggregated model for the video Result classification annotation results corresponding with the video are exported, each feature and the assemblage characteristic institute for obtaining the same video respectively are right The loss function answered.

The classification annotation results of each video of first training unit 34 correspondence according to acquired in first acquisition unit 31 and First processing units 33 are by disaggregated model for each feature of each video and the output of assemblage characteristic as a result, obtaining respectively same Loss function corresponding to each feature and assemblage characteristic of a video.

First training unit 34 is needed when obtaining loss function corresponding to each feature and assemblage characteristic according to video Classification task belonging to corresponding classification annotation results chooses corresponding calculation formula.

In formula: indicating the pre-set categories；Indicate the corresponding mark of classification in classification annotation results corresponding to video Infuse result；Presentation class model is for the output result for inputting exported classification.

Second training unit 35, for the loss according to corresponding to each feature of the same video and assemblage characteristic Function determines the loss function of the disaggregated model, and adjusts the disaggregated model using the loss function of the disaggregated model Parameter, obtain video classification model.

The loss function of each feature of second training unit 35 same video according to acquired in the first training unit 34 And the loss function of the assemblage characteristic of the video, determine the loss function of disaggregated model, and then utilize identified classification mould The loss function of type adjusts the parameter of disaggregated model, obtains video classification model.

That is, the second training unit 35 in addition to can obtain loss that the corresponding assemblage characteristic of video is contributed it Outside, additionally it is possible to obtain the loss that extracted each feature is contributed from video, avoid obtained assemblage characteristic from losing original The problem of information of feature, so that disaggregated model can preferably take into account the contribution of all features in video.

Specifically, the second loss corresponding to each feature and assemblage characteristic according to the same video of training unit 35 It, can be in the following ways when function determines the loss function of disaggregated model: by the loss function of each feature of the same video And the sum of the loss function of assemblage characteristic, it is determined as the loss function of disaggregated model.

It is understood that the second training unit 35, when being trained to disaggregated model, training objective is to minimize to divide The loss function of class model.It wherein, may include: the damage obtained in preset times when minimizing the loss function of disaggregated model Mistake function is equal, or the difference between the loss function obtained in preset times is less than or equal to preset threshold, etc..

Second training unit 35 has just obtained video classification model after by disaggregated model training.Utilize the video Disaggregated model can obtain the probability that the video belongs to each pre-set categories according to the assemblage characteristic of the video to be sorted inputted Set, so according to obtained Making by Probability Sets determine video to be sorted belonging to classification.

Fig. 4 is a kind of structure drawing of device for visual classification that one embodiment of the invention provides, as shown in Figure 4, the dress Set includes: second acquisition unit 41, the second construction unit 42 and the second processing unit 43.

Second acquisition unit 41, for obtaining video to be sorted.

Second acquisition unit 41 obtains video to be sorted.It is understood that acquired video to be sorted can be use Family passes through the video of terminal captured in real-time, or user chooses the video for being stored in terminal local, can also be logical for user Cross terminal video selected from internet；Second acquisition unit 41 sends it to clothes after obtaining video to be sorted Business device end carries out subsequent processing.

Second construction unit 42 for extracting the feature of preset kind from the video to be sorted, and is based on being extracted Feature construction described in the corresponding assemblage characteristic of video to be sorted.

Second construction unit 42 extracts the spy of preset kind from video to be sorted acquired in second acquisition unit 41 Sign, and based on the corresponding assemblage characteristic of the extracted feature construction video to be sorted.Wherein, the second construction unit 42 can be by Server end carries out the extraction and building of feature according to acquired video to be sorted.

Wherein, the feature of the second extracted preset kind from video to be sorted of construction unit 42 is that the vision of video is special Sign, the audio frequency characteristics of video, the corresponding character features of video sound intermediate frequency, the feature of subtitle and directory title, video in video In face characteristic, the feature of video surface plot, the feature of the affiliated label of video, the knowledge of the feature of video title and video At least two in the features such as planting modes on sink characteristic.The extracting mode of every kind of feature is had been described above, herein without repeating.

Specifically, the second construction unit 42 is based on the corresponding assemblage characteristic of the extracted feature construction video to be sorted When, it can be in the following ways: extracted each feature will splice from video to be sorted, and obtain splicing result；It obtains Syntagmatic between each feature extracted in video to be sorted；According to each feature extracted from video to be sorted To the attention of other features, extracted each feature is adjusted, obtains each feature adjusted；By obtained splicing As a result, the syntagmatic between each feature and each feature adjusted are spliced, using splicing result as video to be sorted Corresponding assemblage characteristic.

The second processing unit 43, for using the corresponding assemblage characteristic of the video to be sorted as the defeated of video classification model Enter, the generic of video to be sorted is determined according to the output result of the video classification model.

The second processing unit 43 inputs the corresponding assemblage characteristic of video to be sorted constructed by the second construction unit 42 pre- The first video classification model that training obtains, the generic of video to be sorted is determined according to the output result of video classification model. Wherein, the output result of video classification model is the Making by Probability Sets that video to be sorted belongs to each pre-set categories.

Specifically, the second processing unit 43 determines belonging to video to be sorted in the output result according to video classification model It, can be in the following ways when classification: determining the probability value for being greater than preset threshold in the output result of video classification model；By institute The corresponding classification of determining probability value is determined as the generic of video to be sorted.

As shown in figure 5, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes The component of business device 012 can include but is not limited to: one or more processor or processing unit 016, system storage 028, connect the bus 018 of different system components (including system storage 028 and processing unit 016).

Bus 018 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Computer system/server 012 typically comprises a variety of computer system readable media.These media, which can be, appoints The usable medium what can be accessed by computer system/server 012, including volatile and non-volatile media, movably With immovable medium.

System storage 028 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can For reading and writing immovable, non-volatile magnetic media (Fig. 5 do not show, commonly referred to as " hard disk drive ").Although in Fig. 5 It is not shown, the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and to can The CD drive of mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these situations Under, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 may include At least one program product, the program product have one group of (for example, at least one) program module, these program modules are configured To execute the function of various embodiments of the present invention.

Program/utility 040 with one group of (at least one) program module 042, can store in such as memory In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other It may include the realization of network environment in program module and program data, each of these examples or certain combination.Journey Sequence module 042 usually executes function and/or method in embodiment described in the invention.

Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment, Display 024 etc.) communication, in the present invention, computer system/server 012 is communicated with outside radar equipment, can also be with One or more enable a user to the equipment interacted with the computer system/server 012 communication, and/or with make the meter Any equipment (such as network interface card, the modulation that calculation machine systems/servers 012 can be communicated with one or more of the other calculating equipment Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes Being engaged in device 012 can also be by network adapter 020 and one or more network (such as local area network (LAN), wide area network (WAN) And/or public network, such as internet) communication.As shown, network adapter 020 by bus 018 and computer system/ Other modules of server 012 communicate.It should be understood that although not shown in the drawings, computer system/server 012 can be combined Using other hardware and/or software module, including but not limited to: microcode, device driver, redundant processing unit, external magnetic Dish driving array, RAID system, tape drive and data backup storage system etc..

Processing unit 016 by the program that is stored in system storage 028 of operation, thereby executing various function application with And data processing, such as realize method flow provided by the embodiment of the present invention.

With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by Tangible medium, can also be directly from network downloading etc..It can be using any combination of one or more computer-readable media. Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or Any above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes: with one Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN) is connected to subscriber computer, or, it may be connected to outer computer (such as provided using Internet service Quotient is connected by internet).

Using technical solution provided by the present invention, loss function and video pair corresponding to each feature by video Loss function corresponding to the assemblage characteristic answered obtains the loss function of disaggregated model, and then according to obtained loss function It trains to obtain video classification model, so that can take into account the contribution of all features in video in the training process of disaggregated model, Avoid the problem that the included information of single feature in video is lost, and due to having used more abundant information to carry out classification mould The training of type, so as to promote the classification accuracy of established video classification model.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of method for establishing video classification model, which is characterized in that the described method includes:

Training data is obtained, includes each video and the corresponding classification annotation results of each video in the training data；

The feature of preset kind is extracted from each video respectively, and based on each video pair described in extracted each feature construction The assemblage characteristic answered；

Using each feature extracted from video and the corresponding assemblage characteristic of the video as the input of disaggregated model, obtain Take the disaggregated model for each feature of the video and the output result of assemblage characteristic；

It is corresponding with the video for each feature of the video and the output result of assemblage characteristic according to the disaggregated model Classification annotation results, obtain respectively the same video each feature and assemblage characteristic corresponding to loss function；

According to loss function corresponding to each feature of the same video and assemblage characteristic, the disaggregated model is determined Loss function, and the parameter of the disaggregated model is adjusted using the loss function of the disaggregated model, obtain video classification model.

2. the method according to claim 1, wherein the feature of the preset kind be video visual signature, The audio frequency characteristics of video, the corresponding character features of video sound intermediate frequency, the feature of subtitle and directory title in video, in video Face characteristic, the feature of video surface plot, the feature of the affiliated label of video, the feature of video title and video knowledge Al Kut At least two in sign.

3. the method according to claim 1, wherein described based on each video described in extracted each feature construction Corresponding assemblage characteristic includes:

Each feature extracted from the video is spliced, splicing result is obtained；

It obtains from the syntagmatic between each feature extracted in the video；

According to feature each in extracted video to the attention of other features, extracted each feature is adjusted, is obtained Take each feature adjusted；

By between the splicing result, each feature syntagmatic and each feature adjusted splice, by splicing result As the corresponding assemblage characteristic of each video.

4. the method according to claim 1, wherein each feature and group for obtaining the same video respectively Closing loss function corresponding to feature includes:

According to the corresponding classification annotation results of each video, classification task belonging to each video is determined；

The calculation formula of loss function is determined according to classification task belonging to each video；

According to the calculation formula of identified loss function, each feature and the assemblage characteristic institute for obtaining the same video respectively are right The loss function answered.

5. the method according to claim 1, wherein each feature and group according to the same video Loss function corresponding to feature is closed, determines that the loss function of disaggregated model includes:

By the sum of the loss function of each feature of the same video and the loss function of assemblage characteristic, it is determined as mould of classifying The loss function of type.

6. the method according to claim 1, wherein the method also includes:

The training objective of the disaggregated model is the loss function for minimizing the disaggregated model.

7. a kind of method of visual classification, which is characterized in that the described method includes:

Obtain video to be sorted；

The feature of preset kind is extracted from the video to be sorted, and based on video to be sorted described in extracted feature construction Corresponding assemblage characteristic；

Using the corresponding assemblage characteristic of the video to be sorted as the input of video classification model, according to the video classification model Output result determine the generic of the video to be sorted；

According to claim 1 ,~any one of 6 claim constructs the video classification model in advance.

8. the method according to the description of claim 7 is characterized in that described based on view to be sorted described in extracted feature construction Frequently corresponding assemblage characteristic includes:

It extracted feature will splice from the video to be sorted, and obtain splicing result；

It obtains from the syntagmatic in the video to be sorted between extracted feature；

According to each feature extracted from the video to be sorted to the attention of other features, to extracted feature into Row adjustment, obtains each feature adjusted；

By between the splicing result, feature syntagmatic and each feature adjusted splice, splicing result is made For the corresponding assemblage characteristic of the video to be sorted.

9. the method according to claim 1, wherein described true according to the output result of the video classification model The generic of the video to be sorted includes: calmly

Determine the probability value for being greater than preset threshold in the output result of the video classification model；

The corresponding classification of identified probability value is determined as to the generic of the video to be sorted.

10. a kind of device for establishing video classification model, which is characterized in that described device includes:

First acquisition unit includes each video and the corresponding class of each video in the training data for obtaining training data Other annotation results；

First construction unit for extracting the feature of preset kind from each video respectively, and is based on extracted each spy Sign constructs the corresponding assemblage characteristic of each video；

First processing units, for will from video extracted each feature and the corresponding assemblage characteristic of the video respectively as The input of disaggregated model obtains the disaggregated model for each feature of the video and the output result of assemblage characteristic；

First training unit, for being directed to each feature of the video and the output knot of assemblage characteristic according to the disaggregated model Fruit classification annotation results corresponding with the video, obtain respectively the same video each feature and assemblage characteristic corresponding to damage Lose function；

Second training unit, for the loss function according to corresponding to each feature of the same video and assemblage characteristic, It determines the loss function of the disaggregated model, and adjusts the ginseng of the disaggregated model using the loss function of the disaggregated model Number, obtains video classification model.

11. device according to claim 10, which is characterized in that the feature of the preset kind is that the vision of video is special Sign, the audio frequency characteristics of video, the corresponding character features of video sound intermediate frequency, the feature of subtitle and directory title, video in video In face characteristic, the feature of video surface plot, the feature of the affiliated label of video, the knowledge of the feature of video title and video At least two in planting modes on sink characteristic.

12. device according to claim 10, which is characterized in that first construction unit is being based on extracted feature It is specific to execute when constructing the corresponding assemblage characteristic of each video:

It obtains from the syntagmatic between each feature extracted in the video；

13. device according to claim 10, which is characterized in that first training unit is obtaining the same view respectively It is specific to execute when loss function corresponding to each feature and assemblage characteristic of frequency:

14. device according to claim 10, which is characterized in that second training unit is according to the same view Loss function corresponding to each feature and assemblage characteristic of frequency, specific to execute when determining the loss function of disaggregated model:

15. device according to claim 10, which is characterized in that second training unit also executes:

For making the training objective of the disaggregated model minimize the loss function of the disaggregated model.

16. a kind of device of visual classification, which is characterized in that described device includes:

Second acquisition unit, for obtaining video to be sorted；

Second construction unit for extracting the feature of preset kind from the video to be sorted, and is based on extracted feature Construct the corresponding assemblage characteristic of the video to be sorted；

The second processing unit, for using the corresponding assemblage characteristic of the video to be sorted as the input of video classification model, root The generic of the video to be sorted is determined according to the output result of the video classification model；

According to claim 1, any one of 0~15 claim constructs the video classification model in advance.

17. device according to claim 16, which is characterized in that second construction unit is being based on extracted feature It is specific to execute when constructing the corresponding assemblage characteristic of the video to be sorted:

18. device according to claim 16, which is characterized in that described the second processing unit is according to the visual classification It is specific to execute when the output result of model determines the generic of the video to be sorted:

19. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor is realized when executing described program as any in claim 1~9 Method described in.

20. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed Such as method according to any one of claims 1 to 9 is realized when device executes.