CN110516749A

CN110516749A - Model training method, method for processing video frequency, device, medium and calculating equipment

Info

Publication number: CN110516749A
Application number: CN201910811249.3A
Authority: CN
Inventors: 孙丽坤; 许盛辉; 刘彦东
Original assignee: Netease Media Technology Beijing Co Ltd
Current assignee: Netease Media Technology Beijing Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-29

Abstract

Embodiments of the present invention provide a kind of model training method, comprising: obtain multiple video clips；Respectively the multiple video clip adds label, wherein the label is for characterizing the effective information that the video clip is included；Establish the neural network model comprising time dimension；And the neural network model is trained using the multiple video clip with label, Optimized model is obtained, the Optimized model from video file for extracting the target video segment comprising maximum effective information.Embodiments of the present invention additionally provide a kind of method for processing video frequency, model training apparatus, video process apparatus, medium and calculate equipment.

Description

Model training method, method for processing video frequency, device, medium and calculating equipment

Technical field

Embodiments of the present invention are related to field of computer technology, more specifically, embodiments of the present invention are related to model Training method, method for processing video frequency, device, medium and calculating equipment.

Background technique

Background that this section is intended to provide an explanation of the embodiments of the present invention set forth in the claims or context.Herein Description recognizes it is the prior art not because not being included in this section.

In the prior art, will usually one two points be regarded for the resolution of wonderful in video file and non-wonderful Class problem carries out analysis solution, two disaggregated model training is carried out by intersecting entropy loss, and carrying out model training process Often ignore the temporal aspect in video clip, point of the model for causing training to obtain to the video clip of different excellent degree Distinguish that ability is limited.

Summary of the invention

In the present context, embodiments of the present invention are intended to provide a kind of model training method, method for processing video frequency, dress Set, medium and calculate equipment.

In the first aspect of embodiment of the present invention, a kind of model training method is provided, comprising: obtain multiple videos Segment, then respectively multiple video clip adds label.Wherein, label is for characterizing effective letter that video clip is included Breath amount.Then, the neural network model comprising time dimension is established, and using above-mentioned multiple video clips with label to this Neural network model is trained, and obtains Optimized model.The Optimized model is used to extracted from video file comprising maximum effective The target video segment of information content.

In one embodiment of the invention, above-mentioned label includes: the first label, the second label and third label.Wherein, The effective information of first tag characterization is greater than the effective information of the second tag characterization, the effective information of the second tag characterization Greater than the effective information of third tag characterization.

In another embodiment of the invention, it is above-mentioned using with label multiple video clips to neural network model into Row training includes: to construct multiple samples pair based on multiple video clips with label.Wherein each sample is not to including with With two video clips of label.Then it is trained using multiple samples to neural network model, obtains Optimized model.

It is above-mentioned to be trained using multiple samples to neural network model in another embodiment of the present invention, it obtains Optimized model includes: for any sample pair, and any sample is to including the first video clip and the second video clip, wherein the The effective information of the tag characterization of one video clip is greater than the effective information of the tag characterization of the second video clip.Then will Any sample respectively obtains the first score and the second video clip of the first video clip to neural network model is input to Second score.The first numerical value is obtained after first score is subtracted the second score, is determined using loss function based on first numerical value Penalty values, wherein loss function is monotonous descending function.When the penalty values are less than or equal to predetermined threshold, determine that current training obtains Neural network model be Optimized model.When the penalty values are greater than predetermined threshold, to the current neural network mould trained and obtained The parameter of type continues to optimize, and repeats aforesaid operations until obtaining Optimized model.

It is above-mentioned that two video clips of a sample centering are separately input into nerve in one more embodiment of the present invention Network model includes: the image sequence comprising predetermined quantity image to be extracted from the video clip, so for each video clip The image sequence is input to neural network model afterwards.

In one more embodiment of the present invention, above-mentioned foundation includes that the neural network model of time dimension includes: to establish volume Product neural network model, and be one or more convolution kernels of the convolutional neural networks model setting comprising time dimension.

In one more embodiment of the present invention, the above-mentioned multiple video clips of acquisition include: obtain multiple video samples and Image exchange format file relevant to multiple video sample.Then for any video sample in multiple video sample This, determines starting position and knot of the image exchange format file relevant to any video sample in any Sample video Beam position is extracted from any video sample from above-mentioned starting position to the video clip of above-mentioned end position.

In one more embodiment of the present invention, above-mentioned be respectively multiple video clips addition labels includes: for from any Extracted in video sample from above-mentioned starting position to the video clip of above-mentioned end position, add the first label.

In one more embodiment of the present invention, the above-mentioned multiple video clips of acquisition include: to obtain multiple video samples.Then Video segmentation processing is carried out to multiple video samples, obtains multiple video clips.

In the second aspect of embodiment of the present invention, a kind of method for processing video frequency is provided, comprising: obtain video text Part.Then the video file is handled using Optimized model, to be extracted from the video file comprising maximum effective information Target video segment.Wherein, Optimized model be any one of based on the above embodiment described in model training method training obtain 's.

In one embodiment of the invention, above-mentioned to handle video file using Optimized model, so as to from the video file Target video segment of the middle extraction comprising maximum effective information includes: to carry out Video segmentation processing to a video file, is obtained To multiple video clips to be measured.Then for any video clip to be measured in multiple video clips to be measured, this is any to be measured Video clip is input to Optimized model, obtains the score of any video clip to be measured.Then by the video to be measured of highest scoring Segment is as target video segment.

In another embodiment of the invention, it is above-mentioned by any video clip to be measured be input to Optimized model include: from appoint The testing image sequence comprising predetermined quantity image is extracted in one video clip to be measured, and testing image sequence inputting is extremely optimized Model.

In another embodiment of the present invention, the above method further include: image exchange lattice are made based on target video segment The image exchange format file as the cover of video file and is shown by formula file.

In the third aspect of embodiment of the present invention, a kind of model training apparatus is provided, comprising: first obtains mould Block, mark module, modeling module and training module.Wherein, the first acquisition module is for obtaining multiple video clips.Label Module is for being respectively that multiple video clips add label.Wherein label is for characterizing the effective information that video clip is included Amount.Modeling module is for establishing the neural network model comprising time dimension.Training module is used for using multiple with label Video clip is trained neural network model, obtains Optimized model.Wherein Optimized model from video file for extracting Target video segment comprising maximum effective information.

In one embodiment of the invention, above-mentioned label includes: the first label, the second label and third label.First The effective information of tag characterization is greater than the effective information of the second tag characterization, and the effective information of the second tag characterization is greater than The effective information of third tag characterization.

In another embodiment of the invention, training module includes sample to building submodule and sample to training submodule Block.Wherein, sample is used to construct multiple samples pair based on multiple video clips with label to building submodule, wherein each Sample is to including two video clips with different labels.Sample is used for using multiple samples to nerve training submodule Network model is trained, and obtains Optimized model.

In another embodiment of the present invention, sample is specifically used for training submodule: any sample is to including the first view Frequency segment and the second video clip, the effective information of the tag characterization of the first video clip are greater than the label of the second video clip The effective information of characterization.For any sample pair, by any sample to neural network model is input to, the first view is respectively obtained First score of frequency segment and the second score of the second video clip.Then first is obtained after the first score being subtracted the second score Numerical value, and penalty values are determined based on the first numerical value using loss function, wherein loss function is monotonous descending function.When penalty values are small When being equal to predetermined threshold, determine that neural network model is Optimized model.When penalty values are greater than predetermined threshold, to neural network The parameter of model optimizes, and repeats aforesaid operations until obtaining Optimized model.

In one more embodiment of the present invention, sample is to training submodule by any sample to being input to neural network model Process may is that sample to training submodule for each video clip of any sample centering, mentioned from the video clip The image sequence comprising predetermined quantity image is taken, and the image sequence is input to neural network model.

In one more embodiment of the present invention, modeling module is specifically used for establishing convolutional neural networks model, and is the volume Product neural network model setting includes one or more convolution kernels of time dimension.

In one more embodiment of the present invention, the first acquisition module includes sample acquisition submodule and snippet extraction submodule Block.Wherein, sample acquisition submodule is handed over for obtaining multiple video samples and image relevant to the multiple video sample Change formatted file.Snippet extraction submodule is used for for any video sample in multiple video samples, determining and any video Starting position and end position in the relevant image exchange format file of a sample Sample video in office, from any video sample Video clip of the middle extraction from starting position to end position.

In one more embodiment of the present invention, mark module is specifically used for for extracting from any video sample from opening The first label is added to the video clip of end position in beginning position.

In one more embodiment of the present invention, the first acquisition module includes sample acquisition submodule and dividing processing submodule Block.Wherein, sample acquisition submodule is for obtaining multiple video samples.Dividing processing submodule be used for multiple video samples into The processing of row Video segmentation, obtains multiple video clips.

In the fourth aspect of embodiment of the present invention, a kind of video process apparatus is provided, comprising: second obtains module And extraction module.Wherein, the second acquisition module is for obtaining video file.Extraction module is used to handle video using Optimized model File, to extract the target video segment comprising maximum effective information from video file.Wherein Optimized model is to be based on The training of model training method described in any one of above-described embodiment obtains.

In one embodiment of the invention, extraction module include: Video segmentation submodule, evaluation submodule and really Stator modules.Video segmentation submodule is used to carry out Video segmentation processing to a video file, obtains multiple piece of video to be measured Section.Submodule is evaluated to be used for for any video clip to be measured in the multiple video clip to be measured, it will be described any to be measured Video clip is input to the Optimized model, obtains the score of any video clip to be measured.Determine submodule for that will obtain Divide highest video clip to be measured as the target video segment.

In another embodiment of the invention, evaluation submodule, which is specifically used for extracting from any video clip to be measured, includes The testing image sequence of the predetermined quantity image, and by testing image sequence inputting to Optimized model.

In another embodiment of the present invention, above-mentioned video process apparatus further includes display module, for being regarded based on target Frequency segment makes image exchange format file, which as the cover of video file and is shown.

In the 5th aspect of embodiment of the present invention, a kind of medium is provided, computer executable instructions are stored with, is referred to It enables when being executed by processor for realizing method described in any one of above-described embodiment.

Embodiment of the present invention the 6th aspect in, provide a kind of calculating equipment, comprising: memory, processor and Store the executable instruction that can be run on a memory and on a processor, realization when processor executes instruction: above-described embodiment Any one of described in method.

The method for processing video frequency and device of embodiment according to the present invention, using with label video clip to comprising when Between the neural network model of dimension be trained, label is obtained for characterizing the effective information that video clip is included, training The Optimized model effective information that can be included to unknown video clip evaluate, so as to for from unknown video The featured videos segment comprising maximum effective information is extracted in file.

Detailed description of the invention

The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:

Figure 1A~1B schematically show the model training method, method for processing video frequency of embodiment according to the present invention and The application scenarios of its device；

Fig. 2 schematically shows the flow charts of model training method according to an embodiment of the invention；

Fig. 3 schematically shows the flow chart of model training process according to an embodiment of the invention；

Fig. 4 schematically shows the example schematic diagrams of model training process according to an embodiment of the invention；

Fig. 5 schematically shows the flow chart of method for processing video frequency according to an embodiment of the invention；

Fig. 6 schematically shows the example schematic diagram of video processing procedure according to an embodiment of the invention；

Fig. 7 schematically shows the block diagram of model training apparatus according to an embodiment of the invention；

Fig. 8 schematically shows the block diagram of video process apparatus according to an embodiment of the invention；

Fig. 9 schematically shows the schematic diagram of the computer readable storage medium product of embodiment according to the present invention； And

Figure 10 schematically shows the block diagram of the calculating equipment of embodiment according to the present invention.

In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.

Specific embodiment

The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and energy It is enough that the scope of the present disclosure is completely communicated to those skilled in the art.

One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method Or computer program product.Therefore, the present disclosure may be embodied in the following forms, it may be assumed that complete hardware, complete software The form that (including firmware, resident software, microcode etc.) or hardware and software combine.

Embodiment according to the present invention, propose a kind of model training method, method for processing video frequency, device, medium and Calculate equipment.

Herein, it is to be understood that related term includes: image exchange format (Graphics Interchange Format, GIF) file, convolutional neural networks (Convolutional Neural Network, CNN), 3D Convolutional network, video clip etc..Wherein, gif file be it is a kind of based on Lan Bo-Li Fu-Wei Qu (Lempel-Ziv-Welch, LZW) the nondestructive compression type of the continuous tone of algorithm.Unlike other static images display forms, gif file is increased Time dimension is the continuous dynamic picture played automatically.CNN is a kind of feedforward neural network, its artificial neuron Member can respond the surrounding cells in a part of coverage area, have outstanding performance for large-scale image procossing.When handling image, The dimension of the convolution kernel used is usually 2 dimensions, and a convolution kernel acts on a characteristic pattern (Feature Map), then plus Upper storage reservoir layer etc. can get the local feature and global characteristics of image very well.The energy such as convolutional layer combination pond layer, full articulamentum Enough construct a complete neural network model.Model learns convolution nuclear parameter by back-propagation algorithm.3D convolution net Network is converted to 3 dimension video data of processing from 2 dimensional data images of processing, and the convolution kernel used also expands to 3 dimensions from 2 dimensions.Calculating side Formula is similar with 2 dimension convolution modes, the difference is that can be good at obtaining videl stage feature because expanding time dimension. One video is really made of multiple video clips.The existing regular hour correlation of these segments, and there is centainly only Vertical property.Temporal correlation, which refers to, in chronological order to be connected them, and a continuous smooth complete video can be formed. Independence refers to when certain segments are individually shown, user can independent of it context and understand in the segment Hold.The information content that each segment is included is also different.In addition, any number of elements in attached drawing be used to example rather than Limitation and any name are only used for distinguishing, without any restrictions meaning.

Below with reference to several representative embodiments of the invention, the principle and spirit of the present invention are explained in detail.

Summary of the invention

Model training mode based on the prior art, point of the model that training obtains to the video clip of different excellent degree Distinguish that ability is limited.For this purpose, the embodiment of the invention provides a kind of model training method, method for processing video frequency and devices.Wherein, mould Type training method includes: to obtain multiple video clips.Respectively the multiple video clip adds label, wherein the label The effective information for being included for characterizing the video clip.Establish the neural network model comprising time dimension.Then, sharp The neural network model is trained with the multiple video clip with label, obtains Optimized model, wherein optimizing Model from video file for extracting the target video segment comprising maximum effective information.It is of the invention basic describing After principle, various non-limiting embodiments of the present invention will be described in detail below.

Application scenarios overview

Referring initially to Figure 1A and Figure 1B elaborate the model training method of the embodiment of the present invention, method for processing video frequency and its The application scenarios of device.

Figure 1A~1B schematically show the model training method, method for processing video frequency of embodiment according to the present invention and The application scenarios of its device.

As shown in Figure 1A, which may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link etc..

Terminal device 101,102,103 can be various electronic equipments, can have identical or different computing capability, can To support video playing, including but not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc. Deng.Various client applications, such as video playback class application etc. can be installed (only to show on terminal device 101,102,103 Example).As shown in Figure 1B, the video counts that user is issued by the video playback class application viewing server 105 in terminal device 101 According to.

User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out Send message etc..Server 105 can be to provide the server of various services, such as provide view to terminal device 101,102,103 The back-stage management server (merely illustrative) of frequency evidence, Optimized model etc..Back-stage management server can be to the user received Request analyze etc. processing, and by processing result (such as according to user's request or the webpage of generation, information or data Deng) feed back to terminal device.

It should be noted that model training method and/or method for processing video frequency provided by the embodiment of the present disclosure generally may be used To be executed by server 105.Correspondingly, model training apparatus provided by the embodiment of the present disclosure and/or video process apparatus one As can be set in server 105.Model training method and/or method for processing video frequency provided by the embodiment of the present disclosure can also To be executed by terminal device 101,102,103.Correspondingly, model training apparatus and/or video provided by the embodiment of the present disclosure Processing unit can be set in terminal device 101,102,103.Alternatively, model training method provided by the embodiment of the present disclosure And/or method for processing video frequency can also by can be communicated with terminal device 101,102,103 and/or server 105 other clothes Business device or server cluster execute.Correspondingly, model training apparatus and/or video process apparatus provided by the embodiment of the present disclosure Also it can be set in other servers or server set that can be communicated with terminal device 101,102,103 and/or server 105 In group.

It should be understood that number, the type of terminal device, network and server in Figure 1A~1B are only schematical. According to actual needs, arbitrary number, any type of terminal device, network and server be can have.

Illustrative methods

Below with reference to the application scenarios of Figure 1A and Figure 1B, exemplary embodiment party according to the present invention is described with reference to Fig. 2~Fig. 6 The model training method and method for processing video frequency of formula.It should be noted that above-mentioned application scenarios are merely for convenience of understanding this hair Bright spirit and principle and show, embodiments of the present invention are not limited in this respect.On the contrary, embodiment party of the invention Formula can be applied to applicable any scene.

Fig. 2 schematically shows the flow charts of model training method according to an embodiment of the invention.

As shown in Fig. 2, this method may include following operation S210~S240.

In operation S210, multiple video clips are obtained.

In operation S220, respectively multiple video clips add label.

Wherein, label and video clip correspond, and are used to characterize the piece of video for the label of video clip addition The effective information that section is included.The effective information that one video clip is included is bigger, and the video clip is for viewer For should be more excellent.If any two video clip has same label, show the essence of the two video clips Color degree is suitable.If any two video clip has different labels, show the excellent degree of the two video clips not Together.

In operation S230, the neural network model comprising time dimension is established.

Wherein, for the information in video clip, temporal aspect is extremely important.This operation S203 is to usually used Two-Dimensional Neural Network Model is extended, and establishes the neural network model comprising time dimension, and then can obtain piece of video The timing information of section.Such as can establish the convolutional neural networks comprising time dimension, i.e. 3D convolutional network, convolution kernel therein It is to increase 3D convolution kernel obtained from time dimension on the basis of 2D convolution kernel.Illustratively, 3D convolutional network can use A variety of neural network models such as C3D, P3D, I3D, 2+1D, herein with no restrictions.

In operation S240, neural network model is trained using multiple video clips with label, is optimized Model.

Wherein, the above-mentioned neural network model comprising time dimension is instructed using multiple video clips with label Practice, the parameter of neural network model is constantly adjusted according to loss function.When loss function, which is realized, restrains, determine that training is completed, Obtain Optimized model.The Optimized model from any video file for extracting the target video piece comprising maximum effective information Section, i.e., extract most excellent video clip as target video segment from any video file.

It will be understood by those skilled in the art that method shown in Fig. 2 utilizes the video clip with label to including the time The neural network model of dimension is trained, and label is used to characterize the effective information that video clip is included, what training obtained The effective information that Optimized model can be included to unknown video clip is evaluated, so as to for literary from unknown video The featured videos segment comprising maximum effective information is extracted in part.

In one embodiment of the invention, above-mentioned for label added by video clip may include: the first label, the Two labels and third label.Wherein, the effective information of the first tag characterization is greater than the effective information of the second tag characterization, the The effective information of two tag characterizations is greater than the effective information of third tag characterization.

In one embodiment of the invention, at least one of (1)~(2) multiple views can be obtained in the following way Frequency segment is as training sample.

Mode (1) first obtains multiple video samples and image exchange format file relevant to multiple video samples.So Afterwards for any video sample in multiple video samples, image exchange format file relevant to any video sample is determined Starting position and end position in any Sample video are extracted from any video sample from starting position to end The video clip of position.

For example, the gif file and former video data collection that are generated in open source data set video2GIF comprising a large number of users, it can Therefrom to obtain multiple gif files and its corresponding video sample.GIF1 and video sample 1 are such as got, determines that GIF1 is being regarded Initial position in frequency sample 1 be 13 points 21 seconds, end position be 25 points 11 seconds.It is possible thereby to from video sample 1 extract from 13 points of 21 seconds to 25 points 11 seconds video clips are as subsequent training sample.Wherein, gif file is substantially and is based on being extracted Video clip obtained after overcompression etc. reason, to obtain complete video clip, gif file cannot be directlyed adopt, and It needs to determine video clip according to the whole story position of gif file.

Since each gif file is that user's Manual interception from corresponding video sample makes, show the gif file Quality it is higher, corresponding video clip includes more effective information, excellent degree with higher.Therefore, right When the video clip that pass-through mode (1) extracts adds label, the characterization biggish label classification of effective information can be added.Show Example property, for the video clip from starting position to end position extracted from any video sample, add the first mark Label.

Further, since GIF included in open source data set is made by a large number of users, a variety of interest angles can be covered The content of degree, therefore the video clip that pass-through mode (1) extraction obtains can also cover several scenes, meet following model training Demand.

Mode (2), first obtains multiple video samples, then carries out Video segmentation processing to multiple video sample, obtains Multiple video clips.

For example, multiple video samples can be obtained from multiple support channels.For each video sample, using FFmpeg tool into Video segmentation processing of the row based on division scene detection, so as to obtain multiple video clips.It can manually be these videos Segment adds label, such as clashes when a video clip can show object in video, the attraction of burst point, accident It, can be to the video clip the first label of addition, table when complete segment (such as fireworks burst forth, goal of compete) of viewer's concern Show that excellent degree is high.It, can be to the piece of video when a video clip shows general mo or scene (such as walk, drink water) The second label of Duan Tianjia indicates that excellent degree is general.When a video clip shows static or almost static scene (such as one People speaks, pondering without what other were acted) or show the more duplicate information of invalid redundancy (such as fast blink is shaken acutely) When, third label can be added to the video clip, indicate that excellent degree is low, the concern of viewer will not be attracted, or even may be used also Viewer can be caused to generate the discomforts such as spinning sensation.

In one embodiment of the invention, convolutional neural networks model is established, and is set for the convolutional neural networks model Set one or more convolution kernels comprising time dimension.

Fig. 3 schematically shows the flow chart of model training process according to an embodiment of the invention, in explanation Stating operation S240 utilizes multiple video clips with label to be trained to neural network model to obtain the mistake of Optimized model Journey.

As shown in figure 3, the process may include following operation S241~S242.

In operation S241, multiple samples pair are constructed based on multiple video clips with label.

Wherein, each sample includes two video clips with different labels to (pair).For example, acquired more In a video clip, choose two video clips every time and constitute a samples pair, the label of two video clips can be as One of lower situation: (the first label, the second label), (the first label, third label) and (the second label, third label).By This is it is found that each sample is relatively low to the video clip for having an excellent degree relatively high and an excellent degree Video clip.

In operation S242, it is trained using multiple samples to neural network model, obtains Optimized model.

It is appreciated that the present embodiment utilizes the sample being made of two video clips with different labels to nerve net Network is trained, so that gradually to learn into sample pair excellent degree in signal transduction process relatively high for neural network Difference between video clip and the relatively low video clip of excellent degree, to obtain that different excellent degree can be differentiated The Optimized model of video clip.

Illustratively, convolutional neural networks model is established, and includes time dimension for convolutional neural networks model setting One or more convolution kernels, as neural network model used in this example.It is above-mentioned to utilize the multiple sample to described Neural network model is trained, and obtaining the Optimized model may include: for any sample pair, and any sample is to including The effective information of first video clip and the second video clip, the tag characterization of the first video clip is greater than the second video clip Tag characterization effective information.By any sample to neural network model is input to, the first video clip is respectively obtained The first score and the second video clip the second score.Then, the first numerical value is obtained after the first score being subtracted the second score. Then, penalty values are determined based on the first numerical value using loss function, which is monotonous descending function.When penalty values be less than etc. When predetermined threshold, determine that neural network model is Optimized model, when penalty values are greater than predetermined threshold, to neural network model Parameter optimize, repeat aforesaid operations until obtain Optimized model.

Fig. 4 schematically shows the example schematic diagrams of model training process according to an embodiment of the invention.

In the example shown in Figure 4, neural network model is realized using the full articulamentum of 3D convolutional network model combination, it will One sample is to the first video clip S in p⁺With the second video clip S^-It is separately input into neural network model.Such as first view Frequency segment is the video clip with the first label extracted by above mode (1), and the second video clip is by upper The video clip with the second label that mode (2) obtains in text.Operation transmitting of the neural network model Jing Guo multilayer, output the First score h (S of one video clip⁺) and the second video clip the second score h (S^-), wherein h (S⁺) and h (S^-) value Section is [0,1].Based on the first score h (S⁺), the second score h (S^-) and neural network model loss function determine Penalty values loss_p(S⁺, S^-), it can be indicated by formula (1):

loss_p(S⁺, S^-)=max (0,1-h (S⁺)+h(S^-))^p

Formula (1)

It is appreciated that the target of above-mentioned training process is so that the relatively high video of the excellent degree of any sample centering The score of segment is higher than the score of the relatively low video clip of excellent degree, with the target at a distance of remoter generated penalty values It is bigger.When above-mentioned loss function, which is realized, restrains, determines that the training of neural network model is completed, obtain Optimized model.

With continued reference to Fig. 4, when two video clips of a sample centering are separately input into neural network model, for Each video clip can first extract the image sequence comprising predetermined quantity image from the video clip, the image sequence Extraction process can be uniform extraction.Then image sequence is input to neural network model, by neural network model to the figure As sequential extraction procedures convolution 3D feature (Convolutional 3D (C3D) Feature), using one or more full articulamentums (FullyConnected Layer), scoring layer (Scoring Layer) etc., finally obtain the score of the video clip.For example, Input is to be exported by the image sequence of the image construction of 16 224 × 224 sizes as the score between 0~1.For trained For the Optimized model arrived, the score of a video clip is higher, and the effective information for indicating that the video clip is included is bigger, Excellent degree is higher.

Fig. 5 schematically shows the flow chart of method for processing video frequency according to an embodiment of the invention.

As shown in figure 5, this method may include following operation S510~S530.

In operation S510, video file is obtained.

In operation S520, Optimized model is obtained.

Wherein, Optimized model is obtained after being trained according to model training method described above to neural network model , training process has hereinbefore been described in detail, and details are not described herein.

In operation S530, video file is handled using Optimized model, to extract comprising maximum effectively from video file The target video segment of information content.

Wherein, video file is handled using Optimized model, can obtained for videos one or more in the video file The score of segment, since the height of the score of video clip characterizes the size of the included effective information of video clip.

Illustratively, above-mentioned to handle video file using Optimized model, have to be extracted from video file comprising maximum The process for imitating the target video segment of information content can be carried out as follows: first carry out Video segmentation to a video file Processing, obtains multiple video clips to be measured.Each of multiple video clips to be measured video clip to be measured is separately input into excellent Change model, respectively obtains the score of each video clip to be measured.The score of more multiple video clips to be measured, by highest scoring Target video segment of the video clip to be measured as the video file.

In one embodiment of the invention, the above-mentioned target video segment extracted from video file can be used for making The cover of the video file is effectively converted user with including that the cover of mass efficient information attracts the concern of user by this For the viewer of the video file.Method for processing video frequency according to an embodiment of the present invention can also include: based on target video piece Section production gif file, which as the cover of video file and is shown.

Fig. 6 schematically shows the example schematic diagram of video processing procedure according to an embodiment of the invention.

It as shown in fig. 6, video file is first divided into multiple video clips to be measured, such as may include: video clip t- N, video clip t, video clip t+n, video clip t+m etc., wherein t, m, n are positive integer, and t is greater than n.Video segmentation Process can be split based on scene, can also be divided based on the behavior of target object in video file.It is each to be measured Video clip includes multiple consecutive images, and the testing image sequence comprising predetermined quantity image is extracted from each video clip to be measured Column.For example, extracting the testing image sequence comprising 16 images from each video clip to be measured, testing image sequence is extracted Process is uniformly extracted, as uniformly extracted 16 figures from 100 consecutive images that a video clip to be measured is included Picture constitutes corresponding testing image sequence in order.Again by testing image sequence inputting corresponding to each video clip to be measured It to Optimized model, is scored by Optimized model each video clip to be measured, exports the score of each video clip to be measured. Illustratively, video clip t-n is scored at 0.6, and video clip t is scored at 0.7, and video clip t+n is scored at 0.8, Video clip t+m is scored at 0.1, etc..It determines that the video clip t+n of highest scoring is target video segment, is based on the mesh It marks video clip and makes corresponding gif file, using the cover as video file.

It can be seen from the above, method for processing video frequency according to an embodiment of the present invention can be extracted intelligently from video file Wonderful comprising maximum effective information, and be fabricated to gif file and show as the cover of the video file in user face Before.User can be helped quickly to know content interesting in video file, attract user's viewing.User experience can be improved, Increase user's clicking rate and stay time.

Exemplary means

After describing the method for exemplary embodiment of the invention, next, showing with reference to Fig. 7 and Fig. 8 the present invention The model training apparatus and video process apparatus of example property embodiment are described in detail.

Fig. 7 schematically shows the block diagram of model training apparatus according to an embodiment of the invention.

As shown in fig. 7, the model training apparatus 700 includes: the first acquisition module 710, mark module 720, modeling module 730 and training module 740.

First acquisition module 710 is for obtaining multiple video clips.

Mark module 720 is for being respectively that multiple video clips add label.Wherein label is for characterizing video clip institute The effective information for including.

Modeling module 730 is for establishing the neural network model comprising time dimension.

Training module 740 is used to be trained neural network model using multiple video clips with label, obtains Optimized model.Wherein Optimized model from video file for extracting the target video segment comprising maximum effective information.

In another embodiment of the invention, training module 740 includes that sample is sub to training to building submodule and sample Module.Wherein, sample is used to construct multiple samples pair based on multiple video clips with label to building submodule, wherein often A sample is to including two video clips with different labels.Sample is used for using multiple samples to mind training submodule It is trained through network model, obtains Optimized model.

In one more embodiment of the present invention, modeling module 730 is specifically used for establishing convolutional neural networks model, and to be somebody's turn to do The setting of convolutional neural networks model includes one or more convolution kernels of time dimension.

In one more embodiment of the present invention, first, which obtains module 710, includes sample acquisition submodule and snippet extraction Module.Wherein, sample acquisition submodule is for obtaining multiple video samples and image relevant to the multiple video sample Exchange format file.Snippet extraction submodule is used for for any video sample in multiple video samples, determining and any view Starting position and end position in the relevant image exchange format file of a frequency sample Sample video in office, from any video sample The video clip from starting position to end position is extracted in this.

In one more embodiment of the present invention, mark module 720 is specifically used for for extracting from any video sample From starting position to the video clip of end position, the first label is added.

In one more embodiment of the present invention, first, which obtains module 710, includes sample acquisition submodule and dividing processing Module.Wherein, sample acquisition submodule is for obtaining multiple video samples.Dividing processing submodule is used for multiple video samples Video segmentation processing is carried out, multiple video clips are obtained.

Fig. 8 schematically shows the block diagram of video process apparatus according to an embodiment of the invention.

As shown in figure 8, the video process apparatus 800 includes: the second acquisition module 810 and extraction module 820.

Second acquisition module 810 is for obtaining video file.

Extraction module 820 is used to handle video file using Optimized model, to extract from video file comprising maximum The target video segment of effective information.Wherein Optimized model be any one of based on the above embodiment described in model training side Method training obtains.

In one embodiment of the invention, extraction module 820 include: Video segmentation submodule, evaluation submodule and Determine submodule.Video segmentation submodule is used to carry out Video segmentation processing to a video file, obtains multiple videos to be measured Segment.Submodule is evaluated to be used for for any video clip to be measured in the multiple video clip to be measured, by it is described it is any to It surveys video clip and is input to the Optimized model, obtain the score of any video clip to be measured.Determine submodule for will The video clip to be measured of highest scoring is as the target video segment.

In another embodiment of the present invention, above-mentioned video process apparatus 800 further includes display module, for being based on target Video clip makes image exchange format file, which as the cover of video file and is opened up Show.

It should be noted that in device section Example each module/unit/subelement etc. embodiment, the skill of solution Art problem, the function of realization and the technical effect reached respectively with the implementation of corresponding step each in method section Example Mode, the technical issues of solving, the function of realization and the technical effect that reaches are same or like, and details are not described herein.

Exemplary media

After describing the method and apparatus of exemplary embodiment of the invention, next, to the exemplary reality of the present invention Apply mode, be introduced for realizing the medium of model training method and/or method for processing video frequency.

The embodiment of the invention provides a kind of media, are stored with computer executable instructions, above-metioned instruction is by processor For realizing model training method and/or method for processing video frequency described in any one of above method embodiment when execution.

In some possible embodiments, various aspects of the invention are also implemented as a kind of shape of program product Formula comprising program code, when described program product is run on the computing device, said program code is for making the calculating Equipment executes described in above-mentioned " illustrative methods " part of this specification the mould of various illustrative embodiments according to the present invention Operating procedure in type training method and/or method for processing video frequency.

Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red The system of outside line or semiconductor, device or device, or any above combination.The more specific example of readable storage medium storing program for executing (non exhaustive list) includes: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Fig. 9 schematically shows the schematic diagram of the computer readable storage medium product of embodiment according to the present invention, As shown in figure 9, describe embodiment according to the present invention for realizing model training method and/or method for processing video frequency Program product 90, can be using portable compact disc read only memory (CD-ROM) and including program code, and can count It calculates and is run in equipment, such as PC.However, program product of the invention is without being limited thereto, and in this document, readable storage medium Matter can be any tangible medium for including or store program, which, which can be commanded execution system, device or device, makes With or it is in connection.

Readable signal medium may include in a base band or as the data-signal that carrier wave a part is propagated, wherein carrying Readable program code.The data-signal of this propagation can take various forms, including --- but being not limited to --- electromagnetism letter Number, optical signal or above-mentioned any appropriate combination.Readable signal medium can also be other than readable storage medium storing program for executing it is any can Read medium, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or Program in connection.

The program code for including on readable medium can transmit with any suitable medium, including --- but being not limited to --- Wirelessly, wired, optical cable, RF etc. or above-mentioned any appropriate combination.

The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language --- and such as Java, C++ etc. further include routine Procedural programming language --- such as " C ", language or similar programming language.Program code can fully exist It is executed in user calculating equipment, part executes on a remote computing or completely remote on the user computing device for part Journey calculates to be executed on equipment or server.In the situation for being related to remote computing device, remote computing device can be by any The network of type --- it is connected to user calculating equipment including local area network (LAN) or wide area network (WAN) one, alternatively, can connect To external computing device (such as being connected using ISP by internet).

Exemplary computer device

After method, medium and the device for describing exemplary embodiment of the invention, next, introducing according to this hair The calculating equipment for realizing model training method and/or method for processing video frequency of bright another exemplary embodiment.

The embodiment of the invention also provides a kind of calculating equipment, comprising: memory, processor and storage are on a memory simultaneously The executable instruction that can be run on a processor, the processor are realized any in above method embodiment when executing described instruction Model training method and/or method for processing video frequency described in.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as circuit, " module " or " system ".

In some possible embodiments, according to the present invention to be handled for realizing model training method and/or video The calculating equipment of method can include at least at least one processing unit and at least one storage unit.Wherein, the storage Unit is stored with program code, when said program code is executed by the processing unit, so that the processing unit executes sheet The model training method of various illustrative embodiments according to the present invention described in above-mentioned " illustrative methods " part of specification And/or the operating procedure in method for processing video frequency.

Described referring to Figure 10 this embodiment according to the present invention for realizing model training method and/or The calculating equipment 100 of method for processing video frequency.Calculating equipment 100 as shown in Figure 10 is only an example, should not be to the present invention The function and use scope of embodiment bring any restrictions.

As shown in Figure 10, equipment 100 is calculated to show in the form of universal computing device.The component for calculating equipment 100 can be with Including but not limited to: at least one above-mentioned processing unit 1001, connects not homologous ray group at least one above-mentioned storage unit 1002 The bus 1003 of part (including storage unit 1002 and processing unit 1001).

Bus 1003 includes data/address bus, address bus and control bus.

Storage unit 1002 may include volatile memory, such as random access memory (RAM) 10021 and/or height Fast buffer memory 10022 can further include read-only memory (ROM) 10023.

Storage unit 1002 can also include program/utility with one group of (at least one) program module 10024 10025, such program module 10024 includes but is not limited to: operating system, one or more application program, other programs It may include the realization of network environment in module and program data, each of these examples or certain combination.

Calculating equipment 100 can also be with one or more external equipments 1004 (such as keyboard, sensing equipment, bluetooth equipment Deng) communicate, this communication can be carried out by input/output (I/O) interface 1005.Also, calculating equipment 100 can also pass through Network adapter 1006 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, example Such as internet) communication.As shown, network adapter 1006 is communicated by bus 1003 with the other modules for calculating equipment 100. It should be understood that using other hardware and/or software module although not shown in the drawings, can combine and calculate equipment 100, including but not Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and Data backup storage system etc..

It should be noted that although being referred to several lists of model training apparatus and video process apparatus in the above detailed description Member/module or subelement/module, but it is this division be only exemplary it is not enforceable.In fact, according to the present invention Embodiment, the feature and function of two or more above-described units/modules can be specific in a units/modules Change.Conversely, the feature and function of an above-described units/modules can with further division be by multiple units/modules Lai It embodies.

In addition, although describing the operation of the method for the present invention in the accompanying drawings with particular order, this do not require that or Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one Step is decomposed into execution of multiple steps.

Although detailed description of the preferred embodimentsthe spirit and principles of the present invention are described by reference to several, it should be appreciated that, this It is not limited to the specific embodiments disclosed for invention, does not also mean that the feature in these aspects cannot to the division of various aspects Combination is benefited to carry out, this to divide the convenience merely to statement.The present invention is directed to cover appended claims spirit and Included various modifications and equivalent arrangements in range.

Claims

1. a kind of model training method, comprising:

Obtain multiple video clips；

Respectively the multiple video clip adds label, wherein the label, which is used to characterize the video clip, is included Effective information；

Establish the neural network model comprising time dimension；And

The neural network model is trained using the multiple video clip with label, obtains Optimized model, institute Optimized model is stated for extracting the target video segment comprising maximum effective information from video file.

2. according to the method described in claim 1, wherein, the label includes: the first label, the second label and third label, The effective information of first tag characterization is greater than the effective information of second tag characterization, second tag characterization Effective information be greater than the third tag characterization effective information.

3. according to the method described in claim 2, wherein, the multiple video clip of the utilization with label is to the mind It is trained through network model and includes:

Multiple samples pair are constructed based on the multiple video clip with label, and the sample is to including with different labels Two video clips；And

It is trained using the multiple sample to the neural network model, obtains the Optimized model.

4. according to the method described in claim 3, wherein, it is described using the multiple sample to the neural network model into Row training, obtaining the Optimized model includes:

For any sample pair, any sample is to including the first video clip and the second video clip, first video The effective information of the tag characterization of segment is greater than the effective information of the tag characterization of second video clip,

By any sample to the neural network model is input to, the first score of first video clip is respectively obtained With the second score of second video clip；

The first numerical value is obtained after first score is subtracted second score；

Penalty values are determined based on first numerical value using loss function, the loss function is monotonous descending function；

When the penalty values are less than or equal to predetermined threshold, determine that the neural network model is the Optimized model；And

When the penalty values are greater than the predetermined threshold, the parameter of the neural network model is optimized, is repeated above-mentioned Operation is until obtain the Optimized model.

5. described to be input to the neural network mould in any sample pair according to the method described in claim 4, wherein Type includes:

For each video clip of any sample centering, the figure comprising predetermined quantity image is extracted from the video clip As sequence；And

By described image sequence inputting to the neural network model.

6. a kind of method for processing video frequency, comprising:

Obtain video file；And

The video file is handled using Optimized model, to be extracted from the video file comprising maximum effective information Target video segment, the Optimized model are obtained based on such as model training method according to any one of claims 1 to 5 's.

7. a kind of model training apparatus, comprising:

First obtains module, for obtaining multiple video clips；

Mark module adds label for respectively the multiple video clip, wherein the label is for characterizing the video The effective information that segment is included；

Modeling module, for establishing the neural network model comprising time dimension；And

Training module is obtained for being trained using the multiple video clip with label to the neural network model To Optimized model, the Optimized model from video file for extracting the target video segment comprising maximum effective information.

8. a kind of video process apparatus, comprising:

Second obtains module, for obtaining video file；And

Extraction module, for handling the video file using Optimized model, to extract from the video file comprising most The target video segment of big effective information, the Optimized model are based on such as model according to any one of claims 1 to 5 What training method obtained.

9. a kind of medium, be stored with computer executable instructions, described instruction when being executed by processor for realizing:

Model training method as described in any one of claims 1 to 5；And/or

Method for processing video frequency as claimed in claim 6.

10. a kind of calculating equipment, comprising: memory, processor and storage on a memory and can run on a processor can It executes instruction, the processor is realized when executing described instruction:

Model training method as described in any one of claims 1 to 5；And/or

Method for processing video frequency as claimed in claim 6.