CN110213668A

CN110213668A - Generation method, device, electronic equipment and the storage medium of video title

Info

Publication number: CN110213668A
Application number: CN201910356968.0A
Authority: CN
Inventors: 左凯
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-09-06

Abstract

The present invention provides a kind of generation method of video title, device, electronic equipment and storage mediums, which comprises is input to every frame image of target video in first nerves network trained in advance, prediction obtains the corresponding feature vector of every frame image；The corresponding feature vector of every frame image is input in nervus opticus network trained in advance according to the time sequencing of image, prediction obtains the corresponding action message of every frame image；Every frame image of the target video is input in third nerve network trained in advance, prediction obtains the corresponding scene information of every frame image；The corresponding action message of every frame image, scene information are input in fourth nerve network trained in advance according to the time sequencing of image, prediction obtains heading message.Heading message can be predicted according to action message, scene information, help to improve the comprehensive and accuracy of heading message.

Description

Generation method, device, electronic equipment and the storage medium of video title

Technical field

The present embodiments relate to video recommendations technical field more particularly to a kind of generation method of video title, device, Electronic equipment and storage medium.

Background technique

When recommending video to user, need to generate heading message to video, to assist user to determine whether to see See the video.Wherein, heading message may include the key message in video.

In the prior art, the heading message for generating video may include following key step: firstly, extracting each in video Color characteristic, contour feature, scene characteristic, the character features of Moving Objects in frame image, and analyzed；Then, it uses Features described above extracting method handles the picture of multiple known class, uses the contour feature and scene characteristic exercise wheel of these pictures Wide classifier and scene classifier；Subsequently, extracted using features described above, analysis method and classifier to video to be retrieved into Row processing, generates the type label of the object in video in each frame image, for constructing object tag database；Finally, with After inquiry request is submitted at family, retrieval response server is generated in object tag database search video relevant to inquiry request Orderly result is browsed and is consulted for user.

As can be seen that inventor, to finding in above scheme research process, above scheme is simultaneously underused in video All information cause the heading message generated not comprehensive enough and accurate.

Summary of the invention

The present invention provides generation method, device, electronic equipment and the storage medium of a kind of video title, to solve existing skill The above problem in art.

According to the first aspect of the invention, a kind of generation method of video title is provided, which comprises

Every frame image of target video is input in first nerves network trained in advance, prediction obtains every frame image pair The feature vector answered；

The corresponding feature vector of every frame image is input to the second mind of training in advance according to the time sequencing of image Through in network, prediction obtains the corresponding action message of every frame image；

Every frame image of the target video is input in third nerve network trained in advance, prediction obtains every frame figure As corresponding scene information；

The corresponding action message of every frame image, scene information are input to preparatory training according to the time sequencing of image Fourth nerve network in, prediction obtain heading message.

Optionally, the time sequencing according to image is defeated by the corresponding action message of every frame image, scene information Enter into fourth nerve network trained in advance, predict the step of obtaining heading message, comprising:

Identify the language message in every frame image；

The corresponding action message of every frame image, scene information, language message are inputted according to the time sequencing of image Into fourth nerve network trained in advance, prediction obtains heading message.

Optionally, the time sequencing according to image is by the corresponding action message of every frame image, scene information, language Say that information input into fourth nerve network trained in advance, predicts the step of obtaining heading message, comprising:

For every frame image, corresponding action message, scene information, language message are spliced into target information；

The target information is input in fourth nerve network trained in advance, prediction obtains heading message.

Optionally, the step of language message identified in every frame image, comprising:

The audio-frequency information in every frame image is identified by speech recognition technology；And/or

The text information in every frame image is identified by character recognition technology；

The audio-frequency information and/or text information are fused to language message.

Optionally, described before the described the step of audio-frequency information and/or text information are fused to language message Method further include:

Audio-frequency information and/or the text described in every frame image are obtained by the human face detection tech based on face key point The people information of word information；

The described the step of audio-frequency information and/or text information are fused to language message, comprising:

The audio-frequency information and/or text information and the corresponding people information are fused to language message.

Optionally, every frame image of target video is input to described in first nerves network trained in advance, prediction Before the step of obtaining every frame image corresponding feature vector, the method also includes:

Video sample is collected, every frame image pattern in the video sample includes following markup information: movement mark letter Breath, scene markup information, title markup information；

By the video sample and the movement markup information, scene markup information, title markup information, training the One neural network, nervus opticus network, third nerve network, fourth nerve network.

Optionally, described to be marked by the video sample and the movement markup information, scene markup information, title The step of information, training first nerves network, nervus opticus network, third nerve network, fourth nerve network, comprising:

Every frame image pattern of the video sample is input in first nerves network, prediction obtains every frame image pattern Corresponding feature vector；

The corresponding feature vector of every frame image pattern is input to nervus opticus according to the time sequencing of image pattern In network, prediction obtains the corresponding action message of every frame image pattern；

According to the corresponding movement markup information of every frame image pattern and the action message, to first nerves network, described Nervus opticus network is trained；

Every frame image pattern of the video sample is input in third nerve network, prediction obtains every frame image pattern Corresponding scene information；

According to the corresponding scene markup information of every frame image pattern and scene information, third nerve network is trained；

The corresponding action message of every frame image pattern, scene information are input to according to the time sequencing of image pattern In fourth nerve network, prediction obtains heading message；

According to the corresponding title markup information of video sample and heading message, the fourth nerve network is trained.

Optionally, the first nerves network model is Three dimensional convolution neural network, nervus opticus network model and the 4th Neural network model is shot and long term memory network, and third nerve network model is convolutional neural networks.

According to the second aspect of the invention, a kind of generating means of video title are provided, described device includes:

Feature vector prediction module, for every frame image of target video to be input to first nerves network trained in advance In, prediction obtains the corresponding feature vector of every frame image；

Action message prediction module, it is for the time sequencing according to image that the corresponding feature vector of every frame image is defeated Enter into nervus opticus network trained in advance, prediction obtains the corresponding action message of every frame image；

Scene information prediction module, for every frame image of the target video to be input to third nerve trained in advance In network, prediction obtains the corresponding scene information of every frame image；

Heading message prediction module, for the time sequencing according to image by the corresponding action message of every frame image, Scene information is input in fourth nerve network trained in advance, and prediction obtains heading message.

Optionally, the heading message prediction module, comprising:

Language message identifies submodule, for identification the language message in every frame image；

Heading message predicts submodule, will the corresponding movement letter of the every frame image for the time sequencing according to image Breath, scene information, language message are input in fourth nerve network trained in advance, and prediction obtains heading message.

Optionally, the heading message predicts submodule, comprising:

Target information concatenation unit is used for for every frame image, by corresponding action message, scene information, language message It is spliced into target information；

Heading message predicting unit, for the target information to be input in fourth nerve network trained in advance, in advance Measure heading message.

Optionally, the language message identifies submodule, comprising:

Audio-frequency information recognition unit, for identifying the audio-frequency information in every frame image by speech recognition technology；And/or

Text information recognition unit, for identifying the text information in every frame image by character recognition technology；

Language message integrated unit, for the audio-frequency information and/or text information to be fused to language message.

Optionally, the method also includes:

People information recognition unit, for obtaining institute in every frame image by the human face detection tech based on face key point State the people information of audio-frequency information and/or the text information；

The language message integrated unit, comprising:

Language message merges subelement, is used for the audio-frequency information and/or text information and the corresponding personage Information is fused to language message.

Optionally, the method also includes:

Video sample collection module, for collecting video sample, every frame image pattern in the video sample includes such as Lower markup information: movement markup information, scene markup information, title markup information；

Network training module, for passing through the video sample and the movement markup information, scene markup information, mark Inscribe markup information, training first nerves network, nervus opticus network, third nerve network, fourth nerve network.

Optionally, the network training module, comprising:

First prediction submodule, for every frame image pattern of the video sample to be input in first nerves network, Prediction obtains the corresponding feature vector of every frame image pattern；

Second prediction submodule, for the time sequencing according to image pattern by the corresponding feature of every frame image pattern Vector is input in nervus opticus network, and prediction obtains the corresponding action message of every frame image pattern；

First training submodule, is used for according to the corresponding movement markup information of every frame image pattern and the action message, First nerves network, the nervus opticus network are trained；

Third predicts submodule, for every frame image pattern of the video sample to be input in third nerve network, Prediction obtains the corresponding scene information of every frame image pattern；

Second training submodule, for according to the corresponding scene markup information of every frame image pattern and scene information, to the Three neural networks are trained；

4th prediction submodule, for the time sequencing according to image pattern by the corresponding movement of every frame image pattern Information, scene information are input in fourth nerve network, and prediction obtains heading message；

Third trains submodule, for according to the corresponding title markup information of video sample and heading message, to described the Four neural networks are trained.

According to the third aspect of the invention we, a kind of electronic equipment is provided, comprising:

Processor, memory and it is stored in the computer journey that can be run on the memory and on the processor Sequence, the processor realize method above-mentioned when executing described program.

According to the fourth aspect of the invention, provide a kind of readable storage medium storing program for executing, when the instruction in the storage medium by When the processor of electronic equipment executes, so that electronic equipment is able to carry out method above-mentioned.

It is described the embodiment of the invention provides a kind of generation method of video title, device, electronic equipment and storage medium Method includes: to be input to every frame image of target video in first nerves network trained in advance, and prediction obtains every frame image Corresponding feature vector；The corresponding feature vector of every frame image is input to training in advance according to the time sequencing of image In nervus opticus network, prediction obtains the corresponding action message of every frame image；Every frame image of the target video is input to In advance in trained third nerve network, prediction obtains the corresponding scene information of every frame image；It will according to the time sequencing of image The corresponding action message of every frame image, scene information are input in fourth nerve network trained in advance, and prediction is marked Inscribe information.Heading message can be predicted according to action message, scene information, help to improve the comprehensive and accurate of heading message Degree.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is a kind of specific steps flow chart of the generation method for video title that the embodiment of the present invention one provides；

Fig. 2 is a kind of specific steps flow chart of the generation method of video title provided by Embodiment 2 of the present invention；

Fig. 3 is a kind of structure chart of the generating means for video title that the embodiment of the present invention three provides；

Fig. 4 is a kind of structure chart of the generating means for video title that the embodiment of the present invention four provides.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Embodiment one

Referring to Fig.1, a kind of specific steps of the generation method of the video title provided it illustrates the embodiment of the present invention one Flow chart.

Step 101, every frame image of target video is input in first nerves network trained in advance, prediction obtains every The corresponding feature vector of frame image.

Wherein, target video is the video of heading message to be generated.

First nerves network is indicated for getting characteristic information, i.e. key message from image with feature vector.Example Such as, for a frame image, first nerves network can identify the key messages such as personage, key construction, scene from image, And indicated with feature vector, different key messages corresponds to different feature vectors.

If it is appreciated that can identify that the neural network of the characteristic information of image can be used as first nerves network, The embodiment of the present invention is without restriction to its.

Step 102, the corresponding feature vector of every frame image is input to preparatory training according to the time sequencing of image Nervus opticus network in, prediction obtain the corresponding action message of every frame image.

Wherein, nervus opticus network can identify the change information of the feature vector in each frame image, and be believed according to variation Breath generates action message.Specifically, if the personage in previous frame image is in standing state, the personage in a later frame image is in The state of seat, so that previous frame image and the change information of the feature vector in a later frame image can embody the movement of the personage Information is to sit down.

As long as it is appreciated that can identify that the neural network of change information in image sequence can be used as nervus opticus net Network, the embodiment of the present invention are without restriction to its.

In addition, each action message may include the main body of movement, movement, object.For example, playing ball for people, act The main body of information is behaved, and is acted to pat, object is ball.Certainly, in practical applications, object may be not present, for example, people exists Dancing, cloud are waving, at this point it is possible to set empty for object.

It should be noted that in practical applications, step 101 and 102 can pass through a neural fusion, i.e., one Neural network recognizes the action message of every frame image from target video.

Step 103, every frame image of the target video is input in third nerve network trained in advance, is measured in advance To the corresponding scene information of every frame image.

Wherein, scene information is usually determined from the static information in video, i.e., in previous frame image and a later frame image In geostationary information.For example, when including table tennis field in multiple image, so that scene information can be table tennis Court, when the image information in multiple image including rain, so that scene information can be the rainy day.

It should be noted that scene information can be determined by multiple image, but not necessarily by all figures in target video As determining.For example, target video includes 20 frame images, preceding 10 frame image includes the image information of table tennis field, rear 10 frame In image include rain image information, so as to obtain target video scene information include two kinds: table tennis field and Rainy day.

As long as it is appreciated that can identify that the neural network of static information in image sequence can be used as third nerve net Network, the embodiment of the present invention are without restriction to its.

Step 104, the corresponding action message of every frame image, scene information are input to according to the time sequencing of image In advance in trained fourth nerve network, prediction obtains heading message.

Wherein, fourth nerve network can merge action message, scene information to obtain heading message.

Heading message is a kind of description information to target, chooses whether to open target view so as to facilitate user Frequently.Heading message needs to meet nature semanteme and grammer, so that fourth nerve network is not only needed action message, scene information Simple combination, and need to generate grammaticalness semanteme title letter according to action message, scene information according to grammatical and semantic rule Breath.For example, scene information is table tennis court, then heading message can be in table tennis if action message is to brandish table tennis bat It plays table tennis field.

It is appreciated that as long as the neural network that can be merged to action message, scene information can be used as the 4th mind Through network, the embodiment of the present invention is without restriction to its.

In practical applications, first nerves network, nervus opticus network, third nerve network, fourth nerve network It is obtained, and is preset in the generating device of video title with preparatory training, to facilitate subsequent call directly.

In conclusion the embodiment of the invention provides a kind of generation methods of video title, which comprises by target Every frame image of video is input in first nerves network trained in advance, and prediction obtains the corresponding feature vector of every frame image； The corresponding feature vector of every frame image is input in nervus opticus network trained in advance according to the time sequencing of image, Prediction obtains the corresponding action message of every frame image；Every frame image of the target video is input to third mind trained in advance Through in network, prediction obtains the corresponding scene information of every frame image；It is according to the time sequencing of image that every frame image is corresponding Action message, scene information be input in advance trained fourth nerve network, prediction obtains heading message.It can be according to dynamic Make information, scene information prediction heading message, helps to improve the comprehensive and accuracy of heading message.

Embodiment two

Referring to Fig. 2, it illustrates a kind of specific steps of the generation method of video title provided by Embodiment 2 of the present invention Flow chart.

Step 201, video sample is collected, every frame image pattern in the video sample includes following markup information: dynamic Make markup information, scene markup information, title markup information.

Specifically, video sample can be collected from the network platform, and it is manually marked.

Movement markup information determines first and second for exercising supervision to the training of first nerves network, nervus opticus network Whether the action message of neural network cascaded-output is approximate with movement markup information consistent.

Scene markup information determines the field of third nerve network output for exercising supervision to the training of third nerve network Whether scape information is approximate with scene markup information consistent.

Title markup information determines the mark that fourth nerve network generates for exercising supervision to the training of fourth nerve network It is whether approximate with title markup information consistent to inscribe information.

Step 202, pass through the video sample and the movement markup information, scene markup information, title mark letter Breath, training first nerves network, nervus opticus network, third nerve network, fourth nerve network.

Specifically, video sample can be input to first nerves network, nervus opticus network, third nerve network, Four neural networks, and compared according to the predictive information that each network exports with markup information, network parameter is adjusted, so that Penalty values between the predictive information and markup information of network output reach preset range, at this point, obtaining first nerves network, the Two neural networks, third nerve network, fourth nerve network.

Optionally, in another embodiment of the invention, step 202 includes sub-step A1 to A6:

Every frame image pattern of the video sample is input in first nerves network by sub-step A1, and prediction obtains every The corresponding feature vector of frame image pattern.

The step be sample training during each iteration when feature vector extraction step, be referred to applying step 101 detailed description, details are not described herein.

The corresponding feature vector of every frame image pattern is input to by sub-step A2 according to the time sequencing of image pattern In nervus opticus network, prediction obtains the corresponding action message of every frame image pattern.

The step be sample training during each iteration when action message prediction steps, be referred to applying step 102 detailed description, details are not described herein.

Sub-step A3, according to the corresponding movement markup information of every frame image pattern and the action message, to first nerves Network, the nervus opticus network are trained.

Specifically, the penalty values between calculating action markup information and action message, if the penalty values are less than default loss It is worth threshold value, then first nerves network, nervus opticus network at this time is first nerves network, the nervus opticus net that training obtains Network；If the penalty values are greater than or equal to penalty values threshold value, the parameter of first nerves network, nervus opticus network is adjusted, is continued Training.

Every frame image pattern of the video sample is input in third nerve network by sub-step A4, and prediction obtains every The corresponding scene information of frame image pattern.

The step be sample training during each iteration when scene information prediction steps, be referred to applying step 103 detailed description, details are not described herein.

Sub-step A5, according to the corresponding scene markup information of every frame image pattern and scene information, to third nerve network It is trained.

Specifically, the penalty values between scene markup information and scene information are calculated, if the penalty values are less than default loss It is worth threshold value, then third nerve network at this time is the third nerve network that training obtains；If the penalty values are greater than or equal to loss It is worth threshold value, then adjusts the parameter of third nerve network, continue to train.

In addition, penalty values can use calculation formula in the prior art, calculating side of the embodiment of the present invention to penalty values Formula is without restriction.

Sub-step A6, according to the time sequencing of image pattern by the corresponding action message of every frame image pattern, scene For information input into fourth nerve network, prediction obtains heading message.

It specifically, can be after the first, second and third neural metwork training terminates, using the output of second and third neural network Action message, scene information training fourth nerve network help to improve the at this point, action message, scene information are more accurate The convergence rate and accuracy of four neural networks.

Sub-step A7, according to the corresponding title markup information of video sample and heading message, to the fourth nerve network It is trained.

Specifically, for multitude of video sample, penalty values are calculated according to title markup information and heading message, if the loss Value is less than default penalty values threshold value, then fourth nerve network at this time is the fourth nerve network that training obtains；If the penalty values More than or equal to penalty values threshold value, then the parameter of fourth nerve network is adjusted, continues to train.

Optionally, in another embodiment of the invention, the first nerves network model is Three dimensional convolution nerve net Network, nervus opticus network model and fourth nerve network model are shot and long term memory network, and third nerve network model is volume Product neural network.

Wherein, the input of Three dimensional convolution neural network is three-dimensional information, and in practical applications, image can use RGB (Red Green Blue, RGB), the formats table such as YUV (Luminance Chrominance Chroma, brightness, coloration, saturation degree) Show, equal corresponding three-dimensional vector.For example, each Color Channel RGB corresponds to a dimension for RGB image, YUV is schemed Picture, brightness, coloration, saturation degree correspond to a dimension.In embodiments of the present invention, since motion characteristic is difficult to capture (object It is smaller, change greatly), Three dimensional convolution neural network can the various dimensions as much as possible using image information, extract compared with More motion characteristics helps to improve the accuracy of heading message.

Shot and long term memory network can learn from the image of time sequencing to characteristic information, as nervus opticus network Shot and long term memory network passes through may learn change information in video between different images as action message after training, Shot and long term memory network as fourth nerve network can be by the action message of different time, scene information, language message, root Become heading message according to time sequencing fusion.

Convolutional neural networks can be to be one-dimensional, two-dimentional, three-dimensional etc., in embodiment of the disclosure, since scene information holds (scenario objects are larger, variation is smaller) is easily determined, so as to preferentially use one-dimensional, two-dimensional convolutional neural networks, due to one Dimension, two-dimensional convolution neural network are smaller relative to the computation complexity of Three dimensional convolution neural network, can guarantee scene information Prediction accuracy under the premise of, improve convolution algorithm speed, reduce the extraction duration of scene information.

Step 203, every frame image of target video is input in first nerves network trained in advance, prediction obtains every The corresponding feature vector of frame image.

The step is referred to the detailed description of step 101, and details are not described herein.

Step 204, the corresponding feature vector of every frame image is input to preparatory training according to the time sequencing of image Nervus opticus network in, prediction obtain the corresponding action message of every frame image.

The step is referred to the detailed description of step 102, and details are not described herein.

Step 205, every frame image of the target video is input in third nerve network trained in advance, is measured in advance To the corresponding scene information of every frame image.

The step is referred to the detailed description of step 103, and details are not described herein.

Step 206, the language message in every frame image is identified.

Wherein, language message embodies conversation content and aside content etc., may include textual representation and audio representation.Example Such as, if target video is a vidclip, the aside content of conversation content and narrator between personage is represented The theme of film.

Wherein, textual representation can be subtitle.

Optionally, in another embodiment of the invention, step 206 includes sub-step B1 to B3:

Sub-step B1 identifies the audio-frequency information in every frame image by speech recognition technology.

In practical applications, every frame image is corresponding with a voice messaging, and speech recognition technology can be known from the voice Audio-frequency information is identified in other technology.For example, conversation content, aside content.

And/or sub-step B2, the text information in every frame image is identified by character recognition technology.

Wherein, character recognition technology can identify text information from image, for example, can pass through OCR (Optical Character Recognition, optical character identification) technology identification caption information.

The audio-frequency information and/or text information are fused to language message by sub-step B3.

In practical applications, text information is not present in certain videos, for example, the video of user oneself shooting, not Subtitle is added to it；Audio-frequency information is not present in certain videos, for example, the video about Animal World, there is only the sound of animal Sound, there is no the sound of speaking of the mankind.

In the presence of audio-frequency information and text information are equal, the two is fused to language message.Specifically, firstly, it is necessary to it is right Audio-frequency information and text information merge into total language message, and then, the duplicate message in audio-frequency information and text information is deleted It removes, only retains one, avoid the language message for generating redundancy, facilitate heading message quickly generates and guarantee terseness.

Optionally, in another embodiment of the invention, before sub-step B3, step 207 further includes sub-step B4:

Sub-step B4 obtains audio-frequency information described in every frame image by the human face detection tech based on face key point And/or the people information of the text information.

Wherein, face key point can include but is not limited to principal outline point, five official ranks.

People information can mainly be indicated that number of person can be the number of different faces in image by number of person.This Outside, the face key point that identification obtains can also be compared with the face key point in presetting database, and confirms identity Information.

Wherein, the face key point information of the typical characters such as famous person can be stored in advance in presetting database, so as to from Famous person's identity is identified in every frame image.For example, historical personage, current political leadership, business circles distinguished personages, star in amusement circle can be collected The photo of the equal various angles of personages, and face key point information is recognized, finally store into presetting database.

The sub-step B3, including sub-step B31:

The audio-frequency information and/or text information and the corresponding people information are fused to language by sub-step B3 Information.

Specifically, the quantity of people information and famous person's identity fusion to language message can be added to language message In, further, it is also possible to the corresponding relationship of designated person information and audio-frequency information and/or text information.For example, if people information is Famous person's identity information can then determine the corresponding speech content of the famous person (including audio-frequency information and/or text information).

In embodiments of the present invention, people information can also be added in language message, so that language message includes more More information helps to improve the diversification and accuracy of heading message.

Step 207, according to the time sequencing of image by the corresponding action message of every frame image, scene information, language For information input into fourth nerve network trained in advance, prediction obtains heading message.

Specifically, by action message, scene information, language message by frame be aligned, i.e., by the action message of same frame image, Scene information, language message are input to fourth nerve network together.

The embodiment of the present invention not only can extract heading message from action message, scene information, can also believe from language Heading message is extracted in breath, so that heading message content is more comprehensive, accuracy is more preferable.

Embodiment three

Referring to Fig. 3, it illustrates a kind of structure chart of the generating means of video title of the offer of the embodiment of the present invention three, tools Body is as follows.

Feature vector prediction module 301, for every frame image of target video to be input to first nerves trained in advance In network, prediction obtains the corresponding feature vector of every frame image.

Action message prediction module 302, for the time sequencing according to image by the corresponding feature of every frame image to Amount is input in nervus opticus network trained in advance, and prediction obtains the corresponding action message of every frame image.

Scene information prediction module 303, for every frame image of the target video to be input to third trained in advance In neural network, prediction obtains the corresponding scene information of every frame image.

Heading message prediction module 304 believes the corresponding movement of every frame image for the time sequencing according to image Breath, scene information are input in fourth nerve network trained in advance, and prediction obtains heading message.

In conclusion the embodiment of the invention provides a kind of generating means of video title, described device include: feature to Prediction module is measured, for every frame image of target video to be input in first nerves network trained in advance, prediction obtains every The corresponding feature vector of frame image；Action message prediction module, for the time sequencing according to image by every frame image pair The feature vector answered is input in nervus opticus network trained in advance, and prediction obtains the corresponding action message of every frame image；? Scape information prediction module, for every frame image of the target video to be input in third nerve network trained in advance, in advance Measure the corresponding scene information of every frame image；Heading message prediction module, will be described every for the time sequencing according to image The corresponding action message of frame image, scene information are input in fourth nerve network trained in advance, and prediction obtains heading message. Heading message can be predicted according to action message, scene information, help to improve the comprehensive and accuracy of heading message.

Embodiment three is the corresponding Installation practice of embodiment of the method one, and details are referred to the detailed of embodiment one Illustrate, details are not described herein.

Example IV

Referring to Fig. 4, it illustrates a kind of structure chart of the generating means of video title of the offer of the embodiment of the present invention four, tools Body is as follows.

Video sample collection module 401, for collecting video sample, every frame image pattern in the video sample includes Following markup information: movement markup information, scene markup information, title markup information.

Network training module 402, for passing through the video sample and the movement markup information, scene mark letter Breath, title markup information, training first nerves network, nervus opticus network, third nerve network, fourth nerve network.

Feature vector prediction module 403, for every frame image of target video to be input to first nerves trained in advance In network, prediction obtains the corresponding feature vector of every frame image.

Action message prediction module 404, for the time sequencing according to image by the corresponding feature of every frame image to Amount is input in nervus opticus network trained in advance, and prediction obtains the corresponding action message of every frame image.

Scene information prediction module 405, for every frame image of the target video to be input to third trained in advance In neural network, prediction obtains the corresponding scene information of every frame image.

Heading message prediction module 406 believes the corresponding movement of every frame image for the time sequencing according to image Breath, scene information are input in fourth nerve network trained in advance, and prediction obtains heading message；Optionally, of the invention real It applies in example, the heading message prediction module 406, comprising:

Language message identifies submodule 4061, for identification the language message in every frame image.

Heading message predicts submodule 4062, for the time sequencing according to image by the corresponding movement of every frame image Information, scene information, language message are input in fourth nerve network trained in advance, and prediction obtains heading message.

Optionally, in another embodiment of the invention, network training module 402, comprising:

First prediction submodule, for every frame image pattern of the video sample to be input in first nerves network, Prediction obtains the corresponding feature vector of every frame image pattern.

Second prediction submodule, for the time sequencing according to image pattern by the corresponding feature of every frame image pattern Vector is input in nervus opticus network, and prediction obtains the corresponding action message of every frame image pattern.

First training submodule, is used for according to the corresponding movement markup information of every frame image pattern and the action message, First nerves network, the nervus opticus network are trained.

Third predicts submodule, for every frame image pattern of the video sample to be input in third nerve network, Prediction obtains the corresponding scene information of every frame image pattern.

Second training submodule, for according to the corresponding scene markup information of every frame image pattern and scene information, to the Three neural networks are trained.

4th prediction submodule, for the time sequencing according to image pattern by the corresponding movement of every frame image pattern Information, scene information are input in fourth nerve network, and prediction obtains heading message.

Optionally, in another embodiment of the invention, the language message identifies submodule 4061, comprising:

Audio-frequency information recognition unit, for identifying the audio-frequency information in every frame image by speech recognition technology.

And/or text information recognition unit, for identifying the text information in every frame image by character recognition technology；

Optionally, in another embodiment of the invention, the language message identifies submodule 4061, further includes:

People information recognition unit, for obtaining institute in every frame image by the human face detection tech based on face key point State the people information of audio-frequency information and/or the text information.

The language message integrated unit includes:

Example IV is the corresponding Installation practice of embodiment of the method two, and details are referred to the detailed of embodiment two Illustrate, details are not described herein.

The embodiment of the invention also provides a kind of electronic equipment, comprising: processor, memory and is stored in the storage On device and the computer program that can run on the processor, the processor realize side above-mentioned when executing described program Method.

The embodiment of the invention also provides a kind of readable storage medium storing program for executing, when the instruction in the storage medium is by electronic equipment Processor execute when so that electronic equipment is able to carry out method above-mentioned.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.

Similarly, it should be understood that in order to simplify the present invention and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice In the generating device of microprocessor or digital signal processor (DSP) to realize video title according to an embodiment of the present invention The some or all functions of some or all components.The present invention is also implemented as executing method as described herein Some or all device or device programs.It is such to realize that program of the invention can store computer-readable On medium, or it may be in the form of one or more signals.Such signal can be downloaded from an internet website It arrives, is perhaps provided on the carrier signal or is provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of generation method of video title, which is characterized in that the described method includes:

Every frame image of target video is input in first nerves network trained in advance, it is corresponding that prediction obtains every frame image Feature vector；

The corresponding feature vector of every frame image is input to nervus opticus net trained in advance according to the time sequencing of image In network, prediction obtains the corresponding action message of every frame image；

Every frame image of the target video is input in third nerve network trained in advance, prediction obtains every frame image pair The scene information answered；

The corresponding action message of every frame image, scene information are input to the of training in advance according to the time sequencing of image In four neural networks, prediction obtains heading message.

2. the method according to claim 1, wherein the time sequencing according to image is by every frame image The step of corresponding action message, scene information are input in fourth nerve network trained in advance, and prediction obtains heading message, Include:

Identify the language message in every frame image；

The corresponding action message of every frame image, scene information, language message be input to according to the time sequencing of image pre- First in trained fourth nerve network, prediction obtains heading message.

3. according to the method described in claim 2, it is characterized in that, the time sequencing according to image is by every frame image Corresponding action message, scene information, language message are input in fourth nerve network trained in advance, and prediction obtains title letter The step of breath, comprising:

4. according to the method described in claim 2, it is characterized in that, the step of the language message in identification every frame image Suddenly, comprising:

5. according to the method described in claim 4, it is characterized in that, the audio-frequency information and/or text information are melted described Before the step of being combined into language message, the method also includes:

Audio-frequency information described in every frame image and/or text letter are obtained by the human face detection tech based on face key point The people information of breath；

6. the method according to any of claims 1 to 5, which is characterized in that in every frame by target video Before the step of image is input in advance trained first nerves network, and prediction obtains every frame image corresponding feature vector, The method also includes:

Video sample is collected, every frame image pattern in the video sample includes following markup information: movement markup information, field Scape markup information, title markup information；

Pass through the video sample and the movement markup information, scene markup information, title markup information, the first mind of training Through network, nervus opticus network, third nerve network, fourth nerve network.

7. according to the method described in claim 6, it is characterized in that, described marked by the video sample and the movement Information, scene markup information, title markup information, training first nerves network, nervus opticus network, third nerve network, the The step of four neural networks, comprising:

Every frame image pattern of the video sample is input in first nerves network, it is corresponding that prediction obtains every frame image pattern Feature vector；

The corresponding feature vector of every frame image pattern is input to nervus opticus network according to the time sequencing of image pattern In, prediction obtains the corresponding action message of every frame image pattern；

According to the corresponding movement markup information of every frame image pattern and the action message, to first nerves network, described second Neural network is trained；

Every frame image pattern of the video sample is input in third nerve network, it is corresponding that prediction obtains every frame image pattern Scene information；

The corresponding action message of every frame image pattern, scene information are input to the 4th according to the time sequencing of image pattern In neural network, prediction obtains heading message；

8. according to claim 1 to method described in 7 any items, which is characterized in that the first nerves network model is Three dimensional convolution neural network, nervus opticus network model and fourth nerve network model are shot and long term memory network, third mind It is convolutional neural networks through network model.

9. a kind of generating means of video title, which is characterized in that described device includes:

Feature vector prediction module, for every frame image of target video to be input in first nerves network trained in advance, Prediction obtains the corresponding feature vector of every frame image；

The corresponding feature vector of every frame image is input to by action message prediction module for the time sequencing according to image In advance in trained nervus opticus network, prediction obtains the corresponding action message of every frame image；

Scene information prediction module, for every frame image of the target video to be input to third nerve network trained in advance In, prediction obtains the corresponding scene information of every frame image；

Heading message prediction module, for the time sequencing according to image by the corresponding action message of every frame image, scene For information input into fourth nerve network trained in advance, prediction obtains heading message.

10. a kind of electronic equipment characterized by comprising

Processor, memory and it is stored in the computer program that can be run on the memory and on the processor, It is characterized in that, the processor is realized when executing described program such as method any one of in claim 1 to 8.

11. a kind of readable storage medium storing program for executing, which is characterized in that when the instruction in the storage medium is held by the processor of electronic equipment When row, so that electronic equipment is able to carry out such as method any one of in claim 1 to 8.