CN108829881A

CN108829881A - video title generation method and device

Info

Publication number: CN108829881A
Application number: CN201810677450.2A
Authority: CN
Inventors: 李俊; 王文; 郑萌
Original assignee: Shenzhen Tencent Network Information Technology Co Ltd
Current assignee: Shenzhen Tencent Network Information Technology Co Ltd
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2018-11-16
Anticipated expiration: 2038-06-27
Also published as: CN108829881B

Abstract

This application discloses a kind of video title generation method and devices, belong to Internet technical field.The method includes：Obtain the sound characteristic information and image feature information of video；Based on the sound characteristic information and described image characteristic information, the target scene information of the video is obtained, the target scene information is used to indicate the scene that the video is presented；Based on the target scene information and described image characteristic information, the title of the video is generated.The present invention improves the formation efficiency for improving video title.The present invention is used to generate the title of video according to video.

Description

Video title generation method and device

Technical field

This application involves Internet technical field, in particular to a kind of video title generation method and device.

Background technique

With the development of science and technology, more and more users obtain information by the way of watching video.Also, user is selecting When selecting the video for wanting viewing, selection is usually carried out according to video title.Therefore, video title has the viewing rate of video It has a major impact.Wherein, video title is used for the main contents by text summarization video.

In the related technology, the method for generating video title is usually：Operation personnel watches video, and after watching video, The title of video is determined according to the content of video.

But when the number of videos of title to be generated is more, the formation efficiency of the video title is lower.

Summary of the invention

The embodiment of the invention provides a kind of video title generation method and devices, can solve video mark in the related technology The lower problem of the formation efficiency of topic.The technical solution is as follows：

In a first aspect, a kind of video title generation method is provided, the method includes：

Obtain the sound characteristic information and image feature information of video；

Based on the sound characteristic information and described image characteristic information, the target scene information of the video, institute are obtained It states target scene information and is used to indicate the scene that the video is presented；

Based on the target scene information and described image characteristic information, the title of the video is generated.

Second aspect provides a kind of title generation method of game video, the method includes：

Obtain the sound characteristic information and image feature information of game video；

Based on the sound characteristic information and described image characteristic information, the target game scene of the game video is obtained Information, the target game scene information are used to indicate the scene of game that the game video is presented；

Based on the target game scene information and described image characteristic information, the title of the game video is generated.

The third aspect, provides a kind of video title generating means, and described device includes：

First obtains module, for obtaining the sound characteristic information and image feature information of video；

Second obtains module, for being based on the sound characteristic information and described image characteristic information, obtains the video Target scene information, the target scene information is used to indicate the scene that the video is presented；

Generation module generates the mark of the video for being based on the target scene information and described image characteristic information Topic.

Fourth aspect provides a kind of terminal, and the terminal includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute Code set or instruction set is stated to be loaded by the processor and executed to realize that the video title as described in first aspect is any generates Method, alternatively, the title generation method of game video described in second aspect.

5th aspect, provides a kind of computer readable storage medium, at least one finger is stored in the storage medium Enable, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or Instruction set is loaded by processor and is executed to realize the video title generation method as described in first aspect is any, alternatively, second The title generation method of game video described in aspect.

By obtaining the sound characteristic information and image feature information of video, and according to sound characteristic information and characteristics of image Information obtains the scene information for the scene that video is presented, and generates video title further according to the scene information and image feature information, Compared to the relevant technologies, video title is produced by viewing video without operation personnel, effectively improves video title Formation efficiency, saved the man power and material for determining video title.

Also, by obtaining scene information according to the sound characteristic information and image feature information of video, further according to scene Information and image feature information generate video title, increase when generating video title can information content for reference so that generating Video title can more accurately describe the main contents of video and therefore effectively improve the standard of the video title of generation True property.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the model schematic for generating video title in the related technology.

Fig. 2 is a kind of flow chart of video title generation method provided in an embodiment of the present invention.

Fig. 3 is a kind of schematic diagram of the target image frame of game video provided in an embodiment of the present invention.

Fig. 4 is a kind of method flow diagram of sound characteristic information for obtaining video provided in an embodiment of the present invention.

Fig. 5 is a kind of schematic diagram of the target image frame of the game video of shooting game provided in an embodiment of the present invention.

Fig. 6 is a kind of method flow diagram of target scene information for obtaining video provided in an embodiment of the present invention.

Fig. 7 is that a kind of pair of sound characteristic information provided in an embodiment of the present invention and image feature information carry out Fusion Features, Obtain the method flow diagram of scene characteristic information.

Fig. 8 is provided in an embodiment of the present invention a kind of based on desired title template and multiple object knowledge libraries, generation title Method flow diagram.

Fig. 9 is the schematic diagram that a kind of video title provided in an embodiment of the present invention generates model.

Figure 10 is a kind of method flow diagram of the title generation method of game video provided in an embodiment of the present invention.

Figure 11 is a kind of structural schematic diagram of video title generating means provided in an embodiment of the present invention.

Figure 12 is a kind of structural schematic diagram of the title generating means of game video provided in an embodiment of the present invention.

Figure 13 is a kind of structural schematic diagram of terminal provided in an embodiment of the present invention.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

With the development of science and technology, more and more users obtain information by the way of watching video.Also, in order to meet The needs of different user, service provider can generally provide multitude of video, for user's viewing.User is before watching video, usually Can be according to the title of video, the video of viewing is wanted in selection in the multitude of video that service provider provides.Therefore, video title is to view The viewing rate of frequency has great influence.For example, the service provider of game generates bigger use to safeguard game ecology goodly Family viscosity, can all produce a large amount of game video daily, for user's viewing, face a large amount of game video, user is usual The video that can need to watch according to the selection of the title of game video.

In the related technology, the method for generating video title is usually：Operation personnel watches video, and after watching video, The title of video is determined according to the content of video.But when the number of videos of title to be generated is more, the life of the video title It is lower at efficiency.

In the related technology, video title can also be generated by the method for machine learning.Such as：By the way that nerve net will be recycled Network (Recurrent Neural Network, RNN), convolutional neural networks (Convolutional Neural Network, CNN) and attention (Attention) models coupling, and using the binding model video title is generated.Fig. 1 is the knot of the model Structure schematic diagram, the model include multistage network (each dotted line frame indicates primary network station in Fig. 1).For every primary network station, pass through One picture frame of video is input to CNN, so that CNN is extracted characteristics of image of the picture frame on Spatial Dimension, then by the figure As feature is input to RNN, characteristics of image of the picture frame on time dimension is extracted by RNN, and by the characteristics of image of the extraction The RNN being input in undernet and afterbody network, to realize the transmitting of image feature information.And in afterbody network RNN can be extracted from time dimension the same level CNN to its input picture frame characteristics of image, according to the characteristics of image and its The characteristics of image that RNN in his grade network is inputted to it, can be generated the title of video.It is raw in the method by deep learning It is according to characteristics of image in default word set when generating video title according to characteristics of image in the implementation of video title In sampled, and the word sampled is spliced to obtain video title.But since the sampling process is usually Uncontrollable process, the title for causing the word obtained according to sampling to generate is usually the unclear and coherent contamination of the meaning of one's words.Also, Since the implementation generates video title according only to characteristics of image, some information in video can be lost, lead to the view generated Frequency marking topic is poor to the descriptive power of video main contents, that is, the accuracy of the video title generated is integrally relatively low.

For this purpose, the sound characteristic by obtaining video is believed the embodiment of the invention provides a kind of video title generation method Breath and image feature information, and according to the sound characteristic information and image feature information, obtain the scene for the scene that video is presented Information generates video title further according to the scene information and image feature information, compared to the relevant technologies, sees without operation personnel It sees that video produces video title, effectively improves the formation efficiency of video title.Also, the video title generation method By obtaining scene information according to the sound characteristic information and image feature information of video, further according to scene information and characteristics of image Information generates video title, increase when generating video title can information content for reference, enable the video title generated The main contents of video are more accurately described, therefore, effectively improve the accuracy of the video title of generation.

Fig. 2 is a kind of flow chart of video title generation method provided in an embodiment of the present invention, as shown in Fig. 2, this method May include：

Step 201, the sound characteristic information and image feature information for obtaining video.

Wherein, sound characteristic information can be the information for describing sound source attribute, such as：The sound characteristic information can be with For the information of information, the information of automobile sound and other sound for describing different firearms shots.Image feature information can be with For the information of the content for describing to show in image.Such as：For the picture frame of game video, which can be with To kill hero in description game, being killed hero, kill and kill and kill the letters of the contents such as heroic blood volume under type, defence tower Breath.

When executing the step 201, realization process includes：The image feature information of video is obtained, and, obtain video Sound characteristic information.The two-part realization process difference is as follows：

First part, the realization process for obtaining the image feature information of video may include：In multiple figures that video includes As obtaining the image feature information of target image frame in frame.

Wherein, each picture frame which can include for video bag, alternatively, the target image frame can for In the multiple images frame that video includes, at interval of the picture frame of preset duration selection.The preset duration can be according to actual needs It is configured, such as：Can be in the multiple images frame that video includes, using first picture frame as starting point, it will be at interval of 1 second The picture frame of (or 0.5 second) is determined as target image frame.

Also, the implementation for obtaining the image feature information of target image frame may include：Target image frame is inputted To CNN, to identify the image for being in specific position in the target image frame by CNN, to obtain the image of the target image frame Characteristic information.

Wherein, the specific position in target image frame can be configured according to actual needs.Such as：Trip to * * game Play video, the image feature information for needing to obtain are to kill hero in description game, kill type, killed hero, defend tower Under kill and kill the information of the contents such as heroic blood volume, Fig. 3 is the schematic diagram of a target image frame of the game video, due to hitting The head portrait for killing hero is shown at position shown in dotted line frame A1 usually in picture frame, kills type dotted line usually in picture frame It being shown at position shown in frame A2, the head portrait for being killed hero is shown at position shown in dotted line frame A3 usually in picture frame, Therefore, for the target image frame, which can be respectively position shown in dotted line frame A1, position shown in dotted line frame A2 It sets, and, position shown in dotted line frame A3.The image at position shown in dotted line frame A1 is identified by CNN, can must be somebody's turn to do It is an anxiety Ji that hero is killed in picture frame, is identified by CNN to the image at position shown in dotted line frame A2, can obtain the image It is to kill (i.e. three even decide the issue of the battle shown in Fig. 3) that type is killed in frame more, by CNN to the image at position shown in dotted line frame A3 It is identified, can obtain and be killed hero in the picture frame is Zhang Fei, and then obtains describing to kill in game in the target image frame Hero is killed hero and kills the image feature information of type.

Further, it is also possible to be identified according to display mode preset in image to target image frame, to obtain image spy Reference breath.Such as：In the display mode of the game video of * * game, the circle display defence tower for being S1 usually using size makes The circle display for being S2 with size is killed the head portrait of hero, therefore, when obtaining image feature information, CNN difference can be used To the size be S1 circle and size be S2 circle detected and identified, hit with obtaining the information for describing defence tower and description Kill the information of hero.Also, it is available anti-after the information and description for obtaining description defence tower are killed the information of hero Yu Ta and the distance for being killed hero, to obtain killing the information of hero under description defence tower.Simultaneously as blood volume is usually shown In the top (such as position shown in dotted line frame A4 in Fig. 3) of heroic head portrait, after the information for obtaining hero, correspondingly, can be with It determines the display position of the blood volume, and the image shown at the position is identified, kill heroic blood volume to obtain description Information.

Since the picture material in video in timing in continuous picture frame is consecutive variations, and figure adjacent in timing As the picture material variation of frame is smaller, by screening picture frame in the picture frame of video, and the picture frame being screened is obtained Image feature information can compared to the implementation for the image feature information for needing to obtain each picture frame that video includes The redundancy in picture frame is reduced, the data processing amount needed during generating video title with reduction, and then improve life At the speed of title.

Second part, since sound characteristic information can be by the mel cepstrum coefficients characteristic present of acoustic information, accordingly , referring to FIG. 4, obtaining the realization process of the sound characteristic information of video, may include：

Step 2011, the mel cepstrum coefficients feature for obtaining acoustic information.

Wherein, mel cepstrum coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC) be The cepstrum parameter extracted in Mel scale frequency domain.By executing following processing to acoustic information (usually being indicated with voice signal)： Preemphasis (pre-emphasis), calculates short-time energy (energy), adding window (hamming at framing (frame blocking) Window), Fast Fourier Transform (FFT transform) and triangle strip pass filter (triangular band-pass Filter it) handles, the mel cepstrum coefficients feature of acoustic information can be obtained.

Optionally, when acoustic information can play for the video of title to be generated from the starting play time of video to end Between whole acoustic informations in section, or, or the acoustic information after being screened to the whole acoustic information.Such as： The acoustic information can be the acoustic information in preset period of time in video, alternatively, the acoustic information can be for positioned at target image The acoustic information in preset duration before at the time of frame corresponds to.

Step 2012 is based on mel cepstrum coefficients feature, classifies to acoustic information, to obtain sound characteristic information.

After the mel cepstrum coefficients feature for obtaining acoustic information, which can be input to RNN, and Using classifier (such as：Softmax classifier or support vector machine classifier etc.) classify to acoustic information, to be somebody's turn to do The sound characteristic information of acoustic information.

It is exemplary, referring to FIG. 5, the Fig. 5 is the target image frame of the game video of certain shooting game, by CNN to this Picture material in target image frame at position shown in solid box B1 is identified, describes to hit in the available shooting game Killing type is the image feature information for more killing (i.e. " seven kill " shown in Fig. 5), correspondingly, by obtaining the target image frame pair 5 seconds acoustic informations before at the time of answering, and the mel cepstrum coefficients feature of the acoustic information is obtained, then by the mel cepstrum system Number feature is input to RNN, and exports sound class by softmax classifier, and available instruction acoustic information is characterized Sound be M rifle shot sound characteristic information.

Step 202 is based on sound characteristic information and image feature information, obtains the target scene information of video.

Wherein, target scene information is used to indicate the scene of video presentation.Referring to FIG. 6, the realization process of the step 202 May include：

Step 2021 carries out Fusion Features to sound characteristic information and image feature information, obtains scene characteristic information.

Wherein, which is usually to be used to distinguish the information of different scenes.Such as：The scene characteristic information can Think and kills the information for killing scene, more execution ground scapes under scene, tower or beating the scenes such as imperial scene for distinguishing residual blood in game.It can Selection of land, referring to FIG. 7, the realization process of the step 2021 may include：

Step 2021a, the type of video is obtained.

The type of video is for distinguishing different video contents.Such as：The type of game video may include：The trip of gunbattle class The video etc. of the video of play and battle class game in real time, the type of television video may include：Domestic play video, idol play video With costume piece video etc..

In general, the information for being used to indicate video type is recorded in the relevant information of video (such as in video code), When executing step 2021a, the information can be read, to obtain the type of the video.

Step 2021b, based on the type of video, determine sound characteristic information and image feature information to scene information respectively Influence weight.

For different types of video, sound characteristic information and image feature information to the influence degree of scene information not Together, such as：In the video of gunbattle class game, sound characteristic information is larger to the influence degree of scene information, fights in real time In the video of class game, image feature information is larger to the influence degree of scene information.Therefore, to sound characteristic information and figure Before carrying out Fusion Features to scene information as characteristic information, sound characteristic information can be determined according to the type of video respectively Influence weight with image feature information to scene information, in order to according to the influence weight, to sound characteristic information and image Characteristic information carries out Fusion Features to scene information, more to be met the scene characteristic information of video content, and then according to field Scape characteristic information obtains the target scene information for more meeting video content.

Step 2021c, according to weight is influenced, Fusion Features is carried out to sound characteristic information and image feature information, are obtained Scene characteristic information.

It, can be according to the influence after determining the influence weight of sound characteristic information and image feature information to scene information Weight carries out Fusion Features to sound characteristic information and image feature information, to obtain scene characteristic information.Optionally, to sound When sound characteristic information and image feature information carry out Fusion Features, the algorithm based on Bayesian decision theory can be used, be based on The algorithm of sparse representation theory or the algorithm based on deep learning theory etc. carry out Fusion Features, and the embodiment of the present invention does not do it It is specific to limit.

It is exemplary, it is assumed that vector W1 is the sound characteristic information obtained, and vector W2 is the image feature information obtained, the sound Sound characteristic information and influence weight of the image feature information to scene information are respectively a and b, according to the influence weight, to sound After sound characteristic information and image feature information carry out Fusion Features, available scene characteristic information Z=a × W1+b × W2.

Step 2022 is based on scene characteristic information, obtains target scene information.

Optionally, sorter model can be used and obtain target scene information, correspondingly, the realization process of the step 2022 May include：By the second sorter model of scene characteristic information input, by the second sorter model according to scene characteristic information, Target scene information is determined in multiple scene informations.Wherein, multiple scene information can be in the instruction of the second sorter model The scene information determined during practicing.Multiple scene information may include indicating that residual blood kills to kill scene, more under scene, tower Execution ground scape or the information for beating the scenes such as imperial scene.Optionally, which can be softmax classifier or support The sorter models such as vector machine classifier.

It is exemplary, it is assumed that the sound characteristic information obtained in step 201 is to describe the information for the shot that sound is M rifle, The image feature information of acquisition is that content shown by description image is to indicate that killing type is the information more killed, to sound spy Reference breath and image feature information carry out Fusion Features, the scene characteristic information input that then will be obtained by Fusion Features Softmax classifier, after the softmax classifier operation, obtaining target scene information is description scene for by M rifle reality The now information of more execution ground scapes.

Step 203 is based on target scene information and image feature information, obtains the desired title template of video.

Normally, identical video content can be described by the different a variety of expression ways of format, and Same Scene can By being described by the different a variety of expression ways of format.Correspondingly, different video contents can by with identical or The video title of similar structure is described.Therefore, after obtaining target scene information, can according to the target scene information and Image feature information selects a title template in preset multiple title templates, that is to say, selection carries out video content The format of description.

Optionally, the desired title template that sorter model obtains video can be used, correspondingly, the reality of the step 203 Now process may include：Target scene information and image feature information are inputted into the first sorter model, by the first classifier mould Type determines desired title template according to target scene information and image feature information in multiple title templates.Wherein, multiple Title template can be determining title template in the training process of the first sorter model.Such as：Multiple title templates can To include：(killing hero) (blood volume) (killing type) (by hero is killed).(killing hero) (walking feature) kills (quantity) People.(killing hero) flashes big perfect double hit harvesting (quantity) consecutive victories.(killing hero) recruits greatly wink second (quantity) people etc..It needs Bright, in this example, the content in bracket is the word for needing to be filled according to video content.

Also, since the main contents of video are in addition to related with image feature information and sound characteristic information, this is main interior It is related to hold the timing of multiple images frame for also including with video, therefore, which can be that can obtain image The model of the timing information of frame.For example, second sorter model can be softmax classifier.

It is exemplary, it is assumed that the target scene information of acquisition is that description scene is that the information of more execution ground scapes is realized by M rifle, is obtained The image feature information taken is to describe to kill type to be the information more killed, and the target scene information and the image feature information is defeated After entering softmax classifier, desired title template, which can be obtained, is：(killing hero) (walking feature) kills (quantity) people.

Step 204 is based on target scene information, obtains the corresponding multiple object knowledge libraries of video.

Wherein, the keyword for describing scene information, and multiple object knowledge libraries are recorded in each object knowledge library It divides to obtain based on different scene characteristics.It that is to say, multiple object knowledge library can record from different perspectives to scene The keyword that information is described.Exemplary, for game video, multiple knowledge base may include：Heroic knowledge base, kills Type knowledge base kills scene knowledge base, kills blood volume knowledge base, the feature knowledge that walks library and heroic type knowledge base etc., should The keyword that scene information is described from different perspectives is recorded in multiple object knowledge libraries.

Also, since each scene information can be characterized by multiple scene characteristics, and each scene characteristic can be with one A knowledge base is corresponding, and therefore, there are corresponding relationships between scene information, scene characteristic and knowledge base.Execute the step 204 it Before, the corresponding relationship of scene information, scene characteristic and knowledge base can be established in advance, it, can in order to when executing the step 204 To inquire the corresponding relationship according to target scene information, to obtain multiple object knowledge libraries.

It is exemplary, it is assumed that target scene information can pass through scene characteristic a, scene characteristic b, scene characteristic c, scene characteristic D and scene characteristic e characterization, and scene information, scene characteristic and the knowledge base established in advance are inquired according to the target scene information Corresponding relationship it is found that scene characteristic a and the knowledge base that heroic Angel is drawn are corresponding, the knowledge base of scene characteristic b and hero's type Corresponding, scene characteristic c is corresponding with the knowledge base for the feature that walks, and scene characteristic d is corresponding with the output knowledge base of feature, scene characteristic E is corresponding with the knowledge base of more execution ground scapes, then multiple object knowledge libraries can be obtained and be respectively：The knowledge base of heroic Angel drawing, hero The knowledge base of type, the knowledge base for the feature that walks, the knowledge base of the knowledge base for exporting feature and more execution ground scapes, and multiple target Scene information can be described from different perspectives in knowledge base.

It should be noted that needing to establish knowledge base in advance before executing the step 204.Wherein it is possible to by artificial The mode of acquisition establishes knowledge base, alternatively, can establish database using modes such as data minings.Also, according to mark to be generated The type of the video of topic is different, the possible thick line difference of the type for the keyword recorded in knowledge base, such as：It can also be in knowledge base It is described the keyword of many aspects attribute such as hero, player, war team, player and scene properties, alternatively, can be in knowledge base The data etc. analyzed according to the viewing rate of the video with same alike result are recorded, the embodiment of the present invention is not done it specifically It limits.

Step 205 is based on desired title template and multiple object knowledge libraries, generates title.

Optionally, referring to FIG. 8, the realization process of the step 205 may include：

Step 2051, in each object knowledge library, obtain the keyword that is filled to desired title template.

Due to recording multiple keywords that the word of an attribute is described, multiple keyword in each knowledge base It can be that synonym or near synonym therefore, can be in each knowledge bases after obtaining the corresponding multiple object knowledge libraries of video Multiple keywords screened, to obtain the keyword that is filled to desired title template.

Optionally, which can be：Random screening, alternatively, according to scene characteristic corresponding with the knowledge base, it will The information of the knowledge base and the information of the scene characteristic are input to classifier, by multiple keys of the classifier in the knowledge base It is selected in word, to obtain the keyword being filled to desired title template.

It is exemplary, it is assumed that multiple object knowledge libraries are respectively：The knowledge base of knowledge base, heroic type that heroic Angel is drawn, The knowledge base of the knowledge base for the feature that walks, the knowledge base for exporting feature and more execution ground scapes, wherein the knowledge base note that heroic Angel is drawn The keyword of load includes：{ Angel is drawn：{ hero is also known as：{ Loli's Angel is drawn, and high outburst Angel is drawn, strong to control Angel drawing etc. }, walks Feature knowledge library record keyword include：{ flexible and changeable, continuous dislocation etc. }, the keyword packet that output feature knowledge library is recorded It includes：{ explosion output etc. }, after screening to the keyword recorded in knowledge base, can obtain：The knowledge base that heroic Angel is drawn In keyword that desired title template is filled be：Loli's Angel is drawn, to desired title template in the feature knowledge that walks library The keyword being filled is：Continuous dislocation, exporting the keyword being filled in feature knowledge library to desired title template is： Explosion output.

Since the multiple keywords recorded in each knowledge base are synonym or near synonym, the language that multiple keyword indicates Justice is essentially identical, so that the process of screening keyword is controllable process, the keyword obtained using the screening is to desired title Template is filled, and is capable of the video title of generative semantics smoothness, is enhanced the readability of the video title of generation.

Step 2052 is filled desired title template using keyword, to obtain title.

It, can be according to the attribute of keyword, by keyword after obtaining the keyword being filled to desired title template Filling is corresponded into desired title template at the position of attribute, to obtain the title of video.

It is exemplary, it is assumed that the desired title template obtained in step 203 is：(killing hero) (walking feature) kills (number Amount) people, and the keyword being filled in the knowledge base of the heroic Angel drawing obtained in step 2051 to desired title template is： Loli's Angel is drawn, and the keyword being filled in the feature knowledge that walks library to desired title template is：Continuous dislocation exports feature The keyword being filled in knowledge base to desired title template is：Explosion output, and hit according to what image feature information determined Killing type is：It kills more, when being filled using keyword to desired title template, filling can be drawn to target mark Loli's Angel At the position for killing hero in topic template, continuous dislocation is filled into desired title template at the position of walking feature, it will be more Kill filling at the position of quantity, then can be obtained entitled into desired title template：Loli's Angel draws continuous dislocation to kill more people.

Optionally, video title generation method provided in an embodiment of the present invention can be by model realization shown in Fig. 9, should Model includes multistage network (each dotted line frame indicates primary network station in Fig. 9), for every primary network station, respectively to each target figure As frame execution step 201, it that is to say, target image frame is input to CNN, picture frame is extracted on Spatial Dimension by CNN Then the characteristics of image is input to RNN by characteristics of image, extract characteristics of image of the picture frame on time dimension by RNN, with Obtain the image feature information of target image frame.And the image feature information of target image frame is input to undernet and last RNN in primary network station, to realize the transmitting of image feature information.Meanwhile extracting the sound characteristic information of video acoustic information. Then, step 202 is executed according to the image feature information and sound characteristic information, to sound characteristic information and image feature information Fusion Features are carried out, scene characteristic information is obtained.And the scene characteristic information input to softmax classifier (is not shown in Fig. 9 Out), to obtain target scene information.Then, step 203 and step 204 are executed according to the target scene information respectively.In step In 203, by the target scene information and image feature information input softmax classifier (being not shown in Fig. 9), to obtain video Desired title template.In step 204, according to the target scene information, scene information, scene characteristic and knowledge base are inquired Corresponding relationship, to obtain the corresponding multiple object knowledge libraries of the target scene information.Then the target mark obtained according to step 203 Template is inscribed, and, multiple object knowledge libraries that step 204 obtains execute step 205, according to what is recorded in the object knowledge library Keyword is filled desired title template, that is, produces video title.

In the model shown in Fig. 9, undernet is input to by the image feature information for obtaining every primary network station RNN (using Attention model) in afterbody network, can reduce the decaying of image feature information, Jin Erti The accuracy for the video title that coca is generated according to the image feature information.

Also, by obtaining video image characteristic information and sound characteristic information, increase when generating video title for The information content of reference can be improved the learning ability of machine, enable using machine learning method generate video title more The main contents of video are accurately described, therefore, the accuracy of the video title of generation can be effectively improved.Further, Other modal characteristics information in addition to image feature information and sound characteristic information can also be obtained according to actual needs, and are based on The image feature information and sound characteristic information and other modal characteristics information generate video title, to further improve life At video title accuracy.

In conclusion video title generation method provided in an embodiment of the present invention, the sound characteristic by obtaining video is believed Breath and image feature information, and according to sound characteristic information and image feature information, obtain the scene letter for the scene that video is presented Breath generates video title further according to the scene information and image feature information and passes through compared to the relevant technologies without operation personnel It watches video and produces video title, effectively improve the formation efficiency of video title, saved for determining video mark The man power and material of topic.

Also, the video title generation method is by obtaining field according to the sound characteristic information and image feature information of video Scape information generates video title further according to scene information and image feature information, and increasing can be for reference when generating video title Information content, enable the video title generated more accurately to describe the main contents of video and therefore effectively improve life At video title accuracy.

It should be noted that the sequencing of video title generation method step provided in an embodiment of the present invention can carry out Appropriate adjustment, step according to circumstances can also accordingly be increased and decreased, and anyone skilled in the art is in the present invention In the technical scope of exposure, the method that can readily occur in variation be should be covered by the protection scope of the present invention, therefore no longer It repeats.

The embodiment of the invention provides a kind of title generation methods of game video, and as shown in Figure 10, this method can wrap It includes：

Step 301, the sound characteristic information and image feature information for obtaining game video.

Step 302 is based on sound characteristic information and image feature information, obtains the target game scene letter of game video Breath.

Wherein, target game scene information is used to indicate the scene of game of game video presentation.

Step 303 is based on target game scene information and image feature information, generates the title of game video.

For above-mentioned steps 301 into step 303, the specific implementation process of each step can be with reference in embodiment illustrated in fig. 2 Correspondence step, the embodiment of the present invention repeats no more this.

In conclusion the title generation method of game video provided in an embodiment of the present invention, by obtaining game video Sound characteristic information and image feature information, and according to sound characteristic information and image feature information, it obtains game video and presents Scene of game scene information, the title of game video is generated further according to the scene information and image feature information, compared to The relevant technologies are produced the title of game video by viewing game video without operation personnel, effectively improve game The formation efficiency of the title of video has saved the man power and material for determining the title of game video.

Also, the title generation method of the game video passes through the sound characteristic information and characteristics of image according to game video Acquisition of information scene information generates the title of game video further according to scene information and image feature information, increases generation trip Play video title when can information content for reference, enable the game video title generated more accurately to describe game video Main contents therefore effectively improve the accuracy of the game video title of generation.

Figure 11 is a kind of structural schematic diagram of video title generating means provided in an embodiment of the present invention, as shown in figure 11, The device 800 can may include：

First obtains module 801, for obtaining the sound characteristic information and image feature information of video.

Second obtains module 802, for being based on sound characteristic information and image feature information, obtains the target scene of video Information, target scene information are used to indicate the scene of video presentation.

Generation module 803 generates the title of video for being based on target scene information and image feature information.

Optionally, generation module 803 can be used for：

Based on target scene information and image feature information, the desired title template of video is obtained.

Based on target scene information, the corresponding multiple object knowledge libraries of video are obtained, are recorded in each object knowledge library For describing the keyword of scene information, and multiple object knowledge libraries divide to obtain based on different scene characteristics.

Based on desired title template and multiple object knowledge libraries, title is generated.

Optionally, generation module 803 is based on desired title template and multiple object knowledge libraries, generates the process of title, can To include：

In each object knowledge library, the keyword being filled to desired title template is obtained.

Desired title template is filled using keyword, to obtain title.

Optionally, generation module 803 are based on target scene information and image feature information, obtain the desired title of video The process of template may include：

Target scene information and image feature information are inputted into the first sorter model, by the first sorter model according to mesh Scene information and image feature information are marked, desired title template is determined in multiple title templates.

Based on target scene information, the corresponding multiple object knowledge libraries of video are obtained, including：

Based on target scene information, the corresponding relationship of scene information and knowledge base is inquired, to obtain multiple object knowledge libraries.

Optionally, second module 802 is obtained based on sound characteristic information and image feature information, obtain the target field of video The process of scape information may include：

Fusion Features are carried out to sound characteristic information and image feature information, obtain scene characteristic information.

Based on scene characteristic information, target scene information is obtained.

Optionally, second module 802 is obtained to sound characteristic information and image feature information progress Fusion Features, must show up The process of scape characteristic information may include：

Obtain the type of video.

Type based on video determines the influence power of sound characteristic information and image feature information to scene information respectively Value.

According to weight is influenced, Fusion Features are carried out to sound characteristic information and image feature information, obtain scene characteristic letter Breath.

Optionally, second module 802 is obtained based on scene characteristic information, obtain the process of target scene information, can wrap It includes：By the second sorter model of scene characteristic information input, by the second sorter model according to scene characteristic information, in multiple fields Target scene information is determined in scape information.

Optionally, first the process that module 801 obtains the sound characteristic information of video is obtained, may include：

Obtain the acoustic information in video in preset period of time.

Obtain the sound characteristic information of acoustic information.

Optionally, first the process that module 801 obtains the sound characteristic information of acoustic information is obtained, may include：

Obtain the mel cepstrum coefficients feature of acoustic information.

Based on mel cepstrum coefficients feature, classify to acoustic information, to obtain sound characteristic information.

Optionally, first the process that module 801 obtains the image feature information of video is obtained, may include：In video bag In the multiple images frame included, the image feature information of target image frame is obtained.

Optionally, target image frame is the picture frame chosen in multiple images frame every preset duration.

In conclusion video title generating means provided in an embodiment of the present invention, first obtains the sound that module obtains video Sound characteristic information and image feature information, second obtains module according to sound characteristic information and image feature information, obtains video The scene information of the scene of presentation, generation module generate video title according to the scene information and image feature information, compared to The relevant technologies produce video title by viewing video without operation personnel, effectively improve the generation of video title Efficiency has saved the man power and material for determining video title.

Also, module is obtained by second, scene letter is obtained according to the sound characteristic information and image feature information of video Breath, generation module generate video title further according to scene information and image feature information, increase when generating video title for Therefore the information content of reference, the main contents for enabling the video title generated more accurately to describe video effectively improve The accuracy of the video title generated.

Figure 12 is a kind of structural schematic diagram of the title generating means of game video provided in an embodiment of the present invention, such as Figure 12 Shown, which can may include：

First obtains module 901, for obtaining the sound characteristic information and image feature information of game video.

Second obtains module 902, for being based on sound characteristic information and image feature information, obtains the target of game video Scene of game information, target game scene information are used to indicate the scene of game of game video presentation.

Generation module 903 generates the mark of game video for being based on target game scene information and image feature information Topic.

In conclusion video title generating means provided in an embodiment of the present invention, first, which obtains module, obtains game video Sound characteristic information and image feature information, second obtains module according to sound characteristic information and image feature information, obtains The scene information for the scene of game that game video is presented, generation module generate game according to the scene information and image feature information The title of video produces the title of game video without operation personnel compared to the relevant technologies by viewing game video, The formation efficiency for effectively improving the title of game video has saved manpower and object for determining the title of game video Power.

Also, by obtaining scene information according to the sound characteristic information and image feature information of game video, generate mould Block generates the title of game video further according to scene information and image feature information, can when increasing the title for generating game video Information content for reference enables the game video title generated more accurately to describe the main contents of game video, therefore, Effectively improve the accuracy of the game video title of generation.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Figure 13 shows the structural schematic diagram of the terminal 1300 of an illustrative embodiment of the invention offer.The terminal 1300 It can be portable mobile termianl, such as：Smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, laptop Or desktop computer.Terminal 1300 be also possible to referred to as user equipment, portable terminal, laptop terminal, terminal console etc. other Title.

In general, terminal 1300 includes：Processor 1301 and memory 1302.

Processor 1301 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place Reason device 1301 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 1301 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit).Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 1301 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 1301 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.

Memory 1302 may include one or more computer readable storage mediums, which can To be non-transient.Memory 1302 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 1302 can Storage medium is read for storing at least one instruction, at least one instruction performed by processor 1301 for realizing this Shen Please in embodiment of the method provide video title generation method, alternatively, the title generation method of game video.

In some embodiments, terminal 1300 is also optional includes：Peripheral device interface 1303 and at least one periphery are set It is standby.It can be connected by bus or signal wire between processor 1301, memory 1302 and peripheral device interface 1303.It is each outer Peripheral equipment can be connected by bus, signal wire or circuit board with peripheral device interface 1303.Specifically, peripheral equipment includes： In radio circuit 1304, display screen 1305, CCD camera assembly 1306, voicefrequency circuit 1307, positioning component 1308 and power supply 1309 At least one.

Peripheral device interface 1303 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 1301 and memory 1302.In some embodiments, processor 1301, memory 1302 and periphery Equipment interface 1303 is integrated on same chip or circuit board.In some other embodiments, processor 1301, memory 1302 and peripheral device interface 1303 in any one or two can be realized on individual chip or circuit board, this implementation Example is not limited this.

Radio circuit 1304 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal. Radio circuit 1304 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 1304 is by telecommunications Number being converted to electromagnetic signal is sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 1304 include：Antenna system, one or more amplifiers, tuner, oscillator, digital signal processor, compiles solution at RF transceiver Code chipset, user identity module card etc..Radio circuit 1304 can by least one wireless communication protocol come with it is other Terminal is communicated.The wireless communication protocol includes but is not limited to：WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some implementations In example, radio circuit 1304 can also include that NFC (Near Field Communication, wireless near field communication) is related Circuit, the application are not limited this.

Display screen 1305 is for showing UI (User Interface, user interface).The UI may include figure, text, Icon, video and its their any combination.When display screen 1305 is touch display screen, display screen 1305 also there is acquisition to exist The ability of the touch signal on the surface or surface of display screen 1305.The touch signal can be used as control signal and be input to place Reason device 1301 is handled.At this point, display screen 1305 can be also used for providing virtual push button and/or dummy keyboard, it is also referred to as soft to press Button and/or soft keyboard.In some embodiments, display screen 1305 can be one, and the front panel of terminal 1300 is arranged.Another In a little embodiments, display screen 1305 can be at least two, be separately positioned on the different surfaces of terminal 1300 or in foldover design. In still other embodiments, display screen 1305 can be flexible display screen, is arranged on the curved surface of terminal 1300 or folds On face.Even, display screen 1305 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 1305 can be with Using LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) etc. materials preparation.

CCD camera assembly 1306 is for acquiring image or video.Optionally, CCD camera assembly 1306 includes front camera And rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.? In some embodiments, rear camera at least two is that main camera, depth of field camera, wide-angle camera, focal length are taken the photograph respectively As any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide Pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are realized in camera fusion in angle Shooting function.In some embodiments, CCD camera assembly 1306 can also include flash lamp.Flash lamp can be monochromatic temperature flash of light Lamp is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for Light compensation under different-colour.

Voicefrequency circuit 1307 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and It converts sound waves into electric signal and is input to processor 1301 and handled, or be input to radio circuit 1304 to realize that voice is logical Letter.For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of terminal 1300 to be multiple. Microphone can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 1301 or radio frequency will to be come from The electric signal of circuit 1304 is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramics loudspeaking Device.When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, can also be incited somebody to action Electric signal is converted to the sound wave that the mankind do not hear to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 1307 may be used also To include earphone jack.

Positioning component 1308 is used for the current geographic position of positioning terminal 1300, to realize navigation or LBS (Location Based Service, location based service).Positioning component 1308 can be the GPS (Global based on the U.S. Positioning System, global positioning system), China dipper system or Russia Galileo system positioning group Part.

Power supply 1309 is used to be powered for the various components in terminal 1300.Power supply 1309 can be alternating current, direct current Electricity, disposable battery or rechargeable battery.When power supply 1309 includes rechargeable battery, which can be line charge Battery or wireless charging battery.Wired charging battery is the battery to be charged by Wireline, and wireless charging battery is to pass through The battery of wireless coil charging.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, terminal 1300 further includes having one or more sensors 1310.One or more sensing Device 1310 includes but is not limited to：Acceleration transducer 1311, gyro sensor 1312, pressure sensor 1313, fingerprint sensing Device 1314, optical sensor 1315 and proximity sensor 1316.

Acceleration transducer 1311 can detecte the acceleration in three reference axis of the coordinate system established with terminal 1300 Size.For example, acceleration transducer 1311 can be used for detecting component of the acceleration of gravity in three reference axis.Processor The 1301 acceleration of gravity signals that can be acquired according to acceleration transducer 1311, control display screen 1305 with transverse views or indulge The display of direction view progress user interface.Acceleration transducer 1311 can be also used for adopting for game or the exercise data of user Collection.

Gyro sensor 1312 can detecte body direction and the rotational angle of terminal 1300, gyro sensor 1312 Acquisition user can be cooperateed with to act the 3D of terminal 1300 with acceleration transducer 1311.Processor 1301 is according to gyro sensors The data that device 1312 acquires, may be implemented following function：Action induction (for example changing UI according to the tilt operation of user) is clapped Image stabilization, game control and inertial navigation when taking the photograph.

The lower layer of side frame and/or display screen 1305 in terminal 1300 can be set in pressure sensor 1313.Work as pressure When the side frame of terminal 1300 is arranged in sensor 1313, user can detecte to the gripping signal of terminal 1300, by processor 1301 carry out right-hand man's identification or prompt operation according to the gripping signal that pressure sensor 1313 acquires.When pressure sensor 1313 When the lower layer of display screen 1305 is set, the pressure operation of display screen 1305 is realized to UI according to user by processor 1301 Operability control on interface is controlled.Operability control includes button control, scroll bar control, icon control, dish At least one of single control part.

Fingerprint sensor 1314 is used to acquire the fingerprint of user, is collected by processor 1301 according to fingerprint sensor 1314 Fingerprint recognition user identity, alternatively, by fingerprint sensor 1314 according to the identity of collected fingerprint recognition user.Knowing Not Chu the identity of user when being trusted identity, authorize the user to execute relevant sensitive operation by processor 1301, which grasps Make to include solving lock screen, checking encryption information, downloading software, payment and change setting etc..Fingerprint sensor 1314 can be set Set the front, the back side or side of terminal 1300.When being provided with physical button or manufacturer Logo in terminal 1300, fingerprint sensor 1314 can integrate with physical button or manufacturer Logo.

Optical sensor 1315 is for acquiring ambient light intensity.In one embodiment, processor 1301 can be according to light The ambient light intensity that sensor 1315 acquires is learned, the display brightness of display screen 1305 is controlled.Specifically, when ambient light intensity is higher When, the display brightness of display screen 1305 is turned up.When ambient light intensity is lower, the display brightness of display screen 1305 is turned down.Another In one embodiment, the ambient light intensity that processor 1301 can also be acquired according to optical sensor 1315, dynamic adjustment camera shooting The acquisition parameters of head assembly 1306.

Proximity sensor 1316, also referred to as range sensor are generally arranged at the front panel of terminal 1300.Proximity sensor 1316 for acquiring the distance between the front of user Yu terminal 1300.In one embodiment, when proximity sensor 1316 is examined When measuring the distance between the front of user and terminal 1300 and gradually becoming smaller, display screen 1305 is controlled from bright screen by processor 1301 State is switched to breath screen state.When proximity sensor 1316 detects that the distance between user and the front of terminal 1300 gradually become When big, display screen 1305 is controlled by processor 1301 and is switched to bright screen state from breath screen state.

It, can be with it will be understood by those skilled in the art that the restriction of the not structure paired terminal 1300 of structure shown in Figure 13 Including than illustrating more or fewer components, perhaps combining certain components or being arranged using different components.

The embodiment of the invention also provides a kind of computer readable storage medium, which is non-volatile memories Jie Matter is stored at least one instruction, at least one section of program, code set or instruction set in the storage medium, at least one instruction, At least one section of program, the code set or the instruction set is loaded by processor and is executed to realize that the above embodiments of the present application such as provide Video title generation method, alternatively, the title generation method of game video.

The embodiment of the invention also provides a kind of computer program product, it is stored with instruction in the computer program product, When run on a computer, it enables a computer to execute video title generation method provided in an embodiment of the present invention, or Person, the title generation method of game video.

The embodiment of the invention also provides a kind of chip, which includes programmable logic circuit and/or program instruction, when The chip is able to carry out video title generation method provided in an embodiment of the present invention when running, alternatively, the title of game video is raw At method.

In embodiments of the present invention, relative determinative "and/or" indicates three kinds of logical relations, and A and/or B expression are individually deposited In A, individualism B and exist simultaneously tri- kinds of logical relations of A and B.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely the preferred embodiments of the application, not to limit the application, it is all in spirit herein and Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims

1. a kind of video title generation method, which is characterized in that the method includes：

Based on the sound characteristic information and described image characteristic information, the target scene information of the video, the mesh are obtained Mark scene information is used to indicate the scene that the video is presented；

2. the method according to claim 1, wherein described special based on the target scene information and described image Reference breath, generates the title of the video, including：

Based on the target scene information and described image characteristic information, the desired title template of the video is obtained；

Based on the target scene information, the corresponding multiple object knowledge libraries of the video are obtained, are remembered in each object knowledge library It is loaded with the keyword for describing scene information, and the multiple object knowledge library divides to obtain based on different scene characteristics；

Based on the desired title template and the multiple object knowledge library, the title is generated.

3. according to the method described in claim 2, it is characterized in that, described be based on the desired title template and the multiple mesh Knowledge base is marked, the title is generated, including：

In each object knowledge library, the keyword being filled to the desired title template is obtained；

The desired title template is filled using the keyword, to obtain the title.

4. according to the method described in claim 2, it is characterized in that,

It is described to be based on the target scene information and described image characteristic information, the desired title template of the video is obtained, is wrapped It includes：

The target scene information and described image characteristic information are inputted into the first sorter model, by the first classifier mould Type determines the desired title mould according to the target scene information and described image characteristic information in multiple title templates Plate；

It is described to be based on the target scene information, the corresponding multiple object knowledge libraries of the video are obtained, including：

Based on the target scene information, the corresponding relationship of scene information and knowledge base is inquired, is known with obtaining the multiple target Know library.

5. method according to any one of claims 1 to 4, which is characterized in that described to be based on the sound characteristic information and institute Image feature information is stated, the target scene information of the video is obtained, including：

Fusion Features are carried out to the sound characteristic information and described image characteristic information, obtain scene characteristic information；

Based on the scene characteristic information, the target scene information is obtained.

6. according to the method described in claim 5, it is characterized in that, described to the sound characteristic information and described image feature Information carries out Fusion Features, obtains scene characteristic information, including：

Obtain the type of the video；

Based on the type of the video, determine that the sound characteristic information and described image characteristic information believe the scene respectively The influence weight of breath；

According to the influence weight, Fusion Features are carried out to the sound characteristic information and described image characteristic information, must be shown up Scape characteristic information.

7. according to the method described in claim 5, it is characterized in that, it is described be based on the scene characteristic information, obtain the mesh Scene information is marked, including：

By second sorter model of scene characteristic information input, by second sorter model according to the scene characteristic Information determines the target scene information in multiple scene informations.

8. method according to any one of claims 1 to 4, which is characterized in that obtain the sound characteristic information of the video, wrap It includes：

Obtain the acoustic information in the video in preset period of time；

Obtain the sound characteristic information of the acoustic information.

9. according to the method described in claim 8, it is characterized in that, the sound characteristic information for obtaining the acoustic information, Including：

Obtain the mel cepstrum coefficients feature of the acoustic information；

Based on the mel cepstrum coefficients feature, classify to the acoustic information, to obtain the sound characteristic information.

10. method according to any one of claims 1 to 4, which is characterized in that the image feature information of the video is obtained, Including：

In the multiple images frame that the video includes, the image feature information of target image frame is obtained.

11. according to the method described in claim 10, it is characterized in that, the target image frame is in described multiple images frame Every the picture frame that preset duration is chosen.

12. a kind of title generation method of game video, which is characterized in that the method includes：

Based on the sound characteristic information and described image characteristic information, the target game scene letter of the game video is obtained Breath, the target game scene information are used to indicate the scene of game that the game video is presented；

13. a kind of video title generating means, which is characterized in that described device includes：

Second obtains module, for being based on the sound characteristic information and described image characteristic information, obtains the mesh of the video Scene information is marked, the target scene information is used to indicate the scene that the video is presented；

Generation module generates the title of the video for being based on the target scene information and described image characteristic information.