CN108829881B

CN108829881B - Video title generation method and device

Info

Publication number: CN108829881B
Application number: CN201810677450.2A
Authority: CN
Inventors: 李俊; 王文; 郑萌
Original assignee: Shenzhen Tencent Network Information Technology Co Ltd
Current assignee: Shenzhen Tencent Network Information Technology Co Ltd
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2021-12-03
Anticipated expiration: 2038-06-27
Also published as: CN108829881A

Abstract

The application discloses a video title generation method and device, and belongs to the technical field of internet. The method comprises the following steps: acquiring sound characteristic information and image characteristic information of a video; acquiring target scene information of the video based on the sound characteristic information and the image characteristic information, wherein the target scene information is used for indicating a scene presented by the video; and generating a title of the video based on the target scene information and the image characteristic information. The invention improves the generation efficiency of the video title. The invention is used for generating the title of the video according to the video.

Description

Video title generation method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for generating a video title.

Background

With the development of science and technology, more and more users acquire information in a video watching mode. When selecting a video to be viewed, the user usually selects the video according to the video title. Therefore, the video title has a significant influence on the viewing rate of the video. Wherein the video title is used to summarize the main content of the video by text.

In the related art, a method for generating a video title generally includes: and the operator watches the video, and determines the title of the video according to the content of the video after watching the video.

However, when the number of videos of a title to be generated is large, the generation efficiency of the video title is low.

Disclosure of Invention

The embodiment of the invention provides a video title generation method and device, which can solve the problem of low video title generation efficiency in the related art. The technical scheme is as follows:

in a first aspect, a method for generating a video title is provided, where the method includes:

acquiring sound characteristic information and image characteristic information of a video;

acquiring target scene information of the video based on the sound characteristic information and the image characteristic information, wherein the target scene information is used for indicating a scene presented by the video;

and generating a title of the video based on the target scene information and the image characteristic information.

In a second aspect, a title generation method for a game video is provided, the method comprising:

acquiring sound characteristic information and image characteristic information of a game video;

acquiring target game scene information of the game video based on the sound characteristic information and the image characteristic information, wherein the target game scene information is used for indicating a game scene presented by the game video;

and generating a title of the game video based on the target game scene information and the image characteristic information.

In a third aspect, there is provided a video title generation apparatus, the apparatus comprising:

the first acquisition module is used for acquiring sound characteristic information and image characteristic information of a video;

a second obtaining module, configured to obtain target scene information of the video based on the sound feature information and the image feature information, where the target scene information is used to indicate a scene presented by the video;

and the generating module is used for generating a title of the video based on the target scene information and the image characteristic information.

In a fourth aspect, there is provided a terminal comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the video title generation method according to any one of the first aspect or the title generation method of game video according to the second aspect.

In a fifth aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the video title generation method according to any one of the first aspect or the game video title generation method according to the second aspect.

Compared with the related art, the video title can be generated without an operator watching the video, so that the generation efficiency of the video title is effectively improved, and manpower and material resources for determining the video title are saved.

Moreover, the scene information is acquired according to the sound characteristic information and the image characteristic information of the video, and the video title is generated according to the scene information and the image characteristic information, so that the information amount which can be referred to when the video title is generated is increased, the generated video title can more accurately describe the main content of the video, and the accuracy of the generated video title is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a model for generating a video title in the related art.

Fig. 2 is a flowchart of a video title generating method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a target image frame of a game video according to an embodiment of the present invention.

Fig. 4 is a flowchart of a method for acquiring sound characteristic information of a video according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a target image frame of a game video of a shooting game according to an embodiment of the present invention.

Fig. 6 is a flowchart of a method for acquiring target scene information of a video according to an embodiment of the present invention.

Fig. 7 is a flowchart of a method for performing feature fusion on sound feature information and image feature information to obtain scene feature information according to an embodiment of the present invention.

Fig. 8 is a flowchart of a method for generating a title based on a target title template and a plurality of target knowledge bases according to an embodiment of the present invention.

Fig. 9 is a schematic diagram of a video title generation model according to an embodiment of the present invention.

Fig. 10 is a flowchart of a method for generating a title of a game video according to an embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a video title generating apparatus according to an embodiment of the present invention.

Fig. 12 is a schematic structural diagram of a game video title generation apparatus according to an embodiment of the present invention.

Fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

With the development of science and technology, more and more users acquire information in a video watching mode. Moreover, in order to meet the needs of different users, a service provider generally provides a large amount of videos for the users to watch. Before watching a video, a user usually selects a desired video from a large number of videos provided by a service provider according to the title of the video. Therefore, the video title has a significant influence on the viewing rate of the video. For example, in order to better maintain game ecology, game facilitators generate greater user stickiness, and generate a large amount of game video for users to watch each day, and in the face of the large amount of game video, users often select the video to watch according to the titles of the game video.

In the related art, a method for generating a video title generally includes: and the operator watches the video, and determines the title of the video according to the content of the video after watching the video. However, when the number of videos of a title to be generated is large, the generation efficiency of the video title is low.

In the related art, a video title may also be generated by a machine learning method. For example: a video title is generated by combining a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), and an Attention (Attention) model, and using the combined model. Fig. 1 is a schematic structural diagram of the model, which includes a multi-stage network (each dashed box in fig. 1 represents a primary network). For each level of network, inputting an image frame of a video to the CNN, enabling the CNN to extract image features of the image frame in a spatial dimension, then inputting the image features to the RNN, extracting image features of the image frame in a temporal dimension through the RNN, and inputting the extracted image features to the RNN in a lower level network and a last level network to realize the transmission of image feature information. And the RNN in the last level network may extract image features of the image frame to which the CNN of this level is input from the time dimension, and may generate a title of the video according to the image features and the image features to which the RNN in the other level networks is input. In the implementation mode of generating the video title through the deep learning method, when the video title is generated according to the image characteristics, sampling is carried out in a preset word set according to the image characteristics, and the sampled words are spliced to obtain the video title. However, since the sampling process is typically an uncontrolled process, the headings generated from the sampled words are typically combinations of words that are not semantically smooth. Moreover, since the implementation method generates the video title only according to the image features, some information in the video is lost, which results in poor description capability of the generated video title on the main content of the video, that is, low accuracy of the generated video title as a whole.

Therefore, the embodiment of the invention provides a video title generation method, which is characterized in that the sound characteristic information and the image characteristic information of a video are acquired, the scene information of a scene presented by the video is acquired according to the sound characteristic information and the image characteristic information, and the video title is generated according to the scene information and the image characteristic information. In addition, the video title generation method acquires the scene information according to the sound characteristic information and the image characteristic information of the video, and generates the video title according to the scene information and the image characteristic information, so that the information amount which can be referred to when the video title is generated is increased, the generated video title can more accurately describe the main content of the video, and the accuracy of the generated video title is effectively improved.

Fig. 2 is a flowchart of a video title generating method according to an embodiment of the present invention, and as shown in fig. 2, the method may include:

step 201, acquiring sound characteristic information and image characteristic information of a video.

The sound characteristic information may be information for describing a sound source attribute, such as: the sound characteristic information may be information describing the sound of a gun of a different firearm, the sound of a car, and other sounds. The image characteristic information may be information for describing contents displayed in the image. For example: for the image frame of the game video, the image characteristic information can be information describing the contents of killing hero, being killed hero, killing type, killing and killing hero blood amount under the defense tower and the like in the game.

When the step 201 is executed, the implementation process includes: the method comprises the steps of obtaining image characteristic information of a video and obtaining sound characteristic information of the video. The two parts are realized as follows:

in the first part, the implementation process of acquiring image feature information of a video may include: in a plurality of image frames included in a video, image feature information of a target image frame is acquired.

The target image frame may be each image frame included in the video, or the target image frame may be an image frame selected at a preset time interval from among a plurality of image frames included in the video. The preset duration can be set according to actual needs, for example: an image frame every 1 second (or 0.5 seconds) interval may be determined as a target image frame starting from a first image frame among a plurality of image frames included in the video.

Furthermore, the implementation manner of acquiring the image feature information of the target image frame may include: inputting the target image frame to the CNN to identify an image at a specific position in the target image frame through the CNN so as to acquire image characteristic information of the target image frame.

Wherein, the specific position in the target image frame can be set according to the actual requirement. For example: for the game video of the game, the image characteristic information to be acquired is information describing contents of killing hero, killing type, killed hero, killing and killing hero blood amount under the defense tower in the game, fig. 3 is a schematic diagram of a target image frame of the game video, since the avatar for killing hero is usually displayed at a position shown by a dotted line frame a1 in the image frame, the killing type is usually displayed at a position shown by a dotted line frame a2 in the image frame, and the avatar for killing hero is usually displayed at a position shown by a dotted line frame A3 in the image frame, for the target image frame, the specific positions can be respectively a position shown by a dotted line frame a1, a position shown by a dotted line frame a2, and a position shown by a dotted line frame A3. The image at the position shown by the dashed line frame a1 is identified by the CNN, so that the hero-killing type in the image frame is paprika, the image at the position shown by the dashed line frame a2 is identified by the CNN, so that the killing type in the image frame is multi-killing (namely the triple-division success shown in fig. 3) can be obtained, the hero-killing type in the image frame is flying by identifying the image at the position shown by the dashed line frame A3 by the CNN, and further, image feature information describing the hero-killing type, hero-killing type and killing type in the game in the target image frame is obtained.

In addition, the target image frame can be identified according to a preset display mode in the image so as to acquire image characteristic information. For example: in the display mode of the game video of the game, the defense tower is generally displayed by using a circle with the size of S1, and the head portrait of the knocked hero is displayed by using a circle with the size of S2, so that when acquiring the image feature information, the circle with the size of S1 and the circle with the size of S2 can be detected and recognized by using the CNN, respectively, to acquire information describing the defense tower and information describing the knocked hero. And after the information describing the defense tower and the information describing the hit hero killing are obtained, the distance between the defense tower and the hit hero killing can be obtained to obtain the information describing the hit hero killing under the defense tower. Meanwhile, since the blood volume is usually displayed above the image of the head of the hero (for example, the position shown by the dashed box a4 in fig. 3), after the information of the hero is acquired, the display position of the blood volume can be determined accordingly, and the image displayed at the position can be identified to acquire the information describing the blood volume of the hero killed.

Because the image content in the image frames which are continuous in time sequence in the video is continuously changed, and the image content change of the image frames which are adjacent in time sequence is small, the image frames are screened from the image frames of the video, and the image characteristic information of the screened image frames is obtained, compared with the implementation mode of needing to obtain the image characteristic information of each image frame included in the video, the redundant information in the image frames can be reduced, the data processing amount needed in the process of generating the video title is reduced, and the speed of generating the title is further improved.

In the second part, since the sound characteristic information can be characterized by the mel-frequency cepstrum coefficient characteristic of the sound information, referring to fig. 4, the implementation process of acquiring the sound characteristic information of the video may include:

and step 2011, acquiring a mel frequency cepstrum coefficient characteristic of the sound information.

Among them, Mel-scale Frequency spectrum Coefficients (MFCCs) are Cepstral parameters extracted in the Mel-scale Frequency domain. By performing the following processing on sound information (usually expressed in terms of speech signals): pre-emphasis (pre-emphasis), framing (frame blocking), short-time energy (energy) calculation, windowing (hamming window), fast fourier transform (FFT transform) and triangular band-pass filter (triangular band-pass filter) processing, and the Mel cepstrum coefficient characteristics of the sound information can be obtained.

Optionally, the sound information may be all sound information of the video to be generated with the title within a period from the start playing time to the end playing time of the video, or may also be sound information obtained by filtering all the sound information. For example: the sound information may be sound information within a preset time period in the video, or the sound information may be sound information within a preset time period before the time corresponding to the target image frame.

Step 2012, classifying the sound information based on the mel-frequency cepstrum coefficient characteristics to obtain sound characteristic information.

After obtaining the mel-frequency cepstrum coefficient features of the sound information, the mel-frequency cepstrum coefficient features can be input into the RNN, and the sound information is classified by using a classifier (such as a softmax classifier or a support vector machine classifier) to obtain the sound feature information of the sound information.

For example, referring to fig. 5, in fig. 5, image feature information describing that the killing type is multi-killing (i.e., "seven killing" shown in fig. 5) in a shooting game may be obtained by identifying, by a CNN, image content at a position shown by a solid line box B1 in a target image frame of a game video of the shooting game, and accordingly, by obtaining sound information 5 seconds before a time corresponding to the target image frame and obtaining mel cepstrum coefficient features of the sound information, inputting the mel cepstrum coefficient features to an RNN, and outputting a sound category by a softmax classifier, sound feature information indicating that a sound characterized by the sound information is a gun sound of an M gun may be obtained.

Step 202, acquiring target scene information of the video based on the sound characteristic information and the image characteristic information.

Wherein the target scene information is used to indicate a scene of the video presentation. Referring to fig. 6, the implementation of step 202 may include:

step 2021, feature fusion is performed on the sound feature information and the image feature information to obtain scene feature information.

The scene feature information is generally information for distinguishing different scenes. For example: the scene characteristic information can be information used for distinguishing scenes such as a residual blood killing scene, a tower-down killing scene, a multi-killing scene or a dragon-hitting scene in a game. Optionally, referring to fig. 7, the implementation process of this step 2021 may include:

step 2021a, type of video is acquired.

The type of video is used to distinguish different video content. For example: the types of game video may include: video of gun battle type games, video of real-time battle type games and the like, and the types of television videos may include: family play videos, even-image play videos, ancient drama videos, and the like.

Typically, the information related to the video (e.g. in the video code) is recorded with information indicating the type of the video, and when the step 2021a is executed, the information can be read to obtain the type of the video.

Step 2021b, determining influence weights of the sound feature information and the image feature information on the scene information respectively based on the type of the video.

For different types of videos, the sound characteristic information and the image characteristic information have different degrees of influence on scene information, for example: in the video of the gun battle game, the degree of influence of the sound characteristic information on the scene information is large, and in the video of the real-time battle game, the degree of influence of the image characteristic information on the scene information is large. Therefore, before the sound characteristic information and the image characteristic information are subjected to characteristic fusion on the scene information, influence weights of the sound characteristic information and the image characteristic information on the scene information can be respectively determined according to the types of videos, so that the sound characteristic information and the image characteristic information are subjected to characteristic fusion on the scene information according to the influence weights, the scene characteristic information which is more in line with the video content is obtained, and then the target scene information which is more in line with the video content is obtained according to the scene characteristic information.

Step 2021c, according to the influence weight, performing feature fusion on the sound feature information and the image feature information to obtain scene feature information.

After determining the influence weight of the sound characteristic information and the image characteristic information on the scene information, feature fusion can be performed on the sound characteristic information and the image characteristic information according to the influence weight to obtain the scene characteristic information. Optionally, when feature fusion is performed on the sound feature information and the image feature information, feature fusion may be performed by using an algorithm based on a bayesian decision theory, an algorithm based on a sparse representation theory, an algorithm based on a deep learning theory, or the like, which is not specifically limited in the embodiment of the present invention.

For example, it is assumed that the vector W1 is the acquired audio feature information, the vector W2 is the acquired image feature information, the influence weights of the audio feature information and the image feature information on the scene information are a and b, respectively, and the scene feature information Z is obtained by performing feature fusion on the audio feature information and the image feature information according to the influence weights, a × W1+ b × W2.

Step 2022, obtaining target scene information based on the scene feature information.

Optionally, the target scene information may be obtained by using a classifier model, and accordingly, the implementation process of this step 2022 may include: and inputting the scene characteristic information into a second classifier model, and determining target scene information in the plurality of pieces of scene information by the second classifier model according to the scene characteristic information. Wherein the plurality of context information may be context information determined during training of the second classifier model. The plurality of scene information may include information indicating scenes such as a residual blood killing scene, a tower killing scene, a multi-killing scene, or a dragon shooting scene. Optionally, the second classifier model may be a classifier model such as a softmax classifier or a support vector machine classifier.

As an example, assuming that the sound characteristic information acquired in step 201 is information describing that the sound is the sound of a gun of an M gun, the acquired image characteristic information is information describing that the content displayed by the image is the indication that the killing type is multi-killing, feature fusion is performed on the sound characteristic information and the image characteristic information, then the scene characteristic information obtained through the feature fusion is input into a softmax classifier, and after the operation of the softmax classifier, the target scene information is obtained as information describing that the scene is a multi-killing scene realized by the M gun.

And step 203, acquiring a target title template of the video based on the target scene information and the image characteristic information.

Generally, the same video content may be described by a plurality of expressions having different formats, and the same scene may be described by a plurality of expressions having different formats. Accordingly, different video content may be described by video titles having the same or similar format. Therefore, after the target scene information is acquired, one title template can be selected from the preset title templates according to the target scene information and the image feature information, that is, the format for describing the video content is selected.

Optionally, a target title template of the video may be obtained by using a classifier model, and accordingly, the implementation process of step 203 may include: and inputting the target scene information and the image characteristic information into a first classifier model, and determining a target title template in the plurality of title templates by the first classifier model according to the target scene information and the image characteristic information. The plurality of title templates may be title templates determined in the training process of the first classifier model. For example: the plurality of title templates may include: (hero is killed) (blood volume) (type of killing) (hero is killed). (killing hero) (walking characteristic) killing (number) people. (killing hero) the big perfect continuous-hit reaping (quantity) is win. (killing hero) attract people in second (number) and so on. Note that, in this example, the content in parentheses is a word that needs to be filled in according to the video content.

Also, since the main content of the video is related to the timing of a plurality of image frames included in the video in addition to the image feature information and the sound feature information, the second classifier model may be a model capable of acquiring timing information of the image frames. For example, the second classifier model may be a softmax classifier.

For example, assuming that the obtained target scene information is information describing that the scene is a multi-killing scene by an M gun, and the obtained image feature information is information describing that the killing type is multi-killing, after the target scene information and the image feature information are input into a softmax classifier, a target title template is obtained as follows: (killing hero) (walking characteristic) killing (number) people.

And 204, acquiring a plurality of target knowledge bases corresponding to the video based on the target scene information.

Each target knowledge base is recorded with keywords for describing scene information, and the plurality of target knowledge bases are obtained by dividing based on different scene characteristics. That is, the target knowledge bases may record keywords describing scene information from different angles. For example, for game video, the plurality of knowledge bases may include: the system comprises a hero knowledge base, a killing type knowledge base, a killing scene knowledge base, a killing blood volume knowledge base, a walking characteristic knowledge base, a hero type knowledge base and the like, wherein keywords for describing scene information from different angles are recorded in a plurality of target knowledge bases.

Moreover, each scene information can be characterized by a plurality of scene features, and each scene feature can correspond to one knowledge base, so that the scene information, the scene features and the knowledge base have corresponding relations. Before the step 204 is executed, a corresponding relationship between the scene information, the scene characteristics, and the knowledge base may be established in advance, so that when the step 204 is executed, the corresponding relationship may be queried according to the target scene information to obtain a plurality of target knowledge bases.

For example, assuming that the target scene information may be represented by a scene feature a, a scene feature b, a scene feature c, a scene feature d, and a scene feature e, and the corresponding relationship between the scene information and the knowledge base established in advance is queried according to the target scene information, where the scene feature a corresponds to the knowledge base of hero angela, the scene feature b corresponds to the knowledge base of hero type, the scene feature c corresponds to the knowledge base of walking position feature, the scene feature d corresponds to the knowledge base of output feature, and the scene feature e corresponds to the knowledge base of suicidal scene, then a plurality of target knowledge bases may be obtained as follows: the system comprises a knowledge base of hero Angela, a knowledge base of hero types, a knowledge base of walking characteristics, a knowledge base of output characteristics and a knowledge base of multi-killing scenes, and scene information can be described from different angles in the multiple target knowledge bases.

It should be noted that, before the step 204 is executed, a knowledge base needs to be established in advance. The knowledge base can be established in a manual acquisition mode, or the database can be established in a data mining mode and the like. Moreover, according to the types of videos of the titles to be generated, the types of keywords recorded in the knowledge base may be different by thick lines, for example: the knowledge base may also include keywords describing a plurality of attributes such as hero, player, team, player, and scene attributes, or may further include data obtained by analyzing the viewing rate of videos having the same attribute, and the like.

Step 205, generating a title based on the target title template and the plurality of target knowledge bases.

Optionally, referring to fig. 8, the implementation process of step 205 may include:

step 2051, obtaining keywords for filling the target title template in each target knowledge base.

Since each knowledge base is recorded with a plurality of keywords describing a word with an attribute, and the keywords may be synonyms or near synonyms, after a plurality of target knowledge bases corresponding to the video are obtained, the keywords in each knowledge base may be filtered to obtain the keywords for filling the target title template.

Optionally, the screening method may be: and randomly screening, or inputting the information of the knowledge base and the information of the scene features into a classifier according to the scene features corresponding to the knowledge base, and selecting the keywords in the knowledge base by the classifier to obtain the keywords for filling the target title template.

For example, assume that the plurality of target knowledge bases are: the method comprises a knowledge base of hero Angela, a knowledge base of hero types, a knowledge base of walking characteristics, a knowledge base of output characteristics and a knowledge base of multi-killing scenes, wherein keywords recorded in the knowledge base of hero Angela comprise: { Angela: { hero alternative name: { Raly Angel, high-outbreak Angel, strong control Angel, etc. }, the keywords recorded in the walking characteristic knowledge base include: { flexible and changeable, continuous displacement, etc. }, outputting keywords recorded by a feature knowledge base comprises: { explosive output, etc. }, by screening the keywords recorded in the knowledge base, we can obtain: the keywords for filling the target title template in the knowledge base of hero Angela are as follows: the keywords for filling the target title template in the walking characteristic knowledge base are as follows: and (3) continuously shifting, and outputting keywords for filling the target title template in the feature knowledge base as follows: and (6) outputting explosion.

Because a plurality of keywords recorded in each knowledge base are synonyms or near synonyms and the semantics represented by the keywords are basically the same, the process of screening the keywords is a controllable process, and the keywords obtained by screening are adopted to fill the target title template, so that the video title with smooth semantics can be generated, and the readability of the generated video title is enhanced.

Step 2052, fill the target title template with keywords to obtain the title.

After the keywords for filling the target title template are obtained, the keywords can be filled to the position of the target title template corresponding to the attributes according to the attributes of the keywords, so as to obtain the title of the video.

For example, assume that the target title template obtained in step 203 is: (hero is killed) (walking characteristic) man is killed (number), and the keywords for filling the target title template in the knowledge base of hero Angela obtained in step 2051 are: the keywords for filling the target title template in the walking characteristic knowledge base are as follows: and (3) continuously shifting, and outputting keywords for filling the target title template in the feature knowledge base as follows: and (3) outputting explosion, wherein the killing type determined according to the image characteristic information is as follows: when the target title template is filled with keywords, the slaliman angel can be filled to the position of killing hero in the target title template, the continuous displacement is filled to the position of the walking feature in the target title template, and the multi-killing is filled to the position of the number in the target title template, so that the obtained title is as follows: the Luoli Angela continuous displacement killing device can kill a plurality of people.

Optionally, the video title generating method provided in the embodiment of the present invention may be implemented by using the model shown in fig. 9, where the model includes multiple levels of networks (each dashed box in fig. 9 represents a level of a network), and for each level of networks, step 201 is performed on each target image frame, that is, the target image frame is input to the CNN, the image feature of the image frame in the spatial dimension is extracted through the CNN, then the image feature is input to the RNN, and the image feature of the image frame in the temporal dimension is extracted through the RNN, so as to obtain the image feature information of the target image frame. And inputting the image characteristic information of the target image frame into the RNN in the lower-level network and the last-level network to realize the transmission of the image characteristic information. Meanwhile, sound characteristic information of the video sound information is extracted. Then, step 202 is executed according to the image feature information and the sound feature information, and the sound feature information and the image feature information are subjected to feature fusion to obtain scene feature information. And inputs the scene feature information to a softmax classifier (not shown in fig. 9) to obtain target scene information. Then, step 203 and step 204 are performed respectively according to the object scene information. In step 203, the target scene information and the image feature information are input to a softmax classifier (not shown in fig. 9) to acquire a target title template of the video. In step 204, according to the target scene information, the corresponding relationship between the scene information, the scene characteristics, and the knowledge bases is queried to obtain a plurality of target knowledge bases corresponding to the target scene information. Then, according to the target title template obtained in step 203 and the plurality of target knowledge bases obtained in step 204, step 205 is executed, and the target title template is filled according to the keywords recorded in the target knowledge base, so that the video title can be generated.

In the model shown in fig. 9, by inputting the image feature information acquired by each level of network into the RNN (i.e., the Attention model) in the lower level network and the last level network, the attenuation of the image feature information can be reduced, thereby improving the accuracy of the video title generated according to the image feature information.

In addition, by acquiring the video image characteristic information and the sound characteristic information, the information amount which can be referred to when the video title is generated is increased, the learning capability of the machine can be improved, so that the main content of the video can be more accurately described by the video title generated by adopting a machine learning method, and the accuracy of the generated video title can be effectively improved. Furthermore, other modality feature information besides the image feature information and the sound feature information can be obtained according to actual needs, and the video title is generated based on the image feature information, the sound feature information and the other modality feature information, so that the accuracy of the generated video title is further improved.

In summary, according to the video title generation method provided by the embodiment of the present invention, by acquiring the sound characteristic information and the image characteristic information of the video, acquiring the scene information of the scene presented by the video according to the sound characteristic information and the image characteristic information, and then generating the video title according to the scene information and the image characteristic information, compared with the related art, the video title can be generated without an operator watching the video, so that the generation efficiency of the video title is effectively improved, and manpower and material resources for determining the video title are saved.

In addition, the video title generation method acquires the scene information according to the sound characteristic information and the image characteristic information of the video, and generates the video title according to the scene information and the image characteristic information, so that the information amount which can be referred to when the video title is generated is increased, the generated video title can more accurately describe the main content of the video, and the accuracy of the generated video title is effectively improved.

It should be noted that, the order of the steps of the video title generating method provided in the embodiment of the present invention may be appropriately adjusted, and the steps may also be increased or decreased according to the circumstances, and any method that can be easily conceived by those skilled in the art within the technical scope of the present invention shall be included in the protection scope of the present invention, and therefore, no further description is given.

An embodiment of the present invention provides a title generation method for a game video, as shown in fig. 10, the method may include:

and step 301, acquiring sound characteristic information and image characteristic information of the game video.

And step 302, acquiring target game scene information of the game video based on the sound characteristic information and the image characteristic information.

The target game scene information is used for indicating the game scene of the game video presentation.

And step 303, generating a title of the game video based on the target game scene information and the image characteristic information.

In the above steps 301 to 303, the specific implementation process of each step may refer to the corresponding step in the embodiment shown in fig. 2, and this is not described again in this embodiment of the present invention.

In summary, according to the title generation method for the game video provided by the embodiment of the present invention, by acquiring the sound characteristic information and the image characteristic information of the game video, acquiring the scene information of the game scene presented by the game video according to the sound characteristic information and the image characteristic information, and then generating the title of the game video according to the scene information and the image characteristic information, compared with the related art, the title of the game video can be generated without an operator watching the game video, so that the generation efficiency of the title of the game video is effectively improved, and manpower and material resources for determining the title of the game video are saved.

In addition, the title generation method of the game video acquires the scene information according to the sound characteristic information and the image characteristic information of the game video, and generates the title of the game video according to the scene information and the image characteristic information, so that the amount of information which can be referred to when the title of the game video is generated is increased, the generated title of the game video can more accurately describe the main content of the game video, and the accuracy of the generated title of the game video is effectively improved.

Fig. 11 is a schematic structural diagram of a video title generating apparatus according to an embodiment of the present invention, and as shown in fig. 11, the apparatus 800 may include:

a first obtaining module 801, configured to obtain sound characteristic information and image characteristic information of a video.

A second obtaining module 802, configured to obtain target scene information of the video based on the sound feature information and the image feature information, where the target scene information is used to indicate a scene presented by the video.

A generating module 803, configured to generate a title of the video based on the target scene information and the image feature information.

Optionally, the generating module 803 may be configured to:

and acquiring a target title template of the video based on the target scene information and the image characteristic information.

The method comprises the steps of obtaining a plurality of target knowledge bases corresponding to videos based on target scene information, wherein keywords used for describing the scene information are recorded in each target knowledge base, and the target knowledge bases are obtained by dividing based on different scene characteristics.

A title is generated based on the target title template and the plurality of target knowledge bases.

Optionally, the generating module 803 may generate the title based on the target title template and the plurality of target knowledge bases, and the process may include:

and acquiring keywords for filling the target title template in each target knowledge base.

And filling the target title template with the keywords to obtain the title.

Optionally, the process of the generating module 803, based on the target scene information and the image feature information, acquiring a target title template of the video may include:

and inputting the target scene information and the image characteristic information into a first classifier model, and determining a target title template in the plurality of title templates by the first classifier model according to the target scene information and the image characteristic information.

Based on the target scene information, acquiring a plurality of target knowledge bases corresponding to the video, including:

and inquiring the corresponding relation between the scene information and the knowledge bases based on the target scene information to obtain a plurality of target knowledge bases.

Optionally, the process of acquiring, by the second acquiring module 802, the target scene information of the video based on the sound characteristic information and the image characteristic information may include:

and carrying out feature fusion on the sound feature information and the image feature information to obtain scene feature information.

And acquiring target scene information based on the scene characteristic information.

Optionally, the process of obtaining the scene feature information by performing feature fusion on the sound feature information and the image feature information by the second obtaining module 802 may include:

the type of video is obtained.

And respectively determining the influence weight of the sound characteristic information and the image characteristic information on the scene information based on the type of the video.

And according to the influence weight, performing feature fusion on the sound feature information and the image feature information to obtain scene feature information.

Optionally, the process of acquiring the target scene information by the second acquiring module 802 based on the scene characteristic information may include: and inputting the scene characteristic information into a second classifier model, and determining target scene information in the plurality of pieces of scene information by the second classifier model according to the scene characteristic information.

Optionally, the process of acquiring the sound characteristic information of the video by the first acquiring module 801 may include:

and acquiring sound information in a preset time period in the video.

And acquiring sound characteristic information of the sound information.

Alternatively, the process of acquiring the sound characteristic information of the sound information by the first acquiring module 801 may include:

and acquiring the Mel cepstrum coefficient characteristics of the sound information.

And classifying the sound information based on the Mel cepstrum coefficient characteristics to obtain sound characteristic information.

Optionally, the process of acquiring the image characteristic information of the video by the first acquiring module 801 may include: in a plurality of image frames included in a video, image feature information of a target image frame is acquired.

Optionally, the target image frame is an image frame selected at intervals of a preset duration in the plurality of image frames.

In summary, in the video title generation apparatus provided in the embodiment of the present invention, the first obtaining module obtains the sound characteristic information and the image characteristic information of the video, the second obtaining module obtains the scene information of the scene presented by the video according to the sound characteristic information and the image characteristic information, and the generating module generates the video title according to the scene information and the image characteristic information.

And the second acquisition module acquires the scene information according to the sound characteristic information and the image characteristic information of the video, and the generation module generates the video title according to the scene information and the image characteristic information, so that the information amount which can be referred to when the video title is generated is increased, the generated video title can more accurately describe the main content of the video, and the accuracy of the generated video title is effectively improved.

Fig. 12 is a schematic structural diagram of a title generation apparatus for game video according to an embodiment of the present invention, and as shown in fig. 12, the apparatus 900 may include:

the first obtaining module 901 is configured to obtain sound characteristic information and image characteristic information of a game video.

A second obtaining module 902, configured to obtain target game scene information of the game video based on the sound feature information and the image feature information, where the target game scene information is used to indicate a game scene presented by the game video.

And a generating module 903, configured to generate a title of the game video based on the target game scene information and the image feature information.

In summary, in the video title generation apparatus provided in the embodiment of the present invention, the first obtaining module obtains the sound characteristic information and the image characteristic information of the game video, the second obtaining module obtains the scene information of the game scene presented by the game video according to the sound characteristic information and the image characteristic information, and the generating module generates the title of the game video according to the scene information and the image characteristic information.

Moreover, the scene information is acquired according to the sound characteristic information and the image characteristic information of the game video, and the generating module generates the title of the game video according to the scene information and the image characteristic information, so that the information amount which can be referred to when the title of the game video is generated is increased, the generated title of the game video can more accurately describe the main content of the game video, and the accuracy of the generated title of the game video is effectively improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 13 is a schematic structural diagram of a terminal 1300 according to an exemplary embodiment of the present invention. The terminal 1300 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, terminal 1300 includes: a processor 1301 and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 1301 may also include a main processor, also referred to as a Central Processing Unit (CPU), which is a processor for Processing data in the wake state, and a coprocessor. A coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, processor 1301 may further include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one instruction for execution by processor 1301 to implement a video title generation method provided by method embodiments herein, or a title generation method for game video.

In some embodiments, terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, display screen 1305, camera assembly 1306, audio circuitry 1307, positioning assembly 1308, and power supply 1309.

Peripheral interface 1303 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1301 and memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board. In some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1304 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1304 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to capture touch signals on or over the surface of the display screen 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this point, the display 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1305 may be one, providing a front panel of terminal 1300. In other embodiments, display 1305 may be at least two, either disposed on different surfaces of terminal 1300 or in a folded design. In still other embodiments, display 1305 may be a flexible display disposed on a curved surface or on a folded surface of terminal 1300. Even further, the display 1305 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display 1305 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1306 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for realizing voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1300. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1301 or the radio frequency circuitry 1304 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 1307 may also include a headphone jack.

The positioning component 1308 is used for positioning the current geographic position of the terminal 1300 for implementing navigation or LBS (Location Based Service). The Positioning component 1308 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

Power supply 1309 is used to provide power to various components in terminal 1300. The power source 1309 may be alternating current, direct current, disposable or rechargeable. When the power source 1309 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyro sensor 1312, pressure sensor 1313, fingerprint sensor 1314, optical sensor 1315, and proximity sensor 1316.

The acceleration sensor 1311 can detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 1300. For example, the acceleration sensor 1311 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1301 may control the display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1312 may detect the body direction and the rotation angle of the terminal 1300, and the gyro sensor 1312 may cooperate with the acceleration sensor 1311 to acquire a 3D motion of the user with respect to the terminal 1300. Processor 1301, based on the data collected by gyroscope sensor 1312, may perform the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1313 may be disposed on a side bezel of terminal 1300 and/or underlying display 1305. When the pressure sensor 1313 is disposed on the side frame of the terminal 1300, a user's holding signal to the terminal 1300 may be detected, and the processor 1301 performs left-right hand recognition or shortcut operation according to the holding signal acquired by the pressure sensor 1313. When the pressure sensor 1313 is disposed at a lower layer of the display screen 1305, the processor 1301 controls an operability control on the UI interface according to a pressure operation of the user on the display screen 1305. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1314 is used for collecting the fingerprint of the user, and the processor 1301 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the identity of the user according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 1314 may be disposed on the front, back, or side of the terminal 1300. When a physical button or vendor Logo is provided on the terminal 1300, the fingerprint sensor 1314 may be integrated with the physical button or vendor Logo.

The optical sensor 1315 is used to collect the ambient light intensity. In one embodiment, the processor 1301 may control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display luminance of the display screen 1305 is increased. When the ambient light intensity is low, the display brightness of the display screen 1305 is reduced. In another embodiment, the processor 1301 can also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.

Proximity sensor 1316, also known as a distance sensor, is typically disposed on a front panel of terminal 1300. Proximity sensor 1316 is used to gather the distance between the user and the front face of terminal 1300. In one embodiment, the display 1305 is controlled by the processor 1301 to switch from the bright screen state to the dark screen state when the proximity sensor 1316 detects that the distance between the user and the front surface of the terminal 1300 gradually decreases. The display 1305 is controlled by the processor 1301 to switch from the rest state to the bright state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 is gradually increasing.

Those skilled in the art will appreciate that the configuration shown in fig. 13 is not intended to be limiting with respect to terminal 1300 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium is a non-volatile storage medium, and at least one instruction, at least one program, a code set, or an instruction set is stored in the storage medium, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the video title generation method or the title generation method for game video provided in the foregoing embodiments of the present application.

Embodiments of the present invention further provide a computer program product, in which instructions are stored, and when the computer program product runs on a computer, the computer is enabled to execute the video title generation method provided by the embodiments of the present invention, or the title generation method of a game video.

The embodiment of the invention also provides a chip, which comprises a programmable logic circuit and/or a program instruction, and when the chip runs, the video title generation method provided by the embodiment of the invention or the title generation method of the game video can be executed.

In the embodiment of the invention, the relation qualifier "and/or" represents three logical relations, and a and/or B represents that a exists alone, B exists alone and a and B exist together.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A video title generation method, comprising:

acquiring a target title template of the video based on the target scene information and the image characteristic information;

acquiring a plurality of target knowledge bases corresponding to the video based on the target scene information, wherein each target knowledge base is recorded with keywords for describing scene information, and the target knowledge bases are obtained by dividing based on different scene characteristics;

generating the title based on the target title template and the plurality of target knowledge bases.

2. The method of claim 1, wherein generating the title based on the target title template and the plurality of target repositories comprises:

acquiring keywords for filling the target title template in each target knowledge base;

and filling the target title template by adopting the keywords to obtain the title.

3. The method of claim 1,

the obtaining a target title template of the video based on the target scene information and the image feature information includes:

inputting the target scene information and the image characteristic information into a first classifier model, and determining the target title template in a plurality of title templates by the first classifier model according to the target scene information and the image characteristic information;

the obtaining of the plurality of target knowledge bases corresponding to the video based on the target scene information includes:

and inquiring the corresponding relation between the scene information and the knowledge bases based on the target scene information to obtain the plurality of target knowledge bases.

4. The method according to any one of claims 1 to 3, wherein the obtaining target scene information of the video based on the sound feature information and the image feature information comprises:

performing feature fusion on the sound feature information and the image feature information to obtain scene feature information;

and acquiring the target scene information based on the scene characteristic information.

5. The method according to claim 4, wherein the performing feature fusion on the sound feature information and the image feature information to obtain scene feature information comprises:

acquiring the type of the video;

respectively determining the influence weight of the sound characteristic information and the image characteristic information on the scene information based on the type of the video;

6. The method of claim 4, wherein the obtaining the target scene information based on the scene feature information comprises:

and inputting the scene characteristic information into a second classifier model, and determining the target scene information in a plurality of pieces of scene information by the second classifier model according to the scene characteristic information.

7. The method according to any one of claims 1 to 3, wherein obtaining the sound feature information of the video comprises:

acquiring sound information in a preset time period in the video;

and acquiring sound characteristic information of the sound information.

8. The method of claim 7, wherein the obtaining the sound feature information of the sound information comprises:

acquiring Mel cepstrum coefficient characteristics of the sound information;

and classifying the sound information based on the Mel cepstrum coefficient characteristics to obtain the sound characteristic information.

9. The method according to any one of claims 1 to 3, wherein obtaining image feature information of the video comprises:

in a plurality of image frames included in the video, image feature information of a target image frame is acquired.

10. The method of claim 9, wherein the target image frame is an image frame selected every preset duration in the plurality of image frames.

11. A title generation method of a game video, the method comprising:

acquiring a target title template of the game video based on the target game scene information and the image characteristic information;

acquiring a plurality of target knowledge bases corresponding to the game video based on the target game scene information, wherein each target knowledge base is recorded with keywords for describing scene information, and the target knowledge bases are obtained by dividing based on different scene characteristics;

generating a title of the game video based on the target title template and the plurality of target knowledge bases.

12. A video title generation apparatus, comprising:

the generating module is used for acquiring a target title template of the video based on the target scene information and the image characteristic information; acquiring a plurality of target knowledge bases corresponding to the video based on the target scene information, wherein each target knowledge base is recorded with keywords for describing scene information, and the target knowledge bases are obtained by dividing based on different scene characteristics;