CN112887588B

CN112887588B - Method and apparatus for generating video

Info

Publication number: CN112887588B
Application number: CN202110025035.0A
Authority: CN
Inventors: 郑新建
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2023-04-07
Anticipated expiration: 2041-01-08
Also published as: CN112887588A

Abstract

A method and apparatus for generating a video are provided. The method comprises the following steps: receiving audio selected by a user; obtaining a target video shooting model based on the audio selected by the user; generating a command sequence based on the target video shooting model; and sending a command sequence for guiding the user to shoot to the user so as to guide the user to shoot the video while playing the audio. By adopting the method and the device for generating the video, the problems that the user cannot know how to set the shooting parameters to shoot the video corresponding to the emotion desired to be expressed, the audio and video editing operation at the later stage is complex and the labor cost is high are solved, and the use experience of the user is improved.

Description

Method and apparatus for generating video

Technical Field

The present invention relates to the field of intelligent terminal technologies, and in particular, to a method and an apparatus for generating a video.

Background

In general, an authoring form of shooting a video and adding background music is a popular expression form. Music may be used to deliver a particular emotion and the emotional characteristics expressed by the music may change as the music progresses. People express a certain emotion by authoring short videos. However, since the shooting technique level of the photographer is uneven and the function of the shooting device is too many, the ordinary photographer cannot know which function is used at a certain time and how to set the parameters to shoot a video corresponding to the feeling that the photographer wants to express at the certain time. In addition, the music editing in the later period of the video is mainly carried out by manpower, the operation process is complex, a video editor is required to spend a great deal of time and energy, the labor cost is too high, and great inconvenience is brought to a user.

Disclosure of Invention

An object of the exemplary embodiments of the present invention is to provide a method and an apparatus for video, so as to solve the problems that in the prior art, a user cannot know how to set shooting parameters to shoot a video corresponding to an emotion that the user wants to express, and that an operation of editing audio at a later stage of the video is complex and labor cost is high.

According to an aspect of exemplary embodiments of the present invention, there is provided a method of generating a video, which may include: receiving audio selected by a user; obtaining a target video shooting model based on the audio selected by the user; generating a command sequence based on the target video shooting model; and sending a command sequence for guiding the user to shoot to the user so as to guide the user to shoot the video while playing the audio.

Based on the user-selected audio, the step of obtaining the target video capture model may include: if the audio selected by the user is not in the music library, performing emotional characteristic analysis on the audio selected by the user to obtain a target video shooting model; and if the audio selected by the user is in the music library, selecting a video shooting model corresponding to the audio selected by the user from a plurality of pre-stored video shooting models as a target video shooting model in combination with the framing characteristics of the shooting scene and the type of the shooting equipment.

The step of performing emotional characteristic analysis on the audio selected by the user to obtain the target video shooting model may include: performing emotional characteristic analysis on the audio selected by the user to generate audio emotional characteristic information; and selecting a video shooting model corresponding to the audio with audio emotional characteristic information which is most similar to the audio emotional characteristic information of the audio selected by the user from the plurality of pre-stored video shooting models by combining the view finding characteristic of the shooting scene and the type of the shooting equipment, and taking the video shooting model as a target video shooting model.

The video shooting model corresponding to any one audio can be obtained by the following steps: performing emotional characteristic analysis on any audio to generate audio emotional characteristic information of any audio; generating spectrogram slicing feature information of any audio frequency based on the audio frequency emotion feature information of any audio frequency through a music emotion model; acquiring video shooting characteristics based on the generated spectrogram slicing characteristic information, the view finding characteristics of a shooting scene and the type of shooting equipment through a video emotion model; and generating a video shooting model based on the video shooting characteristics and a command sequence for guiding a user to shoot, wherein the music emotion model represents the relationship between the audio emotion characteristic information and the spectrogram slice characteristic information, and the video emotion model represents the relationship between the spectrogram slice characteristic information, the view finding characteristics of a shooting scene, the type of shooting equipment and the video shooting characteristics. The video capture features may include: one or more of a video frame rendering feature, a capture focus object in a capture scene, a framing feature of a capture scene, a capture perspective, a capture device type, and a perspective change trajectory.

The command sequence may include: a first command sequence for one or more of controlling a photographing apparatus to move, changing a photographing angle of view, and adjusting a composition; a second command sequence for controlling a respective parameter, wherein the respective parameter comprises: one or more of a focus object, a blurred background, a focal length, a sensitivity, an exposure level, and a filter of the camera.

When the shooting device is a portable mobile terminal, the first command sequence may be a command prompting the user to move the shooting device, and adjust one or more of the composition feature and the shooting angle of view; when the capture device is a drone, the first command sequence may include one or more of commands to control the take-up, take-down, track capture, bird's-eye rotation, simulated swing arm, and arc propulsion/zoom-out of the lens of the drone.

The method may further comprise: receiving evaluation of a user on the shot video; and optimizing the target video shooting model in response to the evaluation of the shot video by the user, wherein when the evaluation of the shot video by the user is less than a specific value, the target video shooting model is optimized.

The method may further comprise: and finally processing the video shot by the user in response to the command sequence to generate a final video, wherein the step of finally processing comprises the following steps: performing one or more of rectifying a composition feature of a partial video frame, adjusting a partial video frame rendering feature, deduplication, global tone unification, and simultaneous rectification processing of audio and video frames on the video data.

The video matching the any audio is from a video file including the any audio.

According to an aspect of exemplary embodiments of the present invention, there is provided an apparatus for generating a video, the apparatus including: the client receives the audio selected by the user; the model generation unit is used for obtaining a target video shooting model based on the audio selected by the user; a command generation unit that generates a command sequence based on the target video shooting model; and the video generation unit is used for sending a command sequence for guiding the user to shoot to the user so as to guide the user to shoot the video while playing the audio.

Based on the audio selected by the user, the process of obtaining the target video capture model may include: if the audio selected by the user is not in the music library, performing emotional characteristic analysis on the audio selected by the user to obtain a target video shooting model; and if the audio selected by the user is in the music library, selecting a video shooting model corresponding to the audio selected by the user as a target video shooting model from a plurality of pre-stored video shooting models by combining the view finding characteristics of the shooting scene and the type of the shooting equipment.

The process of performing emotional feature analysis on the audio selected by the user to obtain the target video shooting model may include: performing emotional characteristic analysis on the audio selected by the user to generate audio emotional characteristic information; and selecting a video shooting model corresponding to the audio with audio emotional characteristic information most similar to the audio emotional characteristic information of the audio selected by the user as a target video shooting model by combining the view finding characteristic of the shooting scene and the type of the shooting equipment from the plurality of pre-stored video shooting models.

The video shooting model corresponding to any one of the audios can be obtained by: performing emotional characteristic analysis on any audio to generate audio emotional characteristic information of any audio; generating spectrogram slicing feature information of any audio frequency based on the audio frequency emotion feature information of any audio frequency through a music emotion model; acquiring video shooting characteristics based on the generated spectrogram slice characteristic information, the view finding characteristics of a shooting scene and the type of shooting equipment through a video emotion model; the method comprises the steps that a video shooting model is generated based on video shooting characteristics and a command sequence used for guiding a user to shoot, wherein the music emotion model embodies the relation among audio emotion characteristic information, the view finding characteristics of a shooting scene, the type of shooting equipment and spectrogram slice characteristic information, and the video emotion model embodies the relation among the spectrogram slice characteristic information, the view finding characteristics of the shooting scene, the type of the shooting equipment and the video shooting characteristics.

The video capture features may include: one or more of a video frame rendering feature, a capture focus object in a capture scene, a framing feature of a capture scene, a capture perspective, a capture device type, and a perspective change trajectory.

When the shooting device is a portable mobile terminal, the first command sequence may be a command for prompting a user to move the shooting device, and to adjust one or more of the composition features and the shooting angle of view; when the capture device is a drone, the first command sequence may include one or more of commands to control the take-up, take-down, track capture, bird's-eye rotation, simulated swing arm, and arc propulsion/zoom-out of the lens of the drone.

The apparatus may further include: receiving evaluation of a user on the shot video; and optimizing the target video shooting model in response to the evaluation of the shot video by the user, wherein when the score of the shot video by the user is less than a specific value, the target video shooting model is optimized.

The apparatus may further include: and finally processing the video shot by the user in response to the command sequence to generate a final video, wherein the step of finally processing can comprise: performing one or more of rectifying a composition feature of a partial video frame, adjusting a partial video frame rendering feature, deduplication, global tone unification, and simultaneous rectification processing of audio and video frames on the video data.

The video matching the any audio is from a video file including the any audio.

According to another aspect of exemplary embodiments of the present invention, there is provided an apparatus for generating a video, the apparatus including: a processor; a memory storing a computer program which, when executed by the processor, performs the method of generating video as described above.

According to another aspect of exemplary embodiments of the present invention, there is provided a computer-readable storage medium having stored therein a computer program which, when executed, performs the method of generating a video as described above.

According to the method and the device for generating the video, the command sequence for guiding the user to shoot is sent to the user according to the audio emotional characteristic information of the audio selected by the user, so that the user can be assisted to shoot the video with high quality and more in line with the expression expectation of the user while playing the video.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:

fig. 1 illustrates a diagram of a method of generating a video based on an AI technique;

FIG. 2 illustrates a flow diagram for obtaining a video capture model corresponding to audio in a music library according to an exemplary embodiment of the present invention;

FIG. 3 illustrates a diagram for generating a sequence of audio expression units according to an exemplary embodiment of the present invention;

FIG. 4 illustrates a diagram of generating a sequence of video expression units according to an exemplary embodiment of the present invention;

fig. 5 is a diagram illustrating capturing of a video using a mobile terminal according to an exemplary embodiment of the present invention;

fig. 6 is a diagram illustrating shooting of a video using a drone according to an exemplary embodiment of the present invention;

fig. 7 is a diagram illustrating an apparatus for generating a video based on the AI technique.

Detailed Description

Various example embodiments will now be described more fully with reference to the accompanying drawings, in which some example embodiments are shown.

Fig. 1 illustrates a diagram of a method of generating a video based on an artificial intelligence AI technique. As an example, the device for generating the video may be an electronic device that has a camera function and is capable of capturing a video, such as a mobile communication terminal (e.g., a smartphone), a tablet computer, a personal digital assistant, an automatically controllable device (e.g., a drone), and the like.

Referring to fig. 1, in step S10, audio selected by a user is received.

In one example, the user-selected audio may be music already in the device that generated the video or music that the user newly uploaded to the device that generated the video.

In step S20, a target video shooting model is obtained based on the audio selected by the user.

After receiving the user-selected audio, it is determined whether the user-selected audio is in a music library. If the audio selected by the user is not in the music library, performing emotional characteristic analysis on the audio selected by the user to obtain a target video shooting model; and if the audio selected by the user is in the music library, selecting a video shooting model corresponding to the audio selected by the user from a plurality of pre-stored video shooting models as a target video shooting model in combination with the framing characteristics of the shooting scene and the type of the shooting equipment.

Further, the audio in the audio library is from a predetermined amount of video files. The video files can be the video files which are manually imported by the user into the preference of the user, and the video files which are intelligently and automatically searched and downloaded from a special video website according to the label classification. The video file is a video file with more professional shooting technology, higher shooting quality or adopting a currently popular shooting and rendering style. In these video files, shooting modes using various shooting devices (for example, a smart phone + a cloud platform (or called as a cloud server), an unmanned aerial vehicle, and the like) are included. By training the video shooting model by using the video file, the video shooting model with high matching degree with the received audio can be obtained, so that the user is better assisted in completing the video shooting meeting the expected expression.

Audio emotional feature information may be generated by performing emotional feature analysis on the audio in the music library according to the value-aroma emotional model to obtain a video shooting model corresponding to the audio having the audio emotional feature information based on the video expression units and the commands corresponding to each of the video expression units, and the model obtained in this way is referred to as a pre-stored video shooting model. The Valence-Arousal emotion model divides emotion into two dimensions of pleasure and Arousal for representing the degree of pleasure (joy) and the degree of excitement (Arousal) that an emotion brings to a person.

The steps of obtaining a video capture model corresponding to audio in a music library are described in detail below with reference to fig. 2, 3, and 4.

Fig. 2 illustrates a flowchart of obtaining a video capture model corresponding to audio in a music library according to an exemplary embodiment of the present invention. Fig. 3 illustrates a diagram of generating a sequence of audio expression units according to an exemplary embodiment of the present invention. Fig. 4 illustrates a diagram for generating a sequence of video expression units according to an exemplary embodiment of the present invention.

In step S210, performing emotion feature analysis on the any audio to generate audio emotion feature information of the any audio.

In one example, the any audio may be audio in a music library. The content of the audio may be a piece of music. Music is an art that expresses emotion and serves as a carrier for expressing the intention of human mind. The music can express different degrees of joy, harmony, lightness, wilfulness, happiness, furness, anxiety, worries, worry, grief, pain, 24774, and emotions of cheering, worship and the like. In one example, the audio in the music library may be subjected to emotion feature analysis according to a value-Arousal emotion model to generate audio emotion feature information (e.g., e1, e2, e3, e4, e5, e6, e7, e8 in fig. 3). In another example, the audio in the music library may be emotionally tagged by way of a big data questionnaire to generate audio emotional characteristic information for the audio in the music library.

In step S220, spectral slice feature information of any audio is generated based on the audio emotion feature information of any audio through a music emotion model.

As shown in fig. 3, the audio may be represented using a spectrogram (or called a speech spectrogram). The spectrogram represents a graph displayed after fourier analysis of a speech signal. In the spectrogram, the vertical axis represents frequency, and the horizontal axis represents time. The speech spectrogram represents a graph of the speech frequency spectrum changing along with time, and the speech spectrogram integrates the characteristics of a frequency spectrogram and a time domain waveform and displays the change condition of the speech frequency spectrum along with time. According to audio emotional feature information (e.g., e1, e2, e3, e4, e5, e6, e7, e 8) generated by performing emotional feature analysis on the audio in the music library, the spectrogram corresponding to the audio is divided, and spectrogram slice feature information corresponding to the audio emotional feature information of the audio is generated.

In one example, a complete piece of audio corresponds to a sequence of audio expression units. A sequence of audio expression units may comprise a plurality of audio expression units. An expression unit is the smallest unit that can define an emotion. In fig. 3, a plurality of audio expression units are shown as u1, u2, u3, u4, u5, u6, u7 and u8, and the audio emotional feature information e1, e2, e3, e4, e5, e6, e7, e8 corresponds to the audio expression units u1, u2, u3, u4, u5, u6, u7 and u8, respectively. The audio expression unit may correspond to speech spectrogram slice feature information corresponding to the audio emotion feature information. In other words, the audio expression unit may include audio emotional feature information and spectrogram slice feature information corresponding to the audio emotional feature information.

In one example, a music emotion model may be generated using a deep learning technique (e.g., convolutional neural network CNN, recurrent neural network RNN, etc.) trained using a set of audio emotion feature information and a set of spectrogram patch feature information corresponding to the audio emotion feature information of the audio. As shown in fig. 3, the music emotion model represents the relationship between the audio emotion feature information and the spectrogram partition feature information. The audio feature model may automatically generate spectrogram slice feature information corresponding to the audio emotional feature information of the audio based on the audio emotional feature information without performing the step of dividing the spectrogram corresponding to the audio to generate spectrogram slice feature information corresponding to the audio emotional feature information of the audio. In one example, for the same type of emotional feature information, a plurality of spectrogram segment feature information corresponding to the emotional feature information can be generated through a music emotion model according to the matching degree of the emotional feature information of the audio. For example, the music emotion model can generate a plurality of spectrogram segmentation feature information 1, 2, \8230 \8230ncorresponding to the emotional feature 1, wherein n is a positive integer greater than 1.

In step S230, a video capturing characteristic is obtained through the video emotion model based on the generated spectrogram region characteristic information, the view finding characteristic of the capturing scene, and the type of the capturing device. The framing characteristics of the shooting scene represent the current type of framing (e.g., people, landscape, buildings, etc.). The photographing device type indicates a device (e.g., a mobile terminal, a drone, etc.) employed for current photographing.

As shown in fig. 4, videos corresponding to the audios in the music library are divided based on the generated spectrogram slice feature information, and video slices (e.g., v1, v2, v3, v4, v5, v6, v7, v 8) are generated. That is, the audio emotional feature information e1, e2, e3, e4, e5, e6, e7, e8 corresponds to the video slices v1, v2, v3, v4, v5, v6, v7, v8, respectively.

In one example, after dividing a video corresponding to audio, video capture features are extracted from video slices (e.g., video frames) according to framing features of a capture scene (e.g., portrait, building, mountain, grassland, etc.), type of capture device (e.g., mobile terminal, drone, etc.), etc., using a deep learning technique, an image processing technique, a synchronized positioning and mapping SLAM technique, etc.

The video emotion model can be generated by training using the spectrogram slice feature information, the view finding feature of the shooting scene, the shooting device type set and the video shooting feature set corresponding to the spectrogram slice feature information, the view finding feature of the shooting scene and the shooting device type set by using a deep learning technology (such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and the like). As shown in fig. 4, the video emotion model represents a relationship between spectrogram slice feature information, view finding features of a shooting scene, shooting device type, and video shooting features. In one example, the video capture features may include: one or more of a video frame rendering feature, a shooting focus object in a shooting scene, a shooting device type, a framing feature of the shooting scene, a shooting perspective, and a perspective change trajectory.

In step S240, a video shooting model is generated based on the video shooting characteristics and a command sequence for guiding the user to shoot.

In one example, the video emotion model can be generated using a video capture feature set and a command sequence set corresponding to the video capture feature set trained using deep learning techniques (e.g., convolutional neural network CNN, cyclic neural network RNN, etc.). The video capture model embodies the relationship between video capture features and command sequences used to guide the user in capturing the video. That is, a command sequence that guides the user to take a picture can be generated based on the video taking characteristics through the video taking model.

In one example, the command sequence to guide the user to take a photograph may include: a first command sequence for one or more of controlling a photographing apparatus to move, changing a photographing angle of view, and adjusting a composition; a second command sequence for controlling a respective parameter, wherein the respective parameter comprises: one or more of a focusing object, a blurring background, a focal length, a sensitivity ISO, an exposure level, and a filter of the camera.

In other words, the video capture feature obtained from the video frame is converted into a first command sequence and a second command sequence that guide the user to capture. Visual angle change tracks in the video shooting characteristics can be converted into commands for guiding a user to control the motion of shooting equipment; the shooting view angle in the video shooting feature can be converted into a command for guiding a user to change the shooting view angle; the framing feature of the shooting scene in the video capture feature may be converted into a command that guides the user to adjust the composition. A video frame rendering feature in the video capture feature and a capture focus object in the capture scene may be translated into a second command sequence directing the user to control the respective parameter.

In the case where the photographing apparatuses are different, the command sequences are also different. For example, when the photographing apparatus is a portable mobile terminal, the first command sequence is a command that prompts the user to move the photographing apparatus, adjust one or more of the composition feature and the photographing angle of view. When the capture device is a drone, the first command sequence includes one or more of commands to control the framing, panning, tracking capture, bird's-eye rotation, simulated swing arm, and arc propulsion/zoom-out of the lens of the drone.

Further, when a user uses a mobile communication terminal (e.g., a smartphone) as a shooting device to perform video shooting, the user may be prompted by a first command sequence to move the shooting device along a recommended trajectory, adjust composition characteristics (e.g., adjust a proportion of a character sculpture in a shot picture, adjust whether the character sculpture is centered in the shot picture, etc.), and shoot angles of view, etc. When the user uses unmanned aerial vehicle to carry out video shooting, the play, the fall of unmanned aerial vehicle's camera lens, follow tracks and shoot, birds-eye view rotation, simulation rocking arm and pitch arc are impeld/are drawn far etc. can be controlled to accessible first command sequence.

In one example, a complete piece of video corresponds to a sequence of video presentation units. A sequence of video expression units includes a plurality of video expression units. In fig. 4, a plurality of video expression units are shown as m1, m2, m3, m4, m5, m6, m7, and m8. Referring to fig. 3, the audio emotional feature information e1, e2, e3, e4, e5, e6, e7, and e8 correspond to the video slices v1, v2, v3, v4, v5, v6, v7, and v8 and the video expression units m1, m2, m3, m4, m5, m6, m7, and m8, respectively. Each video presentation unit corresponds to a command sequence corresponding to audio emotional characteristic information.

Returning to fig. 1, in step S30, a command sequence is generated based on the target video shooting model.

In step S40, a command sequence for guiding the user to take a photograph is sent to the user to guide the user to take a video while playing audio.

In one example, when a user photographs a video, a command sequence as shown in fig. 5 below (e.g., focusing on a character sculpture, a stone, adjusting sensitivity ISO, an exposure level, a filter, prompting the user to move a photographing apparatus around the character sculpture as a focus object in a recommended trajectory, etc.) may be transmitted to the user, and the user may set the command sequence to be executed by himself or automatically by the photographing apparatus, thereby enabling real-time adjustment of the posture of the photographing apparatus for an optimal photographing angle of view, an optimal composition, etc. In one example, when the user sets that the command sequence is automatically executed by the photographing apparatus, the photographing apparatus may automatically invoke the received command sequence in turn according to the received command sequence, which reduces user's participation in function invocation of the photographing apparatus, and allows the user to concentrate on photographing contents, thereby enhancing the user experience.

The method of generating a video may further include: receiving evaluation of a user on the shot video; and optimizing the target video shooting model in response to the evaluation of the shot video by the user, wherein when the score of the shot video by the user is less than a specific value, the target video shooting model is optimized.

In one example, a user makes an evaluation (e.g., a rating) of a captured video by browsing to determine whether the captured video effect has reached its own desired effect. The target video shooting model can be optimized according to the evaluation of the user.

In one example, the method of generating a video may further comprise: and finally processing the video shot by the user in response to the command sequence to generate a final video, wherein the step of finally processing comprises the following steps: performing one or more of a composition feature to rectify a portion of the video frame, an adjustment feature to render a portion of the video frame, redundancy removal, global tone normalization, and a synchronous rectification process of the audio and video frames on the video data.

In another example, the target video capture model may also be generated based on a user-defined manner. If the user desires to take a video that is the same as the video taken by others, the user may enter the video and select to take the same work on the application software. A video shot model may be generated by the method shown in fig. 2, and a control command may be generated based on the video shot model to guide a user to shoot the same work.

Fig. 5 is a diagram illustrating capturing of a video using a mobile terminal according to an exemplary embodiment of the present invention.

In one example, as shown in fig. 5, when a user performs video shooting using a mobile communication terminal (e.g., a smartphone) as a shooting device, the user may install an application APP of the video generation device on the mobile terminal device (e.g., a smartphone), select music for use in the current shooting on the APP, select a shooting device type (e.g., select the shooting device as a mobile terminal), and start framing shooting. In one example, the current view may be transmitted to the cloud server in real-time, but is not limited thereto.

In one example, as shown in FIG. 5, the user takes a square scene, with large stones, foliage, cement floor, remote buildings, sky, etc. appearing in the first frame of the view. The video frames corresponding to the audio expression unit A1 are received from the first video frame (or referred to as the first frame picture), and a command sequence is generated based on the target video capture model and sent to the user. The command sequence includes a first command sequence and a second command sequence. The user is prompted by a first sequence of commands to move the camera, adjust one or more of the composition features and the camera angle of view, and by a second sequence of commands to control the corresponding parameters (e.g., the camera's focus subject, blurred background, focus, sensitivity, exposure level, filters, etc.). In this example, according to the video capture model, a large stone is selected as the focus object, and the user is prompted to move the smartphone with a semicircular motion trajectory.

In one example, the user may set the command sequence to be executed by himself or automatically by the photographing apparatus. When the user sets execution of the command sequence by himself, the photographing apparatus may receive the command sequence as the music is played, and prompt the user whether to execute this command sequence (e.g., prompt the user whether to adjust the exposure level to a certain value, or prompt the user whether to adjust the focus to a certain value, etc.). When the user sets that the shooting equipment automatically executes the command sequence, the shooting equipment can automatically call the received command sequence in sequence according to the received command sequence, so that the participation of the user in calling the functions of the shooting equipment is reduced, the user is enabled to concentrate on shooting contents, and the user experience is improved.

In one example, when the user stays in the same screen all the time, since the audio emotion feature information changes all the time as the audio expression unit advances, the command sequence sent to the user also changes with the change of the audio emotion feature information, and different shooting effects are generated even for the same viewfinder screen. For example, even under the same screen, a command sequence for guiding the user to switch a focus object (e.g., to switch the focus object from a large stone to a human sculpture), to apply different filters, to adjust screen rendering characteristics, and the like may be transmitted to the user to obtain the best photographing effect.

In one example, when the change of the framing picture taken by the user exceeds a certain threshold (e.g., when the user pauses the shooting after finishing shooting one scene and finds a second scene to continue shooting, objects such as a character sculpture, a flat ground, etc. that do not exist in the original shooting picture appear in a new shooting picture), the target video shooting model may be updated based on the audio emotional characteristic information expressed by the audio expression units A2, A3, and A4, and a new command sequence may be generated based on the updated target video shooting model to guide the user to obtain the best shooting effect. In this example, a sequence of commands may be sent to the user to sculpt the person in focus, adjust the ISO, adjust the exposure level, adjust the filters, prompt the user to sculpt this focus object around the person, move the capture device in the recommended trajectory, and so on.

After the video capture is complete, final processing (e.g., performing on the video data a composition feature to correct a portion of the video frame, adjusting a portion of the video frame rendering feature, deduplication, global tone unification, synchronous rectification processing of the audio and video frames) may be performed on the video captured by the user in response to the command sequence to generate the final video.

The generated target video can be sent to a user, and the user judges whether the video effect achieves the expression effect expected by the user through browsing and evaluates (e.g., scores) the finally generated target video.

Fig. 6 is a diagram illustrating shooting of a video using a drone according to an exemplary embodiment of the present invention.

In one example, as shown in fig. 6, when a user uses an automatically controllable device (e.g., a drone) for video shooting, the user may install an application APP of the video generation device on, for example, a smartphone, select music for use in this shooting on the APP, select a shooting device type (e.g., select the shooting device as a drone), and start drone framing shooting. In one example, the current view may be transmitted to the cloud server in real-time, but is not limited thereto.

A video frame corresponding to the audio expression unit A1 is received from a first video frame (or referred to as a first frame picture), and a command sequence is generated based on the target video shooting model and transmitted to the user. The command sequence includes a first command sequence and a second command sequence. The steps of the method comprise the steps of controlling the starting, the reducing, the tracking shooting, the bird's-eye rotation, the simulation of the swing arm, the arc propulsion/zoom-out and the like of a lens of the unmanned aerial vehicle through a first command sequence, and prompting a user to control corresponding parameters (such as a focusing object, a blurring background, a focal length, light sensitivity, an exposure level, a filter and the like of a camera) through a second command sequence.

In this example, according to the audio emotional feature information of the audio expression unit A1, in combination with the current view finding feature, a command sequence for instructing the drone to select a distant sculpture as a focus object and to use a vertical pull-up shooting trajectory from bottom to top is sent to the user, where the change in the pull-up speed of the drone is related to the audio emotional feature information. In the subsequent shooting process, according to the different audio emotional characteristic information of the audio expression units A2, A3 and A4, respectively, a command sequence is sent to the user to instruct the unmanned aerial vehicle to select different focus objects respectively, and video shooting is completed by using different shooting tracks (such as bird's-eye rotation, arc propulsion/zoom-out, simulated rocker arm and the like) and shooting parameters (such as exposure, rendering effect and the like).

The generated target video can be sent to a user, and the user judges whether the video effect reaches the expression effect expected by the user through browsing and evaluates (e.g., scores) the finally generated target video.

As shown in fig. 7, the apparatus for generating a video according to the embodiment includes an audio receiving unit 10, a model generating unit 20, a command generating unit 30, a video generating unit 40, a video evaluating unit 50, and a final processing unit 60.

The audio receiving unit 10 receives audio selected by a user. In one example, the music selected by the user may be music already in the device that generated the video or music that the user newly uploaded to the device that generated the video.

The model generation unit 20 obtains a target video shooting model based on the audio selected by the user.

After receiving the user-selected audio, it is determined whether the user-selected audio is in the music library. If the audio selected by the user is not in the music library, performing emotional characteristic analysis on the audio selected by the user to obtain a target video shooting model; and if the audio selected by the user is in the music library, selecting a video shooting model corresponding to the audio selected by the user from a plurality of pre-stored video shooting models as a target video shooting model in combination with the framing characteristics of the shooting scene and the type of the shooting equipment.

Further, the audio in the audio library is from a predetermined amount of video files. Such video files may be video files that are manually imported by the user to prefer, or video files that are intelligently and automatically searched and downloaded from a special video website according to the label classification. The video file is a video file with more professional shooting technology, higher shooting quality or adopting a currently popular shooting and rendering style. In these video files, shooting modes using various shooting devices (for example, a smart phone + a pan/tilt head (or called a cloud server), an unmanned aerial vehicle, and the like) are included. The music library includes audio extracted from a large number of video files. By training the video shooting model by using the video file, the video shooting model with high matching degree with the received audio can be obtained, so that the user is better assisted in completing the video shooting meeting the expected expression.

The audio emotional feature information can be generated by performing emotional feature analysis on the audio in the music library according to the value-aroma emotional model to obtain a video shooting model corresponding to the audio with the audio emotional feature information, and the model obtained in this way is referred to as a pre-stored video shooting model. The Valence-Arousal emotion model divides emotion into two dimensions of pleasure and Arousal for representing the degree of pleasure (joy) and the degree of excitement (Arousal) that an emotion brings to a person.

After receiving the user-selected audio, it is determined whether the user-selected audio is audio in a music library. If the audio selected by the user is an audio in the music library, a video photographing model corresponding to the audio selected by the user is selected from among the pre-stored video photographing models as a target video photographing model. And if the audio selected by the user is not in the music library, performing emotional characteristic analysis on the audio selected by the user to generate audio emotional characteristic information, searching the audio with the audio emotional characteristic information which is most similar to the audio emotional characteristic information of the audio selected by the user in the music library, and taking the video shooting model corresponding to the audio as a target video shooting model.

The video shooting model corresponding to any one of the audios can be obtained by: performing emotional characteristic analysis on any audio to generate audio emotional characteristic information of any audio; generating spectrogram slicing feature information of any audio frequency based on the audio frequency emotion feature information of any audio frequency through a music emotion model; acquiring video shooting characteristics based on the generated spectrogram slice characteristic information, the view finding characteristics of a shooting scene and the type of shooting equipment through a video emotion model; and generating a video shooting model based on the video shooting characteristics and a command sequence for guiding a user to shoot, wherein the music emotion model represents the relationship between the audio emotion characteristic information and the spectrogram slice characteristic information, and the video emotion model represents the relationship between the spectrogram slice characteristic information, the view finding characteristics of a shooting scene, the type of shooting equipment and the video shooting characteristics.

In one example, the any audio may be audio in a music library. The content of the audio may be a piece of music. Music is an art that expresses emotion and serves as a carrier for expressing the idea of human beings. The music can express different degrees of joy, harmony, lightness, wilfulness, happiness, furness, anxiety, worries, worry, grief, pain, 24774, and emotions of cheering, worship and the like. In one example, the audio in the music library may be subjected to emotion feature analysis according to a value-Arousal emotion model to generate audio emotion feature information (e.g., e1, e2, e3, e4, e5, e6, e7, e8 in fig. 3). In another example, the audio in the music library may be emotion labeled by way of a big data questionnaire to generate audio emotion feature information of the audio in the music library.

As shown in fig. 3 above, the audio may be represented using a spectrogram (or called a speech spectrogram). The spectrogram represents a graph displayed after fourier analysis of a speech signal. In the spectrogram, the vertical axis represents frequency, and the horizontal axis represents time. The spectrogram represents a graph of the voice frequency spectrum changing along with time, and the spectrogram integrates the characteristics of the spectrogram and a time domain waveform and displays the change condition of the voice frequency spectrum along with time. According to audio emotional feature information (e.g., e1, e2, e3, e4, e5, e6, e7, e 8) generated by performing emotional feature analysis on the audio in the music library, the spectrogram corresponding to the audio is divided, and spectrogram slice feature information corresponding to the audio emotional feature information of the audio is generated.

In one example, a complete piece of audio corresponds to a sequence of audio expression units. A sequence of transcription units may comprise a plurality of transcription units. An expression unit is the smallest unit that can define an emotion. As shown in FIG. 3 above, a plurality of audio expression units are shown as u1, u2, u3, u4, u5, u6, u7 and u8, and the audio emotional feature information e1, e2, e3, e4, e5, e6, e7, e8 corresponds to the audio expression units u1, u2, u3, u4, u5, u6, u7 and u8, respectively. The audio expression unit may correspond to speech spectrogram slice feature information corresponding to the audio emotion feature information. In other words, the audio expression unit may include audio emotional feature information and spectrogram slice feature information corresponding to the audio emotional feature information.

In one example, a music emotion model may be generated using a deep learning technique (e.g., convolutional neural network CNN, recurrent neural network RNN, etc.) trained using a set of audio emotion feature information and a set of spectrogram patch feature information corresponding to the audio emotion feature information of the audio. The music emotion model embodies the relation between the audio emotion characteristic information and the spectrogram slicing characteristic information. The audio feature model may automatically generate spectrogram slice feature information corresponding to the audio emotional feature information of the audio based on the audio emotional feature information without performing the step of dividing the spectrogram corresponding to the audio to generate spectrogram slice feature information corresponding to the audio emotional feature information of the audio.

And acquiring video shooting characteristics based on the generated spectrogram fragmentation characteristic information through the video emotion model.

In one example, videos corresponding to audio in a music library are divided based on generated spectrogram slice feature information, generating video slices (e.g., v1, v2, v3, v4, v5, v6, v7, v 8). That is, the audio affective characteristic information e1, e2, e3, e4, e5, e6, e7, e8 corresponds to the video slices v1, v2, v3, v4, v5, v6, v7, v8, respectively.

In one example, after a video corresponding to audio is divided, video capture features are extracted from a video slice (e.g., a certain video frame) according to a specific scene (e.g., portrait, building, mountain, grassland, etc.) using a deep learning technique, an image processing technique, a simultaneous localization and mapping SLAM technique, and the like.

The video emotion model can be generated by training using the spectrogram slice feature information, the view finding feature of the shooting scene, the shooting device type set and the video shooting feature set corresponding to the spectrogram slice feature information, the view finding feature of the shooting scene and the shooting device type set by using a deep learning technology (such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and the like). The video emotion model embodies the relation between spectrogram slice characteristic information, view finding characteristics of a shooting scene, shooting equipment type and video shooting characteristics.

In one example, the video capture features may include: one or more of a video frame rendering feature, a capture focus object in a capture scene, a framing feature of a capture scene, a capture perspective, a capture device type, and a perspective change trajectory.

In one example, the video emotion model can be generated using a video capture feature set and a command sequence set training corresponding to the video capture feature set using deep learning techniques (e.g., convolutional neural network CNN, cyclic neural network RNN, etc.). The video capture model embodies the relationship between video capture features and command sequences used to guide the user in capturing the video. That is, a command sequence that guides the user to take a picture can be generated based on the video taking characteristics through the video taking model.

In one example, the command sequence to guide the user to take a photograph may include: a first command sequence for one or more of controlling a photographing apparatus motion, changing a photographing angle of view, and adjusting a composition; a second command sequence for controlling a respective parameter, wherein the respective parameter comprises: one or more of a focusing object, a blurring background, a focal length, a sensitivity ISO, an exposure level, and a filter of the camera.

In other words, the video capture features obtained from the video frames are converted into a first command sequence and a second command sequence that guide the user in capturing. Visual angle change tracks in the video shooting characteristics can be converted into commands for guiding a user to control the motion of shooting equipment; the shooting perspective in the video shooting feature may be converted into a command directing the user to change the shooting perspective; the framing feature of the shooting scene in the video capture feature may be converted into a command that guides the user to adjust the composition. Video frame rendering features in the video capture feature and capture focus objects in the capture scene may be converted into a second command sequence directing the user to control the respective parameters.

In the case where the photographing apparatuses are different, the command sequences are also different. For example, when the photographing apparatus is a portable mobile terminal, the first command sequence is a command for prompting the user to move the photographing apparatus, adjust one or more of the composition feature and the photographing angle of view. When the capture device is a drone, the first command sequence includes one or more of commands to control the framing, panning, tracking capture, bird's-eye rotation, simulated swing arm, and arc propulsion/zoom-out of the lens of the drone.

In one example, a complete piece of video corresponds to a sequence of video presentation units. A sequence of video expression units includes a plurality of video expression units. As shown in fig. 4 above, a plurality of video representation units are shown as m1, m2, m3, m4, m5, m6, m7, and m8. As shown in fig. 3 above, the audio emotional feature information e1, e2, e3, e4, e5, e6, e7, and e8 correspond to the video slices v1, v2, v3, v4, v5, v6, v7, and v8 and the video expression units m1, m2, m3, m4, m5, m6, m7, and m8, respectively. Each video presentation unit corresponds to a command sequence corresponding to audio emotional characteristic information.

The command generation unit 30 generates a command sequence based on the target video shooting model.

The video generation unit 40 transmits a command sequence for guiding the user to take a picture to the user to guide the user to take a video while playing audio.

In one example, when a user photographs a video, a command sequence as shown in fig. 5 below (e.g., focusing on a character sculpture, a stone, adjusting sensitivity ISO, an exposure level, a filter, prompting the user to move a photographing apparatus around the character sculpture as a focus object in a recommended trajectory, etc.) may be transmitted to the user, and the user may set the command sequence to be executed by himself or automatically by the photographing apparatus, thereby enabling real-time adjustment of the posture of the photographing apparatus for an optimal photographing angle of view, an optimal composition, etc. In one example, when the user sets that the command sequence is automatically executed by the photographing apparatus, the photographing apparatus may automatically invoke the received command sequence in turn according to the received command sequence, which reduces the user's involvement in function invocation of the photographing apparatus, and allows the user to concentrate on photographing contents, thereby improving the user experience.

The video evaluation unit 50 receives a user's evaluation of the photographed video and optimizes a target video photographing model in response to the user's evaluation of the photographed video, wherein the target video photographing model is optimized when the user's score of the photographed video is less than a certain value.

The final processing unit 60 performs final processing on the video captured by the user in response to the command sequence to generate a final video, the steps of the final processing including: performing one or more of rectifying a composition feature of a partial video frame, adjusting a partial video frame rendering feature, deduplication, global tone unification, and simultaneous rectification processing of audio and video frames on the video data.

There is also provided, in accordance with an exemplary embodiment of the present invention, an apparatus for generating a video. The apparatus comprises: a processor and a memory. The memory is for storing a computer program. The computer program, when executed by a processor, causes the processor to perform the above-described method of generating a video.

There is also provided in accordance with an exemplary embodiment of the present invention a computer readable storage medium having a computer program stored therein. The computer readable program when executed performs the above-described method of generating a video.

Further, it should be understood that each unit in the apparatus for generating a video according to an exemplary embodiment of the present invention may be implemented as a hardware component and/or a software component. The individual units may be implemented, for example, using Field Programmable Gate Arrays (FPGAs) or Application Specific Integrated Circuits (ASICs), depending on the processing performed by the individual units as defined by the skilled person.

Also, the method of generating a video according to an exemplary embodiment of the present invention may be implemented as a computer program in a computer-readable recording medium. The computer program may be implemented by a person skilled in the art from the description of the method described above. The above-described method of the present invention is implemented when the computer program is executed in a computer.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims.

Claims

1. A method of generating a video, the method comprising:

receiving audio selected by a user;

obtaining a target video shooting model based on the audio selected by the user;

generating a command sequence based on the target video shooting model;

sending a command sequence for guiding the user to shoot to the user, guiding the user to shoot the video while playing the audio,

the step of obtaining the target video shooting model based on the audio selected by the user comprises the following steps:

and if the audio selected by the user is not in the music library, selecting the video shooting model corresponding to the audio with the audio emotional characteristic information which is most similar to the audio emotional characteristic information of the audio selected by the user from a plurality of pre-stored video shooting models by combining the view finding characteristic of the shooting scene and the type of the shooting equipment, and taking the video shooting model as the target video shooting model.

2. The method of claim 1, wherein the step of obtaining the target video capture model based on the user-selected audio further comprises:

and if the audio selected by the user is in the music library, selecting a video shooting model corresponding to the audio selected by the user from a plurality of pre-stored video shooting models as a target video shooting model in combination with the framing characteristics of the shooting scene and the type of the shooting equipment.

3. The method of claim 1, wherein the step of obtaining the target video capture model based on the user-selected audio further comprises:

and if the audio selected by the user is not in the music library, performing emotional characteristic analysis on the audio selected by the user to generate audio emotional characteristic information.

4. The method of claim 2, wherein the video capture model corresponding to any audio is obtained by:

performing emotional characteristic analysis on any audio to generate audio emotional characteristic information of any audio;

generating spectrogram slicing feature information of any audio based on the audio emotion feature information of any audio through a music emotion model;

acquiring video shooting characteristics based on the generated spectrogram slice characteristic information, the view finding characteristics of a shooting scene and the type of shooting equipment through a video emotion model;

generating a video capture model based on the video capture features and a sequence of commands for directing a user to capture,

wherein, the music emotion model represents the relationship between the audio emotion characteristic information and the spectrogram segmentation characteristic information,

the video emotion model embodies the relation between spectrogram slice characteristic information, view finding characteristics of a shooting scene, shooting equipment type and video shooting characteristics.

5. The method of claim 4, wherein the video capture features comprise: one or more of a video frame rendering feature, a capture focus object in a capture scene, a framing feature of a capture scene, a capture perspective, a capture device type, and a perspective change trajectory.

6. The method of claim 1, wherein the command sequence comprises:

a first command sequence for one or more of controlling a photographing apparatus to move, changing a photographing angle of view, and adjusting a composition;

a second command sequence for controlling the corresponding parameter,

wherein, the corresponding parameters include: one or more of a focus object, a blurred background, a focal length, a sensitivity, an exposure level, and a filter of the camera.

7. The method of claim 6, wherein,

when the shooting equipment is a portable mobile terminal, the first command sequence is a command for prompting a user to move the shooting equipment and adjusting one or more of composition characteristics and shooting visual angles;

when the capture device is a drone, the first command sequence includes one or more of commands to control the panning, track capture, bird's eye rotation, simulated swing arm, and arc propulsion/zoom out of the lens of the drone.

8. The method of claim 1, further comprising:

receiving evaluation of a user on the shot video;

optimizing a target video capture model in response to a user's evaluation of captured video,

and when the score of the user on the shot video is smaller than a specific value, optimizing the target video shooting model.

9. The method of claim 1, further comprising:

final processing of the video captured by the user in response to the command sequence, generating a final video,

wherein the final processing step comprises: performing one or more of a composition feature to rectify a portion of the video frame, an adjustment feature to render a portion of the video frame, redundancy removal, global tone normalization, and a synchronous rectification process of the audio and video frames on the video data.

10. An apparatus to generate video, the apparatus comprising:

an audio receiving unit receiving an audio selected by a user;

the model generation unit is used for obtaining a target video shooting model based on the audio selected by the user;

a command generation unit that generates a command sequence based on the target video shooting model;

a video generating unit that transmits a command sequence for guiding a user to take a picture to the user to guide the user to take a video while playing audio,

wherein the process of obtaining the target video shooting model based on the audio selected by the user comprises:

and if the audio selected by the user is not in the music library, selecting a video shooting model corresponding to the audio with audio emotional characteristic information most similar to the audio emotional characteristic information of the audio selected by the user from a plurality of pre-stored video shooting models according to the view finding characteristic and the shooting equipment type of the shooting scene as a target video shooting model.

11. The apparatus of claim 10, wherein the process of obtaining the target video capture model based on the user-selected audio further comprises:

12. The apparatus of claim 11, wherein the process of obtaining the target video capture model based on the user-selected audio further comprises:

13. The apparatus of claim 11, wherein the video shooting model corresponding to any one of the audios is obtained by:

generating spectrogram slicing feature information of any audio frequency based on the audio frequency emotion feature information of any audio frequency through a music emotion model;

generating a video capture model based on the video capture characteristics and a sequence of commands for guiding a user in capturing,

wherein, the music emotion model embodies the relationship between the audio emotion characteristic information and the spectrogram slice characteristic information,

the video emotion model embodies the relation among spectrogram slicing feature information, view finding features of shooting scenes, shooting equipment types and video shooting features.

14. The apparatus of claim 13, wherein the video capture features comprise: one or more of a video frame rendering feature, a capture focus object in a capture scene, a framing feature of a capture scene, a capture perspective, a capture device type, and a perspective change trajectory.

15. The apparatus of claim 10, wherein the command sequence comprises:

a first command sequence for one or more of controlling a photographing apparatus motion, changing a photographing angle of view, and adjusting a composition;

a second command sequence for controlling the corresponding parameter,

16. The apparatus of claim 15, wherein,

17. The apparatus of claim 10, further comprising:

receiving evaluation of a user on the shot video;

18. The apparatus of claim 10, further comprising:

wherein the final processing step comprises: performing one or more of rectifying a composition feature of a partial video frame, adjusting a partial video frame rendering feature, deduplication, global tone unification, and simultaneous rectification processing of audio and video frames on the video data.

19. An apparatus for generating video, the apparatus comprising:

a processor;

memory storing a computer program which, when executed by the processor, performs the method of any one of claims 1 to 9.

20. A computer readable storage medium having a computer program stored therein, which when executed performs the method of any one of claims 1 to 9.