WO2021057740A1 - 视频生成方法、装置、电子设备和计算机可读介质 - Google Patents

视频生成方法、装置、电子设备和计算机可读介质 Download PDF

Info

Publication number
WO2021057740A1
WO2021057740A1 PCT/CN2020/116921 CN2020116921W WO2021057740A1 WO 2021057740 A1 WO2021057740 A1 WO 2021057740A1 CN 2020116921 W CN2020116921 W CN 2020116921W WO 2021057740 A1 WO2021057740 A1 WO 2021057740A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
duration
audio
music
initial
Prior art date
Application number
PCT/CN2020/116921
Other languages
English (en)
French (fr)
Inventor
王妍
刘舒
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to KR1020227010159A priority Critical patent/KR20220045056A/ko
Priority to BR112022005713A priority patent/BR112022005713A2/pt
Priority to JP2022519290A priority patent/JP7355929B2/ja
Priority to EP20868358.1A priority patent/EP4024880A4/en
Publication of WO2021057740A1 publication Critical patent/WO2021057740A1/zh
Priority to US17/706,542 priority patent/US11710510B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/036Insert-editing
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, and in particular to video generation methods, devices, electronic equipment, and computer-readable media.
  • the content of the present invention is used to introduce concepts in a brief form, and these concepts will be described in detail in the following specific embodiments.
  • the content of the present invention is not intended to identify the key features or essential features of the technical solution required to be protected, nor is it intended to be used to limit the scope of the technical solution required to be protected.
  • the purpose of some embodiments of the present disclosure is to propose an improved video generation method, device, electronic device, and computer-readable medium to solve the technical problems mentioned in the background art section above.
  • some embodiments of the present disclosure provide a video generation method, the method includes: acquiring image material and audio material, wherein the image material includes image material; determining the music point of the audio material, wherein the music Points are used to divide the above audio material into multiple audio clips; using the above image material, generate a video clip for each music clip in the audio material to obtain multiple video clips; among them, the corresponding music clips and videos
  • the segments have the same duration; the multiple video segments are spliced together according to the time when the music segments corresponding to the multiple video segments appear in the audio material, and the audio material is added as a video sound track to obtain a synthesized video.
  • some embodiments of the present disclosure provide a video generation device.
  • the device includes: an acquisition unit configured to acquire image materials and audio materials, wherein the image materials include image materials; and the determining unit is configured to determine The music point of the audio material, wherein the music point is used to divide the audio material into a plurality of audio segments; the generating unit is configured to use the image material to generate a video for each music segment in the audio material Fragments to obtain multiple video fragments; wherein the corresponding music fragments and video fragments have the same duration; the synthesizing unit is configured to combine the music fragments corresponding to the multiple video fragments in the audio material according to the time when the corresponding music fragments appear in the audio material. Multiple video clips are spliced together, and the above audio material is added as a video sound track to obtain a synthesized video.
  • some embodiments of the present disclosure provide an electronic device, including: one or more processors; a storage device, on which one or more programs are stored. When one or more programs are stored by one or more The processor executes such that one or more processors implement the method as in any one of the first aspect.
  • some embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, where the program is executed by a processor to implement the method as in any one of the first aspect.
  • One of the above-mentioned various embodiments of the present disclosure has the following beneficial effects: by dividing the music points, it is possible to generate each video segment in the composite video, which reduces the time for the user to process the material and makes the editing easier. Furthermore, the video clips in the composite video can be generated from image materials. In this way, when the user has no video material or few video materials, the user can use the image material to edit the video, making the clipped video content more diverse. .
  • FIGS. 1A-1D are schematic diagrams of an application scenario of a video generation method according to some embodiments of the present disclosure
  • Figure 2 is a flowchart of some embodiments of a video generation method according to the present disclosure
  • 3A-3D are schematic diagrams of an application scenario of picture material movement according to some embodiments of the present disclosure.
  • FIG. 4 is a flowchart of still other embodiments of the video generation method according to the present disclosure.
  • Figure 5 is a schematic structural diagram of some embodiments of an apparatus for retrieving videos according to the present disclosure
  • Fig. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure.
  • FIG. 1A-1D are schematic diagrams of different application scenarios of a video generation method according to some embodiments of the present disclosure.
  • the user can select multiple image materials on the upload page 1017 of the terminal device 101. For example, upload the picture 1011-1014 shown in page 1017.
  • the user clicks the position shown in the selection box 1015 to select the picture 1011 and the picture 1012.
  • the music point 107 in the audio material 106 is determined.
  • the audio material 106 is divided into a music piece A and a music piece B according to the music point 107.
  • the video material 104 and the video material 105 are respectively processed according to the obtained durations of the music segment A and the music segment B.
  • the video segments 1041 and 1051 are spliced according to the time when the music segment A and the music segment B appear in the audio material 106, and the audio material 106 is added as the audio track of the spliced video to obtain the synthesized video 108.
  • the above-mentioned terminal device 101 sends the image information 102 including the number of image materials (shown as 2 in the figure) to the server 103.
  • the server 103 determines the music point 107 in the audio material 106 according to the acquired audio material 106, and divides the audio material 106 into a music piece A and a music piece B according to the music point 107.
  • the server 103 sends the information 109 including the duration of the music segment A and the music segment B to the aforementioned terminal device 101.
  • the aforementioned terminal device 101 processes the video material 104 and the video material 105 respectively according to the duration of the music segment A and the music segment B to obtain corresponding video segments 1041 and 1051.
  • the video segment 1041 has the same duration as the music segment A
  • the video segment 1051 is equal to the duration of music segment B.
  • the aforementioned terminal device 101 splices the video segments 1041 and 1051 according to the time when the music segment A and the music segment B appear in the audio material 106, and adds the audio material 106 as the audio track of the spliced video to obtain the synthesized video 108.
  • the video generation method may be executed by the terminal device 101, or it may be executed by the server 103, or it may be executed by the interaction between the terminal device 101 and the server 103, or it may be various Kind of software program to execute.
  • the terminal device 101 may be, for example, various electronic devices with a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop computers, desktop computers, and so on.
  • the execution subject may also be embodied as the server 103, software, and so on.
  • the execution subject is software, it can be installed in the electronic devices listed above. It can be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. There is no specific limitation here.
  • FIG. 1 the numbers of mobile phones and servers in FIG. 1 are only illustrative. According to implementation needs, there can be any number of mobile phones and servers.
  • the video generation method includes the following steps:
  • Step 201 Obtain image material and audio material.
  • the execution body of the video generation method may obtain the image material and the audio material through a wired connection or a wireless connection, wherein the above-mentioned image material includes image material.
  • the aforementioned image material may be a picture stored locally by the user, or a picture downloaded by the user from the Internet.
  • the above audio material may be music stored locally by the user, or music on the Internet.
  • the image material may include video material in addition to picture material.
  • the above-mentioned video material may be a video uploaded by the user, a video stored locally by the user, or a video downloaded by the user from the Internet.
  • the image material can include video material or picture material, the types of image material can be increased.
  • Step 202 Determine the music point of the aforementioned audio material.
  • the above-mentioned executive body may first determine candidate music points of the audio material.
  • the candidate music point may be a point in the audio material that satisfies the set tempo change condition.
  • the above-mentioned execution subject can select a target number of music points from the candidate music points that have been obtained.
  • the target quantity may be determined according to the quantity of the acquired image material, or it may be determined according to the number of strong shots in the audio material, or it may be a quantity set by the user.
  • 9 music points can be determined.
  • the above-mentioned strong beat is usually a beat with strong musical strength.
  • candidate music point is the position in the audio material that meets the set musicality change.
  • the position where the musicality changes may include the position where the tempo changes and the position where the melody changes.
  • candidate music points can be determined in the following way: the above-mentioned executive body can analyze the above-mentioned audio material to determine the beat point and the starting point of the note, where the beat point is the position where the beat changes, and the starting point of the note is the melody The location where the change occurred.
  • a deep learning-based beat analysis algorithm can be used to analyze the audio material to obtain the beat point in the audio material and the time stamp at which the beat point is located; on the other hand, perform short-time spectrum analysis on the audio material to obtain the audio material
  • the starting point of the note can be obtained by an onset detector. Then, the beat points and the starting points of the notes obtained through the two methods are unified, and the beat points and the starting points of the notes are combined and de-duplicated to obtain candidate music points.
  • Step 203 Using the above-mentioned video material, a video segment is generated for each music segment in the above-mentioned audio material to obtain multiple video segments.
  • the above-mentioned execution subject may generate a video segment for the music segment based on the image material with the same length as the music segment.
  • the duration of the 3 music fragments is 1 second, 2 seconds and 3 seconds respectively
  • the duration of the video fragments corresponding to the above music fragments can also be 1 second respectively. , 2 seconds and 3 seconds.
  • the above-mentioned execution subject may generate multiple video clips based on one image material.
  • the executor For example, suppose the above-mentioned executor obtains a 10-second image material and an 8-second audio material, the executor divides the audio material into 3 audio clips according to the music point, and the duration is 2 seconds, 3 seconds, and 5 seconds, respectively. Then the execution subject can cut out 3 different video clips from the image material, the durations are 2 seconds, 3 seconds, and 5 seconds, respectively. As another example, the above-mentioned execution subject may also generate a video segment based on an image material.
  • a video material when a video material is used to generate a video clip for a music clip, when the duration of the video material is greater than the duration of the music clip, a video clip with the same duration as the music clip is intercepted from the original video material, and When the duration of the video material is less than the duration of the music clip, the original video material is subjected to variable speed processing to increase the duration, and then the variable video material is used as a video clip to make the duration of the video clip equal to the duration of the music clip. It is understandable that for the picture material in the image material, a variety of implementation methods can be used to generate the picture material into a video clip.
  • the generated multiple video clips include a second video clip, and the second video clip is formed by the movement of the picture material.
  • the above-mentioned second video segment may be a picture material with a motion effect added.
  • the aforementioned motion effects may include, but are not limited to, at least one of the following: inward zooming, outward zooming, moving the mirror to the left, and moving the mirror to the right.
  • the above inward zooming can be: in the initial situation, the center area of the picture is displayed in the display frame of the page, as shown in Figure 3A; then the size of the picture gradually becomes smaller, and the area displayed in the display frame of the picture gradually Expand until the complete picture is displayed in the display box, as shown in Figure 3B.
  • the above-mentioned outward zooming can be: in the initial situation, the complete picture is displayed in the display box of the page, as shown in Figure 3B; then the size of the picture gradually increases, and the area displayed in the display box of the picture gradually shrinks until The central area of the preset size in the picture is displayed in the display box, as shown in Figure 3A.
  • the above moving mirror to the left can be: in the initial situation, display the preset right area in the display picture in the display frame of the page, as shown in Figure 3D; then the picture moves to the left relative to the display frame, and the picture is in the display frame The displayed area gradually moves to the left until the preset left area in the picture is displayed in the display box, as shown in Fig. 3C, the picture visually moves from right to left.
  • the above-mentioned moving mirror to the right can be: in the initial situation, display the preset left area in the picture in the display box of the page, as shown in Figure 3C; then the picture moves to the right relative to the display box, and the picture is displayed in the display box The area gradually moves to the right until the preset right area in the picture is displayed in the display box, as shown in Figure 3D, the picture moves visually from left to right.
  • adding motion to the picture material can make the conversion between the picture material and the video material more natural.
  • curTime is the time when the picture currently appears in the video.
  • EndTime is the time when the picture ends moving
  • StartTime is the time when the picture starts moving
  • EndTime-StartTime is the time length of the picture moving.
  • curScale can be the position of the current display area in the picture
  • EndScale can be the position of the display area when the picture ends
  • StartScale can be the position of the display area when the picture starts to move.
  • EndScale-StartScale can be the amount of change in the position of the display area during the motion of the picture.
  • curScale can be the size of the current display area in the picture
  • EndScale can be the size of the display area when the picture ends
  • StartScale can be the size of the display area when the picture starts to move
  • EndScale- StartScale can be the size change of the display area during the movement of the picture.
  • the size change amount and the position change amount can be artificially set.
  • the generated multiple video clips include a first video clip, and the first video clip is generated by adding motion effects to the picture material.
  • the above-mentioned first video clip may be a picture material after adding a motion effect.
  • Motion effects can be foreground motion effects randomly added to the picture material.
  • the foreground animation effect can be a dynamic animation effect added to the picture. For example, add an animation effect of raining to a picture. Among them, adding dynamic effects to the picture material can make the picture material more visually beautiful and improve the user's visual effect.
  • the animation effect added to the above-mentioned picture material may be determined according to the scene category of the above-mentioned picture material.
  • the above-mentioned scene category may represent the scene presented in the above-mentioned picture material.
  • the scene category may include but is not limited to at least one of the following: general scene category and indoor category.
  • the general scene category may include but is not limited to at least one of the following: babies, beaches, buildings, cars, cartoons, and animals.
  • Indoor categories may include but are not limited to at least one of the following: bookstores, coffee shops, KTV (Karaoke, karaoke) and shopping malls.
  • the execution subject may identify whether the aforementioned picture material includes preset scene information to determine the scene category.
  • adding motion effects to the picture material according to the scene category can increase the association between the picture and the motion effect. For example, if the scene information in the picture material is "snowman", then the motion effect can be an animation effect of "fluttering snowflakes".
  • the scene category of the picture material may be obtained by analyzing the picture material through a machine learning model, where the machine learning model has been trained through a set of training samples.
  • the training samples in the above-mentioned training sample set include sample picture materials and sample scene categories corresponding to the sample picture materials. Among them, determining the scene category through the model can increase the speed and save manpower.
  • the machine learning model may be obtained by performing the following training steps based on a set of training samples. Perform the following training steps based on the training sample set: input sample picture materials of at least one training sample in the training sample set into the initial machine learning model to obtain the scene category corresponding to each sample picture material in the at least one training sample; Compare the scene category corresponding to each sample picture material in the at least one training sample with the corresponding sample scene category; determine the prediction accuracy of the initial machine learning model according to the comparison result; determine whether the prediction accuracy is greater than the preset accuracy In response to determining that the accuracy is greater than the preset accuracy threshold, the initial machine learning model is used as the trained machine learning model; in response to determining that the accuracy is not greater than the preset accuracy threshold, adjust the initial The parameters of the machine learning model and unused training samples are used to form a training sample set, the adjusted initial machine learning model is used as the initial machine learning model, and the above training steps are performed again.
  • the machine learning model can be used to characterize the correspondence between the picture material and the scene category.
  • the machine learning model mentioned above may be a convolutional neural network model.
  • the training sample set includes sample pictures and scene categories of the above sample pictures
  • the machine learning model takes the above sample pictures as input and uses the scene categories of the above sample pictures as the desired Output for training.
  • step 204 the multiple video clips are spliced together according to the time when the music clips corresponding to the multiple video clips appear in the audio material, and the audio material is added as a video sound track to obtain a synthesized video.
  • the execution subject of the video generation method may splice the video fragments corresponding to the music fragments together in sequence according to the order in which the music fragments in the audio material appear, and in the audio track of the spliced video Add the above audio material to get a synthesized video.
  • the above audio material can be divided into 3 segments in order according to music points.
  • segment A can be from 0 seconds to 2 seconds
  • segment B can be from 2 seconds to 5 seconds
  • segment C can be from 5 seconds to 5 seconds. 10 seconds.
  • the corresponding video segments are a segment, b segment, and c segment.
  • the spliced video can be expressed as abc.
  • the above audio material is added to the audio track of the spliced video abc to obtain a synthesized video.
  • the image material is only video material
  • the type of image material will be single, and the content will be relatively single, which will affect the diversity of the composite video content.
  • the types of image materials can be enriched, thereby increasing the diversity of the synthesized video content.
  • the video generation method includes the following steps:
  • Step 401 Obtain initial audio.
  • the execution body of the video generation method may obtain the initial audio through a wired connection or a wireless connection.
  • the aforementioned initial audio may be music stored locally by the user or music on the network.
  • some music can be recommended to the user first, and if the user cannot find the music that needs to be used in the recommended music, he can manually search for other music, so as to obtain the music selected by the user as the initial audio.
  • Step 402 Determine the duration of the audio material according to the total duration of the image material and the duration of the initial audio.
  • the above-mentioned execution subject may calculate the total duration of all the image materials based on the multiple acquired image materials.
  • the duration of the video material may be the duration of the video, and the duration of the picture material may be artificially set, for example, 4 seconds.
  • determining the duration of the audio material according to the total duration of the image material and the duration of the initial audio includes: according to the total duration of the image material and the duration of the initial audio, Determine the initial duration; the aforementioned initial duration can be the duration of the initial audio or the total duration of the video material. If the initial duration is greater than the duration threshold, the duration threshold is determined as the duration of the audio material; the duration threshold may be artificially set, for example, the duration threshold may be 20 seconds. If the initial duration is less than 20 seconds, the initial duration is determined as the duration of the audio material. Among them, setting the threshold can control the duration of the audio material.
  • determining the initial duration according to the total duration of the aforementioned image material and the duration of the aforementioned initial audio includes: if the total duration of the aforementioned image material is greater than the duration of the aforementioned initial audio, the initial The duration of the audio is determined as the initial duration; if the total duration of the image material is less than the duration of the initial audio, the total duration of the image material is reduced to obtain the duration of the audio material.
  • the method of reducing the total duration of the image material may be to multiply the total duration by a target ratio, or to subtract a preset duration from the total duration.
  • the above-mentioned target ratio and the above-mentioned preset duration may be artificially set.
  • the aforementioned preset duration needs to be less than the aforementioned total duration.
  • the above method can flexibly control the duration of the audio material.
  • Step 403 Extract the audio material from the initial audio according to the duration of the audio material.
  • the duration of the aforementioned audio material of the execution subject is extracted from the aforementioned initial audio.
  • Step 404 Obtain image material and audio material.
  • Step 405 Determine the music point of the aforementioned audio material.
  • Step 406 Using the above-mentioned video material, a video segment is generated for each music segment in the above-mentioned audio material to obtain multiple video segments.
  • step 407 the multiple video clips are spliced together according to the time when the music clips corresponding to the multiple video clips appear in the audio material, and the audio material is added as a video sound track to obtain a synthesized video.
  • steps 404-407 and the technical effects brought by it can refer to steps 201-204 in those embodiments corresponding to FIG. 2, which will not be repeated here.
  • the video generation method disclosed in some embodiments of the present disclosure determines the duration of the audio material based on the acquired initial audio, the total duration of the image material and the duration of the initial audio, and extracts the audio material from the initial audio. , So that the duration of the audio material can be adapted to the duration of the synthesized video.
  • the present disclosure provides some embodiments of a video generation apparatus. These apparatus embodiments correspond to those of the above-mentioned method embodiments in FIG. 2, and the apparatus can be specifically applied. Used in various electronic devices.
  • the video generation device 500 of some embodiments includes: an acquisition unit 501, a determination unit 502, a generation unit 503, and a synthesis unit 504.
  • the acquiring unit 501 is configured to acquire image materials and audio materials, wherein the image materials include picture materials
  • the determining unit 502 is configured to determine music points of the audio materials, wherein the music points are used to use the audio The material is divided into multiple audio segments
  • the generating unit 503 is configured to use the above-mentioned video material to generate a video segment for each music segment in the above-mentioned audio material to obtain multiple video segments; wherein, the corresponding music segment and The video clips have the same duration
  • the synthesizing unit 504 is configured to splice the multiple video clips together according to the time when the music clips corresponding to the multiple video clips appear in the audio material, and add the audio material as the video Audio track, get synthesized video.
  • the multiple video clips in the generating unit 503 of the video generating device 500 include a first video clip, and the first video clip is generated by adding animation to the picture material.
  • the animation effect added to the picture material in the video generating device 500 is determined according to the scene category of the picture material.
  • the scene category of the picture material in the video generating device 500 is obtained by analyzing the picture material through a machine learning model, where the machine learning model has been trained through a set of training samples.
  • the training sample set in the video generating device 500 includes a sample picture and the scene category of the sample picture.
  • the machine learning model uses the sample picture as input and the scene category of the sample picture as the desired output. Trained.
  • the multiple video segments in the generating unit 503 of the video generating device 500 include a second video segment, and the above-mentioned second video segment is formed by the movement of the above-mentioned picture material.
  • the image material in the acquiring unit 501 of the video generating device 500 also includes video material.
  • the multiple video clips in the generating unit 503 of the video generating device 500 include a third video clip, and the third video clip is extracted from the video material.
  • the video generating device 500 further includes: a first acquiring unit configured to acquire initial audio; and a first determining unit configured to determine the audio according to the total duration of the image material and the duration of the initial audio. The duration of the material, wherein the duration of the audio material is less than the total duration of the image material; the extraction unit is configured to extract the audio material from the initial audio according to the duration of the audio material.
  • the first determining unit of the video generating device 500 includes: a first determining subunit configured to determine the initial duration according to the total duration of the image material and the duration of the initial audio; a second determining subunit, Is configured to determine the duration threshold as the duration of the audio material if the initial duration is greater than the duration threshold; the third determining subunit is configured to determine the initial duration as the audio material if the initial duration is less than the duration threshold The length of time.
  • the first determining subunit in the first determining unit of the video generating device 500 is further configured to determine the duration of the initial audio as the foregoing if the total duration of the image material is greater than the duration of the initial audio. Initial duration; if the total duration of the image material is less than the duration of the initial audio, the total duration of the image material is reduced to obtain the duration of the audio material.
  • FIG. 6 shows a schematic structural diagram of an electronic device (for example, the server in FIG. 1) 600 suitable for implementing some embodiments of the present disclosure.
  • the terminal devices in some embodiments of the present disclosure may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals, etc. (E.g. car navigation terminals) and other mobile terminals and fixed terminals such as digital TVs, desktop computers, etc.
  • the terminal device shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 600 may include a processing device (such as a central processing unit, a graphics processor, etc.) 601, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 602 or from a storage device 608.
  • the program in the memory (RAM) 603 executes various appropriate actions and processing.
  • various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604.
  • the following devices can be connected to the I/O interface 605: including input devices 606 such as touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; including, for example, liquid crystal displays (LCD), speakers, vibration An output device 607 such as a device; a storage device 608 such as a memory card; and a communication device 609.
  • the communication device 609 may allow the electronic device 600 to perform wireless or wired communication with other devices to exchange data.
  • FIG. 6 shows an electronic device 600 having various devices, it should be understood that it is not required to implement or have all the illustrated devices. It may alternatively be implemented or provided with more or fewer devices. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as needed.
  • the process described above with reference to the flowchart may be implemented as a computer software program.
  • some embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602.
  • the processing device 601 the above-mentioned functions defined in the methods of some embodiments of the present disclosure are executed.
  • the aforementioned computer-readable medium in some embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable removable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable signal medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, RF (Radio Frequency), etc., or any suitable combination of the above.
  • the client and server can communicate with any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communication e.g., communication network
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (for example, the Internet), and end-to-end networks (for example, ad hoc end-to-end networks), as well as any currently known or future research and development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or it may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: obtains image materials and audio materials, wherein the above-mentioned image materials include image materials; The music point of the audio material, where the music point is used to divide the audio material into multiple audio segments; the video material is used to generate a video segment for each music segment in the audio material to obtain multiple video segments ; Among them, the corresponding music clips and video clips have the same duration; according to the time when the music clips corresponding to the multiple video clips appear in the audio material, the multiple video clips are spliced together, and the audio material is added As a video audio track, a composite video is obtained.
  • the computer program code used to perform the operations of some embodiments of the present disclosure can be written in one or more programming languages or a combination thereof.
  • the above-mentioned programming languages include object-oriented programming languages such as Java, Smalltalk, C++, Also includes conventional procedural programming languages-such as "C" language or similar programming languages.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network-including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to Connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units described in some embodiments of the present disclosure may be implemented in software or hardware.
  • the described unit may also be provided in the processor.
  • a processor includes an acquisition unit, a determination unit, a generation unit, and a synthesis unit.
  • the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the acquisition unit can also be described as "a unit for acquiring image material and audio material".
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a video generation method including: acquiring image material and audio material, wherein the image material includes image material; determining the music point of the audio material, wherein the music point Used to divide the audio material into multiple audio segments; using the video material to generate a video segment for each music segment in the audio material to obtain multiple video segments; among them, the corresponding music segments and video segments Have the same duration; splice the multiple video clips together according to the time when the music clips corresponding to the multiple video clips appear in the audio material, and add the audio material as a video sound track to obtain a synthesized video.
  • the multiple video clips include a first video clip, and the first video clip is generated by adding motion effects to the picture material.
  • the animation effect is determined according to the scene category of the above-mentioned picture material.
  • the scene category of the picture material is obtained by analyzing the picture material through a machine learning model, wherein the machine learning model has been trained through a set of training samples.
  • the training sample set includes sample pictures and scene categories of the above sample pictures
  • the machine learning model takes the above sample pictures as input and uses the scene categories of the above sample pictures as the desired output. Trained.
  • the plurality of video clips include a second video clip, and the second video clip is formed by the movement of the picture material.
  • the image material also includes video material.
  • the multiple video clips include a third video clip, and the third video clip is extracted from the video material.
  • the method further includes: obtaining initial audio; determining the duration of the audio material according to the total duration of the image material and the duration of the initial audio, wherein the duration of the audio material is less than the duration of the aforementioned audio material.
  • the total duration of the image material; according to the duration of the audio material, the audio material is extracted from the initial audio.
  • determining the duration of the audio material based on the total duration of the video material and the duration of the initial audio includes: determining the initial duration based on the total duration of the video material and the duration of the initial audio Duration; if the initial duration is greater than the duration threshold, the duration threshold is determined as the duration of the audio material; if the initial duration is less than the duration threshold, the initial duration is determined as the duration of the audio material.
  • determining the initial duration based on the total duration of the aforementioned image material and the duration of the aforementioned initial audio includes: if the total duration of the aforementioned image material is greater than the duration of the aforementioned initial audio, setting the duration of the aforementioned initial audio The duration is determined as the initial duration; if the total duration of the image material is less than the duration of the initial audio, the total duration of the image material is reduced to obtain the duration of the audio material.
  • the device includes: an acquiring unit configured to acquire image material and audio material, wherein the image material includes image material; the determining unit is configured to determine the music point of the audio material , Wherein the music point is used to divide the audio material into a plurality of audio segments; the generating unit is configured to use the image material to generate a video segment for each music segment in the audio material to obtain multiple videos Fragments; wherein the corresponding music fragments and video fragments have the same duration; the synthesis unit is configured to splice the multiple video fragments in the audio material according to the time when the music fragments corresponding to the multiple video fragments appear in the audio material Together, and add the above audio material as a video audio track to get a composite video.
  • an electronic device including: one or more processors; a storage device, on which one or more programs are stored, when one or more programs are stored by one or more Execution by two processors, so that one or more processors implement the method described in any of the foregoing embodiments.
  • a computer-readable medium on which a computer program is stored, where the program is executed by a processor to implement the method described in any of the above embodiments.

Abstract

本公开的实施例公开了视频生成方法、装置、电子设备和计算机可读介质。该方法的一具体实施方式包括:获取影像素材和音频素材,其中,该影像素材包括图片素材;确定该音频素材的音乐点,其中,该音乐点用于将该音频素材划分成多个音频片段;利用该影像素材,为该音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;按照该多个视频片段分别对应的音乐片段在该音频素材中出现的时间将该多个视频片段拼接在一起,并添加该音频素材作为视频音轨,得到合成视频。该实施方式丰富了可以用于生成视频的素材的类型。

Description

视频生成方法、装置、电子设备和计算机可读介质
相关申请的交叉引用
本申请要求于2019年09月26日提交的,申请号为201910919296.X、发明名称为“视频生成方法、装置、电子设备和计算机可读介质”的中国专利申请的优先权,该申请的全文通过引用结合在本申请中。
技术领域
本公开的实施例涉及计算机技术领域,具体涉及视频生成方法、装置、电子设备和计算机可读介质。
背景技术
随着多媒体技术的飞速发展,视频处理技术也在快速进步,视频处理软件已经作为终端上的一种常用软件,广泛应用于各种场景。在许多情况下,用户往往需要使用视频、音乐等素材剪辑制作出一个视频,但目前,用户在使用视频软件剪辑视频时往往需要花费大量的精力和时间来处理各种素材,可见,目前的视频剪辑方式对用户来说是不够简便的。
发明内容
本发明内容部分用于以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。本发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
本公开的一些实施例的目的在于提出一种改进的视频生成方法、装置、电子设备和计算机可读介质,来解决以上背景技术部分提到的技术问题。
第一方面,本公开的一些实施例提供了一种视频生成方法,该方 法包括:获取影像素材和音频素材,其中,上述影像素材包括图片素材;确定上述音频素材的音乐点,其中,上述音乐点用于将上述音频素材划分成多个音频片段;利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
第二方面,本公开的一些实施例提供了一种视频生成装置,装置包括:获取单元,被配置成获取影像素材和音频素材,其中,上述影像素材包括图片素材;确定单元,被配置成确定上述音频素材的音乐点,其中,上述音乐点用于将上述音频素材划分成多个音频片段;生成单元,被配置成利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;合成单元,被配置成按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
第三方面,本公开的一些实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一的方法。
第四方面,本公开的一些实施例提供了一种计算机可读介质,其上存储有计算机程序,其中,程序被处理器执行时实现如第一方面中任一的方法。
本公开的上述各个实施例中的一个实施例具有如下有益效果:通过对音乐点的划分能够生成合成视频中的一个个视频片段,减少了用户处理素材的时间,使得剪辑更简便。进一步而言,合成视频中的视频片段可以是由图像素材生成的,这样,当用户没有视频素材或视频素材很少时,用户可以使用图片素材来剪辑视频,使得剪辑出的视频内容更多样。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1A-1D是根据本公开的一些实施例的视频生成方法的一个应用场景的示意图;
图2是根据本公开的视频生成方法的一些实施例的流程图;
图3A-3D是根据本公开的一些实施例的图片素材运动的一个应用场景的示意图;
图4是根据本公开的视频生成方法的又一些实施例的流程图;
图5是根据本公开的用于检索视频的装置的一些实施例的结构示意图;
图6是适于用来实现本公开实施例的电子设备的计算机系统的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例。相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
下面将参考附图并结合实施例来详细说明本公开。
图1A-1D是根据本公开的一些实施例的视频生成方法的不同应用场景的示意图。在如图1A的应用场景中所示,首先,用户可以在终端设备101的上传页面1017上选择多条影像素材。例如,上传页面1017中所示的图片1011-1014。用户单击选择框1015所示的位置,选中图片1011和图片1012。用户点击“下一步”按键1016,上述终端设备101基于选中的图片1011和图片1012分别生成影像素材104、影像素材105。根据得到的影像素材的数量(图中示出为2),对获取到的音频素材106,确定音频素材106中的音乐点107。根据音乐点107将音频素材106划分成音乐片段A和音乐片段B。根据得到的音乐片段A和音乐片段B时长分别对影像素材104、影像素材105进行处理。得到对应的视频片段1041和1051。将视频片段1041和1051按照音乐片段A和音乐片段B在音频素材106中出现的时间进行拼接,并添加音频素材106作为拼接后视频的音轨,得到合成视频108。
与图1A不同,如图1B-1D的应用场景中所示,上述终端设备101将包括影像素材的数量(图中示出为2)的影像信息102发送给服务器103。在如图1C中服务器103根据获取到的音频素材106,确定音频素材106中的音乐点107,根据音乐点107将音频素材106划分成音乐片段A和音乐片段B。在如图1D中服务器103将包括音乐片段A和音乐片段B时长的信息109发送给上述终端设备101。上述终端设备101根据音乐片段A和音乐片段B的时长分别对影像素材104、影像素材105进行处理,得到对应的视频片段1041和1051,其中,视频片段1041与音乐片段A的时长相等,视频片段1051与音乐片段B的时长相等。上述终端设备101将视频片段1041和1051按照音乐片段A和音乐片段B在音频素材106中出现的时间进行拼接,并添加音频素材106作为拼接后视频的音轨,得到合成视频108。
可以理解的是,视频生成方法可以是由终端设备101来执行,或者也可以是由服务器103来执行,或者也可以是通过终端设备101与服务器103之间的交互来执行,或者还可以是各种软件程序来执行。其中,终端设备101例如可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台 式计算机等等。此外,执行主体也可以体现为服务器103、软件等。当执行主体为软件时,可以安装在上述所列举的电子设备中。其可以实现成例如用来提供分布式服务的多个软件或软件模块,也可以实现成单个软件或软件模块。在此不做具体限定。
应该理解,图1中的手机、服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的手机和服务器。
继续参考图2,示出了根据本公开的视频生成方法的一些实施例的流程200。该视频生成方法,包括以下步骤:
步骤201,获取影像素材和音频素材。
在一些实施例中,视频生成方法的执行主体(例如,图1所示的服务器103)可以通过有线连接方式或者无线连接方式,获取影像素材和音频素材,其中,上述影像素材包括图片素材。作为示例,上述影像素材可以是用户存储在本地的图片,还可以是用户从网上下载的图片。上述音频素材可以是用户存储在本地的音乐,也可以是网络上的音乐。
在一些实施例的一些可选的实现方式中,影像素材除了包括图片素材,还可以包括视频素材。作为示例,上述视频素材可以是用户上传的视频,也可以是用户存储在本地的视频,还可以是用户从网上下载的视频。其中,由于影像素材可以包括视频素材也可以包括图片素材,由此影像素材的类型得以增加。
步骤202,确定上述音频素材的音乐点。
在一些实施例中,上述执行主体可以首先确定音频素材的候选音乐点。在这里,候选音乐点可以是音频素材中满足设定的节拍变换条件的点。然后,上述执行主体可以从已经得到的各个候选音乐点中选取出目标数量的音乐点。上述目标数量可以根据获取的上述影像素材的数量来确定,或者也可以是根据上述音频素材中具有的强拍数量来确定,或者还可以是用户设定的数量。作为示例,当获取到10个影像素材,可以确定9个音乐点。其中,上述强拍通常是音乐力度强的拍子。
作为示例,当候选音乐点为音频素材中满足设定的音乐性发生变换 的位置。上述音乐性发生变换的位置可以包括节拍发生变换的位置和旋律发生变换的位置。基于此,候选音乐点可以通过如下方式来确定:上述执行主体可以对上述音频素材进行分析,确定其中的节拍点和音符起始点,其中,节拍点为节拍发生变换的位置,音符起始点为旋律发生变换的位置。具体地,一方面可以采用基于深度学习的节拍分析算法对音频素材进行分析,得到音频素材中的节拍点以及节拍点所在的时间戳,另一方面对音频素材进行短时频谱分析,得到音频素材中的音符起始点以及音符起始点所在的时间戳。在这里,音符起始点可以是通过起始点检测器(onset detector)得到。然后,统一通过两种方式得到的节拍点和音符起始点,对节拍点和音符起始点进行合并及去重,从而得到候选音乐点。
步骤203,利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段。
在一些实施例中,针对音频素材中的每一个音乐片段,上述执行主体可以基于影像素材为该音乐片段生成一个与该音乐片段时长相同的视频片段。作为示例,假设音乐素材被划分成3个音乐片段,3个音乐片段的时长分别是1秒、2秒和3秒时,那么与上述音乐片段相对应的视频片段的时长也可以分别是1秒、2秒和3秒。作为一种示例,上述执行主体可以根据一个影像素材生成多个视频片段。例如,假设上述执行主体获取到一个10秒影像素材和一个8秒的音频素材,该执行主体根据音乐点将该音频素材划分成3个音频片段,时长分别是2秒、3秒和5秒,则该执行主体可以从该影像素材中裁剪出3个不同的视频片段,时长分别是2秒、3秒和5秒。作为另一种示例,上述执行主体也可以根据一个影像素材生成一个视频片段。例如,当使用一个影像素材为一个音乐片段生成一个视频片段时,在该影像素材的时长大于该音乐片段的时长时,在该原影像素材中截取与该音乐片段的时长相等的视频片段,而在该影像素材的时长小于该音乐片段的时长时,则对该原影像素材进行变速处理来加长时长,再将变速后的影像素材作为视频片段,使视频片段的时长与音乐片段的时长相等。可以理解的是,对于影像素材中的图片素材,多种实现方式可以用于将图片素材生成 视频片段。
作为一种示例,生成的多个视频片段包括第二视频片段,上述第二视频片段是通过上述图片素材运动而形成的。上述第二视频片段可以是添加运动效果后的图片素材。上述运动效果可以包括但不限于以下至少一项:向内变焦、向外变焦、向左运镜和向右运镜。作为示例,上述向内变焦可以是:在初始情况下,在页面的显示框中显示图片的中心区域,如图3A所示;然后图片的尺寸逐渐变小,图片在显示框中显示的区域逐渐扩大,直至在显示框中显示出完整的图片,如图3B所示。上述向外变焦可以是:在初始情况下,在页面的显示框中显示完整的图片,如图3B所示;然后图片的尺寸逐渐变大,图片在显示框中显示出的区域逐渐缩小,直至在显示框中显示图片中预置大小的中心区域,如图3A所示。上述向左运镜可以是:在初始情况下,在页面的显示框中显示显示图片中预置的右边区域,如图3D所示;然后图片相对于显示框向左移动,图片在显示框中显示的区域逐渐向左移动,直到在显示框中显示出图片中预置的左边区域,如图3C所示,视觉上图片从右向左运动。上述向右运镜可以是:在初始情况下,在页面的显示框中显示图片中预置的左边区域,如图3C所示;然后图片相对于显示框向右移动,图片在显示框中显示的区域逐渐向右移动,直到在显示框中显示出图片中预置的右边区域,如图3D所示,视觉上图片从左向右运动。其中,给图片素材增加运动可以让图片素材在与视频素材转换是更加自然。
其中,图片的运动速率例如可以是根据以下公式确定的:curScale=(curTime/(EndTime-StartTime)*(EndScale-StartScale))。其中,curTime是图片当前在视频中出现的时间。EndTime是图片结束运动的时间,StartTime是图片开始运动的时间,即EndTime-StartTime是图片运动的时间长度。对于向左运镜、向右运镜等运动效果,curScale可以是图片中当前显示区域的位置,EndScale可以是图片结束运动时显示区域的位置,StartScale可以是图片开始运动时显示区域的位置,即EndScale-StartScale可以是图片运动过程中显示区域的位置变化量。对于向内变焦、向外变焦等运动效果,curScale可以是图片中当前显示区域的尺寸,EndScale可以是图片结束运动时显示区域的尺寸,StartScale可以 是图片开始运动时显示区域的尺寸,即EndScale-StartScale可以是图片运动过程中显示区域的尺寸变化量。其中,尺寸变化量和位置变化量可以是人为设置的。
作为另一种示例,生成的多个视频片段包括第一视频片段,上述第一视频片段是通过为上述图片素材添加动效而生成的。上述第一视频片段可以是添加动效后的图片素材。动效可以是给图片素材随机添加的前景动效。前景动效可以是在图片上添加的动态的动画效果。例如,给图片添加下雨的动画效果。其中,给图片素材增加动效可以让图片素材在视觉上更加优美,提高了用户的视觉效果。
在这里,在使用图片素材生成视频片段时,可以是先通过加运动或动效生成一个预设时长(如3秒)的视频素材,然后再根据这个视频素材生成与音频片段时长相同的视频片段。
在一些实施例的一些可选的实现方式中,为上述图片素材添加的动效可以是根据上述图片素材的场景类别确定的。上述场景类别可以表示上述图片素材中呈现的场景。例如,场景类别可以包括但不限于以下至少一项:通用场景类别和室内类别。其中,通用场景类别可以包括但不限于以下至少一项:婴儿、沙滩、建筑物、汽车、卡通、动物。室内类别可以包括但不限于以下至少一项:书店,咖啡馆,KTV(Karaoke,卡拉ok)和商场。
可以理解的是,对于场景类别,多种实现方式可以获取图片素材的场景类别。
作为一种示例,执行主体可以识别上述图片素材中是否包括预设的场景信息来确定场景类别。其中,根据场景类别给图片素材添加动效可以增加图片和动效的关联,例如,图片素材中的场景信息是“雪人”,那么动效就可以是“飘动的雪花”的动画效果。
作为另一种示例,图片素材的场景类别可以是通过机器学习模型对上述图片素材进行分析得到的,其中,上述机器学习模型已通过训练样本集合进行了训练。上述训练样本集合中的训练样本包括样本图片素材和与样本图片素材对应的样本场景类别。其中,通过模型来确定场景类别可以提高速度,节省了人力。
作为示例,机器学习模型可以是基于训练样本集合执行以下训练步骤得到的。基于训练样本集合执行以下训练步骤:将训练样本集合中的至少一个训练样本的样本图片素材分别输入至初始机器学习模型,得到上述至少一个训练样本中的每个样本图片素材所对应的场景类别;将上述至少一个训练样本中的每个样本图片素材对应的场景类别与对应的样本场景类别进行比较;根据比较结果确定上述初始机器学习模型的预测准确率;确定上述预测准确率是否大于预设准确率阈值;响应于确定上述准确率大于上述预设准确率阈值,则将上述初始机器学习模型作为训练完成的机器学习模型;响应于确定上述准确率不大于上述预设准确率阈值,调整上述初始机器学习模型的参数,以及使用未使用过的训练样本组成训练样本集合,使用调整后的初始机器学习模型作为初始机器学习模型,再次执行上述训练步骤。
可以理解的是,经过上述训练之后,机器学习模型可以用于表征图片素材与场景类别之间的对应关系。上述提及的机器学习模型可以是卷积神经网络模型。
在一些实施例的一些可选的实现方式中,训练样本集合包括样本图片和上述样本图片的场景类别,上述机器学习模型是以上述样本图片作为输入并以上述样本图片的场景类别用于作为期望输出进行训练的。
步骤204,按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
在一些实施例中,视频生成方法的执行主体可以根据上述音频素材中上述音乐片段出现的顺序将上述与上述音乐片段对应的视频片段依次拼接在一起,并且在拼接而成的视频的音轨中添加上述音频素材,得到合成视频。作为示例,可以根据音乐点将上述音频素材按照顺序划分成3段,例如,A段可以是从0秒到2秒,B段可以是从2秒到5秒,C段可以是从5秒到10秒。对应的视频片段分别是a段,b段,c段。那么拼接而成的视频可以表示为abc。将上述音频素材添加到拼接而成的视频abc的音轨中,得到合成视频。
由上述示例可以看出,影像素材只有视频素材,影像素材的类型 会单一,内容也会相对单一,进而会影响合成视频的内容的多样性。而通过获取包含图片素材的影像素材,可以丰富影像素材的类型,进而提高了合成视频的内容的多样性。
继续参考图4,示出了根据本公开的视频生成方法的一些实施例的流程400。该视频生成方法,包括以下步骤:
步骤401,获取初始音频。
在一些实施例中,视频生成方法的执行主体(例如,图1所示的服务器103)可以通过有线连接方式或者无线连接方式,获取初始音频。上述初始音频可以是用户本地存储的音乐,也可以是网络上的音乐。作为示例,可以先向用户推荐一些音乐,如果用户在推荐的音乐中找不到需要使用的音乐时,可以手动搜索其他音乐,从而获取用户选择的音乐作为初始音频。
步骤402,根据上述影像素材的总时长与上述初始音频的时长,确定上述音频素材的时长。
在一些实施例中,上述执行主体根据获取到的多个影像素材,可以统计出所有影像素材的总时长。在影像素材中,视频素材的时长可以是视频的时长,图片素材的时长可以是人为设置的,例如4秒。将上述总时长与获取到的上述初始音频的时长做比较。根据比较结果,确定上述音频素材的时长。确定上述音频素材的时长小于上述影像素材的总时长。
在一些实施例的一些可选的实现方式中,根据上述影像素材的总时长与上述初始音频的时长,确定上述音频素材的时长,包括:根据上述影像素材的总时长和上述初始音频的时长,确定初始时长;上述初始时长可以是初始音频的时长或影像素材的总时长。若上述初始时长大于时长阈值,将上述时长阈值确定为上述音频素材的时长;上述时长阈值可以是人为设定的,例如,时长阈值可以是20秒。若上述初始时长小于20秒,将上述初始时长确定为上述音频素材的时长。其中,设定阈值可以控制音频素材的时长。
在一些实施例的一些可选的实现方式中,根据上述影像素材的总时长和上述初始音频的时长,确定初始时长,包括:若上述影像素材的总 时长大于上述初始音频的时长,将上述初始音频的时长确定为上述初始时长;若上述影像素材的总时长小于上述初始音频的时长,对上述影像素材的总时长进行缩减得到上述音频素材的时长。作为示例,对上述影像素材的总时长进行缩减的方式,可以是将上述总时长乘以一个目标比例,也可以是将上述总时长减去一个预设时长。其中,上述目标比例和上述预设时长可以是人为设定的。其中,上述预设时长需小于上述总时长。其中,上述方法可以灵活的控制音频素材的时长。
步骤403,根据上述音频素材的时长,从上述初始音频中提取出上述音频素材。
在一些实施例中,上述执行主体上述音频素材的时长,从上述初始音频中提取出上述音频素材。
步骤404,获取影像素材和音频素材。
步骤405,确定上述音频素材的音乐点。
步骤406,利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段。
步骤407,按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
在一些实施例中,步骤404-407的具体实现及所带来的技术效果可以参考图2对应的那些实施例中的步骤201-204,在此不再赘述。
本公开的一些实施例公开的视频生成方法,通过根据获取到的获取初始音频,根据影像素材的总时长与上述初始音频的时长,确定音频素材的时长,从上述初始音频中提取出上述音频素材,从而实现了使得音频素材的时长能够适应于合成视频的时长。
进一步参考图5,作为对上述各图上述方法的实现,本公开提供了一种视频生成装置的一些实施例,这些装置实施例与图2上述的那些方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,一些实施例的视频生成装置500包括:获取单元501、确定单元502、生成单元503和合成单元504。其中,获取单元501,被 配置成获取影像素材和音频素材,其中,上述影像素材包括图片素材;确定单元502,被配置成确定上述音频素材的音乐点,其中,上述音乐点用于将上述音频素材划分成多个音频片段;生成单元503,被配置成利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;合成单元504,被配置成按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
在一些实施例中,视频生成装置500的生成单元503中的多个视频片段包括第一视频片段,上述第一视频片段是通过为上述图片素材添加动效而生成的。
在一些实施例中,视频生成装置500中图片素材添加的动效是根据上述图片素材的场景类别确定的。
在一些实施例中,视频生成装置500中图片素材的场景类别是通过机器学习模型对上述图片素材进行分析得到的,其中,上述机器学习模型已通过训练样本集合进行了训练。
在一些实施例中,视频生成装置500中训练样本集合包括样本图片和上述样本图片的场景类别,上述机器学习模型是以上述样本图片作为输入并以上述样本图片的场景类别用于作为期望输出进行训练的。
在一些实施例中,视频生成装置500的生成单元503中的多个视频片段包括第二视频片段,上述第二视频片段是通过上述图片素材运动而形成的。
在一些实施例中,视频生成装置500的获取单元501中的影像素材还包括视频素材。
在一些实施例中,视频生成装置500生成单元503中的多个视频片段包括第三视频片段,上述第三视频片段是从上述视频素材中提取出来的。
在一些实施例中,视频生成装置500还包括:第一获取单元,被配置成获取初始音频;第一确定单元,被配置成根据上述影像素材的总时长与上述初始音频的时长,确定上述音频素材的时长,其中,上述音频 素材的时长小于上述影像素材的总时长;提取单元,被配置成根据上述音频素材的时长,从上述初始音频中提取出上述音频素材。
在一些实施例中,视频生成装置500的第一确定单元包括:第一确定子单元,被配置成根据上述影像素材的总时长和上述初始音频的时长,确定初始时长;第二确定子单元,被配置成若上述初始时长大于时长阈值,将上述时长阈值确定为上述音频素材的时长;第三确定子单元,被配置成若上述初始时长小于上述时长阈值,将上述初始时长确定为上述音频素材的时长。
在一些实施例中,视频生成装置500的第一确定单元中的第一确定子单元被进一步被配置成若上述影像素材的总时长大于上述初始音频的时长,将上述初始音频的时长确定为上述初始时长;若上述影像素材的总时长小于上述初始音频的时长,对上述影像素材的总时长进行缩减得到上述音频素材的时长。
下面参考图6,其示出了适于用来实现本公开的一些实施例的电子设备(例如图1中的服务器)600的结构示意图。本公开的一些实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的终端设备仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸 板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如存储卡等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的一些实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的一些实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的一些实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开的一些实施例的方法中限定的上述功能。
需要说明的是,本公开的一些实施例上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的一些实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的一些实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信 号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取影像素材和音频素材,其中,上述影像素材包括图片素材;确定上述音频素材的音乐点,其中,上述音乐点用于将上述音频素材划分成多个音频片段;利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的一些实施例的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中, 远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的一些实施例中的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括获取单元、确定单元、生成单元和合成单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取影像素材和音频素材的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
根据本公开的一个或多个实施例,提供了一种视频生成方法,包括:获取影像素材和音频素材,其中,上述影像素材包括图片素材;确定上述音频素材的音乐点,其中,上述音乐点用于将上述音频素材划分成多个音频片段;利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片 段和视频片段具有相同的时长;按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
根据本公开的一个或多个实施例,多个视频片段包括第一视频片段,上述第一视频片段是通过为上述图片素材添加动效而生成的。
根据本公开的一个或多个实施例,动效是根据上述图片素材的场景类别确定的。
根据本公开的一个或多个实施例,图片素材的场景类别是通过机器学习模型对上述图片素材进行分析得到的,其中,上述机器学习模型已通过训练样本集合进行了训练。
根据本公开的一个或多个实施例,训练样本集合包括样本图片和上述样本图片的场景类别,上述机器学习模型是以上述样本图片作为输入并以上述样本图片的场景类别用于作为期望输出进行训练的。
根据本公开的一个或多个实施例,多个视频片段包括第二视频片段,上述第二视频片段是通过上述图片素材运动而形成的。
根据本公开的一个或多个实施例,影像素材还包括视频素材。
根据本公开的一个或多个实施例,多个视频片段包括第三视频片段,上述第三视频片段是从上述视频素材中提取出来的。
根据本公开的一个或多个实施例,该方法还包括:获取初始音频;根据上述影像素材的总时长与上述初始音频的时长,确定上述音频素材的时长,其中,上述音频素材的时长小于上述影像素材的总时长;根据上述音频素材的时长,从上述初始音频中提取出上述音频素材。
根据本公开的一个或多个实施例,根据上述影像素材的总时长与上述初始音频的时长,确定上述音频素材的时长,包括:根据上述影像素材的总时长和上述初始音频的时长,确定初始时长;若上述初始时长大于时长阈值,将上述时长阈值确定为上述音频素材的时长;若上述初始时长小于上述时长阈值,将上述初始时长确定为上述音频素材的时长。
根据本公开的一个或多个实施例,根据上述影像素材的总时长和上述初始音频的时长,确定初始时长,包括:若上述影像素材的总时长大于上述初始音频的时长,将上述初始音频的时长确定为上述初始时长; 若上述影像素材的总时长小于上述初始音频的时长,对上述影像素材的总时长进行缩减得到上述音频素材的时长。
根据本公开的一个或多个实施例,该装置包括:获取单元,被配置成获取影像素材和音频素材,其中,上述影像素材包括图片素材;确定单元,被配置成确定上述音频素材的音乐点,其中,上述音乐点用于将上述音频素材划分成多个音频片段;生成单元,被配置成利用上述影像素材,为上述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;合成单元,被配置成按照上述多个视频片段分别对应的音乐片段在上述音频素材中出现的时间将上述多个视频片段拼接在一起,并添加上述音频素材作为视频音轨,得到合成视频。
根据本公开的一个或多个实施例,提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如上述任一实施例描述的方法。
根据本公开的一个或多个实施例,提供了一种计算机可读介质,其上存储有计算机程序,其中,程序被处理器执行时实现如上述任一实施例描述的方法。
以上描述仅为本公开的一些较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (14)

  1. 一种视频生成方法,包括:
    获取影像素材和音频素材,其中,所述影像素材包括图片素材;
    确定所述音频素材的音乐点,其中,所述音乐点用于将所述音频素材划分成多个音频片段;
    利用所述影像素材,为所述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段,其中,相对应的音乐片段和视频片段具有相同的时长;
    按照所述多个视频片段分别对应的音乐片段在所述音频素材中出现的时间将所述多个视频片段拼接在一起,并添加所述音频素材作为视频音轨,得到合成视频。
  2. 根据权利要求1所述的方法,其特征在于,所述多个视频片段包括第一视频片段,所述第一视频片段是通过为所述图片素材添加动效而生成的。
  3. 根据权利要求2所述的方法,其特征在于,所述动效是根据所述图片素材的场景类别确定的。
  4. 根据权利要求3所述的方法,其特征在于,所述图片素材的场景类别是通过机器学习模型对所述图片素材进行分析得到的,其中,所述机器学习模型已通过训练样本集合进行了训练。
  5. 根据权利要求4所述的方法,其特征在于,所述训练样本集合包括样本图片和所述样本图片的场景类别,所述机器学习模型是以所述样本图片作为输入并以所述样本图片的场景类别用于作为期望输出进行训练的。
  6. 根据权利要求1所述的方法,其特征在于,所述多个视频片段包 括第二视频片段,所述第二视频片段是通过所述图片素材运动而形成的。
  7. 根据权利要求1所述的方法,其特征在于,所述影像素材还包括视频素材。
  8. 根据权利要求7所述的方法,其特征在于,所述多个视频片段包括第三视频片段,所述第三视频片段是从所述视频素材中提取出来的。
  9. 根据权利要求1至8中任意一项所述的方法,其中,所述方法还包括:
    获取初始音频;
    根据所述影像素材的总时长与所述初始音频的时长,确定所述音频素材的时长,其中,所述音频素材的时长小于所述影像素材的总时长;
    根据所述音频素材的时长,从所述初始音频中提取出所述音频素材。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述影像素材的总时长与所述初始音频的时长,确定所述音频素材的时长,包括:
    根据所述影像素材的总时长和所述初始音频的时长,确定初始时长;
    若所述初始时长大于时长阈值,将所述时长阈值确定为所述音频素材的时长;
    若所述初始时长小于所述时长阈值,将所述初始时长确定为所述音频素材的时长。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述影像素材的总时长和所述初始音频的时长,确定初始时长,包括:
    若所述影像素材的总时长大于所述初始音频的时长,将所述初始音频的时长确定为所述初始时长;
    若所述影像素材的总时长小于所述初始音频的时长,对所述影像素材的总时长进行缩减得到所述音频素材的时长。
  12. 一种视频生成装置,包括:
    获取单元,被配置成获取影像素材和音频素材,其中,所述影像素材包括图片素材;
    确定单元,被配置成确定所述音频素材的音乐点,其中,所述音乐点用于将所述音频素材划分成多个音频片段;
    生成单元,被配置成利用所述影像素材,为所述音频素材中的每个音乐片段分别生成一个视频片段,得到多个视频片段;其中,相对应的音乐片段和视频片段具有相同的时长;
    合成单元,被配置成按照所述多个视频片段分别对应的音乐片段在所述音频素材中出现的时间将所述多个视频片段拼接在一起,并添加所述音频素材作为视频音轨,得到合成视频。
  13. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-11中任一所述的方法。
  14. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-11中任一所述的方法。
PCT/CN2020/116921 2019-09-26 2020-09-22 视频生成方法、装置、电子设备和计算机可读介质 WO2021057740A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020227010159A KR20220045056A (ko) 2019-09-26 2020-09-22 비디오 생성방법, 장치, 전자 장치 및 컴퓨터 판독 가능한 매체
BR112022005713A BR112022005713A2 (pt) 2019-09-26 2020-09-22 Método e aparelho de geração de vídeo, dispositivo eletrônico e mídia legível por computador
JP2022519290A JP7355929B2 (ja) 2019-09-26 2020-09-22 ビデオ生成方法、装置、電子装置及びコンピュータ読み取り可能な媒体
EP20868358.1A EP4024880A4 (en) 2019-09-26 2020-09-22 VIDEO GENERATING METHOD AND APPARATUS, ELECTRONIC DEVICE AND COMPUTER READABLE MEDIA
US17/706,542 US11710510B2 (en) 2019-09-26 2022-03-28 Video generation method and apparatus, electronic device, and computer readable medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910919296.X 2019-09-26
CN201910919296.XA CN112565882A (zh) 2019-09-26 2019-09-26 视频生成方法、装置、电子设备和计算机可读介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/706,542 Continuation US11710510B2 (en) 2019-09-26 2022-03-28 Video generation method and apparatus, electronic device, and computer readable medium

Publications (1)

Publication Number Publication Date
WO2021057740A1 true WO2021057740A1 (zh) 2021-04-01

Family

ID=75029985

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116921 WO2021057740A1 (zh) 2019-09-26 2020-09-22 视频生成方法、装置、电子设备和计算机可读介质

Country Status (7)

Country Link
US (1) US11710510B2 (zh)
EP (1) EP4024880A4 (zh)
JP (1) JP7355929B2 (zh)
KR (1) KR20220045056A (zh)
CN (1) CN112565882A (zh)
BR (1) BR112022005713A2 (zh)
WO (1) WO2021057740A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114495916A (zh) * 2022-04-15 2022-05-13 腾讯科技(深圳)有限公司 背景音乐的插入时间点确定方法、装置、设备和存储介质
CN115243107A (zh) * 2022-07-08 2022-10-25 华人运通(上海)云计算科技有限公司 短视频播放的方法、装置、系统、电子设备和介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113438543B (zh) * 2021-06-22 2023-02-03 深圳市大头兄弟科技有限公司 文档转视频的匹配方法、装置、设备及存储介质
CN113676772B (zh) * 2021-08-16 2023-08-08 上海哔哩哔哩科技有限公司 视频生成方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005181646A (ja) * 2003-12-19 2005-07-07 Omron Corp 音楽画像出力システム、音楽画像出力方法、および、音楽データ生成サーバ装置
WO2015120333A1 (en) * 2014-02-10 2015-08-13 Google Inc. Method and system for providing a transition between video clips that are combined with a sound track
CN105814634A (zh) * 2013-12-10 2016-07-27 谷歌公司 提供节拍匹配
US20170026719A1 (en) * 2015-06-17 2017-01-26 Lomotif Private Limited Method for generating a composition of audible and visual media
US20180286458A1 (en) * 2017-03-30 2018-10-04 Gracenote, Inc. Generating a video presentation to accompany audio

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4285287B2 (ja) * 2004-03-17 2009-06-24 セイコーエプソン株式会社 画像処理装置、画像処理方法およびそのプログラム、記録媒体
US7512886B1 (en) * 2004-04-15 2009-03-31 Magix Ag System and method of automatically aligning video scenes with an audio track
CN103793446B (zh) * 2012-10-29 2019-03-01 汤晓鸥 音乐视频的生成方法和系统
CN103810504B (zh) * 2014-01-14 2017-03-22 三星电子(中国)研发中心 一种图像处理方法和装置
CN104199841B (zh) * 2014-08-06 2018-01-02 武汉图歌信息技术有限责任公司 一种图片生成动画并与视频片段拼接合成的视频编辑方法
US10204273B2 (en) * 2015-10-20 2019-02-12 Gopro, Inc. System and method of providing recommendations of moments of interest within video clips post capture
CN109147771B (zh) * 2017-06-28 2021-07-06 广州视源电子科技股份有限公司 音频分割方法和系统
CN108111909A (zh) * 2017-12-15 2018-06-01 广州市百果园信息技术有限公司 视频图像处理方法及计算机存储介质、终端
CN108202334B (zh) * 2018-03-22 2020-10-23 东华大学 一种能够识别音乐节拍和风格的舞蹈机器人
CN110149517B (zh) * 2018-05-14 2022-08-23 腾讯科技(深圳)有限公司 视频处理的方法、装置、电子设备及计算机存储介质
CN109379643B (zh) * 2018-11-21 2020-06-09 北京达佳互联信息技术有限公司 视频合成方法、装置、终端及存储介质
CN110233976B (zh) * 2019-06-21 2022-09-09 广州酷狗计算机科技有限公司 视频合成的方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005181646A (ja) * 2003-12-19 2005-07-07 Omron Corp 音楽画像出力システム、音楽画像出力方法、および、音楽データ生成サーバ装置
CN105814634A (zh) * 2013-12-10 2016-07-27 谷歌公司 提供节拍匹配
WO2015120333A1 (en) * 2014-02-10 2015-08-13 Google Inc. Method and system for providing a transition between video clips that are combined with a sound track
US20170026719A1 (en) * 2015-06-17 2017-01-26 Lomotif Private Limited Method for generating a composition of audible and visual media
US20180286458A1 (en) * 2017-03-30 2018-10-04 Gracenote, Inc. Generating a video presentation to accompany audio

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114495916A (zh) * 2022-04-15 2022-05-13 腾讯科技(深圳)有限公司 背景音乐的插入时间点确定方法、装置、设备和存储介质
CN114495916B (zh) * 2022-04-15 2022-07-12 腾讯科技(深圳)有限公司 背景音乐的插入时间点确定方法、装置、设备和存储介质
CN115243107A (zh) * 2022-07-08 2022-10-25 华人运通(上海)云计算科技有限公司 短视频播放的方法、装置、系统、电子设备和介质
CN115243107B (zh) * 2022-07-08 2023-11-21 华人运通(上海)云计算科技有限公司 短视频播放的方法、装置、系统、电子设备和介质

Also Published As

Publication number Publication date
BR112022005713A2 (pt) 2022-06-21
US20220223183A1 (en) 2022-07-14
EP4024880A1 (en) 2022-07-06
EP4024880A4 (en) 2022-10-19
CN112565882A (zh) 2021-03-26
JP2022549700A (ja) 2022-11-28
US11710510B2 (en) 2023-07-25
KR20220045056A (ko) 2022-04-12
JP7355929B2 (ja) 2023-10-03

Similar Documents

Publication Publication Date Title
WO2021093737A1 (zh) 生成视频的方法、装置、电子设备和计算机可读介质
WO2021057740A1 (zh) 视频生成方法、装置、电子设备和计算机可读介质
US11587593B2 (en) Method and apparatus for displaying music points, and electronic device and medium
WO2021196903A1 (zh) 视频处理方法、装置、可读介质及电子设备
JP7199527B2 (ja) 画像処理方法、装置、ハードウェア装置
JP6971292B2 (ja) 段落と映像を整列させるための方法、装置、サーバー、コンピュータ可読記憶媒体およびコンピュータプログラム
US20220351454A1 (en) Method and apparatus for displaying lyric effects, electronic device, and computer readable medium
WO2021098670A1 (zh) 视频生成方法、装置、电子设备和计算机可读介质
CN113365134B (zh) 音频分享方法、装置、设备及介质
WO2022007565A1 (zh) 增强现实的图像处理方法、装置、电子设备及存储介质
CN111970571B (zh) 视频制作方法、装置、设备及存储介质
US20220353587A1 (en) Method and apparatus for generating music poster, electronic device, and medium
US20230307004A1 (en) Audio data processing method and apparatus, and device and storage medium
US20240064367A1 (en) Video processing method and apparatus, electronic device, and storage medium
WO2023169356A1 (zh) 图像处理方法、装置、设备及存储介质
CN112153460A (zh) 一种视频的配乐方法、装置、电子设备和存储介质
JP2023525091A (ja) 画像特殊効果の設定方法、画像識別方法、装置および電子機器
EP4344230A1 (en) Video generation method, apparatus, and device, storage medium, and program product
CN109815408B (zh) 用于推送信息的方法和装置
WO2021018176A1 (zh) 文字特效处理方法及装置
CN117596452A (zh) 视频生成方法、装置、介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20868358

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022519290

Country of ref document: JP

Kind code of ref document: A

Ref document number: 20227010159

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112022005713

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 2020868358

Country of ref document: EP

Effective date: 20220328

ENP Entry into the national phase

Ref document number: 112022005713

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20220325