CN111405357A

CN111405357A - Audio and video editing method and device and storage medium

Info

Publication number: CN111405357A
Application number: CN201910001833.2A
Authority: CN
Inventors: 耿军; 马春阳; 杨昌源; 陈羽飞
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2020-07-10

Abstract

The disclosure relates to an audio and video editing method, an apparatus and a storage medium. The method comprises the following steps: shot detection is carried out on at least one video to be edited, and at least one shot section in the at least one video to be edited is determined; determining a turning point in background audio corresponding to the at least one video to be edited; and synthesizing the at least one shot section and the background audio so that shot switching time points between adjacent shot sections are matched with the turning point. The method and the device can determine the turning point in the background audio by analyzing the background audio, so that the shot section corresponds to the audio section of the background audio, the music change is matched with the shot switching rhythm, and the synthesized video with the music change and the picture change coordinated can be obtained.

Description

Audio and video editing method and device and storage medium

Technical Field

The present disclosure relates to the field of video technologies, and in particular, to an audio and video editing method and apparatus, and a storage medium.

Background

Technology promotes media upgrading, and various large platforms actively promote video media upgrading to convert E-commerce and social software contents from a picture and text form to a video form. Audio-video editing is the key and difficult point of audio-video content production. How to combine a plurality of video segments according to a certain rule to form a video with the whole music and the picture change coordinated is a problem to be solved urgently.

Disclosure of Invention

In view of this, the present disclosure provides an audio and video editing method, an apparatus and a storage medium.

According to an aspect of the present disclosure, there is provided an audio and video editing method, including:

shot detection is carried out on at least one video to be edited, and at least one shot section in the at least one video to be edited is determined;

determining a turning point in background audio corresponding to the at least one video to be edited;

and synthesizing the at least one shot section and the background audio so that shot switching time points between adjacent shot sections are matched with the turning point.

In one possible implementation, determining the turning point in the background audio corresponding to the at least one video to be clipped includes one or both of:

determining turning points in the background audio according to the energy of the notes in the background audio;

determining a turning point in the background audio according to a time interval between notes in the background audio.

In one possible implementation manner, the method further includes:

determining a priority of the turning point;

determining a target turning point from the turning points according to the priorities of the turning points;

synthesizing the at least one shot section and the background audio such that shot-cut time points between adjacent shot sections match the inflection point, comprising:

and synthesizing the at least one shot section and the background audio so that shot switching time points between adjacent shot sections are matched with the target turning point.

In one possible implementation, determining the priority of the turning point includes one or both of:

determining the priorities of the turning points of different classes according to the speed of the background audio;

and determining the priorities of the same type of different turning points according to the energies of the notes corresponding to the same type of different turning points.

In one possible implementation, determining a turning point in the background audio according to the energy of the musical notes in the background audio includes:

if the ratio of the energy of a first note to the energy of a second note in the background audio is larger than a first threshold, determining a first-class turning point in the background audio according to the occurrence time of the first note, wherein the first note is the next note of the second note; alternatively, the first and second electrodes may be,

and if the difference value of the energy of the first note and the energy of the second note is larger than a second threshold value, determining a first-class turning point in the background audio according to the appearance time of the first note.

In a possible implementation manner, the priority of the first-type turning point determined according to the occurrence time of the first note is positively correlated with the ratio of the energy of the first note to the energy of the second note; alternatively, the first and second electrodes may be,

the priority of the first-class turning point determined according to the appearance time of the first note is positively correlated with the difference between the energy of the first note and the energy of the second note.

if the similarity between the first audio segment and the second audio segment in the background audio is greater than a third threshold, respectively determining a third turning point in the background audio according to the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment.

In a possible implementation manner, the priority of the third turning point determined according to the occurrence time of the last note of the first audio segment is positively correlated with the similarity of the first audio segment and the second audio segment; the priority of the third turning point determined according to the occurrence time of the last note of the second audio segment is positively correlated with the similarity of the first audio segment and the second audio segment.

and if the alternating phenomenon of the intensity of the energy of the notes of the first audio segment in the background audio conforms to the specified alternating pattern of the intensity, determining a fourth turning point in the background audio according to the occurrence time of the last note of the first audio segment.

In one possible implementation, the priority of the fourth type turning point determined according to the occurrence time of the last note of the first audio piece is positively correlated with the energy difference value of the adjacent notes of the first audio piece.

In one possible implementation, determining a turning point in the background audio according to a time interval between notes in the background audio includes:

and if the time interval between the first note and the second note in the background audio is larger than a fourth threshold value, determining a second type turning point in the background audio according to the occurrence time of the first note, wherein the first note is the next note of the second note.

According to another aspect of the present disclosure, there is provided an audio-video editing apparatus including:

the shot detection module is used for carrying out shot detection on at least one video to be clipped and determining at least one shot section in the at least one video to be clipped;

the first determining module is used for determining a turning point in background audio corresponding to the at least one video to be edited;

and the synthesis module is used for synthesizing the at least one shot section and the background audio so as to enable shot switching time points between adjacent shot sections to be matched with the turning point.

In one possible implementation, the first determining module is one or both of a first determining submodule and a second determining submodule;

the first determining submodule is used for determining turning points in the background audio according to the energy of notes in the background audio;

the second determining submodule is used for determining turning points in the background audio according to time intervals among the notes in the background audio.

In one possible implementation, the apparatus further includes:

a second determining module for determining the priority of the turning point;

the third determining module is used for determining a target turning point from the turning points according to the priority of the turning points;

the synthesis module is configured to:

In one possible implementation, the second determining module includes one or both of a third determining submodule and a fourth determining submodule;

the third determining submodule is used for determining the priorities of the turning points of different classes according to the speed of the background audio;

and the fourth determining submodule is used for determining the priorities of the same type of different turning points according to the energies of the notes corresponding to the same type of different turning points.

In one possible implementation, the first determining sub-module is configured to:

In one possible implementation, the second determining submodule is configured to:

According to another aspect of the present disclosure, there is provided an audio-video editing apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, shot detection is performed on at least one video to be clipped, at least one shot section in the at least one video to be clipped is determined, a turning point in background audio corresponding to the at least one video to be clipped is determined, and the at least one shot section and the background audio are synthesized, so that shot switching time points and turning points between adjacent shot sections are matched, thereby determining the turning point in the background audio by analyzing the background audio, enabling the shot sections to correspond to audio sections of the background audio, enabling music change to be matched with shot switching rhythm, and obtaining a synthesized video with music change and picture change coordinated.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of an audio-video editing method according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram illustrating matching of shot switching time points and turning points between adjacent shot segments in an audio and video editing method according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating a shot switching time point between adjacent shot segments and a target turning point in an audio-video editing method according to an embodiment of the present disclosure.

Fig. 4 shows a block diagram of an audiovisual editing apparatus according to an embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating an audiovisual editing apparatus 800 according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating an audiovisual editing apparatus 1900 in accordance with an exemplary embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of an audio-video editing method according to an embodiment of the present disclosure. The execution subject of the audio and video editing method can be an audio and video editing device. For example, the audio-video editing method may be executed by a terminal device or a server or other processing device. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, or a wearable device. In some possible implementations, the audiovisual editing method may be implemented by a processor calling computer-readable instructions stored in a memory. As shown in fig. 1, the av editing method includes steps S11 to S13.

In step S11, shot detection is performed on at least one video to be clipped, and at least one shot section in the at least one video to be clipped is determined.

In the disclosed embodiments, one or more videos to be clipped may be synthesized with background audio, where each video to be clipped may include one or more shot sections. The embodiment of the disclosure can adopt the correlation technique to perform shot detection on the video to be clipped, and determine the shot switching time point in the video to be clipped, i.e. determine different shot sections in the video to be clipped.

In a possible implementation manner, when the number of the videos to be clipped is greater than 1, the order of the videos to be clipped in the composite video may be determined according to the order of the videos to be clipped that are selected.

In step S12, a turning point in the background audio corresponding to the at least one video to be clipped is determined.

In the embodiment of the present disclosure, the background audio corresponding to the at least one video to be clipped may refer to audio corresponding to background music used for synthesizing the at least one video to be clipped.

In one possible implementation, determining the turning point in the background audio corresponding to the at least one video to be edited includes one or both of: determining turning points in the background audio according to the energy of the notes in the background audio; according to the time interval between notes in the background audio, a turning point in the background audio is determined.

As an example of this implementation, determining a turning point in the background audio from the energy of the notes in the background audio comprises: if the ratio of the energy of the first note to the energy of the second note in the background audio is larger than a first threshold, determining a first turning point in the background audio according to the occurrence time of the first note, wherein the first note is the next note of the second note; or, if the difference between the energy of the first note and the energy of the second note is greater than a second threshold, determining a first-class turning point in the background audio according to the occurrence time of the first note. In this example, the inflection points in the background audio comprise inflection points of a first type.

For example, the first threshold is equal to 2. In this example, a first note may be defined as a peak note if the ratio of the energy of the first note to the energy of the second note is greater than a first threshold.

For example, determining a first type of turning point in the background audio according to the occurrence time of the first note may be: the time of occurrence of the first note is determined as a first type of inflection point in the background audio.

As another example, determining the first type of turning point in the background audio according to the occurrence time of the first note may be: the time of occurrence of the last note of the first note (i.e., the second note) is determined as the first type of inflection point in the background audio.

It should be noted that, although the manner of determining the first-type turning point in the background audio according to the occurrence time of the first note is described above by taking the determination of the occurrence time of the first note as the first-type turning point in the background audio or the determination of the occurrence time of the last note (i.e. the second note) of the first note as the first-type turning point in the background audio as an example, the skilled person can understand that the present disclosure should not be limited thereto. The person skilled in the art can flexibly determine the first-type turning point in the background audio according to the appearance time of the first note according to the actual application scene requirement and/or personal preference. For example, a note interval can be determined according to a first note, the middle point of the note interval is the first note, and the length of the note interval is a preset value. The time of occurrence of any one note in the note interval may be used to determine a first type of inflection point in the background audio.

In this example, the priority of the first type of turning point, determined from the time of occurrence of the first note, is positively correlated to the ratio of the energy of the first note to the energy of the second note, for example; or the priority of the first-class turning point determined according to the appearance time of the first note is positively correlated with the difference between the energy of the first note and the energy of the second note.

In this example, the greater the ratio of the energy of the first note to the energy of the second note, the higher the priority of the first type of inflection point determined by the time of occurrence of the first note; alternatively, the greater the difference between the energy of the first note and the energy of the second note, the higher the priority of the first type of turning point determined from the time of occurrence of the first note.

As another example of this implementation, determining a turning point in the background audio from the energy of the notes in the background audio comprises: and if the similarity between the first audio segment and the second audio segment in the background audio is greater than a third threshold, respectively determining a third turning point in the background audio according to the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment. In this example, the inflection points in the background audio comprise a third type of inflection point.

In this example, the similarity of the first audio piece to the second audio piece may be determined based on the difference in energy of the notes at the respective locations of the first audio piece and the second audio piece. For example, the energy of the notes of the first audio piece is x in order₁To x_nThe energy of the notes of the second audio piece is y₁To y_nThe distance between the first audio piece and the second audio piece can be expressed as

The greater the distance between the first audio piece and the second audio piece, the lower the similarity of the first audio piece and the second audio piece; the smaller the distance between the first audio piece and the second audio piece, the higher the similarity of the first audio piece and the second audio piece.

In this example, for example, the third type of turning point in the background audio is determined according to the occurrence time of the last note of the first audio piece and the occurrence time of the last note of the second audio piece, which may be: and respectively determining the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment as a third-type turning point in the background audio.

For another example, the third type turning point in the background audio is determined according to the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment, and may be: and respectively determining the appearance time of the next note of the last note of the first audio segment and the appearance time of the next note of the last note of the second audio segment as a third type turning point in the background audio.

It should be noted that, although the manner of determining the third type turning point in the background audio according to the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment is described as above by taking the determination of the occurrence time of the last note of the first audio segment and the determination of the occurrence time of the last note of the second audio segment as the third type turning point in the background audio, respectively, or the determination of the occurrence time of the next note of the last note of the first audio segment and the determination of the occurrence time of the next note of the last note of the second audio segment as the third type turning point in the background audio, respectively, as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. The third-class turning point in the background audio can be flexibly determined by those skilled in the art according to the actual application scene requirement and/or the personal preference according to the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment.

In this example, for example, the priority of the third type turning point determined according to the occurrence time of the last note of the first audio piece is positively correlated with the similarity of the first audio piece and the second audio piece; the priority of the third turning point determined according to the occurrence time of the last note of the second audio segment is positively correlated with the similarity of the first audio segment and the second audio segment.

In this example, the higher the similarity of the first audio piece to the second audio piece, the higher the priority of the third type of turning point determined according to the occurrence time of the last note of the first audio piece, and the higher the priority of the third type of turning point determined according to the occurrence time of the last note of the second audio piece.

As another example of this implementation, determining a turning point in the background audio from the energy of the notes in the background audio comprises: and if the alternating phenomenon of the intensity of the energy of the notes of the first audio segment in the background audio conforms to the specified alternating pattern of the intensity, determining a fourth turning point in the background audio according to the occurrence time of the last note of the first audio segment. In this example, the inflection points in the background audio include the fourth type of inflection point.

In this example, the specified alternating patterns of intensity may include one or more alternating patterns of intensity. For example, a strong and weak alternating pattern is "strong and weak" and the strong and weak alternating pattern sets the number of notes to 10 or more.

In this example, the fourth type of turning point in the background audio is determined from the time of occurrence of the last note of the first audio piece, which may be, for example: the time of occurrence of the last note of the first audio piece is determined as a fourth type of turning point in the background audio.

As another example, determining the fourth type turning point in the background audio according to the occurrence time of the last note of the first audio piece may be: the time of occurrence of the next note to the last note of the first audio piece is determined as a fourth type of inflection point in the background audio.

It should be noted that, although the manner of determining the fourth-type turning point in the background audio according to the occurrence time of the last note of the first audio piece is described above by taking the determination of the occurrence time of the last note of the first audio piece as the fourth-type turning point in the background audio or the determination of the occurrence time of the next note of the last note of the first audio piece as the fourth-type turning point in the background audio as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. The person skilled in the art can flexibly determine the fourth-type turning point in the background audio according to the occurrence time of the last note of the first audio piece according to the actual application scene requirement and/or personal preference.

In this example, the priority of the fourth type turning point, which is determined according to the occurrence time of the last note of the first audio piece, is positively correlated with the energy difference of the adjacent notes of the first audio piece, for example.

In this example, the greater the energy difference between adjacent notes of the first audio piece, the higher the priority of the fourth type of inflection point determined by the time of occurrence of the last note of the first audio piece.

In this example, the standard deviation of the energy difference values of adjacent notes of the first audio piece may be expressed as

Where N represents the number of notes of the first audio piece, x_iRepresents the energy of the ith note, i is more than or equal to 1 and less than or equal to N-1.

As an example of this implementation, determining a turning point in the background audio according to a time interval between notes in the background audio comprises: and if the time interval between the first note and the second note in the background audio is larger than a fourth threshold value, determining a second type turning point in the background audio according to the occurrence time of the first note, wherein the first note is the next note of the second note. In this example, the inflection points in the background audio comprise inflection points of a second type.

For example, determining the second type of turning point in the background audio according to the occurrence time of the first note may be: the time of occurrence of the first note is determined as a second type of inflection point in the background audio.

As another example, determining the second type of turning point in the background audio according to the occurrence time of the first note may be: the time of occurrence of the last note (i.e., the second note) of the first note is determined as a second-type break point in the background audio.

It should be noted that, although the manner of determining the second-type turning point in the background audio according to the occurrence time of the first note is described above by taking the determination of the occurrence time of the first note as the second-type turning point in the background audio or the determination of the occurrence time of the last note (i.e. the second note) of the first note as the second-type turning point in the background audio as an example, those skilled in the art can understand that the disclosure should not be limited thereto. The person skilled in the art can flexibly determine the second-type turning point in the background audio according to the appearance time of the first note according to the actual application scene requirement and/or personal preference.

For example, the fourth threshold may be equal to 1.8 × 60/t, where t represents beats per minute.

In step S13, the at least one shot section and the background audio are synthesized so that the shot cut time point and the inflection point between adjacent shot sections match.

In one possible implementation, the matching of the shot-cut time points between adjacent shot segments and the inflection points may indicate that the shot-cut time points between adjacent shot segments are at the inflection points, i.e., the shot-cut time points between adjacent shot segments are aligned with the inflection points.

In another possible implementation manner, the shot-cut time points and the inflection points between the adjacent shot slices are matched, which may indicate that the number of beats of the difference between the shot-cut time points and the inflection points between the adjacent shot slices is less than a seventh threshold. In this implementation, it is not required that the shot-cut time points between adjacent shot slices are perfectly aligned with the turning point, for example, the two may differ by one beat.

Fig. 2 is a schematic diagram illustrating matching of shot switching time points and turning points between adjacent shot segments in an audio and video editing method according to an embodiment of the present disclosure. In FIG. 2, the line segments represent the time axis, A₁To A₄Indicates the shot cut time point, B₁To B₈Indicating the turning point. As shown in fig. 2, in the embodiment of the disclosure, the turning points with the same number as the shot-to-shot time points can be selected from the turning points as the target turning points, so that the shot-to-shot time points are matched with the target turning points. In the embodiment of the present disclosure, the target inflection point may be selected from the inflection points in any manner, for example, the inflection points with the same number as the shot-cut time points may be randomly selected from the inflection points as the target inflection point.

In one possible implementation, the method further includes: determining the priority of the turning point; determining a target turning point from the turning points according to the priorities of the turning points; synthesizing the at least one shot section and the background audio such that shot-cut time points and inflection points between adjacent shot sections match, including: the at least one shot section and the background audio are synthesized such that shot-cut time points between adjacent shot sections match the target inflection point.

As an example of this implementation, determining the priority of the turning point includes one or both of: determining the priorities of the turning points of different classes according to the speed of the background audio; and determining the priorities of the same type of different turning points according to the energies of the notes corresponding to the same type of different turning points.

In this example, the speed of the background audio may refer to the number of beats per minute of the background audio.

For example, determining priorities of inflection points of different classes based on the velocity of the background audio may include: if the speed of the background audio is less than the fifth threshold, the priority of the first type turning point is greater than the priority of the second type turning point, the priority of the third type turning point is greater than the priority of the fourth type turning point. For example, the fifth threshold is 60 bpm. Wherein ">" indicates a higher priority than, for example, a second type of inflection point > a third type of inflection point indicates that the second type of inflection point has a higher priority than the third type of inflection point.

As another example, determining priorities of inflection points of different classes based on the velocity of the background audio may include: if the speed of the background audio is greater than or equal to the fifth threshold and less than the sixth threshold, the priority of the third type turning point is greater than the priority of the fourth type turning point > the priority of the second type turning point > the priority of the first type turning point. For example, the sixth threshold is 120 bpm.

As another example, determining priorities of inflection points of different classes based on the velocity of the background audio may include: if the speed of the background audio is greater than or equal to the sixth threshold, the priority of the first type turning point is greater than the priority of the fourth type turning point, the priority of the second type turning point is greater than the priority of the third type turning point.

As an example of this implementation, the shot-cut time point between adjacent shot segments matches the target inflection point, which may indicate that the shot-cut time point between adjacent shot segments is at the target inflection point, i.e., the shot-cut time point between adjacent shot segments is aligned with the target inflection point.

As another example of the implementation, the shot-cut time point between adjacent shot sections and the target turning point match, which may indicate that the number of beats of the difference between the shot-cut time point between adjacent shot sections and the target turning point is less than the eighth threshold. In this example, it is not required that the shot-cut time point between adjacent shot slices is completely aligned with the target turning point, for example, the two may be different by one beat.

In this implementation, a target inflection point is determined from the inflection points according to the priorities of the inflection points, and the at least one shot section and the background audio are synthesized such that shot-cut time points between adjacent shot sections and the target inflection point are matched, thereby enabling a tempo of shot-cut to more match a tempo of music change.

In the embodiment of the present disclosure, video clips having the same duration as the corresponding audio clips are cut from each of the shot clips and synthesized, so that shot switching time points and turning points between adjacent shot clips are matched. For example, in the example shown in FIG. 3, the shot C is associated with_jThe corresponding audio segment is D_jWherein j is more than or equal to 1 and less than or equal to 5. In this implementation, the secondary shot C_jMiddle cut and audio clip D_jThe video segments with the same duration are synthesized so that the shot switching time points between adjacent shot segments are matched with the target turning point. For example, D₁Has a duration of 3 seconds, if C₁Is longer than 3 seconds, then from C₁Cut off for 3 seconds and synthesize.

In the embodiment of the disclosure, shot detection is performed on at least one video to be clipped, at least one shot section in the at least one video to be clipped is determined, a turning point in background audio corresponding to the at least one video to be clipped is determined, and the at least one shot section and the background audio are synthesized, so that shot switching time points and turning points between adjacent shot sections are matched, thereby determining the turning point in the background audio by analyzing the background audio, enabling the shot sections to correspond to audio sections of the background audio, enabling music change to be matched with shot switching rhythm, and obtaining a synthesized video with music change and picture change coordinated. The embodiment of the disclosure can reduce the audio and video editing and manufacturing cost and time of common video users and merchants.

Fig. 4 shows a block diagram of an audiovisual editing apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus includes: a shot detection module 41, configured to perform shot detection on at least one video to be edited, and determine at least one shot section in the at least one video to be edited; a first determining module 42, configured to determine a turning point in the background audio corresponding to the at least one video to be edited; a synthesizing module 43 for synthesizing the at least one shot section and the background audio so that the shot switching time point and the inflection point between adjacent shot sections are matched.

In one possible implementation, the first determination module 42 includes one or both of a first determination sub-module and a second determination sub-module; the first determining submodule is used for determining a turning point in the background audio according to the energy of the musical notes in the background audio; and the second determining submodule is used for determining the turning point in the background audio according to the time interval between the notes in the background audio.

In one possible implementation, the apparatus further includes: the second determining module is used for determining the priority of the turning point; the third determining module is used for determining a target turning point from the turning points according to the priority of the turning points; the synthesis module 43 is configured to: the at least one shot section and the background audio are synthesized such that shot-cut time points between adjacent shot sections match the target inflection point.

In one possible implementation, the second determining module includes one or both of a third determining submodule and a fourth determining submodule; the third determining submodule is used for determining the priority of the turning points of different classes according to the speed of the background audio; and the fourth determining submodule is used for determining the priorities of the same type of different turning points according to the energies of the notes corresponding to the same type of different turning points.

In one possible implementation, the first determining submodule is configured to: if the ratio of the energy of the first note to the energy of the second note in the background audio is larger than a first threshold, determining a first turning point in the background audio according to the occurrence time of the first note, wherein the first note is the next note of the second note; or, if the difference between the energy of the first note and the energy of the second note is greater than a second threshold, determining a first-class turning point in the background audio according to the occurrence time of the first note.

In one possible implementation manner, the priority of the first-class turning point determined according to the occurrence time of the first note is positively correlated with the ratio of the energy of the first note to the energy of the second note; or the priority of the first-class turning point determined according to the appearance time of the first note is positively correlated with the difference between the energy of the first note and the energy of the second note.

In one possible implementation, the first determining submodule is configured to: and if the similarity between the first audio segment and the second audio segment in the background audio is greater than a third threshold, respectively determining a third turning point in the background audio according to the occurrence time of the last note of the first audio segment and the occurrence time of the last note of the second audio segment.

In a possible implementation manner, the priority of the third turning point determined according to the occurrence time of the last note of the first audio segment is positively correlated with the similarity between the first audio segment and the second audio segment; the priority of the third turning point determined according to the occurrence time of the last note of the second audio segment is positively correlated with the similarity of the first audio segment and the second audio segment.

In one possible implementation, the first determining submodule is configured to: and if the alternating phenomenon of the intensity of the energy of the notes of the first audio segment in the background audio conforms to the specified alternating pattern of the intensity, determining a fourth turning point in the background audio according to the occurrence time of the last note of the first audio segment.

In one possible implementation, the second determining submodule is configured to: and if the time interval between the first note and the second note in the background audio is larger than a fourth threshold value, determining a second type turning point in the background audio according to the occurrence time of the first note, wherein the first note is the next note of the second note.

Fig. 5 is a block diagram illustrating an audiovisual editing apparatus 800 according to an exemplary embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user, in some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the device 800 to perform the above-described methods.

Fig. 6 is a block diagram illustrating an audiovisual editing apparatus 1900 in accordance with an exemplary embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 6, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The device 1900 may further include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input-output (I/O) interface 1958 the device 1900 may be operable based on an operating system stored in memory 1932, such as Windows server, MacOS XTM, UnixTM, &ltttttranslation = L "&tttl &/ttt &gtgttinux, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the apparatus 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including AN object oriented programming language such as Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An audio-video editing method, characterized by comprising:

2. The method of claim 1, wherein determining the turning point in the background audio corresponding to the at least one video to be edited comprises one or both of:

3. The method of claim 1 or 2, further comprising:

determining a priority of the turning point;

4. The method of claim 3, wherein determining the priority of the turning point comprises one or both of:

5. The method of claim 2, wherein determining a turning point in the background audio based on the energy of the notes in the background audio comprises:

6. The method according to claim 5, wherein the priority of the first-type turning point determined according to the occurrence time of the first note is positively correlated with the ratio of the energy of the first note to the energy of the second note; alternatively, the first and second electrodes may be,

7. The method of claim 2, wherein determining a turning point in the background audio based on the energy of the notes in the background audio comprises:

8. The method according to claim 7, wherein the priority of the third turning point determined according to the occurrence time of the last note of the first audio segment positively correlates with the similarity of the first audio segment and the second audio segment; the priority of the third turning point determined according to the occurrence time of the last note of the second audio segment is positively correlated with the similarity of the first audio segment and the second audio segment.

9. The method of claim 2, wherein determining a turning point in the background audio based on the energy of the notes in the background audio comprises:

10. The method of claim 9, wherein the priority of the fourth type turning point determined according to the occurrence time of the last note of the first audio piece is positively correlated with the energy difference of the adjacent notes of the first audio piece.

11. The method of claim 2, wherein determining a turning point in the background audio based on a time interval between notes in the background audio comprises:

12. An audio-video editing apparatus, comprising:

13. The apparatus of claim 12, wherein the first determination module comprises one or both of a first determination submodule and a second determination submodule;

14. The apparatus of claim 12 or 13, further comprising:

a second determining module for determining the priority of the turning point;

the synthesis module is configured to:

15. The apparatus of claim 14, wherein the second determination module comprises one or both of a third determination submodule and a fourth determination submodule;

16. The apparatus of claim 13, wherein the first determination submodule is configured to:

17. The apparatus according to claim 16, wherein the priority of the first type turning point determined according to the occurrence time of the first note is positively correlated with the ratio of the energy of the first note to the energy of the second note; alternatively, the first and second electrodes may be,

18. The apparatus of claim 13, wherein the first determination submodule is configured to:

19. The apparatus according to claim 18, wherein the priority of the third turning point determined according to the occurrence time of the last note of the first audio segment positively correlates with the similarity of the first audio segment and the second audio segment; the priority of the third turning point determined according to the occurrence time of the last note of the second audio segment is positively correlated with the similarity of the first audio segment and the second audio segment.

20. The apparatus of claim 13, wherein the first determination submodule is configured to:

21. The apparatus of claim 20, wherein the priority of the fourth type turning point determined according to the occurrence time of the last note of the first audio piece is positively correlated with the energy difference of the adjacent notes of the first audio piece.

22. The apparatus of claim 13, wherein the second determination submodule is configured to:

23. An audio-video editing apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any one of claims 1 to 11.

24. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 11.