CN111901626B

CN111901626B - Background audio determining method, video editing method, device and computer equipment

Info

Publication number: CN111901626B
Application number: CN202010775464.5A
Authority: CN
Inventors: 余自强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2021-12-14
Anticipated expiration: 2040-08-05
Also published as: CN111901626A

Abstract

The application relates to a background audio determining method, a video clipping device and computer equipment. The method comprises the following steps: acquiring a content time length sequence corresponding to target content; the playing time lengths of all content segments of the target content form a content time length sequence according to the content playing sequence; acquiring drum point time interval sequences corresponding to the candidate audios respectively; forming corresponding drum point time interval sequences according to the sequence of the drum points in the candidate audio by the interval lengths among the drum points in the candidate audio; acquiring target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio; and determining background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio. The recommendation of the background audio can be carried out by combining the target similarity and a recommendation model based on artificial intelligence. By adopting the method, the matching degree of audio selection can be improved.

Description

Background audio determining method, video editing method, device and computer equipment

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a background audio determining method, a video editing apparatus, and a computer device.

Background

With the development of short video technology, the shooting and editing of short videos are more common, for example, many users are keen to produce click videos. When making short videos, it is usually necessary to pick a suitable background music for the selected video.

In the conventional technology, background music can be selected for recommendation according to the preference of a user based on an artificial intelligence recommendation model. However, there are often cases where the selected background music does not match the video, i.e., the background music has a low degree of matching with the video.

Disclosure of Invention

In view of the above, it is necessary to provide a background audio determining method, a video clipping method, an apparatus and a computer device for solving the technical problem of low matching degree between the background music and the video.

A method of background audio determination, the method comprising: acquiring a content time length sequence corresponding to target content of background audio to be determined; the target content comprises a plurality of content segments, and the playing time lengths of the content segments form the content time length sequence according to the content playing sequence; acquiring drum point time interval sequences corresponding to the candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency; acquiring target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio; and determining background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio.

A background audio determination apparatus, the apparatus comprising: the content time length sequence acquisition module is used for acquiring a content time length sequence corresponding to target content of the background audio to be determined; the target content comprises a plurality of content segments, and the playing time lengths of the content segments form the content time length sequence according to the content playing sequence; the drum point time interval sequence acquisition module is used for acquiring drum point time interval sequences corresponding to all candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency; a target similarity obtaining module, configured to obtain a target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio; and the background audio determining module is used for determining the background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring a content time length sequence corresponding to target content of background audio to be determined; the target content comprises a plurality of content segments, and the playing time lengths of the content segments form the content time length sequence according to the content playing sequence; acquiring drum point time interval sequences corresponding to the candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency; acquiring target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio; and determining background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring a content time length sequence corresponding to target content of background audio to be determined; the target content comprises a plurality of content segments, and the playing time lengths of the content segments form the content time length sequence according to the content playing sequence; acquiring drum point time interval sequences corresponding to the candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency; acquiring target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio; and determining background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio.

The background audio determining method, the device, the computer device and the storage medium obtain a content time length sequence corresponding to a target content of a background audio to be determined, obtain a drum point time interval sequence corresponding to each candidate audio in a candidate audio set, obtain a target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio, and determine the background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio, because the content time length sequence is formed by the playing time length of a content segment included in the target content according to a content playing sequence and the drum point time interval sequence is formed by the interval length between each drum point according to the order of the drum point in the candidate audio, the content time length sequence can reflect the content playing rhythm of the target content, the drumbeat time interval sequences can reflect the music tempos of the candidate audios, so that the background audio is selected according to the target similarity between the content time length sequence and the drumbeat time interval sequences corresponding to the candidate audios, the background audio with the music tempos matched with the content playing tempos of the target content can be selected, and the matching degree of the background audio and the video is improved.

A video clipping method, the method comprising: acquiring time lengths corresponding to all clip video segments in a video clip page, and forming a content time length sequence according to the playing sequence of the clip video segments in a target video; acquiring a background audio corresponding to the target video; wherein the background audio is determined from a candidate audio set according to the content time length sequence and a target similarity of a drumbeat time interval sequence corresponding to the candidate audio; aligning, on the video clip interface, a starting position of the background audio in an audio track with a starting position of the target video in a video track.

A video clipping device, the device comprising: the content time length sequence forming module is used for acquiring the time length corresponding to each clip video segment in a video clip page and forming a content time length sequence according to the playing sequence of the clip video segments in the target video; the background audio acquisition module is used for acquiring a background audio corresponding to the target video; wherein the background audio is determined from a candidate audio set according to the content time length sequence and a target similarity of a drumbeat time interval sequence corresponding to the candidate audio; and the position alignment module is used for aligning the starting position of the background audio in the audio track with the starting position of the target video in the video track on the video clip interface.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring time lengths corresponding to all clip video segments in a video clip page, and forming a content time length sequence according to the playing sequence of the clip video segments in a target video; acquiring a background audio corresponding to the target video; wherein the background audio is determined from a candidate audio set according to the content time length sequence and a target similarity of a drumbeat time interval sequence corresponding to the candidate audio; aligning, on the video clip interface, a starting position of the background audio in an audio track with a starting position of the target video in a video track.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring time lengths corresponding to all clip video segments in a video clip page, and forming a content time length sequence according to the playing sequence of the clip video segments in a target video; acquiring a background audio corresponding to the target video; wherein the background audio is determined from a candidate audio set according to the content time length sequence and a target similarity of a drumbeat time interval sequence corresponding to the candidate audio; aligning, on the video clip interface, a starting position of the background audio in an audio track with a starting position of the target video in a video track.

The video clipping method, the video clipping device, the computer equipment and the storage medium acquire the time length corresponding to each clip video segment in a video clip page, form a content time length sequence according to the playing sequence of the clip video segments in the target video, acquire the background audio corresponding to the target video, align the starting position of the background audio in the audio track with the starting position of the target video in the video track on a video clipping interface, and because the content time length sequence is formed according to the playing sequence of the clip video segments in the target video, the content time length sequence can reflect the content playing rhythm of the target content, and because the drum point time interval sequence can reflect the music rhythm of the candidate audio, the background audio is selected according to the target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio, the background audio with the music rhythm matched with the content playing rhythm of the target content can be selected, and the matching degree of the background audio and the video is improved. In addition, on the video clipping interface, the initial position of the background audio in the audio track is aligned with the initial position of the target video in the video track, so that the automatic alignment of the background audio and the target video is realized, a user does not need to align the background audio and the target video in a manual adjustment mode, the time consumed by the manual adjustment of the user is saved, and the video clipping efficiency is improved.

Drawings

FIG. 1 is a diagram of an application environment of a background audio determination method in some embodiments;

FIG. 2 is a flow diagram illustrating a method for background audio determination in some embodiments;

FIG. 3A is a flow diagram illustrating a method for determining background audio in some embodiments;

FIG. 3B is a schematic diagram of a distance matrix in some embodiments;

FIG. 4A is a flow diagram illustrating a method for determining background audio in some embodiments;

FIG. 4B is a diagram illustrating matching of a sequence of drum time intervals to a sequence of content time lengths in some embodiments;

FIG. 5A is a flow chart illustrating steps in some embodiments for obtaining a sequence of drum point time intervals;

FIG. 5B is a spectral diagram of a Mel filter bank in some embodiments;

fig. 5C is a spectrogram in some embodiments;

FIG. 5D is a filtered spectrogram in some embodiments;

FIG. 5E is a schematic of a sequence of amplitude difference values and a sequence of amplitude difference threshold values in some embodiments;

FIG. 5F is a schematic view of a drum point in some embodiments;

FIG. 5G is a schematic diagram of a drum spot sequence obtained in some embodiments;

FIG. 6A is a flow diagram that illustrates a method for video editing in some embodiments;

FIG. 6B is a schematic diagram of a video clip interface in some embodiments;

FIG. 6C is a timing diagram of a video clip in some embodiments;

FIG. 7 is a flow diagram illustrating a method for background audio determination in some embodiments;

FIG. 8 is a block diagram of the background audio determination means in some embodiments;

FIG. 9 is a block diagram of a video clipping device in some embodiments;

FIG. 10 is a diagram of the internal structure of a computer device in some embodiments;

FIG. 11 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The background audio determining method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The background audio determination method may be applied to the server 104. Specifically, the server 104 may obtain a content time length sequence corresponding to target content of the background audio to be determined, where the target content may include a plurality of content segments, and the playing time lengths of the content segments may form the content time length sequence according to a content playing sequence; the server 104 may obtain drum point time interval sequences corresponding to each candidate audio in the candidate audio set, where the candidate audio may correspond to multiple drum points, and the interval length between each drum point may form a drum point time interval sequence frequency corresponding to the candidate audio according to the order of the drum points in the candidate audio; the server 104 may obtain a target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio, and determine a background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio. The server 104 may transmit at least one of the background audio, the parent audio to which the background audio belongs, or the position information of the background audio in the parent audio to which the background audio belongs to, to the terminal 102. The terminal 102 may perform video clipping according to at least one of the background audio, the parent audio to which the background audio belongs, or the position information of the background audio in the parent audio to which the background audio belongs.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. It can be understood that the background audio determining method provided by the embodiment of the present application may also be executed in a terminal.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application can relate to technologies such as artificial intelligence machine learning, for example, a server can calculate audio features of candidate audios, calculate content features of target content, input the audio features and the content features into a feature similarity model, the feature similarity model can calculate feature similarity of the audio features and the content features, obtain feature similarity of the candidate audios and the target content, and output the feature similarity. The server may determine the background audio corresponding to the target content from the candidate audio set according to the feature similarity corresponding to the candidate audio and the target similarity.

In some embodiments, as shown in fig. 2, a method for determining background audio is provided, which is exemplified by the method applied to the server in fig. 1, and includes the following steps:

s202, acquiring a content time length sequence corresponding to target content of the background audio to be determined; the target content includes a plurality of content segments, and the playback time lengths of the respective content segments form a content time length sequence in the content playback order.

Specifically, the background audio refers to music for adjusting atmosphere, and the insertion of the background audio into the video can enhance the emotional expression of the video, and the background audio is, for example, a song or a part of a song. The target content may include at least one of a picture or a video, for example, the target content may be a video composed of individual video segments added to the video clipping tool by the user. The content segment may be one of a picture or a video. The target content includes a plurality of content pieces, and there is a difference in characteristics between the content pieces. For example, there may be a difference in at least one of information that may be present, content generation time (e.g., video capture time), content source, or content generation location (e.g., capture location) between adjacent content segments. The content segments are switched along with the playing of the target content, and by acquiring the background music which embodies the characteristic difference between different content segments, the content segments in the target content are also switched along with the change of the music rhythm, so that the rhythm of the background music can be matched with the content playing rhythm. For example, when editing a video, video segments of multiple videos need to be stitched together. The content segment may be a video segment shot independently or may be obtained by dividing the target content. The play time length of the content clip refers to the time length of the content clip. The content playback order refers to the playback order of the content segments in the target content. For example, the target content sequentially includes a content segment a, a content segment B, and a content segment C, and the playing time lengths corresponding to the content segment a, the content segment B, and the content segment C are sequentially 30 seconds, 35 seconds, and 20 seconds, so that the content time length sequence corresponding to the target content is obtained by sequentially arranging the playing time length of the content segment a, the playing time length of the content segment B, and the playing time length of the content segment C, that is, the content time length sequence corresponding to the target content is "30, 35, 20".

In some embodiments, the terminal may generate the content time length sequence according to the target content, and the terminal may send a background audio push request corresponding to the target content to the server, where the background audio push request may carry the content time length sequence corresponding to the target content. For example, when a video clip is performed, an audio adding button may be set on a video clip interface of the terminal, the terminal may obtain a selection operation on the audio adding button, obtain time lengths of video segments corresponding to videos on the video clip interface according to the selection operation on the audio adding button, obtain a video segment time length sequence corresponding to the videos by arranging according to a playing sequence of the video segments in the videos, and send a background audio recommendation request to the server. The server may also be actively pushing background audio to the terminal.

In some embodiments, the background audio push request may carry target content, the server may divide the target content to obtain a plurality of content segments, and sort the playing time lengths corresponding to the content segments according to the content playing sequence of the content segments in the target content to obtain a content time length sequence corresponding to the target content. The server can divide the target content according to the plot types included in the target content to obtain content segments corresponding to the plot types respectively.

S204, drum point time interval sequences corresponding to the candidate audios in the candidate audio set are obtained; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency.

In particular, the candidate audio set may include a plurality of candidate audios. The candidate audio may be complete audio, or an audio segment obtained by segmenting the complete audio. The candidate audio is, for example, a song, the server may segment the song according to the time length of the target content to obtain a plurality of segments having lengths matching the length of the target content, each segment may be used as the candidate audio, and the server may also use the segment corresponding to the climax part in the song as the candidate audio. In the slicing, there may be an intersection between the respective candidate audios. The time lengths corresponding to the candidate audios in the candidate audio set may be the same or different. The length of the candidate audio may be matched with the playing time length corresponding to the target content, where matching means that a difference between the length of the candidate audio and the playing time length corresponding to the target content is within a preset difference threshold. The preset difference threshold may be set as desired, for example, 2 seconds.

The drum point corresponding to the candidate audio refers to a rhythm point in the candidate audio, is a point where a high sound is located in the candidate audio, and can be regarded as a drum point where the sound suddenly increases. For example, an audio frame whose amplitude satisfies the condition may be taken as the audio frame where the drumhead is located. The amplitude satisfaction condition may be at least one of the amplitude being greater than an amplitude threshold or the amplitude variance value being greater than a variance value threshold. The amplitude difference value is the difference value between the amplitude of the current audio frame and the amplitude of the forward audio frame within the preset distance. The forward audio frame refers to an audio frame preceding the current audio frame. The preset distance may be represented by a time distance or an audio frame number distance. For example, a difference value between the amplitudes of the current audio frame and the adjacent forward audio frame may be calculated to obtain an amplitude variation value, and when the amplitude variation value is greater than a variation threshold, the current audio frame is an audio frame where the drumhead is seated. The amplitude threshold and the variation threshold may be set according to needs, for example, the amplitude threshold may be a fixed value, or may be an amplitude average value of audio frames in the candidate audio. The time at which the drumhead is located can be represented by the time of the audio frame at the location where the amplitude suddenly rises.

The audio frame is obtained by framing the candidate audio, and framing refers to dividing the candidate audio into a plurality of small sections, wherein each small section is a frame. The sequence of the drum points in the candidate audio refers to the playing sequence of the audio frames corresponding to the drum points in the candidate audio. The length of the interval between the drum points refers to the length of time of the interval between the adjacent drum points. For example, assuming a candidate audio including 3 drum points, the first drum point at the 6 th second of the audio, the second drum point at the 7 th second of the audio, and the third drum point at the 9 th second of the audio, the drum point time interval sequence may be 6,1, and 2 sequentially arranged in order, i.e., the drum point time interval sequence is "6, 1, 2".

In some embodiments, multiple audios may be stored in the server. The drum points corresponding to the candidate audios in the candidate audio set may be obtained through pre-detection, or may be detected according to a background music recommendation request. The server may store the detected drumhead information in a database. The drum point information includes time information of the drum point in the corresponding audio. The server can calculate each drumbeat corresponding to each stored audio frequency to obtain a drumbeat sequence corresponding to each audio frequency, and stores the audio frequency and the corresponding drumbeat sequence in a correlation mode. The drum point sequence is a sequence obtained by sorting the drum points in the order of the drum points in the audio. The server may select at least one audio (denoted as parent audio) from the audio library, and segment the parent audio to obtain a plurality of candidate audios, so as to obtain a candidate audio set corresponding to the parent audio. The server may obtain, from the drum point sequence corresponding to the parent audio, each drum point corresponding to the candidate audio, and obtain, according to each drum point corresponding to the candidate audio, a drum point time interval sequence corresponding to the candidate audio. Of course, the server may directly calculate each drumhead of the candidate audio, and obtain a corresponding drumhead time interval sequence according to the calculated drumhead. When the server acquires the parent audio, all the audios in the music library can be used as the parent audio, and the music can also be selected and obtained based on an artificial intelligence recommendation model according to the interests of the user. So that the finally obtained background audio not only accords with the preference of the user, but also is matched with the rhythm of the clipped video.

In some embodiments, the server may establish a correspondence between the stored audio and the scene type, the same audio may correspond to multiple scene types, and different audios may correspond to the same scene type. The scene type may include at least one of a landscape class, a human class, or an animal class. The server can select from the stored audios with the scene types consistent with the scene type of the target content to obtain a parent audio.

In some embodiments, the server may record key portions of the stored audio, with multiple candidate audios being chosen from the key portions in the parent audio. Wherein the critical portion may be a climax portion of the audio.

In some embodiments, the server may frame the candidate audio according to a preset window size to obtain a plurality of audio frames. For example, the preset window size is 1024 sampling points, and if the sampling frequency is 44.1kHZ (kilohertz), 43 values can be acquired in 1 second, that is, 43 audio frames can be acquired in 1 second.

In some embodiments, the candidate audio may be pre-processed prior to framing the candidate audio. The pre-processing may include at least one of denoising, dynamic range compression, or pre-emphasis. The performance of the audio may be enhanced by pre-processing the candidate audio.

S206, acquiring the target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio.

Specifically, the target similarity refers to a similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio. The server may calculate a target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to each candidate audio, respectively, as a target similarity corresponding to each candidate audio, respectively.

In some embodiments, the server may calculate distances between respective drum time intervals in the sequence of drum time intervals and respective content time lengths in the sequence of content time lengths, and determine the target similarity based on the calculated distances. The server may calculate the distance between the drumbeat time interval and the content time length through a distance calculation formula. The distance calculation formula is, for example, a euclidean distance. Specifically, the server may form a distance matrix from the distance between the drum point time interval and the content time length, and determine the target similarity through a path between a start matrix point and an end matrix point of the distance matrix.

In some embodiments, the server may convert the content time length sequence for multiple times to obtain a drum point time interval sequence corresponding to the candidate audio, and obtain the target similarity according to the number of times of conversion. The conversion times and the target similarity are in a negative correlation relationship, namely the higher the conversion times, the smaller the corresponding target similarity is, and the lower the conversion times, the larger the corresponding target similarity is.

In some embodiments, in order to improve the selection efficiency of the background audio, the server may obtain target similarities between drum time interval sequences and content time length sequences respectively corresponding to some candidate audios in the candidate audio set. Specifically, the server may calculate the number of drum time intervals in the drum time interval sequence corresponding to each candidate audio, and calculate a difference between the number of drum time intervals in the drum time interval sequence and the number of content time lengths in the content time length sequence, to obtain the number difference corresponding to each candidate audio. And selecting candidate audios with the number difference smaller than a preset data difference threshold value from the candidate audio set as target audios, and acquiring the target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the target audios. Of course, the server may also obtain the target similarity between the drum point time interval sequence and the content time length sequence corresponding to all candidate audios in the candidate audio set.

And S208, determining background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio.

Specifically, the target similarity corresponding to the candidate audio refers to a similarity between a drum point time interval sequence and a content time length sequence corresponding to the candidate audio. The number of the background audios corresponding to the target content is determined to be at least one, and may be 10, for example. The server may obtain, from the candidate audio set, a candidate audio whose target similarity satisfies the similarity condition as a background audio corresponding to the target content. The similarity condition may be at least one of the target similarity being greater than a similarity threshold or the rank of the target similarity being before a preset rank. The similarity threshold may be set as desired, for example, 0.8. The preset order is, for example, 5. After determining the background audio corresponding to the target content, the server may push the push information of the background audio to the terminal, so that the user may play the background audio, and when receiving a selection instruction of the user for the background audio, the terminal may combine the background audio selected by the user with the target content to synthesize the target content including the background audio selected by the user, so that the background audio may be played while the target content is played.

It can be understood that the background audio corresponding to the target content determined by the server is not finally synthesized with the target content, for example, if there are 10 background audios corresponding to the determined target content, the push information of the 10 background audios, such as the play link and the title, can be pushed to the terminal, and the user selects one of the 10 background audios to be synthesized with the target content. Of course, the background audio obtained by the server may be one, and the server may synthesize the background audio with the target content.

In some embodiments, the background audio may be selected directly from the target similarity. And determining the background audio corresponding to the target content from the candidate audio set by combining with other information. Wherein the target similarity is positively correlated with the probability that the candidate audio is selected as the background audio. I.e., the greater the similarity of the objects, the greater the probability of being selected as background audio. For example, the target recommendation degree can be obtained according to the target similarity degree. For example, the server may obtain target popularity degrees corresponding to the candidate audios, obtain target recommendation degrees corresponding to the candidate audios according to the target similarity degrees and the target popularity degrees, and select a background audio corresponding to the target content from the candidate audios according to the target recommendation degrees corresponding to the candidate audios. Wherein, the popularity of the audio is used for reflecting the popularity of the audio. The server can take the candidate audio with the target recommendation degree meeting the recommendation degree condition in the candidate audio set as the background audio corresponding to the target content and push the background audio to the terminal. The recommendation degree condition may be at least one of that the target recommendation degree is greater than a recommendation degree threshold or that the target recommendation degree is ranked before a preset recommendation ranking. The recommendation degree threshold may be set as desired, for example, 90%. The preset recommendation order may be, for example, 6. For example, the calculation formula of the target recommendation degree may be formula (1): s ═ d + w · p (1), where s denotes the target recommendation. d represents the object similarity. w represents a weight corresponding to the target heat, and w can be set as needed. P represents the target heat. The target heat degree can be determined according to the attention degree corresponding to the candidate audio.

In some embodiments, the server may obtain a scene type to which each candidate audio in the candidate audio set is respectively applicable. The scene type to which the candidate audio is applicable may be a scene type corresponding to the candidate audio. The server can determine the target recommendation degree according to the target similarity degree, the target heat degree and the scene type. Specifically, the server may obtain a scene type corresponding to the target content, determine a scene similarity between the scene type of the candidate audio and the scene type corresponding to the target content, and determine the target recommendation degree according to the target similarity, the target heat and the scene similarity. The scene similarity and the target recommendation degree can be in a positive correlation relationship.

In some embodiments, the background audio may be an audio clip, and the server may obtain the position information of the background audio in the corresponding parent audio, where the parent audio of the background audio refers to the audio from which the background audio originates, i.e., the background audio is a part of the parent audio. And returning the parent audio corresponding to the background audio and the position information of the background audio in the corresponding parent audio to the terminal. Of course, the server may also directly return the background audio to the terminal.

In the method for determining background audio, a content time length sequence corresponding to a target content of the background audio to be determined is obtained, drum point time interval sequences corresponding to candidate audios in a candidate audio set are obtained, a target similarity between the content time length sequence and the drum point time interval sequences corresponding to the candidate audios is obtained, the background audio corresponding to the target content is determined from the candidate audio set according to the target similarity corresponding to the candidate audios, because the content time length sequence is formed by playing time lengths of content segments included in the target content according to a content playing sequence and the drum point time interval sequences are formed by interval lengths between drum points according to a sequence of the drum points in the candidate audios, the content time length sequence can reflect a content playing rhythm of the target content, and the drum point time interval sequences can reflect a music rhythm of the candidate audios, therefore, the background audio is selected according to the target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio, the background audio with the music rhythm matched with the content playing rhythm of the target content can be selected, and the matching degree of the background audio and the video is improved.

At present, more and more users make short videos, i.e. short videos, which are a mode of internet content dissemination, generally video dissemination contents with dissemination duration within 1 minute on new internet media. When a short video is made, a suitable background audio needs to be selected for the video or the picture, generally, a user manually listens to the background audio one by one in an audition mode, the rhythm of the audio is captured by the audition mode, however, the listening to the background audio one by one manually takes a long time, the accuracy of the audition of the captured rhythm is low, and the obtained background audio is low in matching degree with a material (including at least one of the video or the picture) selected when the short video is made. By the adoption of the background audio determining method, the background audio with high matching degree with the selected materials during video production can be automatically and rapidly acquired, the video picture switching action can be corresponding to the music beat, and the selecting efficiency and accuracy of the background audio of the video are improved.

In some embodiments, as shown in fig. 3A, the step S206 of obtaining the target similarity between the content time length sequence and the drum time interval sequence corresponding to the candidate audio includes:

s302, obtaining the distance between each drum point time interval in the drum point time interval sequence and each content time length in the content time length sequence, and obtaining a target distance matrix formed by the distances.

Specifically, the size of the target distance matrix may be one of M × N or N × M, where M is the number of drum time intervals in the drum time interval sequence and N is the number of content time lengths in the content time length sequence. When the size of the target distance matrix is M × N, a matrix value corresponding to a matrix point in the ith row and the jth column in the target distance matrix represents a distance between the ith drum point time interval and the jth content time length. When the size of the target distance matrix is NxM, a matrix value corresponding to a matrix point in the jth row and ith column in the target distance matrix represents the distance between the jth content time length and the ith drum point time interval, wherein i is more than or equal to 1 and less than or equal to M, and j is more than or equal to 1 and less than or equal to N. The smaller the distance between the drum point time interval and the content time length is, the greater the similarity is, and the greater the distance isThe larger the smaller the similarity. The drum point time interval may be a euclidean distance from the content time length. For example, the ith drum point time interval Q_iAnd the jth content time length P_jHas a Euclidean distance of d (Q) between_i,P_j)＝(Q_i-P_j)²。

In some embodiments, the server may obtain an overall distance matrix formed by the drum point time interval sequence and the content time length sequence of the parent audio corresponding to the candidate audio, and obtain a target distance matrix formed by the drum point time interval sequence and the content time length sequence corresponding to the candidate audio from the overall distance matrix. For example, the parent audio corresponds to a drum point time interval sequence of Q ═ 0.1,2,5,4,6,2,2.4,3,5,2,1.7,3.0,1.4,0.7,1.2,1.6,1.0,1.0,1.1,0.9,0.9,1.3,2.9,0.9,1.3,4,2,1,3 "in seconds, and a content time length sequence of P ═ 3.3,1.2,0.9,1.1,1.5,1.0,1.0,1.1,0.9,0.9, 1.3" in seconds. Fig. 3B shows an overall distance matrix obtained by the drum time interval sequence Q and the content time length sequence P corresponding to the parent audio, where the overall distance matrix is 29 × 11 in size. In fig. 3B, a target distance matrix formed by the drum point time interval sequence C corresponding to the candidate audio a and Q is a matrix in the rectangular frame a, where the target distance matrix is "3.0, 1.4,0.7,1.2,1.6,1.0,1.0,1.1,0.9,0.9, 1.3". It is understood that the parent audio may be segmented into other audio segments, and these audio segments may also be used as candidate audio. For example, the drum point time sequence corresponding to another candidate audio obtained by splitting is "0.1, 2,5,4,6,2,2.4,3, 5". Then the similarity of "0.1, 2,5,4,6,2,2.4,3, 5" to P ═ 3.3,1.2,0.9,1.1,1.5,1.0,1.0,1.1,0.9,0.9,1.3 "can also be calculated.

S304, the shortest path from the starting matrix point to the ending matrix point of the target distance matrix is obtained.

Specifically, the start matrix point refers to a position corresponding to the minimum row and the minimum column in the target distance matrix, and the end matrix point refers to a position corresponding to the maximum row and the maximum column in the target distance matrix. As in fig. 3B, the start matrix point of the target distance matrix formed by candidate audio corresponding drum point time intervals C and Q is the position of the 12 th row and 1 st column of the overall distance matrix, and the end matrix point is the position of the 22 nd row and 11 th column. The matrix points may be represented by (row coordinates, column coordinates), for example, the matrix points (1,2) represent the matrix points of row 1, column 2.

In some embodiments, there may be multiple paths from the start matrix point to the end matrix point of the target distance matrix, and each path includes the start matrix point and the end matrix point of the target distance matrix. The server may calculate a sum of matrix values corresponding to matrix points included in each path from the start matrix point to the end matrix point of the target distance matrix as a sum of paths corresponding to each path, and determine a shortest path from the start matrix point to the end matrix point of the target distance matrix according to the sum of paths corresponding to each path. It should be noted that, because the drum point time interval sequence and the content time length sequence are ordered according to the chronological order, when the path includes the matrix point (i, j), the next matrix point of the matrix point (i, j) in the path may be one of (i +1, j), (i, j +1) or (i +1, j + 1).

And S306, obtaining the target similarity according to the distance of the shortest path.

Specifically, the shortest distance may be a result of adding distance values corresponding to matrix points corresponding to the shortest path. The target similarity has a negative correlation with the distance of the shortest path. The smaller the distance, the greater the similarity, and the larger the distance, the smaller the similarity. The server may calculate the target similarity according to the distance of the shortest path, for example, the reciprocal of the distance of the shortest path may be used as the target similarity.

In this embodiment, because the distance in the target distance matrix is the distance between each drum point time interval in the drum point time interval sequence and each content time length in the content time length sequence, the distance in the shortest path can accurately reflect the similarity between the content time length sequence and the drum point time interval sequence, and the accuracy of the target similarity obtained by the distance in the shortest path is high.

In some embodiments, as shown in fig. 4A, the step S304 of obtaining the shortest path from the start matrix point to the end matrix point of the target distance matrix comprises:

and S402, taking the termination matrix point as the current matrix point.

S404, the forward matrix point with the minimum distance value in the forward matrix points corresponding to the current matrix point is obtained and used as the target path point corresponding to the shortest path, and the target path point is used as the updated current matrix point.

The forward matrix point corresponding to the current matrix point is a matrix point satisfying at least one of matrix points of which the row coordinate is 1 row smaller than the current matrix point or the column coordinate is 1 column smaller than the current matrix point. As shown in fig. 3B, the terminating matrix point of the target distance matrix formed by drum point time intervals C and Q corresponding to the candidate audio is (22,11), and (22,11) is taken as the current matrix point, then the forward matrix point corresponding to (22,11) includes three positions of (22,10), (21,11), and (21, 10). The target path point refers to the forward matrix point with the minimum distance value in the forward matrix points corresponding to the current matrix point. As shown in fig. 3B, the matrix values corresponding to (22,10), (21,11), and (21,10) are 0.16, and 0 in this order, and the matrix value 0 corresponding to (21,10) is the smallest, so that (21,10) can be set as the target waypoint. Wherein one path point corresponds to one matrix point.

Specifically, when the server obtains the target route point according to the current matrix point, the target route point may be used as the updated current matrix point, and the forward matrix point with the smallest distance value in the forward matrix points corresponding to the updated current matrix point is obtained and used as the next target route point corresponding to the shortest path.

S406, judging whether the current matrix point is the initial matrix point of the target distance matrix.

If not, the process returns to step S404. If so, the process proceeds to step S408.

S408, taking the path formed by each target path point as the shortest path from the starting matrix point to the ending matrix point of the target distance matrix.

Specifically, the shortest path is a path composed of the respective destination path points, i.e., the shortest path includes the respective destination path points. As shown in fig. 3B, the gray area is the shortest path from the start matrix point to the end matrix point of the target distance matrix formed by drum point time intervals C and Q corresponding to the candidate audio. The server can calculate the sum of the distance values corresponding to each matrix point of the shortest path, and obtain the target similarity according to the sum of the distance values. As shown in fig. 4B, the matching relationship between the content time length sequence P and the candidate audio corresponding drum time interval C is shown.

In this embodiment, a path formed by each target path point is used as a shortest path from a starting matrix point to an ending matrix point of the target distance matrix, and since the target path point is a forward matrix point with the smallest distance value in forward matrix points corresponding to the current matrix point, each path point in the shortest path is ensured to be the path point with the smallest distance value in the selectable path points, and the accuracy of the shortest path is improved.

In some embodiments, the step of obtaining candidate audio in the set of candidate audio comprises: acquiring the content playing time length corresponding to the target content; and acquiring audio to be divided, and dividing the audio to be divided according to the content playing time length to obtain candidate audio in the candidate audio set, wherein the time length of the candidate audio is matched with the content playing time length.

Specifically, the content play time length refers to the play time length of the target content. The audio to be divided, i.e., the parent audio described above, may be multiple in number. The matching of the time length of the candidate audio and the content playing time means that the time length difference value between the time length of the candidate audio and the content playing time length is within a difference threshold value, for example, within 1 second, for example, the time length of the candidate audio is the same as the content playing time length.

In some embodiments, the server may select a plurality of candidate audios with the time length being the content playing time length from the audios to be divided according to the preset translation interval. For example, the time length of the audio to be divided is 100 seconds, the content playing time length is 10 seconds, and the preset panning interval is 5 seconds, the server may select the content of the first 10 seconds from the audio to be divided as one candidate audio, and then select the content of the time length of 10 seconds from the beginning of 10+ 5-15 seconds, that is, select the content of 15 seconds to 25 seconds from the audio to be divided as another candidate audio, so as to obtain a plurality of candidate audio of 10 seconds length.

In this embodiment, the audio to be divided is divided according to the content playing time length corresponding to the target content to obtain candidate audio in the candidate audio set, so that the time length of the candidate audio is consistent with the content playing time length corresponding to the target content, thereby improving the matching degree of the candidate audio and the target content in time length.

In some embodiments, the method further comprises: and acquiring the position information of the background audio in the corresponding audio to be divided. And pushing the audio to be divided corresponding to the background audio and the position information to a terminal corresponding to the target content.

Specifically, the audio to be divided corresponding to the background audio refers to the audio to be divided to which the background audio belongs, and the position information of the background audio in the corresponding audio to be divided may include at least one of a start time point or an end time point of the background audio in the audio to be divided.

In some embodiments, the server may intercept the background audio from the audio to be divided corresponding to the background audio, and push the background audio to the terminal corresponding to the target content.

In the embodiment, the audio to be divided corresponding to the background audio and the position information are pushed to the terminal corresponding to the target content, so that convenience is provided for a user to determine the background audio corresponding to the target content, and the efficiency of the user to determine the final background audio can be improved. For example, background audio is part of a song. Therefore, by sending the parent audio corresponding to the background audio to the terminal, the user can determine the position of the background audio in the parent audio, and can also intercept and obtain the audio synthesized with the target content from the parent audio by referring to the position information of the background audio according to the preference. For example, assuming that the position of the background audio at the parent audio is 6 seconds to 20 seconds, the user may adjust to 5.9S to 19.8 seconds after playing the parent audio.

In some embodiments, as shown in fig. 5A, the step of obtaining drum points corresponding to candidate audio includes:

s502, obtaining an audio frame sequence corresponding to the candidate audio.

In particular, the sequence of audio frames may comprise at least one audio frame. The server may frame the candidate audio to obtain a plurality of audio frames, and sort the plurality of audio frames obtained by the framing according to a playing sequence of the audio frames in the candidate audio to obtain an audio frame sequence corresponding to the candidate audio.

S504, amplitude difference values between frequency spectrums of adjacent audio frames in the audio frame sequence are obtained, and an amplitude difference value sequence is obtained.

Specifically, the frequency spectrum is a frequency domain representation of the audio, including at least one frequency and a frequency-corresponding amplitude. Since the waveform of the audio in the time domain changes rapidly and is not easy to observe, it can be observed in the frequency domain, and the frequency spectrum of the audio may change slowly with time. Wherein, the time domain is used for describing the relation of the audio signal to the time, and the change of the signal along with the time can be expressed by the time domain waveform of the audio signal. The adjacent audio frames refer to audio frames adjacent in the playback order in the candidate audio. The server may calculate a difference between amplitudes of the same frequency in the frequency spectrums of the adjacent audio frames, and add the difference between the amplitudes of the respective same frequencies as an amplitude difference value.

In some embodiments, the amplitude difference value sequence may include amplitude difference values corresponding to respective audio frames. The server may calculate an amplitude difference value between the frequency spectrum of the current audio frame and the frequency spectrum of the forward audio frame of the current audio frame as the amplitude difference value of the current audio frame. Therefore, the frequency spectrum of the audio frame is compressed into one dimension, and convenience is provided for quickly positioning the drum point. The current audio frame may be any one of the audio frame sequences, so that the server may calculate the amplitude difference value corresponding to each audio frame. The server may sort the amplitude difference values corresponding to the audio frames according to the playing order of the audio frames in the candidate audio, so as to obtain an amplitude difference value sequence. The calculation formula of the amplitude difference value can be expressed as formula (2):

wherein, sf (k) represents the amplitude difference value corresponding to the kth audio frame, s (k, i) represents the amplitude corresponding to the ith frequency in the frequency spectrum of the kth audio frame, and s (k-1, i) represents the amplitude corresponding to the ith frequency in the frequency spectrum of the kth-1 audio frame.

In some embodiments, the server may perform difference calculation on the frequency spectrums corresponding to the respective audio frames in the sequence of audio frames to obtain frequency spectrum fluxes corresponding to the respective audio frames, and use the obtained frequency spectrum fluxes of the audio frames as amplitude difference values of the audio frames to obtain the sequence of amplitude difference values. Wherein the difference may comprise at least one of a first order difference or a second order difference. The spectral flux of an audio frame is used to characterize the magnitude of the change in amplitude of the spectrum of the audio frame relative to the spectrum of the forward audio frame. The difference calculation is to calculate the difference between the frequency spectrum of the audio frame and the corresponding forward audio frame. For the drum point positioning, because only the positive spectral flux in the candidate audio needs to be paid attention to, the influence of the negative spectral flux in the candidate audio on the drum point positioning can be avoided through the second-order difference, and thus the accurate drum point is obtained.

In some embodiments, the server may perform time-frequency transformation on the audio frame to obtain a frequency spectrum corresponding to the audio frame. The time-frequency transformation refers to the conversion of time-domain expression of the audio frame into frequency-domain expression, and the frequency-domain expression obtained by the conversion is the frequency spectrum of the audio frame. Specifically, the time-frequency Transform may be implemented by Fourier Transform (Fourier Transform). The Fourier transform may include at least one implementation of Fast Fourier Transform (FFT) or short-time Fourier transform (STFT), and is not particularly limited herein.

In some embodiments, the server may filter the frequency spectrum of the audio frame through a preset filter to obtain the audio frequencyAnd calculating amplitude difference values by utilizing the filtered frequency spectrums corresponding to the frames respectively to obtain an amplitude difference value sequence. Specifically, the preset filter may be a mel filter bank, and the server may filter the frequency spectrum of the audio frame through the mel filter bank to obtain a filtered frequency spectrum (referred to as a mel frequency spectrum) corresponding to the audio frame. Wherein the Mel filter bank comprises a plurality of triangular filters. The frequency range audible to the human ear is 20Hz to 20000Hz, but the human ear is not linear perceptual for sound on the Hz scale. For example, when the human ear adapts to a pitch frequency of 1000Hz, if the pitch frequency is increased to 2000Hz, the human ear cannot detect the frequency increase by a factor of two. In the mel frequency, the human perception of the pitch is linear, for example, if the mel frequency difference of two audio segments is two times, the human ears sound that the pitches of the two audio segments are two times different. The Mel scale is a non-linear frequency scale determined based on sensory judgment of human ear on equidistant pitch change, wherein the relationship between Mel frequency and Hertz frequency is as follows: f_mel1125ln (1+ F/700), where F_melAt Mel frequency, and f at Hertz frequency. When the frequency is small, F_melThe change with f is fast; when the frequency is large, F_melRising very slowly with f. Therefore, the spectrum of the audio frame is filtered by the mel filter, and the mel spectrum conforming to the auditory characteristics of the human is obtained. As shown in fig. 5B, a spectral diagram of a mel filter bank is shown. In FIG. 5B, H₁(k)～H₆(k) 6 triangular filters are shown, f (0) represents the minimum frequency of the Mel filter set, f (7) represents the maximum frequency of the Mel filter set, and f (0) -f (6) correspond to H₁(k)～H₆(k) The center frequency of (c). As can be seen from fig. 5B, the mel filter set has dense triangular filters at the low frequency, large threshold, sparse triangular filters at the high frequency, and low threshold, and meets the objective rule that the higher the frequency is, the more the ear is sluggish. And the dimension reduction of the frequency domain can be realized by adopting a Mel filter bank for filtering. For example, a mel filter bank includes 24 triangular filters, the frequency spectrum is reduced to 24 dimensions in the frequency domain. Of course, the default filter may be other types of filters, and the filter may be implemented in this caseAnd is not particularly limited.

In some embodiments, in order to improve the filtering efficiency, before performing filtering, the server may splice the frequency spectrums corresponding to the respective audio frames along the time domain, that is, splice the frequency spectrums according to the playing sequence of the respective audio frames in the candidate audio, so as to obtain the spectrogram corresponding to the candidate audio. As shown in fig. 5C, a spectrogram is shown. The horizontal axis of the spectrogram represents time and the vertical axis represents frequency. The server may filter the spectrogram through a preset filter to obtain a filtered spectrogram, where the filtered spectrogram includes filtered spectrums corresponding to each audio frame. As shown in fig. 5D, a filtered spectrogram is shown.

S506, an amplitude difference value greater than the amplitude difference threshold is obtained from the amplitude difference value sequence and is used as the target amplitude difference value.

In particular, the amplitude difference threshold may be set as desired. Each audio frame corresponds to an amplitude difference threshold. The amplitude difference threshold may be the same for different audio frames. The server may compare the amplitude difference value corresponding to each audio frame in the amplitude difference value sequence with the corresponding amplitude difference threshold, and when the comparison result is that the amplitude difference value of the audio frame is greater than the corresponding amplitude difference threshold, take the amplitude difference value of the audio frame as the target amplitude difference value.

In some embodiments, the server may arrange the amplitude difference threshold values corresponding to the audio frames according to a playing sequence of the audio frames in the candidate audio to obtain an amplitude difference threshold value sequence, compare the amplitude difference values in the amplitude difference value sequence with the amplitude difference threshold values at the same positions in the amplitude difference threshold value sequence to obtain the target amplitude difference values, that is, perform peak detection on the amplitude difference value sequence by using the amplitude difference threshold values to obtain peak values, which are used as the target amplitude difference values. As shown in fig. 5E, a sequence of amplitude difference thresholds and a sequence of amplitude difference values in some embodiments are illustrated. As shown in fig. 5F, the obtained target amplitude difference values and the audio frames corresponding to the target amplitude difference values are shown.

In some embodiments, the server may calculate a statistical value of amplitude difference values corresponding to a plurality of audio frames, respectively, to obtain an amplitude difference threshold value corresponding to each of the plurality of audio frames, respectively. The statistical value is, for example, an average value.

And S508, taking the audio frame corresponding to the target amplitude difference value as a drum point corresponding to the candidate audio.

Specifically, there may be a plurality of target amplitude difference values, and thus, there may be a plurality of drum points corresponding to the candidate audio. The server may use the audio frame corresponding to the target amplitude difference value as a drum point corresponding to the candidate audio. After the server obtains the drumhead, the drumhead time interval sequence can be determined according to the time of the drumhead in the candidate audio. Adjacent drum points refer to adjacent drum points in a drum point sequence, and a drum point time interval sequence Q of 30 drum points is, for example: q ═ 0.1,2,5,4,6,2,2.4,3,5,2,1.7,3.0,1.4,0.7,1.2,1.6,1.0,1.0,1.1,0.9,0.9,1.3,2.9,0.9,1.3,4,2,1,3 "in seconds.

In some embodiments, the server may calculate a difference between time information of two adjacent drumheads in the candidate audio in the drumhead sequence corresponding to the candidate audio as a time interval between the adjacent drumheads. For example, the server may calculate a difference between start time points of audio frames in the candidate audio, which respectively correspond to two adjacent drum points, as a time interval between the adjacent drum points. Fig. 5G is a schematic diagram of the drum spot sequence obtained in some embodiments. The initial signal in fig. 5G is an audio signal, the identification function refers to a sequence of amplitude difference values, and the drum point sequence refers to a sequence of drum points corresponding to the initial signal. In fig. 5G, the initial signal is first preprocessed, then a spectrogram corresponding to the preprocessed initial signal is obtained, a difference is performed according to the spectrogram to obtain an identification function, and a peak value detection is performed on the identification function to obtain a drum point sequence.

In this embodiment, the time interval between adjacent drum points is calculated according to the position information corresponding to each drum point, and a drum point time interval sequence corresponding to the candidate audio is obtained, where a drum point is an audio frame corresponding to the target amplitude difference value, and the target amplitude difference value is an amplitude difference value greater than an amplitude difference threshold value in the amplitude difference value sequence, so that the target amplitude difference value can accurately locate an audio frame with an increased amplitude, that is, a drum point can be accurately located, the accuracy of the drum point is improved, and the accuracy of the drum point time interval sequence is improved.

In some embodiments, the step of obtaining an amplitude difference threshold comprises: calculating a difference average value corresponding to the amplitude difference value in the amplitude difference value sequence; and obtaining an amplitude difference threshold value according to the difference average value.

Wherein the variance average refers to an average of a plurality of amplitude variance values. The server may divide the candidate audio to obtain a plurality of audio intervals. The server may calculate an average value of amplitude difference values corresponding to each audio frame in the audio interval, as the difference average value corresponding to the audio interval. The time length corresponding to the audio interval can be set according to the requirement, for example, 0.5 second. The time interval length of each audio interval may be the same or different, and is not particularly limited herein. For example, when the sampling rate during framing is 44100Hz, and the window size is 1024, that is, the duration of each audio frame is about 43ms, if the time length corresponding to the audio interval is 0.5 second, the number of the audio frames corresponding to the audio interval is 0.5/0.043 ═ 11, and the server may calculate an average value of the amplitude differences corresponding to 11 audio frames in the audio interval as the average value of the differences corresponding to the audio interval.

In some embodiments, the server may multiply the difference average by a predetermined constant to obtain the amplitude difference threshold. For example, the server may multiply the average difference value corresponding to the audio interval by a preset constant to obtain an amplitude difference threshold corresponding to each audio frame in the audio interval. The preset constant can be set as required, for example, 1.2.

In some embodiments, the server may obtain an amplitude difference threshold corresponding to the audio interval according to the difference average value corresponding to the audio interval, and use the amplitude difference threshold corresponding to the audio interval as an amplitude difference threshold corresponding to each audio frame in the audio interval.

In this embodiment, the amplitude difference threshold is obtained according to the difference average value by calculating the difference average value corresponding to the amplitude difference value in the amplitude difference value sequence, so that the accuracy of the amplitude difference threshold is improved, and the amplitude difference threshold can be flexibly changed according to the candidate audio.

In some embodiments, the target content is a target video, and the step of obtaining a content time length sequence corresponding to the target content includes: carrying out scene recognition on video frames of a target video to obtain a target scene type corresponding to each video frame; and segmenting the target video according to the target scene type corresponding to the video frame to obtain a video segment corresponding to the target video. And respectively corresponding video segment duration of each video segment, and forming a content time length sequence corresponding to the target video according to the playing sequence of the corresponding video segments in the target video.

In particular, the target video may include a plurality of video images. One video frame corresponds to one video image. The video frames may correspond to scene types. The scene type may include at least one of a landscape class, a human class, or an animal class. The target scene type refers to a scene type corresponding to the video frame. The server can identify scenes corresponding to all video frames in the target video respectively to obtain scene types corresponding to all the video frames respectively.

In some embodiments, the server may determine the slicing position by determining a relationship between scene types between two adjacent video frames. For example, the adjacent positions of two adjacent video frames having different scene types may be taken as the slicing positions. The server can segment the target video according to the segmentation position to obtain a video segment corresponding to the target video.

In this embodiment, the video segment durations respectively corresponding to the video segments form a content time length sequence corresponding to the target content according to the playing sequence of the corresponding video segments in the target video, and the video segments are obtained by dividing the target video according to the target scene types corresponding to the video frames, so that the divided video segments have corresponding scene types, and thus the content time length in the content time length sequence is the time length of the video segments of multiple scene types, so that the content time length sequence accurately reflects the characteristics of the target video, and the music tempo corresponding to the background audio is matched with the scene switching of the target video.

In some embodiments, the target content may be a clipped video. As shown in fig. 6A, a video clipping method is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:

s602, acquiring the time length corresponding to each clip video segment in the video clip page, and forming a content time length sequence according to the playing sequence of the clip video segments in the target video.

In particular, a video clip page refers to a page of a video clip tool used to make a video clip, by which background music can be added to a video. The time length corresponding to the video segment refers to the playing time length of the video segment. The target video refers to a video formed by splicing a plurality of clip video segments. The playing sequence of the clip video segments in the target video can be determined according to the sequence of the clip video segments added to the video clip pages, for example, the playing sequence of the clip video segments added to the video clip pages in the target video is prior to the playing sequence of the clip video segments added to the video clip pages in the target video. As shown in fig. 6B, a video clip page is shown, where the video clip page includes 5 clip video segments, i.e., clip video segment 1 to clip video segment 5, and a video switching manner may be set between adjacent clip video segments. And the video formed by the video clip 1 to the video clip 5 is the target video.

S604, obtaining a background audio corresponding to the target video; the background audio is determined from the candidate audio set according to the content time length sequence and the target similarity of the drum point time interval sequence corresponding to the candidate audio.

Specifically, the terminal may obtain the background audio corresponding to the target video from the server. The terminal can send a background music recommendation request corresponding to the target video to the server, and the background audio recommendation request can carry a content time length sequence corresponding to the target video. The server may rank the audio list according to the similarity between the matching candidate audio and the target video in the audio to obtain a ranked audio list, where the greater the similarity between the candidate audio and the target video, the earlier the ranking of the corresponding audio in the ranked audio list is. The server may use, as the background audio matched with the rhythm of the target content, the audio ordered in the ordered audio list before the preset ordering, for example, the audio with the top 5, and send the push information of the background audio to the terminal.

The steps of how to obtain the background audio according to the content time length sequence and the target similarity of the drum point time interval sequence corresponding to the candidate audio may refer to the description of the background audio determining method, and are not described herein again.

In some embodiments, in order to improve flexibility of background audio selection, the server may send related information corresponding to the background audio to the terminal, the terminal may display the related information such as a play address and a title on a video clip interface, and the user may select the background audio corresponding to the target video from the audio. As shown in fig. 6B, an area of the video clip page in the rectangular frame B is a recommended audio presentation area, and the recommended audio presentation area presents information related to a plurality of background audios pushed to the terminal in a list form. The terminal can acquire background audio for displaying according to sliding operations (including upward sliding operations and downward sliding operations) on the recommended audio display area, so that a user can conveniently select the audio. The server can send the position information of the background audio in the corresponding parent audio to the terminal, and the terminal can display the position information of the background audio in the corresponding parent audio, so that convenience is provided for a user to determine the background audio of the target video, and the selection efficiency of the background audio is improved. As shown in fig. 6B, the video clip page may further show a collection button, a download button, and a use button, the corresponding audio may be collected by a selection operation of the collection button, the corresponding audio may be downloaded by a selection operation of the download button, and the corresponding audio may be added to the audio track by a selection operation of the use button. As shown in fig. 6C, a timing chart of obtaining the background audio corresponding to the target video is shown, which includes the following steps: 1. adding a clip video clip to a video clip interface of the terminal by the user; 2. the terminal sends a background audio recommendation request to the server; 3. the server acquires an audio list; 4. detecting a drum point; 5. the server returns the sorted audio list; 6. the user selects the desired background audio. The drum point detection comprises the steps of framing, Fourier transformation, Mel filter filtering, difference, averaging, peak detection and drum point sequence generation. Wherein the audio list is a list composed of a plurality of audios stored in the server. The sorted audio list refers to an audio list obtained by adjusting the arrangement sequence of the audio in the audio list.

And S606, aligning the starting position of the background audio in the audio track with the starting position of the target video in the video track on the video clip interface.

Specifically, the terminal may add background audio to an audio track of the video clip interface and align the start position of the background audio with the start position of the target video in the video track.

In the video clipping method, the time length corresponding to each clip video segment in a video clip page is obtained, a content time length sequence is formed according to the playing sequence of the clip video segments in the target video, the background audio corresponding to the target video is obtained, the starting position of the background audio in the audio track is aligned with the starting position of the target video in the video track on a video clipping interface, the content time length sequence is formed according to the playing sequence of the clip video segments in the target video, so the content time length sequence can reflect the content playing rhythm of the target content, the drumhead time interval sequence can reflect the music rhythm of the candidate audio, the background audio is selected according to the target similarity between the content time length sequence and the drumhead time interval sequence corresponding to the candidate audio, and the background audio with the music rhythm matched with the content playing rhythm of the target content can be selected, the matching degree of background audio and video is improved. In addition, on the video clipping interface, the initial position of the background audio in the audio track is aligned with the initial position of the target video in the video track, so that the automatic alignment of the background audio and the target video is realized, a user does not need to align the background audio and the target video in a manual adjustment mode, the time consumed by the manual adjustment of the user is saved, and the video clipping efficiency is improved.

In some embodiments, the candidate audios are divided from the corresponding parent audio according to the video playing time length of the target video, and the step S606 of aligning the starting position of the background audio in the audio track with the starting position of the target video in the video track on the video clip interface includes: and acquiring the position information of the background audio in the corresponding parent audio. And displaying the parent audio on the audio track according to the position information on the video clip interface, wherein the starting position of the background audio on the audio track is aligned with the starting position of the target video on the video track.

Specifically, the position information of the background audio in the corresponding parent audio may include at least one of a start time point or an end time point of the background audio in the corresponding parent audio.

In some embodiments, the terminal may obtain a starting time point of the background audio in the corresponding parent audio, where the starting time point is, for example, 3 seconds, and intercept the parent audio with the starting time point as an intercept point to obtain the background audio. And aligning the starting position of the background audio on the audio track with the intercepting point as the starting point with the starting position of the target video on the video track, thereby realizing the alignment of the starting position of the background audio on the audio track with the starting position of the target video on the video track.

In the embodiment, the starting position of the background audio in the audio track is aligned with the starting position of the video in the video track, the background audio can be automatically added into the audio track, and the background audio is automatically aligned with the target video, so that a user does not need to align the background audio with the target video in a manual adjustment mode, time consumed by manual adjustment of the user is saved, and video clipping efficiency is improved. In addition, the user can refer to the position information of the background audio according to the preference, move the audio clip added into the father audio in the audio track on the audio track, and intercept the audio synthesized with the target content, so that the flexibility of audio selection is improved.

In some embodiments, as shown in fig. 7, there is provided a background audio determining method including:

s702, performing scene recognition on video frames of a target video to obtain a target scene type corresponding to each video frame, segmenting the target video according to the target scene type corresponding to the video frames to obtain video segments corresponding to the target video, and forming a content time length sequence corresponding to the target video according to the playing sequence of the corresponding video segments in the target video by the video segment duration corresponding to each video segment;

s704, acquiring content playing time length corresponding to a target video, acquiring audio to be divided, dividing the audio to be divided according to the content playing time length to obtain candidate audio in a candidate audio set, wherein the time length of the candidate audio is matched with the content playing time length;

s706, obtaining an audio frame sequence corresponding to the candidate audio, obtaining amplitude difference values between frequency spectrums of adjacent audio frames in the audio frame sequence, obtaining an amplitude difference value sequence, calculating a difference average value corresponding to the amplitude difference values in the amplitude difference value sequence, obtaining an amplitude difference threshold value according to the difference average value, obtaining the amplitude difference value larger than the amplitude difference threshold value from the amplitude difference value sequence, using the amplitude difference value as a target amplitude difference value, using the audio frame corresponding to the target amplitude difference value as a drum point corresponding to the candidate audio, calculating a time interval between the adjacent drum points according to position information corresponding to each drum point, and obtaining a drum point time interval sequence corresponding to the candidate audio;

s708, drum point time interval sequences corresponding to the candidate audios in the candidate audio set are obtained; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency;

s710, obtaining the distance between each drum point time interval in the drum point time interval sequence and each content time length in the content time length sequence to obtain a target distance matrix formed by the distances;

s712, taking the termination matrix point as the current matrix point, and acquiring the forward matrix point with the minimum distance value in the forward matrix points corresponding to the current matrix point as the target path point corresponding to the shortest path;

s714, the target path point is used as the updated current matrix point, and the step of obtaining the forward matrix point with the minimum distance value in the forward matrix points corresponding to the current matrix point as the target path point corresponding to the shortest path is returned until the initial matrix point of the target distance matrix is reached;

s716, taking the path formed by each target path point as the shortest path from the starting matrix point to the ending matrix point of the target distance matrix;

s718, obtaining target similarity according to the distance of the shortest path;

s720, determining background audio corresponding to the target video from the candidate audio set according to the target similarity corresponding to the candidate audio;

and S722, acquiring the position information of the background audio in the corresponding audio to be divided, and pushing the audio to be divided corresponding to the background audio and the position information to the terminal corresponding to the target video.

It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In some embodiments, as shown in fig. 8, there is provided a background audio determining apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a content time length sequence acquisition module 802, a drumbeat time interval sequence acquisition module 804, a target similarity acquisition module 806, and a background audio determination module 808, wherein:

a content time length sequence obtaining module 802, configured to obtain a content time length sequence corresponding to a target content of a background audio to be determined; the target content includes a plurality of content segments, and the playback time lengths of the respective content segments form a content time length sequence in the content playback order.

A drumbeat time interval sequence obtaining module 804, configured to obtain a drumbeat time interval sequence corresponding to each candidate audio in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency.

And a target similarity obtaining module 806, configured to obtain a target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio.

And the background audio determining module 808 is configured to determine a background audio corresponding to the target content from the candidate audio set according to the target similarity corresponding to the candidate audio.

In some embodiments, the target similarity obtaining module 806 includes:

and the target distance matrix obtaining unit is used for obtaining the distance between each drum point time interval in the drum point time interval sequence and each content time length in the content time length sequence to obtain a target distance matrix formed by the distances.

And the shortest path acquiring unit is used for acquiring the shortest path from the starting matrix point to the ending matrix point of the target distance matrix.

And the target similarity obtaining unit is used for obtaining the target similarity according to the distance of the shortest path.

In some embodiments, the shortest path obtaining unit is further configured to: taking the termination matrix point as a current matrix point, and acquiring a forward matrix point with the minimum distance value from forward matrix points corresponding to the current matrix point as a target path point corresponding to the shortest path; taking the target path point as an updated current matrix point, returning to the step of acquiring the forward matrix point with the minimum distance value from the forward matrix points corresponding to the current matrix point as the target path point corresponding to the shortest path until the initial matrix point of the target distance matrix is reached; and taking the path formed by each target path point as the shortest path from the starting matrix point to the ending matrix point of the target distance matrix.

In some embodiments, the background audio determination apparatus further comprises a candidate audio acquisition module, the candidate audio acquisition module comprising:

and the target time length obtaining unit is used for obtaining the content playing time length corresponding to the target content.

And the candidate audio obtaining unit is used for obtaining the audio to be divided, dividing the audio to be divided according to the content playing time length to obtain candidate audio in the candidate audio set, wherein the time length of the candidate audio is matched with the content playing time length.

In some embodiments, the background audio determining means further comprises:

and the position information acquisition module is used for acquiring the position information of the background audio in the corresponding audio to be divided.

And the pushing module is used for pushing the audio to be divided corresponding to the background audio and the position information to a terminal corresponding to the target content.

In some embodiments, the background audio determining apparatus further comprises a drumhead obtaining module, which includes:

and the audio frame sequence acquisition unit is used for acquiring the audio frame sequence corresponding to the candidate audio.

And the amplitude difference value sequence obtaining unit is used for obtaining amplitude difference values between frequency spectrums of adjacent audio frames in the audio frame sequence to obtain an amplitude difference value sequence.

And the target amplitude difference value obtaining unit is used for obtaining an amplitude difference value which is larger than the amplitude difference threshold value from the amplitude difference value sequence and is used as the target amplitude difference value.

And the drum point determining unit is used for taking the audio frame corresponding to the target amplitude difference value as the drum point corresponding to the candidate audio.

In some embodiments, the background audio determining apparatus further comprises an amplitude difference threshold obtaining module, the amplitude difference threshold obtaining module comprising:

and the difference average value calculating unit is used for calculating the difference average value corresponding to the amplitude difference value in the amplitude difference value sequence.

And the amplitude difference threshold value obtaining unit is used for obtaining an amplitude difference threshold value according to the difference average value.

In some embodiments, the target content is a target video, and the background audio determining apparatus further includes a content time length sequence obtaining module, where the content time length sequence obtaining module includes:

and the target scene obtaining unit is used for carrying out scene recognition on the video frames of the target video to obtain the target scene type corresponding to each video frame.

And the video clip obtaining unit is used for segmenting the target video according to the target scene type corresponding to the video frame to obtain the video clip corresponding to the target video.

And the content time length sequence forming unit is used for forming the content time length sequence corresponding to the target video according to the playing sequence of the corresponding video clips in the target video by using the video clip duration corresponding to each video clip.

For the specific definition of the background audio determining apparatus, reference may be made to the above definition of the background audio determining method, which is not described herein again. The various modules in the background audio determination apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, as shown in fig. 9, there is provided a video clipping apparatus, which may be a part of a computer device using software modules or hardware modules, or a combination of both, the apparatus specifically comprising: a content time length sequence forming module 902, a background audio acquisition module 904, and a position alignment module 906, wherein:

a content time length sequence forming module 902, configured to obtain time lengths corresponding to the clip video segments in the video clip page, and form a content time length sequence according to a playing sequence of the clip video segments in the target video.

A background audio acquiring module 904, configured to acquire a background audio corresponding to the target video; the background audio is determined from the candidate audio set according to the content time length sequence and the target similarity of the drum point time interval sequence corresponding to the candidate audio.

A position alignment module 906 for aligning a start position of the background audio in the audio track with a start position of the target video in the video track on the video clip interface.

In some embodiments, the candidate audio is divided from the corresponding parent audio according to the video playing time length of the target video, and the position alignment module 906 includes:

and the position information acquisition unit is used for acquiring the position information of the background audio in the corresponding parent audio.

And the position alignment unit is used for displaying the parent audio on the audio track according to the position information on the video clip interface, wherein the starting position of the background audio on the audio track is aligned with the starting position of the target video on the video track.

In some embodiments, the video clipping device further comprises a background audio selection module, the background audio selection module comprising:

the drum point time interval sequence acquisition unit is used for acquiring drum point time interval sequences corresponding to all candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency.

And the target similarity acquiring unit is used for acquiring the target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio.

And the background audio determining unit is used for determining the background audio corresponding to the target video from the candidate audio set according to the target similarity corresponding to the candidate audio.

For specific limitations of the video clipping apparatus, reference may be made to the limitations of the video clipping method above, and further description is omitted here. The various modules in the video clipping device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data such as a sequence of content time lengths, a set of candidate audios and a sequence of drum beat time intervals. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a background audio determination method.

In some embodiments, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video clipping method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configurations shown in fig. 10 and 11 are merely block diagrams of portions of configurations related to aspects of the present application, and do not constitute limitations on the computing devices to which aspects of the present application may be applied, as a particular computing device may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.

In some embodiments, there is further provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above method embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for background audio determination, the method comprising:

acquiring a content time length sequence corresponding to a target video of a background audio to be determined; the target video comprises a plurality of video segments, and the video playing time length of each video segment forms the content time length sequence according to the video playing sequence;

acquiring drum point time interval sequences corresponding to the candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency; the candidate audios in the candidate audio set are obtained by dividing from corresponding parent audios according to the video playing time length of the target video;

acquiring target similarity between the content time length sequence and the drum point time interval sequence corresponding to the candidate audio;

obtaining a target recommendation degree corresponding to the candidate audio according to a target similarity corresponding to the candidate audio and a target heat degree corresponding to the candidate audio, wherein the target recommendation degree, the target heat degree and the target similarity form a positive correlation relationship, and determining a background audio corresponding to the target video from the candidate audio set based on the target recommendation degree;

and sending the parent audio of the background audio corresponding to the target video and the position information of the background audio corresponding to the target video in the corresponding parent audio to a terminal corresponding to the target video, so that the terminal displays the position information and the corresponding parent audio, or so that the terminal displays the parent audio on an audio track, and aligning the starting position of the background audio in the parent audio on the audio track with the starting position of the target video on the video track according to the position information.

2. The method of claim 1, wherein obtaining the target similarity between the content time length sequence and the drum time interval sequence corresponding to the candidate audio comprises:

acquiring the distance between each drum point time interval in the drum point time interval sequence and each content time length in the content time length sequence to obtain a target distance matrix formed by the distances;

obtaining a shortest path from a starting matrix point to an ending matrix point of the target distance matrix;

and obtaining the target similarity according to the distance of the shortest path.

3. The method of claim 2, wherein obtaining the shortest path from a starting matrix point to an ending matrix point of the target distance matrix comprises:

taking the termination matrix point as a current matrix point, and acquiring a forward matrix point with the minimum distance value in forward matrix points corresponding to the current matrix point as a target path point corresponding to the shortest path;

taking the target path point as an updated current matrix point, returning to the step of obtaining the forward matrix point with the minimum distance value in the forward matrix points corresponding to the current matrix point as the target path point corresponding to the shortest path until the initial matrix point of the target distance matrix is reached;

and taking the path formed by each target path point as the shortest path from the starting matrix point to the ending matrix point of the target distance matrix.

4. The method of claim 1, wherein the step of obtaining the candidate audio in the candidate audio set comprises:

acquiring the video playing time length corresponding to the target video;

and acquiring audio to be divided, and dividing the audio to be divided according to the video playing time length to obtain candidate audio in a candidate audio set, wherein the time length of the candidate audio is matched with the video playing time length.

5. The method of claim 4, further comprising:

acquiring position information of the background audio in the corresponding audio to be divided;

and pushing the audio to be divided corresponding to the background audio and the position information to a terminal corresponding to the target video.

6. The method of claim 1, wherein obtaining the drum points corresponding to the candidate audios comprises:

acquiring an audio frame sequence corresponding to the candidate audio;

obtaining amplitude difference values between frequency spectrums of adjacent audio frames in the audio frame sequence to obtain an amplitude difference value sequence;

acquiring an amplitude difference value larger than an amplitude difference threshold value from the amplitude difference value sequence as a target amplitude difference value;

and taking the audio frame corresponding to the target amplitude difference value as a drum point corresponding to the candidate audio.

7. The method of claim 6, wherein the step of obtaining the amplitude difference threshold comprises:

calculating a difference average value corresponding to the amplitude difference value in the amplitude difference value sequence;

and obtaining the amplitude difference threshold value according to the difference average value.

8. The method of claim 1, wherein the step of obtaining the time length sequence of the content corresponding to the target video comprises:

carrying out scene recognition on the video frames of the target video to obtain target scene types corresponding to the video frames;

segmenting the target video according to the target scene type corresponding to the video frame to obtain a video segment corresponding to the target video;

and forming a content time length sequence corresponding to the target video according to the playing sequence of the corresponding video clips in the target video by using the video clip duration corresponding to each video clip.

9. A method of video clipping, the method comprising:

acquiring time lengths corresponding to all clip video segments in a video clip page, and forming a content time length sequence according to the playing sequence of the clip video segments in a target video;

acquiring a background audio corresponding to the target video; the background audio is determined from a candidate audio set according to a target recommendation degree corresponding to a candidate audio, the target recommendation degree is in positive correlation with a target similarity and a target heat degree corresponding to the candidate audio, and the target similarity refers to a similarity between the content time length sequence and a drum point time interval sequence corresponding to the candidate audio; the candidate audio is obtained by dividing from the corresponding parent audio according to the video playing time length of the target video;

receiving parent audio of background audio corresponding to the target video and position information of the background audio in the parent audio, wherein the parent audio is sent by a server;

and displaying the position information and the corresponding parent audio on the video clip interface, or displaying the parent audio on an audio track, and aligning the starting position of the background audio in the parent audio on the audio track with the starting position of the target video on the video track according to the position information.

10. The method of claim 9, wherein the obtaining the background audio corresponding to the target video comprises:

sending a background music recommendation request corresponding to the target video to a server;

and receiving the background music returned by the server in response to the background music recommendation request to obtain the background audio corresponding to the target video.

11. The method of claim 9, wherein the step of obtaining the background audio corresponding to the target video comprises:

acquiring drum point time interval sequences corresponding to the candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency;

and determining the background audio corresponding to the target video from the candidate audio set according to the target similarity corresponding to the candidate audio.

12. An apparatus for background audio determination, the apparatus comprising:

the content time length sequence acquisition module is used for acquiring a content time length sequence corresponding to a target video of the background audio to be determined; the target video comprises a plurality of video segments, and the video playing time length of each video segment forms the content time length sequence according to the video playing sequence;

the drum point time interval sequence acquisition module is used for acquiring drum point time interval sequences corresponding to all candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency; the candidate audio is obtained by dividing from the corresponding parent audio according to the video playing time length of the target video;

a target similarity obtaining module, configured to obtain a target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio;

a background audio determining module, configured to obtain a target recommendation degree corresponding to the candidate audio according to a target similarity corresponding to the candidate audio and a target heat degree corresponding to the candidate audio, where the target recommendation degree has a positive correlation with the target heat degree and the target similarity, and determine a background audio corresponding to the target video from the candidate audio set based on the target recommendation degree;

and sending the parent audio of the background audio corresponding to the target video and the position information of the background audio corresponding to the target video in the corresponding parent audio to a terminal corresponding to the target video, so that the terminal displays the position information and the corresponding parent audio, or so that the terminal displays the parent audio on an audio track according to the position information and aligns the initial position of the background audio in the parent audio on the audio track with the initial position of the target video on the video track.

13. The apparatus of claim 12, wherein the target similarity obtaining module comprises:

a target distance matrix obtaining unit, configured to obtain distances between each drum point time interval in the drum point time interval sequence and each content time length in the content time length sequence, and obtain a target distance matrix formed by the distances;

a shortest path obtaining unit, configured to obtain a shortest path from a start matrix point to an end matrix point of the target distance matrix;

14. The apparatus of claim 13, wherein the shortest path obtaining unit is further configured to:

15. The apparatus of claim 12, further comprising a candidate audio acquisition module comprising:

a target time length obtaining unit, configured to obtain a video playing time length corresponding to the target video;

and the candidate audio obtaining unit is used for obtaining the audio to be divided, dividing the audio to be divided according to the video playing time length to obtain candidate audio in a candidate audio set, wherein the time length of the candidate audio is matched with the video playing time length.

16. The apparatus of claim 15, further comprising:

the position information acquisition module is used for acquiring the position information of the background audio in the corresponding audio to be divided;

and the pushing module is used for pushing the audio to be divided corresponding to the background audio and the position information to a terminal corresponding to the target video.

17. The apparatus of claim 12, further comprising a drum point obtaining module, the drum point obtaining module comprising:

an audio frame sequence obtaining unit, configured to obtain an audio frame sequence corresponding to the candidate audio;

an amplitude difference value sequence obtaining unit, configured to obtain an amplitude difference value between frequency spectrums of adjacent audio frames in the audio frame sequence, so as to obtain an amplitude difference value sequence;

a target amplitude difference value obtaining unit, configured to obtain an amplitude difference value greater than an amplitude difference threshold from the amplitude difference value sequence, as a target amplitude difference value;

18. The apparatus of claim 17, further comprising an amplitude difference threshold acquisition module, the amplitude difference threshold acquisition module comprising:

the difference average value calculating unit is used for calculating a difference average value corresponding to the amplitude difference value in the amplitude difference value sequence;

and the amplitude difference threshold value obtaining unit is used for obtaining the amplitude difference threshold value according to the difference average value.

19. The apparatus of claim 12, further comprising a content time length sequence obtaining module, wherein the content time length sequence obtaining module comprises:

the target scene obtaining unit is used for carrying out scene recognition on the video frames of the target video to obtain the target scene type corresponding to each video frame;

a video clip obtaining unit, configured to segment the target video according to a target scene type corresponding to the video frame, so as to obtain a video clip corresponding to the target video;

and the content time length sequence forming unit is used for forming the content time length sequence corresponding to the target video according to the playing sequence of the corresponding video clip in the target video by using the video clip duration corresponding to each video clip.

20. A video clipping apparatus, characterized in that the apparatus comprises:

the content time length sequence forming module is used for acquiring the time length corresponding to each clip video segment in a video clip page and forming a content time length sequence according to the playing sequence of the clip video segments in the target video;

the background audio acquisition module is used for acquiring a background audio corresponding to the target video; the background audio is determined from a candidate audio set according to a target recommendation degree corresponding to a candidate audio, the target recommendation degree is in positive correlation with a target similarity and a target heat degree corresponding to the candidate audio, and the target similarity refers to a similarity between the content time length sequence and a drum point time interval sequence corresponding to the candidate audio; the candidate audio is obtained by dividing from the corresponding parent audio according to the video playing time length of the target video;

the position alignment module is used for receiving a parent audio of a background audio corresponding to the target video and sent by a server, and position information of the background audio in the parent audio; and displaying the position information and the corresponding parent audio on the video clip interface, or displaying the parent audio on an audio track, and aligning the starting position of the background audio in the parent audio on the audio track with the starting position of the target video on the video track according to the position information.

21. The apparatus of claim 20, wherein the background audio acquisition module is further configured to:

22. The apparatus of claim 20, further comprising a background audio selection module, wherein the background audio selection module comprises:

the drum point time interval sequence acquisition unit is used for acquiring drum point time interval sequences corresponding to all candidate audios in the candidate audio set; the candidate audio frequency corresponds to a plurality of drum points, and the interval length between the drum points forms a drum point time interval sequence corresponding to the candidate audio frequency according to the sequence of the drum points in the candidate audio frequency;

a target similarity obtaining unit, configured to obtain a target similarity between the content time length sequence and the drumbeat time interval sequence corresponding to the candidate audio;

23. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 or 9 to 11 when executing the computer program.

24. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8 or 9 to 11.