CN109949792B

CN109949792B - Multi-audio synthesis method and device

Info

Publication number: CN109949792B
Application number: CN201910245364.9A
Authority: CN
Inventors: 邱慧; 巩仔明; 饶洪福; 贾绍坤
Original assignee: Youxinpai Beijing Information Technology Co ltd
Current assignee: Hefei Youquan Information Technology Co ltd
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2021-08-13
Anticipated expiration: 2039-03-28
Also published as: CN109949792A

Abstract

The application provides a method and a device for synthesizing multi-audio, wherein the method comprises the following steps: cutting the audio to be edited into at least two audios to be combined; generating a mute audio according to the total duration of at least two audios to be combined; and combining the at least two audios to be combined to the same sound channel of the mute audio according to the preset sequence of the at least two audios to be combined to obtain the combined audio. According to the scheme provided by the application, the mute audio is generated according to the audio to be combined, and then the audio to be combined is combined to the same sound channel of the mute audio, so that the combined audio is obtained. Because the audio to be combined is combined to the same sound channel of the mute audio, the sound mixing phenomenon caused by different sound channels is avoided under the condition that the audio can be normally combined. The content of the combined audio can be clear, and the audio quality is improved.

Description

Multi-audio synthesis method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for synthesizing multiple audio frequencies.

Background

With the continuous development of internet technology, more and more users publish audio produced by themselves on the internet. Most of the audio is clipped and combined by the user.

In the related art, the multi-audio combining technique generally connects a plurality of cut-out audios end to end directly, and then converts the audios into the same audio format. Although the audio merging technology has high merging speed and high efficiency, the merged audio has the problem of mixing sound. The clipped audio is usually of a separate channel. When a plurality of connected audio is overlapped end to end, the combined audio may be mixed due to different channels.

When the audio is merged, the content of the finally merged audio is unclear and the audio quality is too low due to the mixing problem caused by directly connecting and merging the multiple audios in the related art.

Disclosure of Invention

The application provides a multi-audio synthesis method and device, which can be used for solving the problems that in the related art, due to the fact that a plurality of audios are directly connected and combined, the content of the finally combined audio is unclear and the audio quality is too low.

In a first aspect, the present application provides a method for synthesizing multiple audios, the method comprising:

cutting the audio to be edited into at least two audios to be combined;

generating a mute audio according to the total duration of the at least two audios to be combined;

and combining the at least two audios to be combined to the same sound channel of the mute audio according to the preset sequence of the at least two audios to be combined to obtain a combined audio.

Optionally, the cutting the audio to be edited into at least two audios to be combined includes:

acquiring a cutting time stamp of the audio to be cut, wherein the cutting time stamp is used for indicating the starting time and the ending time of cutting;

acquiring a cut audio within the time indicated by the cut timestamp according to the cut timestamp;

and converting the cut audio into the same audio format to obtain the audio to be combined.

Optionally, the obtaining a cut timestamp of the audio to be clipped includes:

detecting audio content meeting preset conditions in the audio to be edited, wherein the preset conditions comprise at least one of the following items: the audio volume is greater than the preset volume, the audio tone quality is greater than the preset frequency and the audio content is included;

determining the time period of the audio content meeting the preset condition in the audio to be clipped;

and generating the cutting time stamp according to the time period.

Optionally, the mute audio is a two-channel audio, and the mute audio is in the same audio format as the audio to be combined.

Optionally, before the cutting the audio to be edited into at least two audios to be combined, the method further includes:

and cutting out the empty frame of the audio to be clipped according to the start time stamp and the end time stamp of the audio to be clipped, wherein the start time stamp and the end time stamp are used for indicating the start time and the end time of playing the audio to be clipped.

Optionally, the merging the audio to be merged into the same channel of the mute audio includes:

determining a merging time stamp of the audio to be merged according to the preset sequence and the duration of the audio to be merged, wherein the merging time stamp is used for indicating the starting time and the ending time of the audio to be merged;

and inserting the audio to be combined into the mute audio according to the combination timestamp to obtain the combined audio.

In a second aspect, the present application provides a multi-audio synthesizing apparatus, comprising:

the audio cutting module is used for cutting the audio to be cut into at least two audios to be combined;

the audio generation module is used for generating mute audio according to the total duration of the at least two audios to be combined;

and the audio merging module is used for merging the at least two audios to be merged to the same sound channel of the mute audio according to the preset sequence of the at least two audios to be merged to obtain a merged audio.

Optionally, the audio cropping module includes:

the time acquisition unit is used for acquiring a cutting time stamp of the audio to be cut, and the cutting time stamp is used for indicating the starting time and the ending time of cutting;

the audio acquisition unit is used for acquiring the cut audio in the time indicated by the cutting time stamp according to the cutting time stamp;

and the audio conversion unit is used for converting the cut audio into the same audio format to obtain the audio to be combined.

Optionally, the time obtaining unit is configured to:

and generating the cutting time stamp according to the time period.

Optionally, the apparatus further comprises:

and the blank frame cutting module is used for cutting the blank frame of the audio to be clipped according to the start timestamp and the end timestamp of the audio to be clipped, wherein the start timestamp and the end timestamp are used for indicating the start time and the end time of playing the audio to be clipped.

Optionally, the audio combining module is configured to:

In a third aspect, the present application provides a terminal comprising a processor and a memory, the memory storing a computer program, the computer program being loaded and executed by the processor to implement the method for synthesizing multiple audios as described in the first aspect above.

In the scheme provided by the application, the terminal generates the mute audio according to the audio to be combined, and then combines the audio to be combined to the same sound channel of the mute audio to obtain the combined audio. Because the audio to be combined is combined to the same sound channel of the mute audio, the sound mixing phenomenon caused by different sound channels is avoided under the condition that the audio can be normally combined. The content of the combined audio can be clear, and the audio quality is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a method for synthesizing multiple audios according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an audio synthesis process provided by one embodiment of the present application;

fig. 3 is a schematic block diagram of a multi-audio synthesizing apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In the method provided by the embodiment of the application, the execution main body of each step may be a terminal. Alternatively, the execution subject of each step may be an audio synthesis process running in the terminal, where the audio synthesis process refers to a process of an audio recording program. The terminal can be an electronic device such as a mobile phone, a tablet computer, an electronic book reader, a multimedia playing device, a wearable device, a laptop, and the like. For convenience of explanation, in the following method embodiments, only the execution subject of each step is described as a terminal, but the present invention is not limited thereto.

Referring to fig. 1, a flow chart of a method for synthesizing multi-audio according to an embodiment of the present application is shown. The method may include several steps as follows.

Step 101, cutting the audio to be clipped into at least two audios to be combined.

The merging of multiple audios is usually to cut out part of audios from multiple audios, and then to join the cut-out audios end to merge into a new audio. Therefore, the terminal cuts out at least two audios to be combined from the plurality of audios to be cut. The audio to be edited refers to the original audio before the audio is merged by the terminal. The audio to be clipped may be a separate audio. And the terminal cuts and combines the independent audio to be edited, and deletes and sequentially adjusts the content of the audio. The audio to be clipped may also be a plurality of audios. The terminal cuts out a plurality of audios from a plurality of audios to be clipped and combines the audios. For example, the audio to be clipped is two recorded audios: audio a and audio B. Audio a is a 5 minute period of audio. Audio B is a 10 minute period of audio. When the content of the audio A from the 2 nd minute to the 4 th minute and the content of the audio B from the 5 th minute to the 8 th minute need to be merged into a new audio, the terminal cuts out the content of the audio A from the 2 nd minute to the 4 th minute and the content of the audio B from the 5 th minute to the 8 th minute respectively.

Optionally, the terminal first obtains a cut timestamp of the audio to be clipped. The cutting time stamp is used for indicating the starting time and the ending time of the terminal for cutting the audio to be cut. And the terminal cuts the audio to be cut according to the time indicated by the cutting time stamp to obtain at least one cut audio. And the terminal converts the cut audio into the same audio format to obtain combined audio so as to avoid the failure of combination caused by inconsistent formats in the subsequent audio combination.

In one possible implementation, the cut timestamp of the audio to be clipped is set by the user. The user specifies the time period required to be cut in the audio to be cut in the terminal. Accordingly, the terminal acquires a cut timestamp indicating the time period.

In another possible implementation, the terminal automatically generates a cut timestamp of the audio to be clipped. For the audio to be clipped, the terminal detects the content meeting the preset conditions in the audio to be clipped, takes the content meeting the preset conditions as the content to be clipped, and generates a clipping timestamp according to the time period of the content. The audio is combined to cut and synthesize the target content in the audio to be clipped into a new audio. And the audio content meeting the preset conditions in the audio to be edited is the target content. The preset conditions can be set according to actual experience and requirements. The preset condition may be that the audio volume is greater than a preset volume. The preset condition may also be that the tone quality of the audio is greater than a preset frequency to extract clear audio. If the audio to be edited includes the user voice, the preset condition may also be that the preset voice content is included. The terminal obtains the voice content of the user through voice recognition and character extraction technologies, and then detects whether the voice content comprises preset words or not. If not, the terminal determines that the corresponding audio to be edited does not need to be cut; and if so, cutting the audio where the preset words are located by the terminal. The audio may be audio comprising a plurality of consecutive audio frames of the audio frame in which the preset word is located, for example, audio consisting of 20 audio frames. Optionally, the terminal may also separate the audio to be edited from the video to cut and obtain the audio to be combined.

Illustratively, the audio to be edited is recorded audio of a multi-day weather forecast for a plurality of cities. The terminal needs to synthesize all weather audio of city a into one new audio. The terminal identifies the voice content including the city name 'A' in the audio to be clipped, and then clips the audio consisting of 20 audio frames before and after the audio frame where the 'A' is located.

Optionally, the Audio format of the Audio to be merged is unified into a microsoft Audio format (Windows Media Audio, WMA).

Optionally, before the terminal cuts the audio to be clipped, the terminal needs to remove the empty frame of the audio to be clipped. The null frame refers to an audio frame in which no sound is recorded in the audio. For example, the beginning and end of a song's audio typically includes audio frames without sound. In order to avoid the existence of empty frames in the merged audio, the terminal needs to remove the empty frames of the audio to be clipped. The empty frames of the audio to be clipped are usually at the beginning and the end of the audio to be clipped, i.e. before the sound in the audio actually starts playing and after the sound in the audio actually ends playing. And the terminal cuts out the empty frame of the audio to be clipped according to the start time stamp and the end time stamp of the audio to be clipped. The start time stamp and the end time stamp are used to indicate the start time and the end time of the audio playing to be clipped, respectively. For example, the audio to be clipped is the audio of a song. The duration of the audio is 3 minutes 56 seconds. However, in this audio, the song starts playing from 0 minutes 15 seconds and ends playing at 3 minutes 54 seconds, and there is no more music. Then the start and end timestamps are 0 minutes 15 seconds and 3 minutes 54 seconds, respectively. The start time stamp and the end time stamp of the audio to be clipped can be preset by a technician, and can also be determined by the terminal according to the volume of each frame in the audio to be clipped. And when the volume is less than or equal to 0, the terminal determines the empty frame.

And 102, generating a mute audio according to the total duration of at least two audios to be combined.

And after the audio to be combined is obtained, the terminal generates mute audio. The silent audio refers to an audio file without sound. And the terminal generates the mute audio according to the audio to be combined. And the terminal determines the total time length of the audio to be combined and generates mute audio with the time length greater than or equal to the total time length. Illustratively, the audio to be merged includes audio c having a duration of 20 seconds and audio d having a duration of 30 seconds. Accordingly, the terminal generates a mute audio of 50 seconds duration.

Optionally, the mute audio generated by the terminal is a binaural audio. And the audio format of the mute audio is the same as the audio format of the audio to be combined. And when the terminal generates the mute audio, setting the audio format of the mute audio as the audio format of the audio to be combined. In the subsequent audio combining step, the terminal can directly combine the mute audio and the audio to be combined because the audio formats of the mute audio and the audio to be combined are consistent. The merging failure caused by the inconsistency of the audio formats during audio merging is avoided.

And 103, combining the audio to be combined to the same sound channel of the mute audio according to the preset sequence of the at least two audio to be combined to obtain a combined audio.

The preset order of the audio to be combined refers to its position order in the final combined audio. The sequence may be preset according to practical experience and requirements. For example, the audio to be edited is recorded audio of multi-day weather forecasts of a plurality of cities. The terminal needs to synthesize all weather audio of city a into one new audio. Audio about the daily weather forecast for city a when audio is to be merged. The preset sequence may be a date sequence corresponding to the weather forecast for city a.

Optionally, the terminal receives a sequence instruction input by the user when merging the audio. As for the order of the audio to be combined, the user can directly specify the order of combination to the audio to be combined that is actually cut out. The user enters a sequence command in the terminal. The sequence instruction is used for instructing the terminal to combine the sequence of the audio to be combined. Correspondingly, the terminal receives the sequence instruction and combines the audio to be combined according to the sequence indicated by the sequence instruction.

The synthesis of an audio by the terminal is actually splicing the audio to be combined into a new audio. For this, the terminal merges the audio to be merged with the mute audio. Since the audio of the mute audio is zero, i.e., no sound, the sound in the final combined audio is the sound of the audio to be combined. The muted audio does not interfere with the combined audio. When the terminal combines the audio to be combined with the mute audio, the audio to be combined is inserted into the same sound channel of the mute audio. Before merging, the terminal modifies the volume of the audio to be merged to ensure that the requirements of the channel to be inserted are met. Alternatively, the terminal may insert audio to be combined into the same channel of the mute audio through ffmpeg (fast Forward mpeg).

Optionally, the terminal determines a merging timestamp of the audio to be merged according to the preset sequence and the duration of the audio to be merged. The merging time stamp is used to indicate a start time and an end time of the audio to be merged in the merged audio. For example, the terminal needs to merge two audio to be merged: audio a and audio b. The preset order is that audio b precedes audio a. The duration of audio a is 1 minute. The duration of audio b is 4 minutes. The terminal determines that the time period indicated by the merging timestamp of the audio a is: 4 to 5 minutes; the time period indicated by the merging timestamp of audio b is: from 0 minute to 4 minutes. And the terminal inserts the audio to be combined into the silent audio according to the combined time stamp and the time period indicated by the combined time stamp. The specific time period of the audio to be combined is clearly indicated through the combined time stamp, so that the problem that the audio is overlapped to the last audio and is inaccurate when the audio is combined from head to tail is avoided.

Illustratively, as shown in fig. 2, a process of the terminal merging audio is shown. The audio to be clipped includes: audio 201 and audio 202. The terminal cuts the audio 201 to obtain the audio 2011 to be merged. The audio 2011 to be merged is audio of 1 st minute 15 seconds to 2 nd minute 15 seconds in the audio 201. The terminal cuts the audio 202, resulting in the audio to be combined 2021 and the audio to be combined 2022. The audio 2021 to be combined is the audio of 0 minute 20 seconds to 1 minute 20 seconds in the audio 202. The audio 2022 to be combined is the 2 nd 15 th to 3 rd 00 th seconds of the audio 202. The terminal generates mute audio with a duration of 2 minutes and 45 seconds. The mute audio 203 is two-channel audio including channel 1 and channel 2. The preset order designated by the user is the audio to be merged 2021, the audio to be merged 2011, and the audio to be merged 2022. The terminal determines that the merging time stamp of the audio 2021 to be merged is 00:00-01:00, the merging time stamp of the audio 2011 to be merged is 01:00-02:00, and the merging time stamp of the audio 2022 to be merged is 02:00-02: 45. The terminal merges the audio to be merged 2021, the audio to be merged 2011 and the audio to be merged 2022 into the channel 1 of the mute audio 203 according to the three timestamps, so as to obtain the merged audio 203.

In the method provided by the embodiment of the application, the terminal generates the mute audio according to the audio to be combined, and then combines the audio to be combined to the same sound channel of the mute audio to obtain the combined audio. Because the audio to be combined is combined to the same sound channel of the mute audio, the sound mixing phenomenon caused by different sound channels is avoided under the condition that the audio can be normally combined. The content of the combined audio can be clear, and the audio quality is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 3 is a schematic block diagram illustrating a multi-audio synthesizing apparatus according to an exemplary embodiment. The device has the function of realizing the method, and the function can be realized by hardware or by hardware executing corresponding software. The apparatus may include: an audio cropping module 301, an audio generation module 302 and an audio merging module 303.

An audio clipping module 301, configured to clip the audio to be clipped into at least two audio to be combined.

The audio generating module 302 is configured to generate a mute audio according to the total duration of the at least two audios to be combined.

And an audio merging module 303, configured to merge the at least two audios to be merged into the same channel of the mute audio according to a preset sequence of the at least two audios to be merged to obtain a merged audio.

In the device provided by the embodiment of the application, the terminal generates the mute audio according to the audio to be combined, and then combines the audio to be combined to the same sound channel of the mute audio to obtain the combined audio. Because the audio to be combined is combined to the same sound channel of the mute audio, the sound mixing phenomenon caused by different sound channels is avoided under the condition that the audio can be normally combined. The content of the combined audio can be clear, and the audio quality is improved.

Optionally, the audio clipping module 301 includes:

the time acquisition unit is used for acquiring a cutting time stamp of the audio to be cut, and the cutting time stamp is used for indicating the starting time and the ending time of cutting; the audio acquisition unit is used for acquiring the cut audio in the time indicated by the cutting time stamp according to the cutting time stamp; and the audio conversion unit is used for converting the cut audio into the same audio format to obtain the audio to be combined.

Optionally, the time obtaining unit is configured to:

detecting audio content meeting preset conditions in the audio to be edited, wherein the preset conditions comprise at least one of the following items: the audio volume is greater than the preset volume, the audio tone quality is greater than the preset frequency and the audio content is included; determining the time period of the audio content meeting the preset condition in the audio to be clipped; and generating the cutting time stamp according to the time period.

Optionally, the apparatus further comprises: and the blank frame cutting module is used for cutting the blank frame of the audio to be clipped according to the start timestamp and the end timestamp of the audio to be clipped, wherein the start timestamp and the end timestamp are used for indicating the start time and the end time of playing the audio to be clipped.

Optionally, the audio combining module 303 is configured to:

determining a merging time stamp of the audio to be merged according to the preset sequence and the duration of the audio to be merged, wherein the merging time stamp is used for indicating the starting time and the ending time of the audio to be merged; and inserting the audio to be combined into the mute audio according to the combination timestamp to obtain the combined audio.

The application also provides a multi-audio synthesizing terminal, which comprises a processor and a memory. Wherein the memory stores a computer program. The computer program is loaded and executed by the processor to implement the multi-audio synthesizing method.

In an exemplary embodiment, there is also provided a computer-readable storage medium having a computer program stored therein, the computer program being loaded and executed by a terminal to implement the multi-audio synthesizing method provided by the above-described embodiments. Alternatively, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for synthesizing multiple audios, the method comprising:

cutting the audio to be edited into at least two audios to be combined;

according to the preset sequence of the at least two audios to be combined, combining the at least two audios to be combined to the same sound channel of the mute audio to obtain a combined audio;

wherein, cutting the audio to be edited into at least two audios to be combined comprises: detecting the audio tone quality of the audio to be edited, and if the audio tone quality is greater than a preset frequency, determining the time period of the audio content of which the audio tone quality is greater than the preset frequency in the audio to be edited; generating a cutting time stamp according to the time period, wherein the cutting time stamp is used for indicating the starting time and the ending time of cutting; acquiring a cut audio within the time indicated by the cut timestamp according to the cut timestamp; and converting the cut audio into the same audio format to obtain the audio to be combined.

2. The method of claim 1, wherein the muted audio is binaural audio, and wherein the muted audio is in the same audio format as the audio to be combined.

3. The method of claim 1, wherein before the clipping the audio to be clipped into at least two audio to be combined, further comprising:

4. The method of claim 1, wherein the merging the at least two audio to be merged into the same channel of the muted audio comprises:

5. A multi-audio synthesizing apparatus, comprising:

an audio merging module, configured to merge the at least two audios to be merged into a same channel of the mute audio according to a preset sequence of the at least two audios to be merged to obtain a merged audio,

wherein the audio cropping module comprises:

the time acquisition unit is used for detecting the audio tone quality of the audio to be edited, and if the audio tone quality is greater than the preset frequency, determining the time period of the audio content of which the audio tone quality is greater than the preset frequency in the audio to be edited; generating a cutting time stamp according to the time period, wherein the cutting time stamp is used for indicating the starting time and the ending time of cutting;

6. The apparatus of claim 5, wherein the mute audio is a two-channel audio, and wherein the mute audio is in the same audio format as the audio to be combined.

7. The apparatus of claim 5, further comprising:

8. The apparatus of claim 5, wherein the audio merging module is configured to: