CN112637632A

CN112637632A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN112637632A
Application number: CN202011497991.0A
Authority: CN
Inventors: 李钊
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-04-09
Anticipated expiration: 2040-12-17
Also published as: CN112637632B

Abstract

The present disclosure relates to an audio processing method, an apparatus, an electronic device, and a storage medium, the method comprising: carrying out voice detection on original audio data of a video to be processed to obtain a voice detection result; acquiring dubbing music audio data of the video to be processed; and superposing the original audio data of the video to be processed and the audio data of the score according to the human voice detection result. That is to say, in the present disclosure, the original audio data of the video to be processed and the soundtrack audio data are superimposed according to the human voice detection result, so that the original human voice in the original audio data in the video is retained, the soundtrack volume in the corresponding video segment is reduced, the audio processing effect in the video is improved, and the processed video has more richness and expressiveness.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

When video editing is carried out, transcoding is carried out on a user-imported video, then image frames are extracted, the definition, the color richness, the significance of the image frames and the like are analyzed and extracted, on the basis, the quality of the video frames is scored according to different weights, the video content with the best quality and the wonderful quality is cut out according to certain quantization standards (such as the cutting duration range, whether the music is required to be played according to a music beat card point and the like), and the proper music and the image decoration special effect are selected for the video according to other dimensions (such as the scene of the video content and the like).

In the related art, when a video or a mixed video and picture is intelligently cut and edited by an image processing technology, basic information such as basic characteristics (such as definition, color richness, picture significance and the like) of a video picture, a content scene and the like is emphasized, and original audio in the video is not used as reference information for intelligent editing. At present, when a video is intelligently cut and edited, original audio content in the video is simplified, for example, the volume of original sound in the video is defaulted to zero, and then a suitable score is selected for the video based on the content and scene of the video, so as to generate a composite video.

However, in the related art, since the audio in the video is used as important information in the video content, the audio is completely erased in the process of intelligent cutting and editing, so that the original audio (such as human voice) of the video is lost, and the expressive power of the video is reduced.

Disclosure of Invention

The present disclosure provides an audio processing method, an audio processing apparatus, an electronic device, and a storage medium, so as to at least solve the technical problem in the related art that the expressive force of a video is poor due to the fact that when the video is intelligently cut and edited, the audio in the video is completely erased. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an audio processing method, including:

carrying out voice detection on original audio data of a video to be processed to obtain a voice detection result;

acquiring dubbing music audio data of the video to be processed;

and superposing the original audio data of the video to be processed and the audio data of the score according to the human voice detection result.

Optionally, the overlapping processing of the original audio data of the video to be processed and the soundtrack audio data according to the human voice detection result includes:

respectively performing gain processing on the original audio data and the soundtrack audio data according to the human voice detection result;

and superposing the original audio data and the dubbing music audio data after the gain processing.

Optionally, the human voice detection result includes: audio time periods in which human voice occurs; the respectively performing gain processing on the original audio data and the soundtrack audio data according to the human voice detection result comprises:

multiplying the original audio data corresponding to the audio time period in which the human voice appears by a first gain coefficient, and multiplying the dubbing music audio data corresponding to the audio time period by a second gain coefficient;

and multiplying the original audio data corresponding to the audio time period in which the human voice does not appear by a third gain coefficient, and multiplying the dubbing audio data corresponding to the audio time period by a fourth gain coefficient.

Optionally, the first gain coefficient is 1, and the second gain coefficient is smaller than 1; the third gain factor is 0 and the fourth gain factor is 1.

Optionally, the method further includes:

and gradually changing the corresponding gain coefficients of the two sides of the connection point within the preset time periods of the two sides of the connection point for the connection point of the audio time period in which the voice appears and the audio time period in which the voice does not appear.

Optionally, the human voice detection result further includes: a probability of occurrence of a human voice, the method further comprising:

judging whether the probability of the occurrence of the voice reaches a preset threshold value or not;

and if the probability of the occurrence of the voice reaches a preset threshold value, taking the audio time period when the probability of the occurrence of the voice reaches the preset threshold value as the audio time period when the voice occurs.

Optionally, the human voice detection result further includes: at least one of a probability of noise occurrence and a probability of music occurrence, the method further comprising:

judging whether the probability of the occurrence of the human voice is greater than at least one of the probability of the occurrence of the noise and the probability of the occurrence of the music;

and if so, executing the step of judging whether the probability of the occurrence of the voice reaches a preset threshold value.

Optionally, the detecting the voice of the original audio data in the video to be processed to obtain a voice detection result, including:

acquiring original audio data of a video to be processed;

dividing the original audio data of the video to be processed into a plurality of audio data segments according to set time;

and carrying out voice detection on each audio data segment in the plurality of audio data segments through a voice detection model to obtain a voice detection result.

Optionally, after performing voice detection on the original audio data in the video to be processed to obtain a voice detection result, the method further includes:

cutting the video to be processed based on the audio time period in which the voice appears and the content characteristics of the video to be processed to obtain a cut video segment;

and superposing the original audio data and the dubbing music audio data in the video segment obtained after cutting according to the voice detection result.

Optionally, the cutting the video to be processed based on the audio time period in which the voice appears and the content characteristics of the video to be processed to obtain a cut video segment includes:

analyzing content pictures and content scenes of the content characteristics of the video to be processed through a video depth analysis model to obtain video image frames meeting preset content conditions;

and cutting the video to be processed based on the video image frame corresponding to the audio time period in which the voice appears and the obtained video image frame meeting the preset content condition to obtain a cut video segment.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus including:

the detection module is configured to perform human voice detection on the original audio data of the video to be processed to obtain a human voice detection result;

a first acquisition module configured to perform acquisition of soundtrack audio data of the video to be processed;

and the superposition processing module is configured to perform superposition processing on the original audio data of the video to be processed and the audio data of the score according to the human voice detection result.

Optionally, the superposition processing module includes:

the gain processing module is configured to perform gain processing on the original audio data and the soundtrack audio data according to the human voice detection result;

a superposition module configured to perform superposition of the gain-processed original audio data and the soundtrack audio data.

Optionally, the human voice detection result obtained by the detection module includes: audio time periods in which human voice occurs; the gain processing module comprises:

the first calculation module is configured to multiply the original audio data corresponding to an audio time period in which the human voice occurs by a first gain coefficient, and multiply the dubbing music audio data corresponding to the audio time period by a second gain coefficient;

and the second calculation module is configured to multiply the original audio data corresponding to the audio time period in which the human voice does not appear by a third gain coefficient, and multiply the dubbing music audio data corresponding to the audio time period by a fourth gain coefficient.

Alternatively to this, the first and second parts may,

the first gain coefficient multiplied by the first calculation module is 1, and the second gain coefficient is smaller than 1;

the third gain coefficient multiplied by the second calculation module is 0, and the fourth gain coefficient is 1.

Optionally, the apparatus further comprises:

and the gradual change processing module is configured to execute connection points of the audio time periods in which the human voice appears and the audio time periods in which the human voice does not appear, and perform gradual change processing on the corresponding gain coefficients at two sides of the connection points in preset time periods at two sides of the connection points.

Optionally, the detection result of the human voice obtained by the detection module further includes: a probability of occurrence of a human voice, the apparatus further comprising:

the first judgment module is configured to execute judgment on whether the probability of the occurrence of the voice reaches a preset threshold value;

the determining module is configured to execute that when the first judging module judges that the probability of the occurrence of the voice reaches a preset threshold, an audio time period in which the probability of the occurrence of the voice reaches the preset threshold is taken as the audio time period in which the voice occurs.

Optionally, the detection result of the human voice obtained by the detection module further includes: at least one of a probability of occurrence of noise and a probability of occurrence of music, the apparatus further comprising:

a second judgment module configured to perform judgment whether the probability of occurrence of the human voice is greater than at least one of the probability of occurrence of the noise and the probability of occurrence of the music;

the first judging module is further configured to execute, when the second judging module judges that the probability of the occurrence of the human voice is greater than at least one of the probability of the occurrence of the noise and the probability of the occurrence of the music, judging whether the probability of the occurrence of the human voice reaches a preset threshold value.

Optionally, the detection module includes:

the second acquisition module is configured to execute acquisition of original audio data of the video to be processed;

the dividing module is configured to divide the original audio data of the video to be processed into a plurality of audio data segments according to set time;

and the audio detection module is configured to perform human voice detection on each audio data segment in the plurality of audio data segments through a sound detection model to obtain a human voice detection result.

Optionally, the apparatus further comprises:

the cutting module is configured to perform voice detection on original audio data in a video to be processed by the detection module to obtain a voice detection result, and then cut the video to be processed based on an audio time period in which the voice appears and the content characteristics of the video to be processed to obtain a cut video segment;

and the superposition processing module is also configured to perform superposition processing on the original audio data in the video segment obtained after the cutting module cuts and the corresponding dubbing music audio data according to the voice detection result.

Optionally, the cutting module includes:

the video content analysis module is configured to perform content picture and content scene analysis on the content characteristics of the video to be processed through a video depth analysis model to obtain video image frames meeting preset content conditions;

and the video content cutting module is configured to execute video image frames corresponding to the audio time period based on the voice and cut the to-be-processed video by the obtained video image frames meeting the preset content conditions to obtain the cut video segments.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform any of the audio processing methods described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein instructions, when executed by a processor of an electronic device, cause the electronic device to perform any one of the audio processing methods described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, wherein instructions of the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform any one of the audio processing methods described above.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

in the audio processing method shown in the present exemplary embodiment, a human voice detection is performed on audio data in a video to be processed, so as to obtain a human voice detection result; carrying out audio processing on the original audio data in the video to be processed and the corresponding audio data of the soundtrack; acquiring dubbing music audio data of the video to be processed; and superposing the original audio data of the video to be processed and the audio data of the score according to the human voice detection result. That is to say, in the present disclosure, the original audio data of the video to be processed and the soundtrack audio data are superimposed according to the human voice detection result, so that the original human voice in the original audio data in the video is retained, the soundtrack volume in the corresponding video segment is reduced, the audio processing effect in the video is improved, and the processed video has more richness and expressiveness.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating an audio processing method according to an example embodiment.

FIG. 2 is another flow diagram illustrating an audio processing method according to an example embodiment.

Fig. 3 is a block diagram illustrating an audio processing device according to an example embodiment.

Fig. 4 is a block diagram illustrating an overlay processing module according to an example embodiment.

Fig. 5 is another block diagram illustrating an audio processing device according to an example embodiment.

Fig. 6 is yet another block diagram illustrating an audio processing device according to an example embodiment.

FIG. 7 is a block diagram illustrating a detection module according to an example embodiment.

Fig. 8 is a block diagram illustrating a structure of an electronic device according to an example embodiment.

Fig. 9 is a block diagram illustrating a structure having an audio processing device according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment, where the audio processing method is used in a terminal as shown in fig. 1, and includes the following steps:

in step 101, performing voice detection on audio data in a video to be processed to obtain a voice detection result;

in step 102, acquiring dubbing music audio data of the video to be processed;

in step 103, the original audio data of the video to be processed and the audio data of the soundtrack are overlaid according to the human voice detection result.

The audio processing method disclosed by the present disclosure may be applied to a terminal, a server, and the like, and is not limited herein, and the terminal implementation device may be an electronic device such as a smart phone, a notebook computer, a tablet computer, and the like.

The following describes, with reference to fig. 1, specific implementation steps of an audio processing method provided in an embodiment of the present disclosure in detail.

Firstly, step 101 is executed to perform human voice detection on audio data in a video to be processed to obtain a human voice detection result.

The method specifically comprises the following steps:

1) and acquiring original audio data of the video to be processed.

When it is detected that a user imports a video to be processed into an audio/video production interface, original audio data in the video to be processed is obtained, where the imported video to be processed may be a section of video, or multiple sections of video, or include a mixture of at least one section of video and at least one picture, and the embodiment is not limited.

In this embodiment, the user may send the video to be processed to an audio/video production interface, for example, an audio/video production interface for fast publishing of works. And when detecting the video to be processed imported by the user, the background performs audio and video decoding on the video to be processed and extracts corresponding audio data. The process of decoding the audio/video of the video to be processed is well known to those skilled in the art, and is not described herein again. Of course, in this step, the video may also be subjected to format conversion, directly converted into audio, and the like, and then the corresponding audio data is extracted, and the present embodiment is not limited.

2) And dividing the obtained original audio data into a plurality of audio data segments according to set time.

In this step, the set time may be the same time period, or may be different time periods, for example, the set time period is divided into the same time period of 5 seconds, or incremental time points of 3 seconds, 5 seconds, 7 seconds, 10 seconds, etc., of course, the set time period may also be a decremental time period, or the divided time periods may be set according to needs, which is not limited in this embodiment.

3) And carrying out voice detection on each audio data segment in the plurality of audio data segments through a voice detection model to obtain a voice detection result.

In this step, the plurality of audio data segments may be sent to an audio detection Software Development Kit (SDK), and the audio detection SDK performs voice detection on each of the plurality of audio data segments through a voice detection model to obtain a voice detection result.

The SDK is a set of development tools used when establishing application software for a specific software package, a software framework, a hardware package platform, an operating system, and the like. The audio detection SDK is a software package customized for audio detection.

The voice detection model, which may also be referred to as a human voice detection model, is used for detecting human voice in a video, and may be integrated with a voice detector or the like. For example, audio data is input into the sound detection module, the sound detection module performs human voice detection on the audio data, and outputs a detection result, where the detection result may include an audio time period in which human voice occurs, and of course, the detection result may further include: probability of occurrence of human voice; at least one of a probability of music occurrence and a probability of noise occurrence, and the like. In practical application, other parameters and the like may be included according to needs, and the embodiment is not limited.

Next, step 102 is executed to obtain the audio data of the soundtrack of the video to be processed.

In the step, corresponding audio data of the score can be obtained from the recommended score according to the content of the video to be processed; of course, the corresponding soundtrack audio data may also be searched and selected according to user instructions. The method specifically comprises the following steps:

one way is that: the system selects one or more dubbing music audios related to the video content to be processed from the stock according to the video content to be processed; and recommending the soundtrack audio to the user, and then acquiring soundtrack audio data as soundtrack audio data of the video to be processed after the user selects the corresponding soundtrack audio.

The other mode is as follows: and directly selecting corresponding soundtrack audio according to the content of the video to be processed, namely acquiring soundtrack audio data of the video to be processed.

And finally, executing a step 103, and performing superposition processing on the original audio data of the video to be processed and the audio data of the score according to the human voice detection result.

The method specifically comprises the following steps:

1) respectively performing gain processing on the original audio data and the soundtrack audio data according to the human voice detection result;

in this step, the gain processing is to adjust the loudness of the audio data based on the result of the human voice detection. The gain processing in the present disclosure may be processed by an Automatic Gain Control (AGC) algorithm (or an automatic gain control circuit, etc.). For example, based on the human voice detection result, the original audio data is multiplied by a gain factor, which is also equivalent to simultaneously multiplying the gain factor at each frequency in the frequency domain, but since human hearing senses all frequencies not linearly and follows an equal loudness curve, after processing, certain frequencies are heard to be enhanced, and certain frequencies are weakened, which results in amplification of language distortion. And similarly, carrying out AGC processing on the dubbing music audio data based on the human voice detection result.

2) And superposing the original audio data and the dubbing music audio data after the gain processing.

In this step, the original audio data after the gain processing and the dubbing music audio data are superposed to realize volume adjustment of the superposed audio data according to the human voice detection result. Namely, the original voice is reserved in the superposed audio data, the value of the dubbing music volume when the voice appears is reduced, the volume of the audio data in the audio time period in which the voice does not appear is set to be zero, and the dubbing music volume in the audio time period in which the voice does not appear is set to be the original volume, so that redundant sounds in the video to be processed in the other time periods except the audio time period in which the voice appears can be eliminated, the redundant sounds are reserved as normal values of the dubbing music volume in the other time periods, and the video with the volume adjusted and including the original voice in the video is obtained.

In the audio processing method shown in the present exemplary embodiment, a human voice detection is performed on audio data in a video to be processed, so as to obtain a human voice detection result; carrying out audio processing on the original audio data in the video to be processed and the corresponding audio data of the soundtrack; acquiring dubbing music audio data of the video to be processed; and superposing the original audio data of the video to be processed and the audio data of the score according to the human voice detection result. That is to say, in the present disclosure, the original audio data of the video to be processed and the soundtrack audio data are superimposed according to the human voice detection result, the original human voice in the original audio data in the video is retained, the soundtrack volume in the corresponding video clip is reduced, the audio processing effect in the video is improved, and the processed video has more richness and expressiveness.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the human voice detection result includes: audio time periods in which human voice occurs;

the respectively performing gain processing on the original audio data and the soundtrack audio data according to the human voice detection result comprises:

multiplying the original audio data corresponding to the audio time period in which the human voice appears by a first gain coefficient, and multiplying the dubbing music audio data corresponding to the audio time period by a second gain coefficient; and multiplying the original audio data corresponding to the audio time period in which the human voice does not appear by a third gain coefficient, and multiplying the dubbing audio data corresponding to the audio time period by a fourth gain coefficient. Wherein the first gain coefficient is 1, and the second gain coefficient is smaller than 1; the third gain factor is 0 and the fourth gain factor is 1.

That is, in this step, according to the audio time period in which the human voice occurs, the original audio data is multiplied by a first gain coefficient, that is, the volume of the original audio data in the audio time period in which the human voice occurs is set to the original human voice volume (which may also be interpreted as setting the volume of the original audio data in the video to 1), and the dubbing audio data corresponding to the audio time period is multiplied by a second gain coefficient, that is, the dubbing volume in the audio time period in which the human voice occurs is reduced (which may also be understood as setting the dubbing volume to be less than 1), and the volume of the audio data in the audio time period in which the human voice does not occur in the audio processed audio data is set to zero, and the dubbing volume in the audio time period in which the human voice does not occur is set to the original volume, that is set to 1.

In the audio processing method shown in this exemplary embodiment, the original audio data and the dubbing music audio data are subjected to gain processing respectively according to the human voice detection result, and the original audio data and the dubbing music audio data subjected to the gain processing are superimposed. Namely, the volume adjustment of the audio data after dubbing music is realized, the original voice in the audio time period in which the voice appears is reserved, the value of the dubbing music volume in the audio time period in which the voice appears is reduced, the volume of the audio data in the audio time period in which the voice does not appear is set to be zero, and the dubbing music volume in the audio time period in which the voice does not appear is set to be the original volume, so that redundant sounds in the video to be processed in the other time periods except the audio time period in which the voice appears can be eliminated, the redundant sounds are reserved as normal values of the dubbing music volume in the other time periods, and the video including the original voice in the video after the volume adjustment is obtained. In the method, the original audio data and the dubbing music audio data after gain processing are superposed according to the voice detection result, volume adjustment of the audio data is realized, the video clip comprising the voice in the video to be processed is accurately detected, the voice in the video clip is kept during audio processing, the dubbing music volume in the corresponding video clip is reduced, the intelligent editing effect is improved, the volume of the original voice in the video is kept to the maximum degree while redundant information and noise in the audio are eliminated, the audio processing effect in the video is improved, and the processed video has more richness and expressiveness.

Optionally, in another embodiment, on the basis of the above embodiment, the method may further include: and gradually changing the gain coefficients of the two sides of the connection point within a preset time period at the two sides of the connection point for the connection point of the audio time period in which the voice appears and the audio time period in which the voice does not appear.

That is to say, in the embodiment of the present disclosure, in the preset time periods on both sides of the connection point between the audio time period in which the human voice occurs and the audio time period in which the human voice does not occur, the gradual change processing is performed according to the corresponding gain coefficients in the respective audio time periods, so as to facilitate the smooth transition of the audio volume.

Referring to fig. 2, another flowchart of an audio processing method according to an embodiment of the present disclosure is shown, where the method includes:

step 201: carrying out voice detection on audio data in a video to be processed to obtain a voice detection result, wherein the voice detection result comprises: probability of occurrence of human voice;

step 201 is the same as step 101, and is described in detail above, and will not be described herein again.

Step 202: judging whether the probability of the occurrence of the voice reaches a preset threshold value or not; if yes, go to step 203; otherwise, go to step 207;

in this step, the highest probability of occurrence of voice is 1, the probability value of occurrence of voice can be determined according to the level of voice in audio, if the voice in audio is higher, the probability value of occurrence of voice is larger, and then it is determined whether the probability of occurrence of voice reaches a preset threshold, and of course, it can also be determined whether the probability is larger than the corresponding threshold according to the time length of occurrence of voice in the video to be processed, and the specific determination process is similar and is not repeated here.

Step 203: and taking the audio time period when the probability of the occurrence of the voice reaches a preset threshold value as the audio time period when the voice occurs.

In the step, whether the probability of the occurrence of the voice reaches a preset threshold value is judged; if the preset threshold value is reached, it is considered that some videos in the videos to be processed have clear high-quality voices, that is, it is considered that the video segment with voices reaching the preset threshold value is a high-quality video segment, and it is required to determine an audio time segment (that is, an audio time segment in which voices appear) in which the probability of voices appearing reaches the preset threshold value. And marking the video image frame corresponding to the audio time period as a high-quality video image frame after the determination, and adding a certain score to the video image frame.

Step 204: cutting the video to be processed based on the audio time period in which the voice appears and the content characteristics of the video to be processed to obtain a cut video segment;

in the step, the video to be processed is input into a video depth analysis model, content images and content scenes of the content features of the video to be processed are analyzed, and video image frames meeting preset content conditions are obtained, wherein the preset content conditions can be video image frames with clear images and rich content scenes; then, the video to be processed is cut based on the video image frame corresponding to the audio time period in which the voice appears and the obtained video image frame meeting the preset content condition, so as to obtain a cut video segment, that is, the video to be processed is cut based on the video image frame corresponding to the audio time period and the obtained video image frame with clear picture and rich content scene, so as to obtain a cut video segment.

That is to say, in the result of voice detection, if the probability of voice occurrence reaches the preset threshold, it can be considered that the part has relatively clear high-quality voice, and in the clipping of the video to be processed, a certain score is added to the image frame corresponding to the probability of voice occurrence reaching the preset threshold.

In the embodiment, the video to be processed is input into a video depth analysis model, and the content of the video to be processed is analyzed in picture and content scene to obtain a video image frame with clear picture and rich content scene; namely, an analysis result obtained through the video depth analysis model is also used as a standard for judging whether the video image frame is high-quality or not, and is used as a basis for cutting out a wonderful high-quality video together with an audio time period when the probability of occurrence of the human voice reaches a preset threshold value.

Step 205: according to the human voice detection result, overlapping the original audio data and the dubbing music audio data in the cut video segment;

the detailed description of the specific stacking process is given in the above corresponding description, and is not repeated herein.

Step 206: and setting the volume of the audio data in the audio time period in which the voice appears in the audio data after the superposition processing as the original volume, reducing the dubbing volume in the audio time period, setting the volume of the audio data in the audio time period in which the voice does not appear in the audio data after the audio processing as zero, and setting the dubbing volume in the audio time period in which the voice does not appear as the original volume.

In the step, after a more wonderful video segment in the video to be processed is cut, a high-quality voice is detected in the video segment, and when the duration of the voice reaches a certain threshold, the original voice in the audio time period with the voice is reserved, the dubbing music volume in the corresponding audio time period is weakened, and for other voice-free audio time periods in the cut video segment, the volume of the original voice of the video is set to be 0, and the dubbing music volume in the audio time period with the voice is set to be a default normal value of the dubbing music.

Step 207: and discarding the audio time period when the probability of the occurrence of the human voice does not reach a preset threshold value.

In this step, the audio time period for which the probability of occurrence of human voice does not reach the preset threshold may be discarded or ignored.

In the audio processing method shown in the present exemplary embodiment, a voice detection is performed on audio data in a video to be processed to obtain a voice detection result, where the voice detection result includes a probability of occurrence of a voice; if the probability of the occurrence of the voice reaches a preset threshold value, determining an audio time period when the probability of the occurrence of the voice reaches the preset threshold value; cutting the video to be processed based on the video image frame corresponding to the audio time period in which the voice appears and the obtained video image frame meeting the preset content condition to obtain a cut video segment; and superposing the original audio data and the dubbing music audio data in the video segment obtained after cutting according to the voice detection result, reserving the original voice in the audio time period, reducing the dubbing music volume value in the audio time period, eliminating redundant sound in the video segment except the audio time period in the rest time periods, reserving the redundant sound as the normal value of the dubbing music volume in the rest time periods, and obtaining the video with the adjusted volume. That is to say, in the present disclosure, the voice in the video to be processed is taken into the standard for determining whether the video image is good or not, and after the volume of the audio data after the audio processing is adjusted according to the voice detection result, the good voice in the video can be retained, and the dubbing music volume in the audio time period is reduced. That is to say, this disclosure can more accurately discern leading-in video including the high-quality video clip of the human voice to promoted the effect of intelligent editing, when rejecting redundant information, noise in the audio frequency, the volume of the original sound in the video is kept to the at utmost, promoted the audio frequency treatment effect in the video, made the video after the processing have richness and expression.

Optionally, in another embodiment, on the basis of the above embodiment, the human voice detection result may further include: at least one of a probability of noise occurrence and a probability of music occurrence, the method further comprising:

judging whether the probability of the occurrence of the human voice is greater than at least one of the probability of the occurrence of the noise and the probability of the occurrence of the music; and if so, executing the step of judging whether the probability of the occurrence of the voice reaches a preset threshold value.

That is to say, in the embodiment, in the result of the voice detection, the probability of the occurrence of the voice is greater than the probability of the occurrence of music and the probability of the occurrence of noise, and the probability of the occurrence of the voice is still higher than a certain threshold value, so that it can be considered that there is a relatively clear and high-quality voice in the video, and in the video clipping, the image frame in the corresponding time period in the packet video is retained, so that the image frame in the time period is increased by a certain score.

The method can more accurately identify the high-quality video clips including the voices in the video to be processed, improve the intelligent editing effect of the video, and furthest reserve the volume of the original voices in the video while eliminating redundant information and noise in the audio, so that the video after audio adjustment has more richness and expressiveness.

It is noted that, for simplicity of explanation, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present disclosure is not limited by the order of acts described, as some steps may, in accordance with the present disclosure, occur in other orders and/or concurrently. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required in order to implement the disclosure.

Fig. 3 is a block diagram illustrating an audio processing device according to an example embodiment. Referring to fig. 3, the apparatus includes a detection module 301, a first acquisition module 302, and a superposition processing module 303, wherein,

the detection module 301 is configured to perform voice detection on original audio data in a video to be processed to obtain a voice detection result;

the first obtaining module 302 is configured to perform obtaining of the soundtrack audio data of the video to be processed;

the superimposition processing module 303 is configured to perform superimposition processing on the original audio data of the video to be processed and the soundtrack audio data according to the human voice detection result.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the superimposition processing module 303 includes: a gain processing block 401 and a superposition block 402, the block diagram of which is shown in fig. 4, wherein,

the gain processing module 401 is configured to perform gain processing on the original audio data and the soundtrack audio data according to the human voice detection result;

the superimposing module 402 is configured to perform superimposing the gain-processed original audio data and the soundtrack audio data.

Optionally, in another embodiment, on the basis of the above embodiment, the human voice detection result obtained by the detection module includes: audio time periods in which human voice occurs; the gain processing module comprises: a first computing module and a second computing module, wherein,

the first calculation module is configured to multiply the original audio data corresponding to an audio time period in which the human voice appears by a first gain coefficient, and multiply the dubbing music audio data corresponding to the audio time period by a second gain coefficient;

the second calculation module is configured to multiply the original audio data corresponding to the audio time period in which the human voice does not appear by a third gain coefficient, and multiply the dubbing audio data corresponding to the audio time period by a fourth gain coefficient.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the first gain coefficient multiplied by the first calculating module is 1, and the second gain coefficient is smaller than 1;

Optionally, in another embodiment, on the basis of the above embodiment, the apparatus may further include: a gradation processing module, wherein,

the gradual change processing module is configured to execute connection points of audio time periods in which the human voice appears and audio time periods in which the human voice does not appear, and perform gradual change processing on gain coefficients on two sides of the connection points within preset time periods on the two sides of the connection points.

Optionally, in another embodiment, on the basis of the above embodiment, the human voice detection result obtained by the detection module further includes: the device further comprises a first judging module 501 and a determining module 502, the schematic structural diagram of which is shown in fig. 5, wherein,

the first judging module 501 is configured to execute a judgment to determine whether the probability of the occurrence of the voice reaches a preset threshold;

the determining module 502 is configured to execute, when the first determining module 501 determines that the probability of the occurrence of the human voice reaches a preset threshold, taking an audio time period in which the probability of the occurrence of the human voice reaches the preset threshold as the audio time period in which the human voice occurs.

Optionally, in another embodiment, on the basis of the above embodiment, the human voice detection result obtained by the detection module further includes: at least one of a probability of occurrence of noise and a probability of occurrence of music, the apparatus may further include: a second determination module 601, which is schematically shown in fig. 6, wherein,

the second judging module 601 is configured to execute a judgment whether the probability of the occurrence of the human voice is greater than at least one of the probability of the occurrence of the noise and the probability of the occurrence of the music;

the first determining module 501 is further configured to determine whether the probability of the occurrence of the human voice reaches a preset threshold when the second determining module 601 determines that the probability of the occurrence of the human voice is greater than at least one of the probability of the occurrence of the noise and the probability of the occurrence of the music.

Optionally, in another embodiment, on the basis of the above embodiment, the detection module 301 includes: a second obtaining module 701, a dividing module 702 and an audio detecting module 703, which are schematically shown in fig. 7, wherein,

the second obtaining module 701 is configured to perform obtaining of original audio data of a video to be processed;

the dividing module 702 is configured to divide the original audio data acquired by the acquiring module 601 into a plurality of audio data segments according to a set time;

the audio detection module 703 is configured to perform human voice detection on each of the plurality of audio data segments through a sound detection model, so as to obtain a human voice detection result.

Optionally, in another embodiment, on the basis of the above embodiment, the apparatus further includes: a cropping module, wherein the cropping module, wherein,

the superposition processing module is also configured to perform superposition processing on the original audio data in the video segment obtained after the cutting module cuts and the corresponding dubbing music audio data according to the voice detection result.

Optionally, in another embodiment, on the basis of the above embodiment, the cutting module includes: a video content analysis module and a video content cropping module, wherein,

the video content cutting module is configured to execute video image frames corresponding to the audio time periods based on the voice and cut the to-be-processed video through the obtained video image frames meeting the preset content conditions to obtain cut video segments.

In an exemplary embodiment, the present disclosure also provides an electronic device including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method as described above.

In an exemplary embodiment, the present disclosure also provides a storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method as described above.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and reference may be made to part of the description of the embodiment of the method for the relevant points, and the detailed description will not be made here.

In an exemplary embodiment, a storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an electronic device to perform the above method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 8 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile terminal or a server, and in the embodiment of the present disclosure, the electronic device is taken as a mobile terminal as an example for description. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the audio processing methods illustrated above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the audio processing method illustrated above, is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product, the instructions of which, when executed by the processor 820 of the electronic device 800, cause the electronic device 800 to perform the audio processing method illustrated above.

Fig. 9 is a block diagram illustrating an apparatus 900 for audio processing according to an example embodiment. For example, the apparatus 900 may be provided as a server. Referring to fig. 9, the apparatus 900 includes a processing component 922, which further includes one or more processors, and memory resources, represented by memory 932, for storing instructions, such as applications, that are executable by the processing component 922. The application programs stored in memory 932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 922 is configured to execute instructions to perform the above-described methods.

The device 900 may also include a power component 926 configured to perform power management of the device 900, a wired or wireless network interface 950 configured to connect the device 900 to a network, and an input output (I/O) interface 958. The apparatus 900 may operate based on an operating system stored in the memory 932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An audio processing method, comprising:

acquiring dubbing music audio data of the video to be processed;

2. The audio processing method according to claim 1, wherein the superimposing, according to the human voice detection result, the original audio data of the video to be processed and the soundtrack audio data comprises:

3. The audio processing method according to claim 2, wherein the human voice detection result comprises: audio time periods in which human voice occurs; the respectively performing gain processing on the original audio data and the soundtrack audio data according to the human voice detection result comprises:

4. The audio processing method according to claim 3, wherein the first gain factor is 1 and the second gain factor is less than 1; the third gain factor is 0 and the fourth gain factor is 1.

5. The audio processing method of claim 3, further comprising:

6. The audio processing method according to claim 3, wherein the human voice detection result further comprises: a probability of occurrence of a human voice, the method further comprising:

7. The audio processing method according to claim 6, wherein the human voice detection result further comprises: at least one of a probability of noise occurrence and a probability of music occurrence, the method further comprising:

8. An audio processing apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method of any of claims 1 to 7.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method of any of claims 1 to 7.