CN117711434A - Audio processing method and device, electronic equipment and computer readable storage medium - Google Patents

Audio processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN117711434A
CN117711434A CN202311757605.0A CN202311757605A CN117711434A CN 117711434 A CN117711434 A CN 117711434A CN 202311757605 A CN202311757605 A CN 202311757605A CN 117711434 A CN117711434 A CN 117711434A
Authority
CN
China
Prior art keywords
audio
frame
noise reduction
speech
reduction effect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311757605.0A
Other languages
Chinese (zh)
Inventor
武倩平
陈靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shuhang Technology Beijing Co ltd
Original Assignee
Shuhang Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shuhang Technology Beijing Co ltd filed Critical Shuhang Technology Beijing Co ltd
Priority to CN202311757605.0A priority Critical patent/CN117711434A/en
Publication of CN117711434A publication Critical patent/CN117711434A/en
Pending legal-status Critical Current

Links

Landscapes

  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The application discloses an audio processing method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring original audio, noisy audio and audio to be detected, wherein the original audio comprises voice, the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by utilizing a target audio processing algorithm to reduce noise of the noisy audio; according to the original audio and the audio to be detected, determining a first noise reduction effect of a target audio processing algorithm on the voice in the noisy audio; according to the noisy audio and the audio to be detected, determining a second noise reduction effect of a target audio processing algorithm on non-voice in the noisy audio; and determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect. By the method, the noise reduction effect of the target audio processing algorithm can be evaluated from different angles, and the noise reduction effect of the audio processing algorithm can be evaluated.

Description

Audio processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a computer readable storage medium.
Background
In practical applications, audio processing algorithms are generally required to process audio to meet the requirements, which includes processing audio with the audio processing algorithms to remove noise in the audio. Therefore, how to evaluate the noise reduction effect of the audio processing algorithm is of great importance.
Disclosure of Invention
The application provides an audio processing method and device, electronic equipment and a computer readable storage medium, so as to evaluate the noise reduction effect of an audio processing algorithm.
In a first aspect, there is provided an audio processing method, the method comprising:
acquiring original audio, noisy audio and audio to be tested, wherein the original audio comprises voice, the noisy audio is obtained by adding noise to the original audio, and the audio to be tested is obtained by utilizing a target audio processing algorithm to reduce noise of the noisy audio;
determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected;
determining a second noise reduction effect of the target audio processing algorithm on non-voice in the noisy audio according to the noisy audio and the audio to be detected;
And determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect.
In combination with any one of the embodiments of the present application, the determining, according to the original audio and the audio to be detected, a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio includes:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a voice frame from the n-frame first audio frames, wherein the voice frame is an audio frame comprising voice;
determining a third audio frame corresponding to the voice frame from the n frames of second audio frames;
determining that a first noise reduction effect comprises first residual noise after the voice in the noisy audio is processed by the target audio processing algorithm under the condition that the energy of the third audio frame is larger than the energy of the voice frame;
and under the condition that the energy of the third audio frame is smaller than the energy of the voice frame, determining that the first noise reduction effect comprises that the voice in the noisy audio is distorted after being processed by the target audio processing algorithm.
In combination with any one of the embodiments of the present application, in a case where the energy of the third audio frame is greater than the energy of the speech frame, the first noise reduction effect further includes a first energy of the first residual noise, and the method further includes:
and obtaining the first energy according to a first difference between the energy of the third audio frame and the energy of the voice frame, wherein the first difference is positively correlated with the first energy.
In combination with any one of the embodiments of the present application, the first noise reduction effect further includes a first suppression effect, where the first suppression effect is an effect of the target audio processing algorithm to suppress noise of speech in the noisy audio, and when the first energy is obtained, the method further includes:
dividing the noisy frequency into n frames of fourth audio frames;
determining a fifth audio frame corresponding to the voice frame from the n-frame fourth audio frames;
and determining the first inhibition effect according to a first difference value between the energy of the fifth audio frame and the first energy, wherein the smaller the first difference value is, the worse the first inhibition effect is.
In combination with any one of the embodiments of the present application, in a case where the number of the speech frames is greater than 1, the first noise reduction effect further includes a first stability, where the first stability is stability of the target audio processing algorithm for suppressing noise of speech in the noisy audio;
After determining a fifth audio frame corresponding to the speech frame from the n-frame fourth audio frames, the method further comprises:
the first stability is determined based on a first variance of the first difference.
In combination with any one of the embodiments of the present application, in a case where the energy of the third audio frame is smaller than the energy of the speech frame, the first noise reduction effect further includes a speech distortion degree, where the speech distortion degree is a degree of distortion of speech in the noisy audio after being processed by the target audio processing algorithm, and the method further includes:
and in the case that the energy of the third audio frame is smaller than the energy of the voice frame, determining the voice distortion degree according to a second difference between the energy of the voice frame and the energy of the third audio frame, wherein the second difference is positively correlated with the degree.
In combination with any one of the embodiments of the present application, the determining a speech frame from the n-frame first audio frame includes:
and determining the voice frame in the n-frame first audio frame by detecting the voice activity of the n-frame first audio frame.
In combination with any one of the embodiments of the present application, the second noise reduction effect includes a second energy of a second residual noise, where the second residual noise is a residual noise after a noise in a non-speech in the noisy audio is processed by the target audio processing algorithm;
The determining, according to the noisy audio and the audio to be detected, a second noise reduction effect of the target audio processing algorithm on non-speech in the noisy audio includes:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a non-voice frame from the n-frame first audio frames, wherein the non-voice frame is an audio frame except a voice frame, and the non-voice frame is an audio frame except the voice frame;
determining a sixth audio frame corresponding to the non-speech frame from the n-frame second audio frames;
and obtaining the second energy according to the energy of the sixth audio frame.
In combination with any one of the embodiments of the present application, the second noise reduction effect further includes a second suppression effect, where the second suppression effect is an effect that the target audio processing algorithm suppresses non-speech noise in the noisy audio, and when the second energy is obtained, the method further includes:
dividing the noisy frequency into n frames of fourth audio frames;
determining a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames;
and determining the second suppression effect according to a second difference value between the energy of the seventh audio frame and the second energy, wherein the smaller the second difference value is, the worse the second suppression effect is.
In combination with any one of the embodiments of the present application, in a case where the number of non-speech frames is greater than 1, the second noise reduction effect further includes a second stability, where the second stability is stability of the target audio processing algorithm to suppress non-speech noise in the noisy audio;
after determining a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames, the method further comprises:
and determining the second stability according to a second variance of the second difference.
In combination with any one of the embodiments of the present application, the determining, according to the first noise reduction effect and the second noise reduction effect, the third noise reduction effect of the target audio processing algorithm includes:
and taking the first noise reduction effect and the second noise reduction effect as the third noise reduction effect.
In combination with any one of the embodiments of the present application, the determining, according to the first noise reduction effect and the second noise reduction effect, the third noise reduction effect of the target audio processing algorithm includes:
in the case where the first noise reduction effect comprises a first suppression effect on noise in the speech frame and the second noise reduction effect comprises a second suppression effect on noise in the non-speech frame, determining that the third noise reduction effect comprises a third stability that characterizes a stability of the target audio processing algorithm suppressing noise in audio based on a difference of the first suppression effect and the second suppression effect.
In a second aspect, there is provided an audio processing apparatus comprising:
the device comprises an acquisition unit, a target audio processing unit and a processing unit, wherein the acquisition unit is used for acquiring original audio, noisy audio and audio to be detected, the original audio comprises voice, the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by utilizing a target audio processing algorithm to reduce noise of the noisy audio;
the determining unit is used for determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected;
the determining unit is used for determining a second noise reduction effect of the target audio processing algorithm on non-voice in the noisy audio according to the noisy audio and the audio to be detected;
the determining unit is configured to determine a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect.
In combination with any one of the embodiments of the present application, the determining unit is configured to:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
Determining a voice frame from the n-frame first audio frames, wherein the voice frame is an audio frame comprising voice;
determining a third audio frame corresponding to the voice frame from the n frames of second audio frames;
determining that a first noise reduction effect comprises first residual noise after the voice in the noisy audio is processed by the target audio processing algorithm under the condition that the energy of the third audio frame is larger than the energy of the voice frame;
and under the condition that the energy of the third audio frame is smaller than the energy of the voice frame, determining that the first noise reduction effect comprises that the voice in the noisy audio is distorted after being processed by the target audio processing algorithm.
In combination with any one of the embodiments of the present application, the audio processing apparatus further includes:
and the processing unit is used for obtaining the first energy according to a first difference between the energy of the third audio frame and the energy of the voice frame, and the first difference is positively correlated with the first energy.
In combination with any one of the embodiments of the present application, the first noise reduction effect further includes a first suppression effect, where the first suppression effect is an effect that the target audio processing algorithm suppresses noise of speech in the noisy audio, and the audio processing apparatus further includes: a dividing unit for dividing the noisy frequency into n frames of fourth audio frames;
The determining unit is further configured to determine a fifth audio frame corresponding to the voice frame from the n fourth audio frames;
the determining unit is further configured to determine the first suppression effect according to a first difference between the energy of the fifth audio frame and the first energy, where the smaller the first difference is, the worse the first suppression effect is.
In combination with any one of the embodiments of the present application, in a case where the number of the speech frames is greater than 1, the first noise reduction effect further includes a first stability, where the first stability is stability of the target audio processing algorithm for suppressing noise of speech in the noisy audio;
the determining unit is further configured to: the first stability is determined based on a first variance of the first difference.
In combination with any one of the embodiments of the present application, when the energy of the third audio frame is smaller than the energy of the speech frame, the first noise reduction effect further includes a speech distortion degree, where the speech distortion degree is a degree of distortion of speech in the noisy audio after being processed by the target audio processing algorithm, and the determining unit is further configured to determine, when the energy of the third audio frame is smaller than the energy of the speech frame, the speech distortion degree according to a second difference between the energy of the speech frame and the energy of the third audio frame, where the second difference is positively correlated with the degree.
In combination with any one of the embodiments of the present application, the determining unit is configured to determine the speech frame in the n-frame first audio frame by performing speech activity detection on the n-frame first audio frame.
In combination with any one of the embodiments of the present application, the second noise reduction effect includes a second energy of a second residual noise, where the second residual noise is a residual noise after a noise in a non-speech in the noisy audio is processed by the target audio processing algorithm;
the determining unit is used for:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a non-voice frame from the n-frame first audio frames, wherein the non-voice frame is an audio frame except a voice frame, and the non-voice frame is an audio frame except the voice frame;
determining a sixth audio frame corresponding to the non-speech frame from the n-frame second audio frames;
and obtaining the second energy according to the energy of the sixth audio frame.
In combination with any one of the embodiments of the present application, the second noise reduction effect further includes a second suppression effect, where the second suppression effect is an effect that the target audio processing algorithm suppresses non-speech noise in the noisy audio, and the audio processing apparatus further includes: a dividing unit for dividing the noisy frequency into n frames of fourth audio frames;
The determining unit is further configured to determine a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames;
the determining unit is further configured to determine the second suppression effect according to a second difference between the energy of the seventh audio frame and the second energy, where the smaller the second difference is, the worse the second suppression effect is.
In combination with any one of the embodiments of the present application, in a case where the number of non-speech frames is greater than 1, the second noise reduction effect further includes a second stability, where the second stability is stability of the target audio processing algorithm to suppress non-speech noise in the noisy audio;
the determining unit is further configured to determine the second stability according to a second variance of the second difference.
In combination with any one of the embodiments of the present application, the determining, according to the first noise reduction effect and the second noise reduction effect, the third noise reduction effect of the target audio processing algorithm includes:
and taking the first noise reduction effect and the second noise reduction effect as the third noise reduction effect.
In combination with any one of the embodiments of the present application, the determining unit is configured to: in the case where the first noise reduction effect comprises a first suppression effect on noise in the speech frame and the second noise reduction effect comprises a second suppression effect on noise in the non-speech frame, determining that the third noise reduction effect comprises a third stability that characterizes a stability of the target audio processing algorithm suppressing noise in audio based on a difference of the first suppression effect and the second suppression effect.
In a third aspect, an electronic device is provided, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as described in the first aspect and any one of its possible implementations.
In a fourth aspect, there is provided another electronic device comprising: a processor, a transmitting means, an input means, an output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the first aspect and any implementation thereof as described above.
In a fifth aspect, there is provided a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the first aspect and any implementation thereof as described above.
In a sixth aspect, there is provided a computer program product comprising a computer program or instructions which, when run on a computer, cause the computer to perform the first aspect and any embodiments thereof.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
In the application, the original audio comprises voice, the noisy frequency is obtained by adding noise to the original audio, and the audio to be detected is obtained by reducing the noise of the noisy frequency by utilizing a target audio processing algorithm. After the audio processing device acquires the original audio, the noisy audio and the audio to be detected, determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected, and determining a second noise reduction effect of the target audio processing algorithm on the non-voice in the noisy audio according to the noisy audio and the audio to be detected. And then determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect, so that the noise reduction effect of the target audio processing algorithm can be evaluated from different angles, and the accuracy of the third noise reduction effect can be improved.
Drawings
In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.
Fig. 1 is a schematic flow chart of an audio processing method according to an embodiment of the present application;
fig. 2 is a flow chart of another audio processing method according to an embodiment of the present application;
fig. 3 is a flow chart of yet another audio processing method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio processing device according to an embodiment of the present application;
fig. 5 is a schematic hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The execution main body of the embodiment of the application is an audio processing device, wherein the audio processing device can be any electronic equipment capable of executing the technical scheme disclosed by the embodiment of the method of the application. Alternatively, the audio processing device may be one of the following: computer, server.
It should be understood that the method embodiments of the present application may also be implemented by way of a processor executing computer program code. Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application. Referring to fig. 1, fig. 1 is a flow chart of an audio processing method according to an embodiment of the present application.
101. And obtaining the original audio, the noisy audio and the audio to be tested.
In the embodiment of the application, the original audio comprises voice, the noise-carrying frequency is obtained by adding noise to the original audio, and the audio to be detected is obtained by using a target audio processing algorithm to reduce the noise of the noise-carrying frequency. Optionally, the original audio is speech audio without noise, and if the noisy audio is obtained by adding target noise to the original audio, the noise in the noisy audio is the target noise, and the target audio processing algorithm performs noise reduction on the noisy audio, and performs noise reduction on the target noise in the noisy audio. The target audio processing algorithm may be any audio processing algorithm having the capability of noise reduction of audio, for example, the target audio processing algorithm is an algorithm for noise reduction of audio, and for example, the target audio processing algorithm is an algorithm obtained by combining an algorithm for noise reduction of audio with an algorithm for encoding audio.
It should be understood that, in the embodiment of the present application, the step of acquiring the original audio, the step of acquiring the audio with noise, and the step of acquiring the audio to be tested may be performed separately or simultaneously, which is not limited in this application.
102. And determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected.
In the embodiment of the present application, the first noise reduction effect characterizes a processing effect of the target audio processing algorithm on the voice in the noisy audio. In one possible implementation, the first noise reduction effect includes speech distortion, and in particular, the target audio processing algorithm causes speech distortion in the noisy audio by performing noise reduction on the noisy audio, where the first noise reduction effect includes speech distortion.
In another possible implementation, the first noise reduction effect includes a removal effect of noise in the speech, for example, the target audio processing algorithm removes 70% of the noise in the speech in the noisy audio by performing a noise reduction process on the noisy audio, and then the first noise reduction effect includes removing 70% of the noise in the speech.
Because the voice in the noisy audio is the voice in the original audio, and the voice in the audio to be detected is the voice obtained by processing the voice in the noisy audio through the target audio processing algorithm, the voice in the noisy audio can be determined to be processed through the target audio processing algorithm according to the original audio and the audio to be detected, and the noise reduction effect of the target audio processing algorithm on the voice in the noisy audio can be determined. And although the noisy frequency comprises voice, because the voice in the noisy audio has added noise compared with the voice in the original audio, the first noise reduction effect is determined according to the original audio and the audio to be detected, and compared with the first noise reduction effect determined according to the noisy audio and the audio to be detected, the interference of the added noise in the voice can be reduced, and the accuracy of the first noise reduction effect can be improved.
In one possible implementation manner, the audio processing device obtains the first noise reduction effect by comparing the voice in the original audio with the voice in the audio to be tested. Because the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by reducing noise of the noisy audio by utilizing the target audio processing algorithm, the difference between the voice in the audio to be detected and the voice in the original audio can be represented and the influence on the voice of the original audio by the processing of the target audio processing algorithm can be represented. For example, the energy of the voice in the original audio is larger than the energy of the voice in the audio to be detected, which indicates that the voice in the audio to be detected has a signal except the voice in the original audio, and the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by using the target audio processing algorithm to reduce the noise of the noisy audio, so that the signal except the voice in the original audio is the noise, which indicates that the noise remains after the processing of the target audio processing algorithm.
In another possible implementation, the audio processing device determines an intelligibility difference of speech intelligibility of speech in the original audio and speech intelligibility of speech in the audio under test. And determining a first noise reduction effect according to the intelligibility difference. Because noise in the voice affects the voice intelligibility of the voice, the smaller the intelligibility difference is, the less the noise in the voice of the audio to be detected is, and the better the noise removing effect of the target audio processing algorithm is.
103. And determining a second noise reduction effect of the target audio processing algorithm on non-voice in the noisy audio according to the noisy audio and the audio to be detected.
In the embodiment of the present application, the non-speech is audio other than speech. The second noise reduction effect characterizes a processing effect of the target audio processing algorithm on non-speech in the noisy audio. In one possible implementation, the second noise reduction effect includes whether there is noise residual in the non-speech through the target audio processing algorithm. In another possible implementation, the target audio processing algorithm removes 90% of the noise in the non-speech by, for example, denoising the noisy audio, and then the first denoising effect includes removing 90% of the noise in the non-speech.
In one possible implementation manner, the audio processing device obtains the second noise reduction effect by comparing non-speech in the noisy audio with non-speech in the audio to be tested. Because the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by reducing noise of the noisy audio by utilizing the target audio processing algorithm, the difference between the voice in the audio to be detected and the voice in the original audio can be represented and the influence on the voice of the original audio by the processing of the target audio processing algorithm can be represented. For example, the energy of the voice in the original audio is larger than the energy of the voice in the audio to be detected, which indicates that the voice in the audio to be detected has a signal except the signal in the original audio, and the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by reducing the noise of the noisy audio by using the target audio processing algorithm, so that the signal except the voice in the original audio is the noise, which indicates that the noise remains after the processing of the target audio processing algorithm.
104. And determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect.
In this embodiment of the present application, the third noise reduction effect is a noise reduction effect of the target audio processing algorithm. The audio processing device determines the third noise reduction effect according to the first noise reduction effect and the second noise reduction effect, so that the third noise reduction effect can be determined according to the noise reduction effect of the target audio processing algorithm on the voice and the noise reduction effect of the target audio processing algorithm on the non-voice, and the noise reduction effect of the target audio processing algorithm can be evaluated from different angles. And under the condition that the third noise reduction effect is determined according to the first noise reduction effect and the second noise reduction effect, the noise reduction effect of the target audio processing algorithm on the voice and the noise reduction effect of the target audio processing algorithm on the non-voice can be determined through the third noise reduction effect, so that the noise reduction effect of the target audio processing algorithm in different links can be better reflected.
In one possible implementation, the audio processing apparatus uses the first noise reduction effect and the second noise reduction effect as the third noise reduction effect. At this time, the third noise reduction effect includes a noise reduction effect of the target audio processing algorithm on the voice and a noise reduction effect of the target audio processing algorithm on the non-voice.
In another possible implementation, the audio processing device performs weighted summation on the first noise reduction effect and the second noise reduction effect to obtain the third noise reduction effect.
In the embodiment of the application, the original audio comprises voice, the noise-carrying frequency is obtained by adding noise to the original audio, and the audio to be detected is obtained by using a target audio processing algorithm to reduce the noise of the noise-carrying frequency. After the audio processing device acquires the original audio, the noisy audio and the audio to be detected, determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected, and determining a second noise reduction effect of the target audio processing algorithm on the non-voice in the noisy audio according to the noisy audio and the audio to be detected. And then determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect, so that the noise reduction effect of the target audio processing algorithm can be evaluated from different angles, and the accuracy of the third noise reduction effect can be improved.
In an alternative embodiment, the audio processing device performs the following steps in performing step 102:
201. dividing the original audio into n frames of first audio frames.
In the embodiment of the present application, n is an integer greater than 1. The audio processing device may obtain n frames of the first audio frame by dividing the original audio into n segments.
In one possible implementation, the audio processing device obtains the length of the original audio. Dividing the original audio into n segments according to the length of the audio to obtain a first audio frame. Optionally, any two of the n first audio frames do not overlap.
For example, the playing duration of the original audio is 50 seconds, i.e. the length of the original audio is 50 seconds, and then the audio processing device may divide the original audio into 25 segments of first audio frames with a length of 2 seconds. At this time, n is 25, and the length of each first audio frame is 2 seconds.
For another example, the playing duration of the original audio is 51 seconds, that is, the length of the original audio is 50 seconds, and the audio processing apparatus may divide the original audio into 25 first audio frames with a length of 2 seconds and 1 first audio frame with a length of 1 second. At this time, n is 26, i.e., the first audio frame has 26 frames in total.
202. Dividing the audio to be detected into n frames of second audio frames.
The implementation manner of the audio processing device for dividing the audio to be detected into n frames of second audio frames is the same as the implementation manner of dividing the original audio into n frames of first audio frames, and will not be repeated here.
In one possible implementation, the audio processing device aligns the audio to be detected to the original audio to obtain aligned audio, and the aligned audio may be aligned with the same phonemes in the original audio. By dividing the aligned audio into n frames, n second audio frames are obtained, so that the n first audio frames can be aligned with the audio frames having the same phonemes in the n second audio frames.
203. A speech frame is determined from the n-frame first audio frame.
In this embodiment, the speech frame is an audio frame including speech. In one possible implementation, the audio processing apparatus determines a speech frame in the n-frame first audio frame by performing speech activity detection on the n-frame first audio frame.
204. And determining a third audio frame corresponding to the voice frame from the n frames of second audio frames.
For example, the second one of the n frames is a speech frame, and then the third audio frame is the second one of the n frames.
205. And under the condition that the energy of the third audio frame is larger than the energy of the voice frame, determining the first noise reduction effect comprises that the voice in the noisy audio is processed by the target audio processing algorithm to generate first residual noise.
Since the speech in the noisy audio is the same as the speech in the original audio, the energy of the third audio frame is greater than the energy of the speech frame, which means that the third audio frame has noise in addition to the signal in the speech frame, that is, the noise is present in the third audio frame obtained by the processing of the target audio processing algorithm, that is, the target audio processing algorithm does not remove the noise in the speech of the noisy audio, so the audio processing apparatus may determine that the first noise reduction effect includes that the noise residue exists after the speech in the noisy audio is processed by the target audio processing algorithm, and the noise residue is referred to as the first residual noise.
206. And under the condition that the energy of the third audio frame is smaller than the energy of the voice frame, determining that the first noise reduction effect comprises that the voice in the noisy audio is distorted after being processed by the target audio processing algorithm.
Since the speech in the noisy audio is the same as the speech in the original audio, the energy of the third audio frame is less than the energy of the speech frame, indicating that the noise reduction by the target audio processing algorithm results in a reduction of the speech in the noisy audio, that is, the processing by the target audio processing algorithm results in distortion of the speech in the noisy audio. The audio processing device may determine that the first noise reduction effect includes distortion of speech in the noisy audio after processing by the target audio processing algorithm.
In this embodiment, the original audio is divided into n frames of first audio frames and the audio to be measured is divided into n frames of second audio frames. A speech frame is determined from the n frames of first audio frames, and a third audio frame corresponding to the speech frame is determined from the n frames of second audio frames. In the case that the energy of the third audio frame is greater than the energy of the speech frame, determining the first noise reduction effect includes determining that the first residual noise is present after the speech in the noisy audio is processed by the target audio processing algorithm, thereby determining the first noise reduction effect by whether the noise residual is present in the speech. In the case where the energy of the third audio frame is smaller than the energy of the speech frame, determining that the first noise reduction effect includes that the speech in the noisy audio is distorted after being processed by the target audio processing algorithm, so that the first noise reduction effect can be determined by whether the speech is distorted.
As an optional implementation manner, in a case where the energy of the third audio frame is greater than the energy of the speech frame, the first noise reduction effect further includes a first energy of the first residual noise, where the first energy is energy of the residual noise in the speech after being processed by the target audio processing algorithm. In such an embodiment, the audio processing device further performs the steps of:
301. the first energy is obtained from a first difference between the energy of the third audio frame and the energy of the speech frame.
In this embodiment, the first difference is positively correlated with the first energy, that is, the more the energy of the third audio frame is greater than the energy of the speech frame, the greater the energy of the residual noise.
In one possible implementation, the audio processing device takes the first difference as the first energy. In another possible implementation, the audio processing device takes the sum of the first difference and the first preset value as the first energy. In a further possible implementation, the audio processing device takes as the first energy a product of the first difference and the second preset value.
In this embodiment, the audio processing apparatus obtains a first energy of the first residual noise according to a first difference between the energy of the third audio frame and the energy of the speech frame when the energy of the third audio frame is greater than the energy of the speech frame, whereby the first noise reduction effect can be determined by the first energy.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one first energy based on each frame of speech frame, respectively, to obtain at least one first energy. Determining an average of the at least one first energy to obtain a third energy. A first noise reduction effect of the target audio processing algorithm is determined by the third energy.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one first energy based on each frame of speech frame, respectively, to obtain at least one first energy. Determining a maximum of the at least one first energy results in a fourth energy. The first noise reduction effect of the target audio processing algorithm is determined by the fourth energy.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one first energy based on each frame of speech frame, respectively, to obtain at least one first energy. Determining a minimum of the at least one first energy results in a fifth energy. The first noise reduction effect of the target audio processing algorithm is determined by the fifth energy.
As an optional implementation manner, the first noise reduction effect further includes a first suppression effect, where the first suppression effect is an effect of the target audio processing algorithm to suppress noise of the voice in the noisy audio. In case the first energy is derived, the audio processing device further performs the steps of:
401. Dividing the above noisy audio into n frames of fourth audio frames.
The implementation manner of the audio processing device for dividing the noisy audio into the n frames of the fourth audio frame is the same as the implementation manner of dividing the original audio into the n frames of the first audio frame, and will not be described again.
In one possible implementation, the audio processing device aligns noisy frequencies to the original audio to obtain aligned audio, which may be aligned with the same phonemes in the original audio. By dividing the aligned audio into n frames, n fourth audio frames are obtained, so that the n first audio frames can be aligned with the audio frames having the same phonemes in the n fourth audio frames.
402. And determining a fifth audio frame corresponding to the voice frame from the n-frame fourth audio frames.
For example, the second one of the n frames is a speech frame, and then the fifth audio frame is the second one of the n frames of the fourth audio frame.
403. And determining the first suppression effect according to a second difference between the energy of the fifth audio frame and the first energy.
In the embodiment of the present application, the smaller the second difference, the worse the first inhibition effect. In one possible implementation, the audio processing device takes the second difference as the first suppression effect. In another possible implementation, the audio processing apparatus takes the sum of the second difference and the third preset value as the first suppression effect. In a further possible implementation, the audio processing device takes the product of the second difference and the fourth preset value as the first suppression effect.
In this embodiment, when the first energy is obtained, the audio processing apparatus obtains the first suppression effect from the second difference between the energy of the fifth audio frame and the first energy, and can thereby determine the first noise reduction effect from the first suppression effect.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one first suppression effect based on each frame of speech frame, respectively, to obtain at least one first suppression effect. Determining an average of the at least one first inhibition effect results in a third inhibition effect. And determining the first noise reduction effect of the target audio processing algorithm through the third suppression effect.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one first suppression effect based on each frame of speech frame, respectively, to obtain at least one first suppression effect. Determining a maximum value of the at least one first inhibition effect results in a fourth inhibition effect. And determining the first noise reduction effect of the target audio processing algorithm through the fourth suppression effect.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one first suppression effect based on each frame of speech frame, respectively, to obtain at least one first suppression effect. Determining a minimum value of the at least one first inhibition effect results in a fifth inhibition effect. And determining the first noise reduction effect of the target audio processing algorithm through the fifth inhibition effect.
As an alternative embodiment, in case that the number of speech frames is greater than 1, the first noise reduction effect further includes a first stability, wherein the first stability is stability of noise of the speech in the noisy audio suppressed by the target audio processing algorithm. After determining a fifth audio frame corresponding to the speech frame from the n-frame fourth audio frames, the audio processing apparatus further determines a first stability by performing the steps of: the first stability is determined based on the first variance of the second difference.
In such an embodiment, the first variance is a variance of the second difference, the first variance being inversely related to the first stability. The smaller the first variance, the smaller the difference of noise suppression effect of the target audio processing algorithm on different voice frames is, and further the good stability of noise suppression of voice in the noisy audio of the target audio processing algorithm is illustrated. Conversely, the larger the first variance, the larger the difference of noise suppression effects of the target audio processing algorithm on different speech frames, and further the poor stability of noise suppression of speech in noisy audio of the target audio processing algorithm.
As an alternative embodiment, in the case that the energy of the third audio frame is smaller than the energy of the speech frame, the first noise reduction effect further includes a speech distortion degree, where the speech distortion degree is a degree of distortion of speech in the noisy audio after being processed by the target audio processing algorithm. In such an embodiment, the audio processing device further performs the steps of: and determining the speech distortion degree according to a third difference between the energy of the speech frame and the energy of the third audio frame when the energy of the third audio frame is smaller than the energy of the speech frame.
The larger the third difference, the more the energy of the speech frame is greater than the energy of the third audio frame, and the more the noise reduction through the target audio processing algorithm results in a reduction of speech in the noisy audio, that is, the greater the degree to which the processing of the target audio processing algorithm results in speech distortion in the noisy audio. Thus, the third difference is positively correlated with the degree of speech distortion.
In this embodiment, the audio processing apparatus determines the speech distortion degree according to a third difference between the energy of the speech frame and the energy of the third audio frame in a case where the energy of the third audio frame is smaller than the energy of the speech frame, so that the first noise reduction effect can be determined by the speech distortion degree.
Alternatively, in the case that the number of speech frames is greater than 1, the audio processing apparatus may determine a speech distortion degree based on each frame of speech frame, respectively, to obtain at least one speech distortion degree. Determining an average value of the at least one speech distortion degree to obtain a final speech distortion degree. And determining a first noise reduction effect of the target audio processing algorithm through the final voice distortion degree.
Alternatively, in the case that the number of speech frames is greater than 1, the audio processing apparatus may determine a speech distortion degree based on each frame of speech frame, respectively, to obtain at least one speech distortion degree. Determining a maximum value of the at least one speech distortion degree to obtain a final speech distortion degree. And determining a first noise reduction effect of the target audio processing algorithm through the final voice distortion degree.
Alternatively, in the case that the number of speech frames is greater than 1, the audio processing apparatus may determine a speech distortion degree based on each frame of speech frame, respectively, to obtain at least one speech distortion degree. Determining a minimum value of the at least one speech distortion factor results in a final speech distortion factor. And determining a first noise reduction effect of the target audio processing algorithm through the final voice distortion degree.
As an alternative embodiment, the second noise reduction effect includes a second energy of a second residual noise, where the second residual noise is a residual noise of a non-speech noise in the noisy audio after the processing of the target audio processing algorithm. In this embodiment, the audio processing device performs the following steps in performing step 103:
501. dividing the original audio into n frames of first audio frames.
The implementation of this step is the same as step 201 and will not be described here again.
502. Dividing the audio to be detected into n frames of second audio frames.
The implementation of this step is the same as step 202 and will not be described here again.
503. A non-speech frame is determined from the n-frame first audio frame.
In the embodiment of the present application, the non-speech frame is an audio frame other than a speech frame. In one possible implementation, the audio processing apparatus may determine the speech frame and the non-speech frame in the n-frame first audio frame by performing speech activity detection on the n-frame first audio frame.
504. And determining a sixth audio frame corresponding to the non-speech frame from the n-frame second audio frames.
For example, the third one of the n frames is a non-speech frame, and then the sixth audio frame is the third one of the n frames of the second audio frame.
505. And obtaining the second energy according to the energy of the sixth audio frame.
Since the non-speech frames are considered as audio frames that do not include speech, the energy of the non-speech frames is the energy of the noise in the non-speech frames. Therefore, the audio processing device can determine the energy of the noise in the non-voice frame obtained by the processing of the target audio processing algorithm according to the energy of the sixth audio frame, namely the energy of the noise remained in the non-voice frame.
In one possible implementation, the audio processing device takes the energy of the sixth audio frame as the second energy. In another possible implementation, the audio processing apparatus takes the sum of the energy of the sixth audio frame and the fifth preset value as the second energy. In a further possible implementation, the audio processing device takes the product of the energy of the sixth audio frame and the sixth preset value as the second energy.
In this embodiment, the audio processing apparatus obtains the second energy from the energy of the sixth audio frame after determining the sixth audio frame corresponding to the non-speech frame from the n-frame second audio frames, whereby the second noise reduction effect can be determined by the second energy.
Alternatively, in the case where the number of non-speech frames is greater than 1, the audio processing apparatus may determine one second energy based on each frame of non-speech frames, respectively, to obtain at least one second energy. Determining an average of the at least one second energy results in a sixth energy. A second noise reduction effect of the target audio processing algorithm is determined by the sixth energy.
Alternatively, in the case where the number of non-speech frames is greater than 1, the audio processing apparatus may determine one second energy based on each frame of non-speech frames, respectively, to obtain at least one second energy. Determining a maximum of the at least one second energy results in a seventh energy. A second noise reduction effect of the target audio processing algorithm is determined by the seventh energy.
Alternatively, in the case where the number of speech frames is greater than 1, the audio processing apparatus may determine one second energy based on each frame of non-speech frames, respectively, to obtain at least one second energy. Determining a minimum of the at least one second energy results in an eighth energy. A second noise reduction effect of the target audio processing algorithm is determined by the eighth energy.
As an optional implementation manner, the second noise reduction effect further includes a second suppression effect, where the second suppression effect is an effect of suppressing non-speech noise in the noisy audio by the target audio processing algorithm. In case the second energy is derived, the audio processing device further performs the steps of:
601. Dividing the above noisy audio into n frames of fourth audio frames.
The implementation of this step may be referred to as step 401, and will not be described here.
602. And determining a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames.
For example, the third of the n frames is a speech frame, and then the seventh audio frame is the third of the n fourth audio frames.
603. And determining the second suppression effect according to a fourth difference between the energy of the seventh audio frame and the second energy.
In the embodiment of the present application, the smaller the fourth difference, the worse the second inhibition effect. In one possible implementation, the audio processing device takes the fourth difference as the second suppression effect. In another possible implementation, the audio processing apparatus takes the sum of the fourth difference and the seventh preset value as the second suppression effect. In a further possible implementation, the audio processing device takes the product of the fourth difference and the eighth preset value as the second suppression effect.
In this embodiment, when the second energy is obtained, the audio processing apparatus obtains the second suppression effect from the fourth difference between the energy of the seventh audio frame and the second energy, and thereby can determine the second noise reduction effect from the second suppression effect.
Alternatively, in the case where the number of non-speech frames is greater than 1, the audio processing apparatus may determine one second suppression effect based on each non-speech frame, respectively, to obtain at least one second suppression effect. Determining an average of the at least one second inhibition effect results in a sixth inhibition effect. And determining a second noise reduction effect of the target audio processing algorithm through the sixth suppression effect.
Alternatively, in the case where the number of non-speech frames is greater than 1, the audio processing apparatus may determine one second suppression effect based on each non-speech frame, respectively, to obtain at least one second suppression effect. Determining a maximum value of the at least one second inhibition effect results in a seventh inhibition effect. And determining a second noise reduction effect of the target audio processing algorithm through the seventh suppression effect.
Alternatively, in the case where the number of non-speech frames is greater than 1, the audio processing apparatus may determine one second suppression effect based on each non-speech frame, respectively, to obtain at least one second suppression effect. Determining a minimum value of the at least one second inhibition effect results in an eighth inhibition effect. And determining a second noise reduction effect of the target audio processing algorithm through the eighth suppression effect.
As an alternative embodiment, in case the number of non-speech frames is greater than 1, the second noise reduction effect further comprises a second stability, wherein the second stability is the stability of the target audio processing algorithm to suppress non-speech noise in the noisy audio. After determining a seventh audio frame corresponding to a non-speech frame from the n frames of fourth audio frames, the audio processing algorithm further determines a second stability by performing the steps of: and determining the second stability according to the second variance of the fourth difference.
In such an embodiment, the second variance is a variance of the fourth difference, the second variance being inversely related to the second stability. The smaller the second variance is, the smaller the difference of noise suppression effects of the target audio processing algorithm on different non-voice frames is, and further the good stability of suppressing non-voice noise in the noisy audio of the target audio processing algorithm is illustrated. Conversely, the larger the second variance, the larger the difference of the noise suppression effect of the target audio processing algorithm on different non-speech frames, and further the poor stability of suppressing non-speech noise in the noisy audio of the target audio processing algorithm.
As an alternative embodiment, the third noise reduction effect comprises a third stability. The audio processing device performs the following steps in performing step 104: in the case where the first noise reduction effect includes a first suppression effect of noise in the speech frame and the second noise reduction effect includes a second suppression effect of noise in the non-speech frame, determining a third stability according to a difference between the first suppression effect and the second suppression effect.
In the implementation of the present application, the third stability characterizes the stability of the target audio processing algorithm in suppressing noise in audio, specifically, since the non-speech frame does not include speech, noise reduction on the non-speech frame may not consider speech distortion, and the speech frame includes speech, and noise reduction on the speech frame needs to consider speech distortion. Thus, whether the difference in suppression of noise in speech versus suppression of noise in non-speech frames is small may characterize the noise reduction effect on the target audio processing algorithm. According to the first suppression effect and the second suppression effect, a difference between suppression of noise in speech and suppression of noise in non-speech frames can be determined, and the difference between suppression of noise in speech and suppression of noise in non-speech frames is the third stability, so the audio processing apparatus can determine the third noise reduction effect by the third stability after determining the third stability by this embodiment.
Referring to fig. 2, fig. 2 is a flow chart of another audio processing method according to an embodiment of the present application. As shown in fig. 2, the speech frames and the non-speech frames may be determined by performing speech activity detection on the original audio, specifically, the original audio is first divided into n frames of first audio frames (see step 201 for details), and then the speech frames and the non-speech frames are determined by performing speech activity detection on the n frames of first audio frames.
For the speech frame, based on the speech frame and the audio to be detected, it may be determined that the first noise reduction effect includes distortion of the speech in the noisy audio after processing by the target audio processing algorithm (see, for example, steps 202 to 204 and 206). Based on the speech frame and the audio to be tested, it may be determined that the first noise reduction effect includes that the speech in the noisy audio has first residual noise after being processed by the target audio processing algorithm (see, for example, steps 202 to 204 and 205). Based on the noisy audio and the audio to be tested, a first suppression effect may be determined (see steps 401 to 403 for specific details). In case it is determined that the first noise reduction effect comprises a distortion of speech in the noisy audio after processing by the target audio processing algorithm, a speech distortion level may be determined based on the speech frame and the audio to be detected (see in particular the step of determining a speech distortion level from a third difference between the energy of the speech frame and the energy of the third audio frame in case the energy of the third audio frame is smaller than the energy of the speech frame). In case it is determined that the first noise reduction effect comprises the presence of first residual noise after processing of speech in the noisy audio by the target audio processing algorithm, a first energy may be determined based on the speech frame and the audio to be detected (see step 301 for details). In case the first inhibition effect is determined, a first stability may be determined (see in particular the step of determining the first stability from the first variance of the second difference).
For the non-speech frame, based on the non-speech frame and the audio to be tested, it may be determined that the second noise reduction effect includes the presence of second residual noise after the processing of the noise in the non-speech in the noisy audio by the target audio processing algorithm (for implementation of this step, see steps 202 to 204, and step 205 for implementation of determining the first residual noise). Based on the noisy audio and the audio to be tested, a second suppression effect may be determined (see steps 601-603 for specific details). In case it is determined that the second noise reduction effect comprises the presence of second residual noise after the noise in the noisy audio has been processed by the target audio processing algorithm, a second energy may be determined based on the non-speech frame and the audio to be detected (see in particular steps 501 to 505). In case the second inhibition effect is determined, a second stability may be determined (see in particular the step of determining the second stability from the second variance of the fourth difference).
Referring to fig. 3, fig. 3 is a flowchart of another audio processing method according to an embodiment of the present application. As shown in fig. 3, the speech frames and the non-speech frames may be determined by performing speech activity detection on the original audio, specifically, by framing the original audio, dividing the original audio into n frames of first audio frames (see step 201 for details), and then determining the speech frames and the non-speech frames by performing speech activity detection on the n frames of first audio frames.
For a speech frame, energy 2 is calculated based on the n frames of the second audio frame. A third audio frame corresponding to the speech frame is determined from the audio under test (see steps 202 to 204), and then the energy 1 of the third audio frame in the audio under test is calculated. Whether the energy 1 is greater than the energy 2 is determined, if yes, the speech distortion is determined, that is, the first noise reduction effect includes that the speech in the noisy audio is distorted after being processed by the target audio processing algorithm (see step 202 to step 204, step 206). Under the condition of determining the voice distortion, the voice distortion degree of each voice frame is determined through statistics of all frames to obtain at least one voice distortion degree, and then the average value and the maximum value of the at least one voice distortion degree are determined through average value and maximum value calculation to obtain the voice distortion degree 1. If the noise residue is not determined, it is determined that the first noise reduction effect includes the presence of the first residual noise after the voice in the noisy audio is processed by the target audio processing algorithm (see, for example, steps 202 to 204 and 206). In case of determining the noise residuals, a first energy (see specifically step 301), a first suppression effect (see specifically step 401 to step 403), a first stability (see specifically step determining the first stability from the first variance of the second difference) are calculated.
For non-speech frames, energy is calculated based on the n frames of the second audio frame, i.e. the energy of each frame of non-speech frame is calculated. And determining the second noise residues of the non-voice frames by counting all frames to obtain at least one second noise residue. And then, determining the average value and the maximum value of at least one second noise residual by calculating the average value and the maximum value to obtain the noise residual intensity.
While energy is calculated for the fifth audio frame. A second suppression effect may be determined by calculating the difference between the energy of the non-speech frame and the energy of the fifth audio frame (see steps 601-603 for specific details). And determining the second suppression effect of each non-voice frame to obtain at least one second suppression effect by counting all frames. The second stability is then obtained by variance calculation (see in particular the step of determining the first stability from the first variance of the second difference).
Finally, based on the first suppression effect and the second suppression effect, a third stability may be determined (see, in particular, the step of determining the third stability according to a difference between the first suppression effect and the second suppression effect in a case where the first noise reduction effect includes a first suppression effect on noise in a speech frame and the second noise reduction effect includes a second suppression effect on noise in a non-speech frame).
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, where the audio processing apparatus 1 includes: the acquisition unit 11, the determination unit 12, optionally, the audio processing apparatus 1 further comprises: processing unit 13, dividing unit 14, specifically:
an obtaining unit 11, configured to obtain an original audio, a noisy audio, and an audio to be tested, where the original audio includes speech, the noisy audio is obtained by adding noise to the original audio, and the audio to be tested is obtained by denoising the noisy audio using a target audio processing algorithm;
a determining unit 12, configured to determine, according to the original audio and the audio to be detected, a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio;
The determining unit 12 is configured to determine, according to the noisy audio and the audio to be tested, a second noise reduction effect of the target audio processing algorithm on non-speech in the noisy audio;
the determining unit 12 is configured to determine a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect.
In combination with any one of the embodiments of the present application, the determining unit 12 is configured to:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a voice frame from the n-frame first audio frames, wherein the voice frame is an audio frame comprising voice;
determining a third audio frame corresponding to the voice frame from the n frames of second audio frames;
determining that a first noise reduction effect comprises first residual noise after the voice in the noisy audio is processed by the target audio processing algorithm under the condition that the energy of the third audio frame is larger than the energy of the voice frame;
and under the condition that the energy of the third audio frame is smaller than the energy of the voice frame, determining that the first noise reduction effect comprises that the voice in the noisy audio is distorted after being processed by the target audio processing algorithm.
In combination with any one of the embodiments of the present application, the audio processing device 1 further includes:
the processing unit 13 is configured to obtain the first energy according to a first difference between the energy of the third audio frame and the energy of the speech frame, where the first difference is positively related to the first energy.
In combination with any one of the embodiments of the present application, the first noise reduction effect further includes a first suppression effect, where the first suppression effect is an effect that the target audio processing algorithm suppresses noise of speech in the noisy audio, and the audio processing apparatus 1 further includes: a dividing unit 14 for dividing the noisy frequency into n frames of fourth audio frames;
the determining unit 12 is further configured to determine a fifth audio frame corresponding to the speech frame from the n-frame fourth audio frames;
the determining unit 12 is further configured to determine the first suppression effect according to a first difference between the energy of the fifth audio frame and the first energy, where the smaller the first difference is, the worse the first suppression effect is.
In combination with any one of the embodiments of the present application, in a case where the number of the speech frames is greater than 1, the first noise reduction effect further includes a first stability, where the first stability is stability of the target audio processing algorithm for suppressing noise of speech in the noisy audio;
The determining unit 12 is further configured to: the first stability is determined based on a first variance of the first difference.
In combination with any one of the embodiments of the present application, when the energy of the third audio frame is smaller than the energy of the speech frame, the first noise reduction effect further includes a speech distortion degree, where the speech distortion degree is a degree of distortion of speech in the noisy audio after being processed by the target audio processing algorithm, and the determining unit 12 is further configured to determine, when the energy of the third audio frame is smaller than the energy of the speech frame, the speech distortion degree according to a second difference between the energy of the speech frame and the energy of the third audio frame, where the second difference is positively correlated with the degree.
In combination with any one of the embodiments of the present application, the determining unit 12 is configured to determine the speech frame in the n-frame first audio frame by performing speech activity detection on the n-frame first audio frame.
In combination with any one of the embodiments of the present application, the second noise reduction effect includes a second energy of a second residual noise, where the second residual noise is a residual noise after a noise in a non-speech in the noisy audio is processed by the target audio processing algorithm;
The determining unit 12 is configured to:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a non-voice frame from the n-frame first audio frames, wherein the non-voice frame is an audio frame except a voice frame, and the non-voice frame is an audio frame except the voice frame;
determining a sixth audio frame corresponding to the non-speech frame from the n-frame second audio frames;
and obtaining the second energy according to the energy of the sixth audio frame.
In combination with any one of the embodiments of the present application, the second noise reduction effect further includes a second suppression effect, where the second suppression effect is an effect that the target audio processing algorithm suppresses non-speech noise in the noisy audio, and the audio processing apparatus 1 further includes: a dividing unit 14 for dividing the noisy frequency into n frames of fourth audio frames;
the determining unit 12 is further configured to determine a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames;
the determining unit 12 is further configured to determine the second suppression effect according to a second difference between the energy of the seventh audio frame and the second energy, where the smaller the second difference is, the worse the second suppression effect is.
In combination with any one of the embodiments of the present application, in a case where the number of non-speech frames is greater than 1, the second noise reduction effect further includes a second stability, where the second stability is stability of the target audio processing algorithm to suppress non-speech noise in the noisy audio;
the determining unit 12 is further configured to determine the second stability according to a second variance of the second difference.
In combination with any one of the embodiments of the present application, the determining, according to the first noise reduction effect and the second noise reduction effect, the third noise reduction effect of the target audio processing algorithm includes:
and taking the first noise reduction effect and the second noise reduction effect as the third noise reduction effect.
In combination with any one of the embodiments of the present application, the determining unit 12 is configured to: in the case where the first noise reduction effect comprises a first suppression effect on noise in the speech frame and the second noise reduction effect comprises a second suppression effect on noise in the non-speech frame, determining that the third noise reduction effect comprises a third stability that characterizes a stability of the target audio processing algorithm suppressing noise in audio based on a difference of the first suppression effect and the second suppression effect.
In the embodiment of the application, the original audio comprises voice, the noise-carrying frequency is obtained by adding noise to the original audio, and the audio to be detected is obtained by using a target audio processing algorithm to reduce the noise of the noise-carrying frequency. After the audio processing device acquires the original audio, the noisy audio and the audio to be detected, determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected, and determining a second noise reduction effect of the target audio processing algorithm on the non-voice in the noisy audio according to the noisy audio and the audio to be detected. And then determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect, so that the noise reduction effect of the target audio processing algorithm can be evaluated from different angles, and the accuracy of the third noise reduction effect can be improved.
In some embodiments, functions or modules included in the apparatus provided in the embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
Fig. 5 is a schematic hardware structure of an electronic device according to an embodiment of the present application. The electronic device 2 comprises a processor 21 and a memory 22. Optionally, the electronic device 2 further comprises input means 23 and output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors, including various interfaces, transmission lines or buses, etc., as not limited in this application. It should be understood that in various embodiments of the present application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.
The processor 21 may comprise one or more processors, for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single core CPU or a multi core CPU. Alternatively, the processor 21 may be a processor group constituted by a plurality of CPUs, the plurality of processors being coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the present application is not limited.
Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.
The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.
It will be appreciated that in the embodiment of the present application, the memory 22 may be used to store not only relevant instructions, but also relevant data, for example, the memory 22 may be used to store data specifically stored in the memory through the input device 23, etc., which is not limited in the embodiment of the present application.
It will be appreciated that fig. 5 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all electronic devices that may implement the embodiments of the present application are within the scope of protection of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments herein are provided with emphasis, and that the same or similar parts may not be explicitly described in different embodiments for the sake of convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in the description of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital versatiledisc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims (16)

1. A method of audio processing, the method comprising:
acquiring original audio, noisy audio and audio to be tested, wherein the original audio comprises voice, the noisy audio is obtained by adding noise to the original audio, and the audio to be tested is obtained by utilizing a target audio processing algorithm to reduce noise of the noisy audio;
determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected;
determining a second noise reduction effect of the target audio processing algorithm on non-voice in the noisy audio according to the noisy audio and the audio to be detected;
And determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect.
2. The method of claim 1, wherein the determining a first noise reduction effect of the target audio processing algorithm on speech in the noisy audio from the original audio and the audio to be tested comprises:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a voice frame from the n-frame first audio frames, wherein the voice frame is an audio frame comprising voice;
determining a third audio frame corresponding to the voice frame from the n frames of second audio frames;
determining that a first noise reduction effect comprises first residual noise after the voice in the noisy audio is processed by the target audio processing algorithm under the condition that the energy of the third audio frame is larger than the energy of the voice frame;
and under the condition that the energy of the third audio frame is smaller than the energy of the voice frame, determining that the first noise reduction effect comprises that the voice in the noisy audio is distorted after being processed by the target audio processing algorithm.
3. The method of claim 2, wherein the first noise reduction effect further comprises a first energy of the first residual noise if the energy of the third audio frame is greater than the energy of the speech frame, the method further comprising:
and obtaining the first energy according to a first difference between the energy of the third audio frame and the energy of the voice frame, wherein the first difference is positively correlated with the first energy.
4. A method according to claim 3, wherein the first noise reduction effect further comprises a first suppression effect, the first suppression effect being an effect of the target audio processing algorithm suppressing noise of speech in the noisy audio, and wherein, in case the first energy is obtained, the method further comprises:
dividing the noisy frequency into n frames of fourth audio frames;
determining a fifth audio frame corresponding to the voice frame from the n-frame fourth audio frames;
and determining the first inhibition effect according to a second difference between the energy of the fifth audio frame and the first energy, wherein the smaller the second difference is, the worse the first inhibition effect is.
5. The method of claim 4, wherein in the case where the number of speech frames is greater than 1, the first noise reduction effect further comprises a first stability, the first stability being stability of the target audio processing algorithm suppressing noise of speech in the noisy audio;
After determining a fifth audio frame corresponding to the speech frame from the n-frame fourth audio frames, the method further comprises:
the first stability is determined from a first variance of the second difference.
6. The method of claim 2, wherein in the case where the energy of the third audio frame is less than the energy of the speech frame, the first noise reduction effect further comprises a degree of speech distortion, the degree of speech distortion being a degree of distortion of speech in the noisy audio after processing by the target audio processing algorithm, the method further comprising:
and under the condition that the energy of the third audio frame is smaller than that of the voice frame, determining the voice distortion degree according to a third difference between the energy of the voice frame and the energy of the third audio frame, wherein the third difference is positively correlated with the voice distortion degree.
7. The method according to any one of claims 2 to 6, wherein said determining a speech frame from said n frames of first audio frames comprises:
and determining the voice frame in the n-frame first audio frame by detecting the voice activity of the n-frame first audio frame.
8. The method of any one of claims 1 to 6, wherein the second noise reduction effect includes a second energy of a second residual noise, the second residual noise being a residual noise of a non-speech noise in the noisy audio after processing by the target audio processing algorithm;
The determining, according to the noisy audio and the audio to be detected, a second noise reduction effect of the target audio processing algorithm on non-speech in the noisy audio includes:
dividing the original audio into n frames of first audio frames, wherein n is an integer greater than 1;
dividing the audio to be detected into n frames of second audio frames;
determining a non-speech frame from the n-frame first audio frames, the non-speech frame being an audio frame other than the speech frame;
determining a sixth audio frame corresponding to the non-speech frame from the n-frame second audio frames;
and obtaining the second energy according to the energy of the sixth audio frame.
9. The method of claim 8, wherein the second noise reduction effect further comprises a second suppression effect, the second suppression effect being an effect of the target audio processing algorithm suppressing non-speech noise in the noisy audio, and wherein, if the second energy is obtained, the method further comprises:
dividing the noisy frequency into n frames of fourth audio frames;
determining a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames;
and determining the second suppression effect according to a fourth difference between the energy of the seventh audio frame and the second energy, wherein the smaller the fourth difference is, the worse the second suppression effect is.
10. The method of claim 9, wherein in the case where the number of non-speech frames is greater than 1, the second noise reduction effect further comprises a second stability, the second stability being stability of the target audio processing algorithm suppressing non-speech noise in the noisy audio;
after determining a seventh audio frame corresponding to the non-speech frame from the n-frame fourth audio frames, the method further comprises:
and determining the second stability according to a second variance of the fourth difference.
11. The method of any of claims 1 to 6, wherein the determining a third noise reduction effect of the target audio processing algorithm from the first noise reduction effect and the second noise reduction effect comprises:
and taking the first noise reduction effect and the second noise reduction effect as the third noise reduction effect.
12. The method of any one of claims 1 to 6, wherein the third noise reduction effect comprises a third stability characteristic of stability of the target audio processing algorithm to suppress noise in audio;
the determining a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect includes:
In the case where the first noise reduction effect includes a first suppression effect of noise in the speech frame and the second noise reduction effect includes a second suppression effect of noise in the non-speech frame, the third stability is determined according to a difference between the first suppression effect and the second suppression effect.
13. An audio processing apparatus, characterized in that the audio processing apparatus comprises:
the device comprises an acquisition unit, a target audio processing unit and a processing unit, wherein the acquisition unit is used for acquiring original audio, noisy audio and audio to be detected, the original audio comprises voice, the noisy audio is obtained by adding noise to the original audio, and the audio to be detected is obtained by utilizing a target audio processing algorithm to reduce noise of the noisy audio;
the determining unit is used for determining a first noise reduction effect of the target audio processing algorithm on the voice in the noisy audio according to the original audio and the audio to be detected;
the determining unit is used for determining a second noise reduction effect of the target audio processing algorithm on non-voice in the noisy audio according to the noisy audio and the audio to be detected;
the determining unit is configured to determine a third noise reduction effect of the target audio processing algorithm according to the first noise reduction effect and the second noise reduction effect.
14. An electronic device, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 12.
15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 12.
16. A computer program product, characterized in that the computer program product comprises a computer program or instructions; when the computer program or instructions are run on a computer, the computer is caused to perform the method of any one of claims 1 to 12.
CN202311757605.0A 2023-12-20 2023-12-20 Audio processing method and device, electronic equipment and computer readable storage medium Pending CN117711434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311757605.0A CN117711434A (en) 2023-12-20 2023-12-20 Audio processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311757605.0A CN117711434A (en) 2023-12-20 2023-12-20 Audio processing method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN117711434A true CN117711434A (en) 2024-03-15

Family

ID=90162052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311757605.0A Pending CN117711434A (en) 2023-12-20 2023-12-20 Audio processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN117711434A (en)

Similar Documents

Publication Publication Date Title
JP6557786B2 (en) Echo delay tracking method, apparatus and computer storage medium
CN110782914B (en) Signal processing method and device, terminal equipment and storage medium
CN110706693B (en) Method and device for determining voice endpoint, storage medium and electronic device
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
CN110875059B (en) Method and device for judging reception end and storage device
CN112802486B (en) Noise suppression method and device and electronic equipment
CN109074814B (en) Noise detection method and terminal equipment
CN111883160A (en) Method and device for picking up and reducing noise of directional microphone array
CN113539285A (en) Audio signal noise reduction method, electronic device, and storage medium
CN111028855B (en) Echo suppression method, device, equipment and storage medium
CN110503973B (en) Audio signal transient noise suppression method, system and storage medium
JP2011527160A (en) Dynamic filtering for adjacent channel interference suppression
CN108605191B (en) Abnormal sound detection method and device
CN117711434A (en) Audio processing method and device, electronic equipment and computer readable storage medium
CN111081269A (en) Noise detection method and system in call process
CN113316075B (en) Howling detection method and device and electronic equipment
CN115190408A (en) Method, device and equipment for testing active noise reduction and bottom noise of earphone
CN110189763B (en) Sound wave configuration method and device and terminal equipment
CN115273880A (en) Voice noise reduction method, model training method, device, equipment, medium and product
CN113517000A (en) Echo cancellation test method, terminal and storage device
US11955132B2 (en) Identifying method of sound watermark and sound watermark identifying apparatus
CN117727311A (en) Audio processing method and device, electronic equipment and computer readable storage medium
CN114337908B (en) Method and device for generating interference signal of target voice signal
CN117711435A (en) Audio processing method and device, electronic equipment and computer readable storage medium
CN113316074B (en) Howling detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination