CN112309419B - Noise reduction and output method and system for multipath audio - Google Patents

Noise reduction and output method and system for multipath audio Download PDF

Info

Publication number
CN112309419B
CN112309419B CN202011191091.3A CN202011191091A CN112309419B CN 112309419 B CN112309419 B CN 112309419B CN 202011191091 A CN202011191091 A CN 202011191091A CN 112309419 B CN112309419 B CN 112309419B
Authority
CN
China
Prior art keywords
audio
output
segment
background noise
audio segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011191091.3A
Other languages
Chinese (zh)
Other versions
CN112309419A (en
Inventor
张新华
陈华锋
李兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lancoo Technology Co ltd
Original Assignee
Zhejiang Lancoo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lancoo Technology Co ltd filed Critical Zhejiang Lancoo Technology Co ltd
Priority to CN202011191091.3A priority Critical patent/CN112309419B/en
Publication of CN112309419A publication Critical patent/CN112309419A/en
Application granted granted Critical
Publication of CN112309419B publication Critical patent/CN112309419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Abstract

The application relates to audio processing and discloses a noise reduction and output method of multipath audio and a system thereof, which can keep the purity and completeness of finally output audio signals and improve the speech recognition rate. The noise reduction method comprises the following steps: n paths of audios are obtained, each path of audio is divided into a plurality of audio fragments with time labels according to voice pauses, and N audio fragment sequences are obtained; determining a background noise segment in the audio segment, and calculating frequency components of the background noise segment; for each audio fragment with the same time tag in the N audio fragment sequences, selecting the audio fragment with the optimal voice effect as the audio fragment to be output, and calculating the frequency component of the audio fragment to be output; for each audio segment to be output, calculating the background noise frequency component estimation value according to the frequency components of the background noise segments arranged in front of the audio segment to be output in the sequence, and subtracting the estimation value from the frequency components of the audio segment to be output so as to perform noise reduction processing on the audio segment to be output.

Description

Noise reduction and output method and system for multipath audio
Technical Field
The present application relates to audio processing, and in particular, to noise reduction and output techniques for multi-channel audio.
Background
When video or remote video is recorded and broadcast in places such as classrooms and meeting rooms, only one pickup is usually arranged to collect indoor audio, and indoor different directions are different from the pickup distance, so that the collected audio is often unbalanced and large in noise, the tone quality output by the other receiving side is poor, the audio-visual quality of the user of the other receiving side can be seriously influenced, and the audio-visual requirement cannot be met.
Disclosure of Invention
The purpose of the application is to provide a noise reduction and output method and a system for N paths of audio of N sound pickups in a room, which can keep the purity and completeness of the finally output audio signals and improve the speech recognition rate.
The application discloses a noise reduction method for N paths of audio frequencies of N sound pickups indoors, wherein the N sound pickups are arranged in different indoor directions; the method comprises the following steps:
the N paths of audios are obtained, each path of audio is divided into a plurality of audio fragments with time labels according to voice pauses, and corresponding N audio fragment sequences are obtained;
determining a background noise segment in the audio segment, and calculating frequency components of the background noise segment;
for each path of audio clips with the same time tag in the N audio clip sequences, selecting an audio clip with the optimal voice effect as an audio clip to be output, and calculating frequency components of the audio clip to be output;
for each audio segment to be output, calculating a background noise frequency component estimated value of the audio segment to be output according to the frequency components of the background noise segments arranged in front of the audio segment to be output in the sequence of the audio segment to be output, and subtracting the background noise frequency component estimated value from the frequency components of the audio segment to be output so as to perform noise reduction processing on the audio segment to be output.
In a preferred embodiment, for each audio segment with the same time tag in the N audio segment sequences, selecting an audio segment with the optimal speech effect as the audio segment to be output, further includes:
for each audio fragment with the same time label, calculating the signal-to-noise ratio and average amplitude of each audio fragment;
and calculating the voice effect score of each audio segment according to the signal-to-noise ratio and the average amplitude value, and selecting the audio segment with the highest score as the audio segment to be output.
In a preferred embodiment, for each audio segment to be output, the computing the background noise frequency component estimation value of the audio segment to be output according to the frequency components of the background noise segment arranged in front of the audio segment to be output in the sequence, further includes:
for each audio segment to be output, calculating the background noise frequency component estimation value of the audio segment to be output according to the geometric mean or arithmetic mean of all background noise segments arranged in front of the audio segment to be output in the sequence.
In a preferred embodiment, for each audio segment to be output, the computing the background noise frequency component estimation value of the audio segment to be output according to the frequency components of the background noise segment arranged in front of the audio segment to be output in the sequence, further includes:
for each audio segment to be output, a background noise frequency component estimate of the audio segment to be output is calculated from the geometric mean or arithmetic mean of the frequency components of the plurality of background noise segments nearest to it arranged in the sequence in which it is located.
In a preferred embodiment, the determining the background noise segment in the audio segment and calculating the frequency component of the background noise segment further comprises:
calculating the average amplitude of each audio segment, determining the audio segment with the average amplitude smaller than a first preset threshold value as a background noise segment, and calculating the frequency component of the background noise segment;
the step of obtaining the N paths of audios, which is to divide each path of audio into a plurality of audio fragments with time labels according to voice pauses, and further comprises the steps of:
and combining the adjacent audio fragments with the duration smaller than the preset time in sequence, wherein the duration of each combined voice fragment is within the preset duration range.
The application also discloses an output method of N paths of audios for the indoor N sound pickups, wherein the N sound pickups are arranged in different indoor directions; the method comprises the following steps:
acquiring the N paths of audios;
carrying out noise reduction processing according to the N paths of audios, and carrying out voice recognition on all audio fragments to be output after the noise reduction processing to generate subtitles with time labels;
mixing the N paths of audios;
and synchronously outputting the caption and the audio after the audio mixing processing to a terminal based on the time tag.
In a preferred embodiment, before the mixing processing is performed on the N paths of audio, the method further includes:
dividing each path of audio into a plurality of audio fragments with time labels according to voice pauses to obtain N audio fragment sequences;
determining background noise fragments in the audio fragments, and calculating frequency components of all the background noise fragments;
for each audio segment in each audio segment sequence, calculating a background noise frequency component estimated value of the audio segment according to the frequency components of the background noise segments arranged in front of the audio segment in the sequence, and subtracting the background noise frequency component estimated value from the frequency components of the audio segment to perform noise reduction processing on the audio segment;
re-synthesizing each audio fragment sequence after noise reduction treatment into one path of audio based on the time tag;
and mixing the re-synthesized N paths of audio.
In a preferred embodiment, before the acquiring the N paths of audio, the method further includes:
calculating the average amplitude and/or signal-to-noise ratio of each path of audio;
and eliminating the audio with the average amplitude smaller than the second preset threshold value and/or with the signal to noise ratio smaller than the third preset threshold value from the multipath audio.
The application also discloses a noise reduction system for N paths of audios of N sound pickups in the room, wherein the N sound pickups are arranged in different indoor directions; the system comprises:
the first acquisition module is used for acquiring the N paths of audio;
the segmentation module is used for segmenting each path of audio into a plurality of audio fragments with time labels according to voice pauses to obtain corresponding N audio fragment sequences;
a background noise segment determining module, configured to determine a background noise segment in the audio segment, and calculate a frequency component of the background noise segment;
the audio segment selection module to be output is used for selecting the audio segment with the optimal voice effect as the audio segment to be output for each channel of audio segments with the same time tag in the N audio segment sequences, and calculating the frequency component of the audio segment to be output;
the first noise reduction module is used for calculating a background noise frequency component estimated value of each audio fragment to be output according to the frequency components of the background noise fragments arranged in front of the audio fragment to be output in the sequence of the audio fragment to be output, and subtracting the background noise frequency component estimated value from the frequency components of the audio fragment to be output so as to perform noise reduction processing on the audio fragment to be output.
The application also discloses an output system of N paths of audios for N sound pickups in the room, wherein the N sound pickups are arranged in different indoor directions; the system comprises:
the second acquisition module is used for acquiring the N paths of audio;
the second noise reduction module is used for carrying out noise reduction processing according to the N paths of audio described above;
the subtitle generating module is used for carrying out voice recognition on all the audio clips to be output after the noise reduction processing to generate subtitles with time labels;
the audio mixing module is used for carrying out audio mixing processing on the N paths of audio;
and the synchronous output module is used for synchronously outputting the caption and the audio after the audio mixing processing to a terminal based on the time tag.
Compared with the prior art, the embodiment of the application at least comprises the following advantages and effects:
the method comprises the steps of respectively carrying out audio collection on sound collectors installed in different indoor directions, dividing collected multi-channel audio based on natural language pauses to obtain a plurality of corresponding audio fragment sequences, comparing and analyzing all audio fragments carrying the same time tag, selecting one audio fragment with the best effect as an audio fragment to be output corresponding to the time tag, estimating the background noise frequency component of the audio fragment to be output according to the frequency component of the background noise fragment arranged in front of each audio fragment to be output in the sequence, and carrying out noise reduction treatment on the audio fragment to be output based on the estimated value, so that the purity and completeness of finally output audio signals are maintained, and the speech recognition rate is improved.
Further, the audio clips to be output after the noise reduction processing are subjected to voice recognition to obtain corresponding subtitle information, meanwhile, the collected multipath audios are subjected to audio mixing processing to obtain audio information, and finally, the audio information and the subtitle information are synchronously output to a terminal, so that the requirements of the voice recognition obtained subtitle on audio fidelity can be met, and the requirements of a teleconference or a class on sound plumpness and layering sense can be met.
Further, before the acquired multi-path audio or the segmentation result of the multi-path audio is subjected to audio mixing processing to obtain audio information, the background noise frequency component of each audio segment in the segmentation result of each path of audio can be estimated according to the frequency component of the background noise segment arranged in front of the audio segment in the sequence of each audio segment, noise reduction processing is performed on the audio segment based on the estimated value, and the audio quality can be improved on the premise of ensuring the sound fullness and layering sense of the audio information after audio mixing.
In addition, the invention can be applied to scenes such as classrooms, conference rooms, studio and the like, and improves the sound quality of indoor wearing-free remote display microphone pickup.
In the present application, a number of technical features are described in the specification, and are distributed in each technical solution, which makes the specification too lengthy if all possible combinations of technical features (i.e. technical solutions) of the present application are to be listed. In order to avoid this problem, the technical features disclosed in the above summary of the present application, the technical features disclosed in the following embodiments and examples, and the technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (these technical solutions are all regarded as being already described in the present specification) unless such a combination of technical features is technically impossible. For example, in one example, feature a+b+c is disclosed, in another example, feature a+b+d+e is disclosed, and features C and D are equivalent technical means that perform the same function, technically only by alternative use, and may not be adopted simultaneously, feature E may be technically combined with feature C, and then the solution of a+b+c+d should not be considered as already described because of technical impossibility, and the solution of a+b+c+e should be considered as already described.
Drawings
Fig. 1 is a flowchart of a noise reduction method for N-channel audio of N microphones in a room according to a first embodiment of the present application.
Fig. 2 is a schematic diagram of a noise reduction system for N-channel audio of N microphones in a room according to a second embodiment of the present application.
Fig. 3 is a flowchart of an output method of N-channel audio for N microphones in a room according to a third embodiment of the present application.
Fig. 4 is a schematic diagram of an output system structure of N-channel audio for N microphones in a room according to a fourth embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, it will be understood by those skilled in the art that the claimed invention may be practiced without these specific details and with various changes and modifications from the embodiments that follow.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The first embodiment of the present application relates to a noise reduction method for N audio frequencies of N sound pickups in a room, where the N sound pickups are disposed in different indoor directions, the N audio frequencies are in one-to-one correspondence with the N sound pickups, and a flow of the noise reduction method is shown in fig. 1, and the method includes the following steps:
in step 101, the N audio channels are obtained, each audio channel is divided into a plurality of audio segments with time labels according to voice pauses, and corresponding N audio segment sequences are obtained, and each audio segment sequence corresponds to one audio channel.
Typically, the N channels of audio are partitioned based on the same speech pausing feature information. For example, but not limited to, would satisfy a silence duration ≡t 0 Is the natural pause interval of voice, t 0 Is a configurable parameter. For example, the time stamp includes start time point and end time point information of the corresponding audio clip.
Optionally, after step 101, the following steps may be further included: the adjacent time period t is smaller than the first preset time t 1 Sequentially combining the audio segments of the speech segments to ensure that the duration of each combined speech segment is within a predetermined duration (t 2 ~t 2 + Δt). For example, but not limited to, taking t 1 =3s,t 2 5 to 10s, and delta t is 2s.
Thereafter, step 102 is entered to determine a background noise segment in the audio segment and calculate frequency components of the background noise segment.
The method of determining the background noise segment in step 102 is various. For example, by calculating the average amplitude of each of the audio segments, determining the audio segment having the average amplitude smaller than the first preset threshold as a background noise segment, and calculating the frequency component of the background noise segment. For another example, a wiener filtering method is used to separate a voice component S and a noise component N from each path of audio signal, so as to calculate the signal-to-noise ratio of each path of audio. When the signal-to-noise ratio is less than a certain set threshold, the audio segment is a background noise segment.
The step 102 of "calculating the frequency components of the background noise segment" refers to transforming the audio of the background noise segment from the time domain to the frequency domain, and frequency amplitude in the frequency domain.
Then, step 103 is entered, and for each audio segment with the same time tag in the N audio segment sequences, an audio segment with the best speech effect is selected as an audio segment to be output, and frequency components of the audio segment to be output are calculated.
Optionally, this step 103 further comprises the sub-steps of:
a substep 103a of calculating a signal-to-noise ratio and an average amplitude value for each audio segment for the audio segments with the same time tag; step 103b, calculating the voice effect score of each audio segment according to the signal-to-noise ratio and the average amplitude, and selecting the audio segment with the highest score as the audio segment to be output; in a substep 103c, frequency components of the audio piece to be output are calculated. The above-mentioned sub-step 103c of "calculating the frequency components of the audio clip to be output" refers to transforming the audio of the audio clip to be output from the time domain to the frequency domain, and frequency-amplitude in the frequency domain.
For the above sub-step 103b, for example, an audio path with high signal-to-noise ratio and high average amplitude may be selected as the best audio segment corresponding to the time tag through a voice effect scoring function q=f (S/N, V), where S/N is the signal-to-noise ratio and V is the average amplitude of the audio path in the time period corresponding to the time tag.
Further, the specific implementation manner of "selecting the audio path with high signal-to-noise ratio and high average amplitude as the best audio segment of the time period corresponding to the time tag through the voice effect scoring function q=f (S/N, V)" described above may include, for example, the following "embodiment 1" and "embodiment 2".
Example 1 specifically includes: when one or more paths of signal-to-noise ratios of the audio signals meet S/N & gt 10, comparing the average amplitude V of the audio fragments meeting the condition, and selecting the audio fragment with the largest average amplitude as the audio fragment to be output; when all paths in the signal-to-noise ratio of the audio signal meet the S/N less than or equal to 10, the method passes through the formula
Figure BDA0002752752630000091
And calculating the Q value, wherein the maximum Q value is the best voice effect.
This embodiment 2 specifically includes: calculating the scores Q1 and Q2 of the S/N and V respectively, wherein the final score q=q1×q2 is selected as the best path by the audio segment with the highest score, and the specific calculation method of the scores Q1 and Q2 of the S/N is as follows:
when S/N > 10, q1=3;
when S/N is more than 4 and less than or equal to 10, Q1=2;
when S/N is more than 2 and less than or equal to 4, Q1=1;
when S/N is less than or equal to 2, Q1=0.3;
q2=2 when V > 100 Db;
when 80Db < V is less than or equal to 100Db, q2=1.5;
when 50Db < v.ltoreq.80 Db, q2=1.1;
when 0 < v.ltoreq.50 Db, q2=0.8.
Then, step 104 is entered, for each audio segment to be output, calculating a background noise frequency component estimation value of the audio segment to be output according to the frequency components of the background noise segments arranged in front of the audio segment to be output in the sequence of the audio segment to be output, and subtracting the background noise frequency component estimation value from the frequency components of the audio segment to be output, so as to perform noise reduction processing on the audio segment to be output.
It will be appreciated that following this step 104, the following steps are also included: and converting the frequency components of the audio fragment to be output after the noise reduction treatment from the frequency domain to the time domain, and outputting or continuing the subsequent treatment process. The subsequent processing may be, for example, a voice recognition processing, for example, but not limited to, recognition and analysis of teaching contents of a teacher in a teaching video, recognition of subtitles in a teleconference, and the like.
The implementation of the step 104 of calculating the background noise frequency component estimation value of each audio segment to be outputted according to the frequency components of the background noise segment arranged in front of the audio segment in the sequence of the audio segment to be outputted is various. Alternatively, it may be further implemented as: for each audio segment to be output, calculating the background noise frequency component estimation value of the audio segment to be output according to the geometric mean or arithmetic mean of the frequency components of all background noise segments arranged in front of the audio segment to be output in the sequence. Alternatively, it may be further implemented as: for each audio segment to be output, determining the frequency component of the nearest background noise segment arranged in front of the audio segment in the sequence as the background noise frequency component estimated value of the audio segment to be output. Alternatively, it may be further implemented as: for each audio segment to be output, a background noise frequency component estimate of the audio segment to be output is calculated from the geometric mean or arithmetic mean of the frequency components of the plurality of background noise segments nearest to it arranged in the sequence in which it is located. Alternatively, it may be further implemented as: for each audio segment to be output, a background noise frequency component estimate is calculated for that audio segment to be output based on the geometric or arithmetic average of the frequency components of that audio segment (typically including frequency components of background noise segments and frequency component estimates of non-background noise segments) arranged in the sequence in which it is located.
Optionally, the audio segments within a period of time immediately after each pickup is turned on default to the initial background noise segments in the corresponding sequence.
Illustrating the noise reduction process of the audio segment to be output in the audio sequence determined in steps 102 to 104, wherein the sequence is arranged in front of the audio segment to be output and has two background noise segments, namely a first background noise segment and a second background noise segment in turn, frequency components included in the two are calculated to be { N1, N2, N3, …, nk } and { N1', N2', N3', …, nk' }, and then the background noise frequency components of the audio segment to be output are estimated according to the frequency components of the two to be
Figure BDA0002752752630000101
Figure BDA0002752752630000102
Then subtracting +.>
Figure BDA0002752752630000103
And obtaining the frequency component of the audio fragment to be output after noise reduction. Because the method for estimating the background noise of the audio segment to be output in the embodiment of the application can be obtained by accumulating any plurality of or all background noise arranged in the front, the background noise estimated for the audio segment at different moments is continuously and iteratively updated, and smooth transition of the sound of each audio segment in the noise reduction process is ensured.
The second embodiment of the application relates to a noise reduction system for N paths of audios of indoor N sound pick-ups, the N sound pick-ups are arranged in different indoor directions, the N paths of audios are in one-to-one correspondence with the N sound pick-ups, the system structure is as shown in fig. 2, and the system comprises a first acquisition module, a segmentation module, a background noise segment determination module, a to-be-output audio segment selection module and a first noise reduction module.
The first acquisition module is used for acquiring the N paths of audio; the segmentation module is used for segmenting each path of audio into a plurality of audio fragments with time labels according to voice pauses to obtain corresponding N audio fragment sequences; the background noise segment determining module is used for determining a background noise segment in the audio segment and calculating frequency components of the background noise segment; the audio fragment selection module to be output is used for selecting the audio fragment with the optimal voice effect as the audio fragment to be output for each path of audio fragments with the same time tag in the N audio fragment sequences, and calculating the frequency component of the audio fragment to be output; the first noise reduction module is used for calculating a background noise frequency component estimated value of each audio fragment to be output according to the frequency components of the background noise fragments arranged in front of the audio fragment to be output in the sequence of the audio fragment to be output, and subtracting the background noise frequency component estimated value from the frequency components of the audio fragment to be output so as to perform noise reduction processing on the audio fragment to be output.
The calculation process of the frequency components of each audio segment is to transform the audio of the background noise segment from the time domain to the frequency domain, and the frequency amplitude in the frequency domain.
It can be understood that the first noise reduction module is further configured to convert the frequency component of the audio segment to be output after the noise reduction processing from the frequency domain to the time domain and output the audio segment or continue the subsequent processing. The subsequent processing may be, for example, a voice recognition processing, for example, but not limited to, recognition and analysis of teaching contents of a teacher in a teaching video, recognition of subtitles in a teleconference, and the like.
The configuration of the first noise reduction module is diverse. Optionally, the first noise reduction module is configured to calculate, for each audio segment to be output, an estimated value of a background noise frequency component of the audio segment to be output according to a geometric mean or an arithmetic mean of frequency components of all background noise segments arranged in front of the audio segment to be output in a sequence of the audio segment to be output. Optionally, the first noise reduction module is configured to determine, for each audio segment to be output, a frequency component of a background noise segment arranged in front of the audio segment in the sequence to be output as a background noise frequency component estimated value of the audio segment to be output. Optionally, the first noise reduction module is configured to calculate, for each audio segment to be output, an estimated value of a background noise frequency component of the audio segment to be output according to a geometric mean or an arithmetic mean of frequency components of a plurality of background noise segments arranged in front of the audio segment to be output in a sequence. Optionally, the first noise reduction module is configured to calculate, for each audio segment to be output, a background noise frequency component estimation value of the audio segment to be output according to a geometric mean or arithmetic mean of frequency components of the audio segment (typically including frequency components of background noise segments and frequency component estimation values of non-background noise segments) arranged in front of the audio segment in the sequence of the audio segment to be output.
Optionally, the system generally defaults the audio segments for a period of time immediately after each pickup is turned on to the initial background noise segments in the corresponding sequence.
The first embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the first embodiment can be applied to the present embodiment, and the technical details in the present embodiment can also be applied to the first embodiment.
A third embodiment of the present application relates to an output method of N audio frequencies for N sound pickups in a room, where the N sound pickups are disposed in different directions in the room, the N audio frequencies are in one-to-one correspondence with the N sound pickups, and a flow of the output method is shown in fig. 3, and the method includes the following steps:
in step 301, the N-way audio is acquired.
This step 301 may be preceded by a preliminary screening of the N audio channels. For example, an average amplitude of each audio is calculated, and each audio with the average amplitude smaller than the second preset threshold is removed from the N audio channels, because when the average amplitude is smaller than the second preset threshold, the audio channel may be regarded as mute (for example, but not limited to, silence caused by a pickup failure corresponding to the audio channel). For another example, the signal-to-noise ratio of each path of audio is calculated, and the audio with the signal-to-noise ratio smaller than the third preset threshold is removed from the plurality of paths of audio, because when the signal-to-noise ratio is smaller than the third preset threshold, the path of audio signal can be considered to be too noisy and has no application value. For another example, the average amplitude and the signal-to-noise ratio of each path of audio are calculated, and the audio with the average amplitude smaller than the second preset threshold and the signal-to-noise ratio smaller than the third preset threshold is removed from the multiple paths of audio.
For example, but not limited to, a wiener filtering method is used to separate a voice component S and a noise component N from each path of audio signal, so as to calculate the signal-to-noise ratio of each path of audio.
Then, step 302 is entered, where the noise reduction method according to the first embodiment of the present application performs noise reduction on the N channels of audio, and performs speech recognition on all audio segments to be output after the noise reduction processing to generate subtitles with time labels. The noise reduction method in step 302 adopts the noise reduction method for N-channel audio of N microphones in the room according to the first embodiment, so the technical details in the first embodiment of the present application can be applied to this embodiment.
It should be noted that, although in the present embodiment, all the audio segments to be output after the noise reduction processing in step 302 are subjected to speech recognition, the speech recognition result is used to generate subtitles with time labels, in other embodiments, the speech recognition result may also be used to generate corresponding information according to different application scenarios, for example, in the application of remote teaching, the speech recognition result may be used to determine important knowledge points of the current teaching content and highlight important knowledge points on the terminal side, for example, in the teaching software of the student on the terminal side of the PPT.
Then, step 303 is entered to mix the N audio.
Optionally, before this step 303, the following steps a to e may be further included:
in the step a, each path of audio is divided into a plurality of audio fragments with time labels according to voice pauses, and N audio fragment sequences are obtained; in step b, determining background noise fragments in the audio fragments, and calculating frequency components of all the background noise fragments; in step c, for each audio segment in each audio segment sequence, calculating a background noise frequency component estimation value of the audio segment according to the frequency components of the background noise segments arranged in front of the audio segment in the sequence, and subtracting the background noise frequency component estimation value from the frequency components of the audio segment to perform noise reduction processing on the audio segment; in the step d, synthesizing each audio fragment sequence after noise reduction processing into one path of audio again based on the time tag; in step e, the re-synthesized N paths of audio are subjected to audio mixing processing.
Alternatively, the steps d and e may be replaced by the following steps: and respectively carrying out sound mixing treatment on N paths of audio fragments with the same time label for the N audio fragment sequences after the noise reduction treatment, and re-synthesizing one path of audio based on the time label for each audio fragment after the sound mixing treatment.
Then, step 304 is entered, based on the time stamp, to output the caption and the audio after the mixing process to the terminal synchronously.
In one embodiment, the terminal may further include a display for displaying video, and the subtitle, audio after the mixing process, and video may be synchronously output to the terminal based on the time stamp.
The fourth embodiment of the present application relates to an output system of N audio frequencies for N sound pickups in a room, where the N sound pickups are disposed in different indoor directions, the N audio frequencies are in one-to-one correspondence with the N sound pickups, and the output system structure is as shown in fig. 4, and the system includes a second acquisition module, a second noise reduction module, a caption generation module, a mixing module, and a synchronous output module.
The second obtaining module is specifically described as being configured to obtain the N paths of audio; the second noise reduction module is used for carrying out noise reduction processing on the N paths of audio according to the noise reduction method related to the first embodiment; the subtitle generating module is used for carrying out voice recognition on all the audio clips to be output after the noise reduction processing to generate subtitles with time labels; the audio mixing module is used for carrying out audio mixing processing on the N paths of audio; the synchronous output module is used for synchronously outputting the caption and the audio after the audio mixing processing to the terminal based on the time tag.
It should be noted that: the noise reduction method of the present application adopts the noise reduction method for N-channel audio of N microphones in a room according to the first embodiment, so the technical details in the first embodiment of the present application can be applied to the present embodiment.
Optionally, the terminal may further include a display for displaying video, for example, the subtitle, audio after mixing, and video may be synchronously output to the terminal based on the time stamp.
The third embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the third embodiment can be applied to the present embodiment, and the technical details in the present embodiment can also be applied to the third embodiment.
It should be noted that, the wiener filtering method mentioned in the present application belongs to the prior art, so that no description is given here.
It should be noted that, those skilled in the art should understand that the implementation functions of the modules shown in the embodiment of the noise reduction/output system for N-channel audio of N-channel microphones in the room may be understood with reference to the description of the noise reduction/output method for N-channel audio of N-channel microphones in the room. The functions of the modules shown in the above-described embodiments of the noise reduction/output system for N-channel audio of N pickups in a room may be implemented by a program (executable instructions) running on a processor, or by a specific logic circuit. The noise reduction/output system for N-channel audio of N microphones in the room according to the embodiment of the present application may be stored in one computer-readable storage medium if implemented in a form of a software function module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the prior art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Accordingly, embodiments of the present application also provide a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method embodiments of the present application. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
In addition, the embodiment of the application also provides a noise reduction system for N paths of audio of N sound pickups indoors, which comprises a memory for storing computer executable instructions and a processor; the processor is configured to implement the steps of the first embodiment described above when executing computer-executable instructions in the memory. The processor may be a central processing unit (Central Processing Unit, abbreviated as "CPU"), other general purpose processors, digital signal processors (Digital Signal Processor, abbreviated as "DSP"), application specific integrated circuits (Appl ication Specific Integrated Circuit, abbreviated as "ASIC"), and the like. The aforementioned memory may be a read-only memory (ROM), a random access memory (random access memory, RAM), a Flash memory (Flash), a hard disk, a solid state disk, or the like. The steps of the method disclosed in the embodiments of the present invention may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
The embodiment of the application also provides an output system of N paths of audio for N sound pickups indoors, which comprises a memory for storing computer executable instructions and a processor; the processor is configured to implement the steps of the third embodiment described above when executing the computer-executable instructions in the memory. The processor may be a central processing unit (Central Processing Unit, abbreviated as "CPU"), other general purpose processors, digital signal processors (Digital Signal Processor, abbreviated as "DSP"), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as "ASIC"), and the like. The aforementioned memory may be a read-only memory (ROM), a random access memory (random access memory, RAM), a Flash memory (Flash), a hard disk, a solid state disk, or the like. The steps of the method disclosed in the embodiments of the present invention may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
It should be noted that in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. In the present patent application, if it is mentioned that an action is performed according to an element, it means that the action is performed at least according to the element, and two cases are included: the act is performed solely on the basis of the element and is performed on the basis of the element and other elements. Multiple, etc. expressions include 2, 2 times, 2, and 2 or more, 2 or more times, 2 or more.
All documents mentioned in the present application are considered to be included in the disclosure of the present application in their entirety, so that they may be subject to modification if necessary. Furthermore, it should be understood that the foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.

Claims (5)

1. The output method of N paths of audio frequencies for N sound pickups in a room is characterized in that the N sound pickups are arranged in different directions in the room, and N is larger than 1; the method comprises the following steps:
acquiring the N paths of audios;
dividing each path of audio into a plurality of audio fragments with time labels according to voice pauses to obtain N audio fragment sequences;
determining background noise fragments in the audio fragments, and calculating frequency components of all the background noise fragments;
for each path of audio clips with the same time tag in the N audio clip sequences, selecting an audio clip with the optimal voice effect as an audio clip to be output, and calculating frequency components of the audio clip to be output;
for each audio segment in each audio segment sequence, calculating a background noise frequency component estimated value of the audio segment according to the frequency components of the background noise segments arranged in front of the audio segment in the sequence, and subtracting the background noise frequency component estimated value from the frequency components of the audio segment to perform noise reduction processing on the audio segment;
performing voice recognition on all the audio clips to be output after the noise reduction treatment to generate subtitles with time labels;
for N audio fragment sequences after noise reduction treatment, respectively carrying out sound mixing treatment on N audio fragments with the same time label, and re-synthesizing each audio fragment after sound mixing treatment into one path of audio based on the time label;
and synchronously outputting the caption and the audio after the audio mixing processing to a terminal based on the time tag.
2. The method for outputting N-way audio for N microphones in a room of claim 1, further comprising, prior to said acquiring said N-way audio:
calculating the average amplitude and/or signal-to-noise ratio of each path of audio;
and eliminating the audio with the average amplitude smaller than the second preset threshold value and/or with the signal to noise ratio smaller than the third preset threshold value from the N paths of audio.
3. The method for outputting N-way audio for N pickups in a room according to claim 1, wherein for each of the N-way audio clips with the same time tag in the sequence of N-way audio clips, selecting an audio clip with the optimal speech effect as the audio clip to be output, further comprises:
for each audio fragment with the same time label, calculating the signal-to-noise ratio and average amplitude of each audio fragment;
and calculating the voice effect score of each audio segment according to the signal-to-noise ratio and the average amplitude value, and selecting the audio segment with the highest score as the audio segment to be output.
4. The method of outputting N-channel audio for N pickups in a room according to claim 1, wherein for each audio segment to be output, calculating a background noise frequency component estimation value of the audio segment to be output from frequency components of the background noise segment arranged in front thereof in a sequence thereof, further comprises:
for each audio segment to be output, calculating the background noise frequency component estimation value of the audio segment to be output according to the geometric mean or arithmetic mean of all background noise segments arranged in front of the audio segment to be output in the sequence.
5. The method for outputting N audio for N sound pickups in a room according to claim 1,
for each audio segment to be output, calculating a background noise frequency component estimation value of the audio segment to be output according to the frequency components of the background noise segments arranged in front of the audio segment to be output in the sequence, and further comprising:
for each audio segment to be output, a background noise frequency component estimate of the audio segment to be output is calculated from the geometric mean or arithmetic mean of the frequency components of the plurality of background noise segments nearest to it arranged in the sequence in which it is located.
CN202011191091.3A 2020-10-30 2020-10-30 Noise reduction and output method and system for multipath audio Active CN112309419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011191091.3A CN112309419B (en) 2020-10-30 2020-10-30 Noise reduction and output method and system for multipath audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011191091.3A CN112309419B (en) 2020-10-30 2020-10-30 Noise reduction and output method and system for multipath audio

Publications (2)

Publication Number Publication Date
CN112309419A CN112309419A (en) 2021-02-02
CN112309419B true CN112309419B (en) 2023-05-02

Family

ID=74332853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011191091.3A Active CN112309419B (en) 2020-10-30 2020-10-30 Noise reduction and output method and system for multipath audio

Country Status (1)

Country Link
CN (1) CN112309419B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005086139A1 (en) * 2004-03-01 2005-09-15 Dolby Laboratories Licensing Corporation Multichannel audio coding
CA2645915A1 (en) * 2007-02-14 2008-08-21 Lg Electronics Inc. Methods and apparatuses for encoding and decoding object-based audio signals
CN101583859A (en) * 2006-12-15 2009-11-18 诺基亚公司 Memory-efficient system and method for high-quality codebook-based voice conversion
CN103167360A (en) * 2013-02-21 2013-06-19 中国对外翻译出版有限公司 Method for achieving multilingual subtitle translation
CN104681030A (en) * 2006-02-07 2015-06-03 Lg电子株式会社 Apparatus and method for encoding/decoding signal
KR101741699B1 (en) * 2016-01-27 2017-05-31 서강대학교산학협력단 Method of producing panoramic video having reduced discontinuity for images
CN108712642A (en) * 2018-04-20 2018-10-26 天津大学 A kind of three-dimensional subtitle point of addition automatic selecting method suitable for three-dimensional video-frequency
CN110634497A (en) * 2019-10-28 2019-12-31 普联技术有限公司 Noise reduction method and device, terminal equipment and storage medium

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7133825B2 (en) * 2003-11-28 2006-11-07 Skyworks Solutions, Inc. Computationally efficient background noise suppressor for speech coding and speech recognition
US20060241937A1 (en) * 2005-04-21 2006-10-26 Ma Changxue C Method and apparatus for automatically discriminating information bearing audio segments and background noise audio segments
US7382933B2 (en) * 2005-08-24 2008-06-03 International Business Machines Corporation System and method for semantic video segmentation based on joint audiovisual and text analysis
CN101309390B (en) * 2007-05-17 2012-05-23 华为技术有限公司 Visual communication system, apparatus and subtitle displaying method
KR101035726B1 (en) * 2008-12-23 2011-05-19 한국전자통신연구원 Method for adjusting gain of audio output signal
US20110246172A1 (en) * 2010-03-30 2011-10-06 Polycom, Inc. Method and System for Adding Translation in a Videoconference
US8782268B2 (en) * 2010-07-20 2014-07-15 Microsoft Corporation Dynamic composition of media
KR101233272B1 (en) * 2011-03-08 2013-02-14 고려대학교 산학협력단 Apparatus and method for processing speech in noise environment
CN103646654B (en) * 2013-12-12 2017-03-15 深圳市金立通信设备有限公司 A kind of recording data sharing method and terminal
CN105118511A (en) * 2015-07-31 2015-12-02 国网电力科学研究院武汉南瑞有限责任公司 Thunder identification method
CN105931647B (en) * 2016-04-05 2020-01-14 Oppo广东移动通信有限公司 Method and device for noise suppression
CN106340291A (en) * 2016-09-27 2017-01-18 广东小天才科技有限公司 Bilingual subtitle production method and system
US20200051582A1 (en) * 2018-08-08 2020-02-13 Comcast Cable Communications, Llc Generating and/or Displaying Synchronized Captions
CN108847215B (en) * 2018-08-29 2020-07-17 北京云知声信息技术有限公司 Method and device for voice synthesis based on user timbre
CN109545242A (en) * 2018-12-07 2019-03-29 广州势必可赢网络科技有限公司 A kind of audio data processing method, system, device and readable storage medium storing program for executing
CN111128213B (en) * 2019-12-10 2022-09-27 展讯通信(上海)有限公司 Noise suppression method and system for processing in different frequency bands
CN111161751A (en) * 2019-12-25 2020-05-15 声耕智能科技(西安)研究院有限公司 Distributed microphone pickup system and method under complex scene

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005086139A1 (en) * 2004-03-01 2005-09-15 Dolby Laboratories Licensing Corporation Multichannel audio coding
CN104681030A (en) * 2006-02-07 2015-06-03 Lg电子株式会社 Apparatus and method for encoding/decoding signal
CN101583859A (en) * 2006-12-15 2009-11-18 诺基亚公司 Memory-efficient system and method for high-quality codebook-based voice conversion
CA2645915A1 (en) * 2007-02-14 2008-08-21 Lg Electronics Inc. Methods and apparatuses for encoding and decoding object-based audio signals
CN103167360A (en) * 2013-02-21 2013-06-19 中国对外翻译出版有限公司 Method for achieving multilingual subtitle translation
KR101741699B1 (en) * 2016-01-27 2017-05-31 서강대학교산학협력단 Method of producing panoramic video having reduced discontinuity for images
CN108712642A (en) * 2018-04-20 2018-10-26 天津大学 A kind of three-dimensional subtitle point of addition automatic selecting method suitable for three-dimensional video-frequency
CN110634497A (en) * 2019-10-28 2019-12-31 普联技术有限公司 Noise reduction method and device, terminal equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨玉莲 ; 谢磊 ; .基于子词链的中文新闻广播故事自动分割.计算机应用研究.2009,(02),全文. *
苏筱涵 ; 丰洪才 ; 吴诗尧 ; .基于深度网络的多模态视频场景分割算法.武汉理工大学学报(信息与管理工程版).2020,(03),全文. *

Also Published As

Publication number Publication date
CN112309419A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
JP6801023B2 (en) Volume leveler controller and control method
EP2979359B1 (en) Equalizer controller and controlling method
EP3598448A1 (en) Apparatuses and methods for audio classifying and processing
US20120275625A1 (en) Signal processing device, method thereof, program, and data recording medium
JP2010014960A (en) Voice/music determining apparatus and method, and program for determining voice/music
CN104205212A (en) Talker collision in auditory scene
JP4709928B1 (en) Sound quality correction apparatus and sound quality correction method
JP2005530213A (en) Audio signal processing device
CN112309419B (en) Noise reduction and output method and system for multipath audio
CN111243618B (en) Method, device and electronic equipment for determining specific voice fragments in audio
CN115731943A (en) Plosive detection method, plosive detection system, storage medium and electronic equipment
Moinet et al. Audio time-scaling for slow motion sports videos
CN115567845A (en) Information processing method and device
JP2011013383A (en) Audio signal correction device and audio signal correction method
CN112750456A (en) Voice data processing method and device in instant messaging application and electronic equipment
JP2008060725A (en) Sound image localization-enhanced reproduction method, device thereof, program thereof, and storage medium therefor
JP2012027101A (en) Sound playback apparatus, sound playback method, program, and recording medium
JP2011211547A (en) Sound pickup apparatus and sound pickup system
Stokes Improving the perceptual quality of single-channel blind audio source separation
JP2009181044A (en) Voice signal processor, voice signal processing method, program and recording medium
JP6226465B2 (en) Audio signal processing apparatus, recording / reproducing apparatus, and program
CN114678038A (en) Audio noise detection method, computer device and computer program product
Koria Real-Time Adaptive Audio Mixing System Using Inter-Spectral Dependencies
JP2009192739A (en) Speech signal processing apparatus, speech signal processing method, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant