CN117133305B - Stereo noise reduction method, apparatus and storage medium - Google Patents
Stereo noise reduction method, apparatus and storage medium Download PDFInfo
- Publication number
- CN117133305B CN117133305B CN202310481612.6A CN202310481612A CN117133305B CN 117133305 B CN117133305 B CN 117133305B CN 202310481612 A CN202310481612 A CN 202310481612A CN 117133305 B CN117133305 B CN 117133305B
- Authority
- CN
- China
- Prior art keywords
- microphone
- audio
- noise
- noise reduction
- audio signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 248
- 238000000034 method Methods 0.000 title claims abstract description 108
- 230000005236 sound signal Effects 0.000 claims abstract description 156
- 238000003062 neural network model Methods 0.000 claims abstract description 76
- 238000012549 training Methods 0.000 claims abstract description 65
- 230000006870 function Effects 0.000 claims description 80
- 238000012545 processing Methods 0.000 claims description 54
- 238000005070 sampling Methods 0.000 claims description 34
- 230000015654 memory Effects 0.000 claims description 32
- 238000004891 communication Methods 0.000 claims description 18
- 230000002159 abnormal effect Effects 0.000 claims description 15
- 230000000873 masking effect Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 6
- 230000005856 abnormality Effects 0.000 claims description 3
- 230000001960 triggered effect Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 2
- 230000002194 synthesizing effect Effects 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 27
- 239000010410 layer Substances 0.000 description 34
- 238000007726 management method Methods 0.000 description 20
- 230000000694 effects Effects 0.000 description 11
- 238000010295 mobile communication Methods 0.000 description 11
- 238000001228 spectrum Methods 0.000 description 11
- 230000006399 behavior Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 230000004044 response Effects 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000012792 core layer Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000003874 inverse correlation nuclear magnetic resonance spectroscopy Methods 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/16—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/175—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
- G10K11/1752—Masking
- G10K11/1754—Speech masking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/02—Constructional features of telephone sets
- H04M1/19—Arrangements of transmitters, receivers, or complete sets to prevent eavesdropping, to attenuate local noise or to prevent undesired transmission; Mouthpieces or receivers specially adapted therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
- H04M1/72403—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
- H04M1/7243—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
- H04M1/72433—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
- H04M1/72448—User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
- H04M1/72454—User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R5/00—Stereophonic arrangements
- H04R5/04—Circuit arrangements, e.g. for selective connection of amplifier inputs/outputs to loudspeakers, for loudspeaker detection, or for adaptation of settings to personal preferences or hearing impairments
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Quality & Reliability (AREA)
- Environmental & Geological Engineering (AREA)
- Circuit For Audible Band Transducer (AREA)
- Signal Processing Not Specific To The Method Of Recording And Reproducing (AREA)
Abstract
The application provides a stereo noise reduction method, a stereo noise reduction device and a storage medium. After the audio noise reduction function is started, the method determines the amplitude mask suitable for the first audio signal and the second audio signal by adopting the neural network model obtained by training the noise audio training data based on a single channel, and then processes two paths of audio signals by utilizing the same amplitude mask, so that the spatial characteristics of stereo are preserved while the noise reduction of the two paths of audio signals in the stereo is realized, and the audio and video experience is improved.
Description
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a stereo noise reduction method, device, and storage medium.
Background
With the development of stereo technology, various terminal devices with stereo recording and playback are favored by users. Taking stereo in audio and video recording as an example, in this scenario, stereo is specifically binaural audio received by two or more microphones in different positions. Compared with a mono audio-video scene, the stereo audio-video scene can present the spatial characteristics of audio, and the audio-video effect is richer.
However, in a video-on-audio scene, the recorded stereo may include not only the target speech signal but also noise signals other than the target speech signal, subject to environmental influences.
Therefore, it is needed to provide a method for reducing noise for recorded stereo, which can reduce noise while ensuring the fidelity of the target voice signal in the stereo, and can save the spatial characteristics of the stereo, thereby improving the audio and video experience.
Disclosure of Invention
In order to solve the technical problems, the application provides a stereo noise reduction method, equipment and a storage medium, which aim to realize noise reduction processing under the condition of ensuring the fidelity of a target voice signal in stereo, simultaneously preserve the spatial characteristics of the stereo and improve the experience of audio recording and video recording.
In a first aspect, the present application provides a method of stereo noise reduction. The method is applied to the terminal equipment and comprises the following steps: displaying a first interface, wherein the first interface displays an audio noise reduction option; when the audio noise reduction options are in the first state, and when clicking operation on the audio noise reduction options is received, the audio noise reduction options are switched to the second state, and the audio noise reduction function is started; after the audio noise reduction function is started, determining noise-containing audio according to the microphone state of the first microphone, the microphone state of the second microphone, a first audio signal collected by the first microphone and a second audio signal collected by the second microphone; wherein the first microphone and the second microphone are microphones positioned at different positions; inputting the noise-containing frequency into a neural network model to obtain an amplitude mask of the noise-containing frequency; the neural network model is obtained by training noise-containing frequency training data, wherein the noise-containing frequency training data is obtained by fusing noise data on single-channel noise-free audio; masking the first audio signal and the second audio signal by using an amplitude mask to obtain the stereophonic sound after noise reduction.
Therefore, after the audio noise reduction function is started, the amplitude masks suitable for the first audio signal and the second audio signal are determined by adopting the neural network model obtained through training of the noise audio training data based on a single channel, and then the two paths of audio signals are processed by utilizing the same amplitude mask, so that the spatial characteristics of stereo are preserved while the noise reduction of the two paths of audio signals in the stereo is realized, and the audio and video experience is improved.
According to a first aspect, determining a noise-containing frequency from a microphone state of a first microphone, a microphone state of a second microphone, a first audio signal collected by the first microphone, and a second audio signal collected by the second microphone, comprises: determining noise-containing audio according to the first audio signal and the second audio signal when the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone works normally; determining the first audio signal as noise-containing audio when the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone is abnormal; and determining the second audio signal as noise-containing audio when the microphone state of the first microphone indicates that the first microphone is abnormal and the microphone state of the second microphone indicates that the second microphone is working normally.
This can be applied to various actual scenes.
According to a first aspect, or any implementation manner of the first aspect, when a microphone state of a first microphone indicates that the first microphone is operating normally, and a microphone state of a second microphone indicates that the second microphone is operating normally, determining a noise-containing frequency according to the first audio signal and the second audio signal includes: when the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone works normally, carrying out average processing on the frequency domain characteristics of the first audio signal and the frequency domain characteristics of the second audio signal; and taking the audio signal corresponding to the frequency domain characteristic obtained after the mean value processing as noise-containing audio.
Therefore, the average value of the frequency domain characteristics of the two paths of normal audio signals is selected as the noise-containing audio, so that the amplitude mask of the noise-containing audio determined by the neural network model can better reflect the frequency domain relation between the clean audio signal (which can be understood as a target audio signal or an audio signal to be reserved) and the noise audio signal (an audio signal to be suppressed or removed) in the two paths of audio signals, further guarantee that the target voice signal in the noise-reduced stereo obtained by masking based on the amplitude mask corresponding to the noise-containing audio meets the required fidelity, and simultaneously keep the spatial characteristics of the stereo.
According to a first aspect, or any implementation manner of the first aspect, when a microphone state of a first microphone indicates that the first microphone is operating normally, and a microphone state of a second microphone indicates that the second microphone is operating normally, determining a noise-containing frequency according to the first audio signal and the second audio signal includes: and when the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone works normally, selecting an audio signal with low energy from the first audio signal and the second audio signal as noise-containing audio.
Because the energy difference corresponding to the target audio signal in the first audio signal and the second audio signal acquired by the first microphone and the second microphone is usually smaller, the main factor causing the larger difference of the signal to noise ratio is noise energy, harmonic distortion can be reduced by selecting the audio signal with lower overall energy as the noise-containing frequency determination amplitude mask, and therefore the target audio signal in the noise-reduced stereo obtained by masking based on the amplitude mask can obtain better guarantee effect.
According to the first aspect, or any implementation manner of the first aspect, the method further includes: displaying a prompt window in the first interface when the microphone state of the first microphone indicates that the first microphone is abnormal and the microphone state of the second microphone indicates that the second microphone is abnormal; the prompt window displays prompt information of abnormality of the first microphone and the second microphone.
Therefore, when two microphones are abnormal, related prompt information is displayed on the first interface, so that the situation that a user does not know is avoided, the abnormal microphones are used for recording and video, and finally recorded audio and video files are free of sound or are all interference audio.
According to the first aspect, or any implementation manner of the first aspect, after the audio noise reduction function is turned on, the method further includes: when a clicking operation on the audio noise reduction options is received, the audio noise reduction options are switched to a first state, and the audio noise reduction function is closed; after the audio noise reduction function is turned off, the non-noise reduced stereo sound is synthesized according to the first audio signal and the second audio signal.
Therefore, in the process of recording and video, the audio noise reduction options which can be operated by the user are always displayed, so that the user can select to start the audio noise reduction function or close the audio noise reduction function according to the needs in the process of recording and video, and the use requirements of the user for different scenes are better met.
According to the first aspect, or any implementation manner of the first aspect, the first interface further displays an end recording option; the method further comprises the steps of: after the audio noise reduction function is started, when a clicking operation on the end recording option is received, the audio noise reduction option is restored to a first state, and the audio noise reduction function is closed.
Thus, regardless of the state of the audio noise reduction option before the recording is finished, the audio noise reduction option is restored to the first state by default after each recording is finished. In this way, there is no need in an implementation to additionally maintain and manage files that record state information of the audio noise reduction options.
According to the first aspect, or any implementation manner of the first aspect, the first interface further displays an end recording option; the method further comprises the steps of: after the audio noise reduction function is started, when a clicking operation on the end recording option is received, recording a second state; when the recording is triggered again, the audio noise reduction function is directly started according to the second recorded state.
Therefore, after the recording is finished each time, the audio noise reduction option keeps the current state, and the next time the recording is started, the state is continued, so that the method and the device are better suitable for actual use habits of users.
According to the first aspect or any implementation manner of the first aspect, the sampling frequency of the noise-containing frequency training data is 16kHz, and the bandwidth of the amplitude mask determined by training the noise-containing frequency training data to obtain the neural network model is between 0kHz and 8kHz.
According to a first aspect, or any implementation manner of the first aspect, the inputting the noise-containing frequency into the neural network model, to obtain the amplitude mask of the noise-containing frequency, includes: when the sampling frequency of the noise-containing audio is 16kHz, inputting the frequency domain characteristics corresponding to the noise-containing frequency into a neural network model to obtain an amplitude mask with the bandwidth between 0kHz and 8 kHz; and taking the obtained amplitude mask with the bandwidth between 0kHz and 8kHz as the amplitude mask of the noise-containing audio.
Because the neural network model is obtained by training the noise-containing frequency training data with the sampling frequency of 16kHz, for the stereo noise reduction scene with the sampling frequency of 16kHz, the amplitude mask determined by the neural network model is directly used as the amplitude mask of the noise-containing audio, other processing is not needed, and the method is convenient and rapid.
According to the first aspect, or any implementation manner of the first aspect, the method further includes: when the sampling frequency of the noise-containing audio is 32kHz or 48kHz, inputting the frequency domain characteristics corresponding to the noise-containing frequency into a neural network model to obtain an amplitude mask with the bandwidth between 0kHz and 8 kHz; taking the average value of amplitude masks with bandwidths between 0kHz and 8kHz output by the neural network model as the amplitude mask corresponding to the bandwidths above 8 kHz; fusing an amplitude mask with the bandwidth between 0kHz and 8kHz output by the neural network model and an amplitude mask corresponding to the bandwidth above 8kHz according to a set gain proportion to obtain an amplitude mask corresponding to the bandwidth near 8 kHz; and taking an amplitude mask with the bandwidth between 0kHz and 8kHz, an amplitude mask corresponding to the bandwidth near 8kHz and an amplitude mask corresponding to the bandwidth above 8kHz as the amplitude mask of the noise-containing audio.
Because the neural network model is obtained by training the training data with the noise frequency of 16kHz, the neural network model obtained by training the training data with the noise frequency of 16kHz is still applicable to stereophonic noise reduction scenes with the sampling frequency of 32kHz or 48kHz because human voice is mainly concentrated at medium and low frequencies. But in order to adapt to the part with the bandwidth above 8kHz, the frequency band is expanded based on the amplitude mask determined by the neural network model, so that the continuity of the frequency spectrum is ensured.
According to a first aspect, or any implementation manner of the first aspect, the merging noise data on the single-channel noise-free audio includes: dividing the single-channel noise-free audio frequency and noise data into the same number of noise-free audio frequency time frames and noise audio frequency time frames according to the same time period; and superposing the frequency domain features in the noise-free frequency time frame and the frequency domain features in the noise-free audio time frame for the noise-free frequency time frame and the noise-audio time frame at the same time.
According to the first aspect, or any implementation manner of the first aspect, the neural network model includes a convolutional network layer, a long-short-time memory network layer, and a fully-connected network layer that are sequentially arranged; the full-connection network layer is used for mapping local features of the frequency domain features acquired by the convolution network layer and the time sequence features acquired by the long-short-time memory network layer to feature dimensions corresponding to the amplitude mask.
According to a first aspect, or any implementation of the first aspect above, the first microphone is located at a top of the terminal device and the second microphone is located at a bottom of the terminal device.
According to the first aspect, or any implementation manner of the first aspect, the terminal device establishes a communication link with a left earphone and a right earphone of the real wireless earphone respectively; the first microphone is located at the left earphone and the second microphone is located at the right earphone.
According to a first aspect, or any implementation manner of the first aspect, the first interface is a video recording interface corresponding to a camera application.
According to the first aspect, or any implementation manner of the first aspect, the first interface is a recording interface corresponding to a recording application.
In a second aspect, the present application provides a terminal device. The terminal device includes: a memory and a processor, the memory and the processor coupled; the memory stores program instructions that, when executed by the processor, cause the terminal device to perform the instructions of the first aspect or of the method in any possible implementation of the first aspect.
Any implementation manner of the second aspect and the second aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. The technical effects corresponding to the second aspect and any implementation manner of the second aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In a third aspect, the application provides a computer readable medium storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.
Any implementation manner of the third aspect and any implementation manner of the third aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. The technical effects corresponding to the third aspect and any implementation manner of the third aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In a fourth aspect, the present application provides a computer program comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.
Any implementation manner of the fourth aspect and any implementation manner of the fourth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the fourth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a fifth aspect, the present application provides a chip comprising processing circuitry, transceiver pins. Wherein the transceiver pin and the processing circuit communicate with each other via an internal connection path, the processing circuit performing the method of the first aspect or any one of the possible implementation manners of the first aspect to control the receiving pin to receive signals and to control the transmitting pin to transmit signals.
Any implementation manner of the fifth aspect and any implementation manner of the fifth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the fifth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
Drawings
Fig. 1 is a schematic diagram of a hardware structure of an exemplary terminal device;
FIGS. 2 through 5 are diagrams illustrating a user interface for turning on or off an audio noise reduction function in a recorded stereo scene;
FIG. 6 is a schematic diagram of a user interface for turning on or off audio noise reduction functionality, shown schematically;
fig. 7 is a schematic flow chart of a stereo noise reduction method according to an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of stereo noise reduction processing logic provided by an exemplary embodiment of the present application;
FIG. 9 is a schematic diagram of still another stereo noise reduction processing logic provided by an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram illustrating interaction of a terminal device with a server;
Fig. 11 is a schematic diagram of a software structure of an exemplary terminal device;
Fig. 12 is a schematic diagram illustrating exemplary software and hardware interactions.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.
The terms first and second and the like in the description and in the claims of embodiments of the application, are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, the plurality of processing units refers to two or more processing units; the plurality of systems means two or more systems.
In order to better understand the technical solution provided by the embodiments of the present application, before describing the technical solution of the embodiments of the present application, a description is first given of a hardware structure of a terminal device (for example, a mobile phone, a tablet computer, etc.) applicable to the embodiments of the present application with reference to the accompanying drawings.
It should be noted that, the technical solution provided in the embodiment of the present application is particularly suitable for a terminal device capable of stereo recording, for example, a mobile phone, a tablet computer, etc. with at least 2 microphones located at different positions, or a mobile phone, a tablet computer, a smart watch, etc. connected with a stereo headset. For convenience of explanation, a mobile phone will be described below as an example.
For example, for a mobile phone with 2 microphones, one microphone (microphone) is disposed at the bottom of the mobile phone, such as two sides or one side of the charging interface; the other microphone is for example arranged on top of the handset, such as near the rear camera, or near the handset front.
For example, for a cell phone connected stereo headset, such as a True wireless stereo headset (True WIRELESS STERE, TWS headset). In particular, in the present application, the following description of embodiments takes TWS headphones as an example of two microphones for collecting audio signals.
In addition, it should be noted that, because the technical scheme provided by the embodiment of the application is a noise reduction processing scheme for stereo. Therefore, in order to record stereo sound, when the TWS earphone is used as two microphones for collecting audio signals, it is necessary to ensure that both the left earphone and the right earphone of the TWS earphone are successfully connected with the mobile phone.
Referring to fig. 1, the mobile phone 100 may include: processor 110, external memory interface 120, internal memory 121, universal serial bus (universal serial bus, USB) interface 130, charge management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, sensor module 180, keys 190, motor 191, indicator 192, camera 193, display 194, and subscriber identity module (subscriber identification module, SIM) card interface 195, among others.
The processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a Modem processor (Modem), a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc., which are not further illustrated herein, and the application is not limited in this regard.
The controller as the processing unit may be a neural center or a command center of the mobile phone 100. In practical application, the controller can generate operation control signals according to the instruction operation codes and the time sequence signals to complete instruction fetching and instruction execution control.
With respect to the modem processor described above, a modulator and demodulator may be included. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal and transmitting the low-frequency baseband signal obtained by demodulation to the baseband processor for processing.
The baseband processor is used for processing the low-frequency baseband signal transmitted by the regulator and transmitting the processed low-frequency baseband signal to the application processor.
It should be noted that in some implementations, the baseband processor may be integrated within the modem, i.e., the modem may be provided with the functionality of the baseband processor.
With respect to the above-mentioned application processor, it is used to output sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or to display images or videos through the display screen 194.
The above-mentioned digital signal processor is used for processing digital signals. Specifically, the digital signal processor may process other digital signals in addition to the digital image signal. For example, when the handset 100 selects a frequency bin, the digital signal processor may be used to fourier transform the frequency bin energy, or the like. In particular, in the technical scheme provided by the application, the digital signal processor can be used for carrying out Fourier transform on the frequency point energy of the 48kHz audio signal, so as to obtain a 24kHz frequency spectrum.
The neural network processor described above, particularly in the technical solution provided in the present application, may be used to train the neural network model for performing noise reduction processing on stereo sound described in the embodiments of the present application. Understandably, in order to reduce the resource occupation of the mobile phone 100, the neural network model may be trained by a cloud server or other servers and issued to the mobile phone 100.
With respect to the video codec described above, it is used for compressing or decompressing digital video. Illustratively, the handset 100 may support one or more video codecs. In this way, the mobile phone 100 can play or record video in multiple coding formats, for example: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
The ISP is used for outputting the digital image signal to the DSP processing. Specifically, the ISP is used to process data fed back by the camera 193. For example, when photographing and video recording, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some implementations, the ISP may be provided in the camera 193.
The DSP is used to convert digital image signals into standard RGB, YUV, and other image signals.
Furthermore, it should be noted that with respect to the processor 110 including the processing units described above, in some implementations, the different processing units may be separate devices. That is, each processing unit may be considered a processor. In other implementations, different processing units may also be integrated in one or more processors. For example, in some implementations, the modem processor may be a stand-alone device. In other implementations, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
Further, the processor 110 may also include one or more interfaces. The interfaces may include, but are not limited to, an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.
Further, a memory may be provided in the processor 110 for storing instructions and data. In some implementations, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.
With continued reference to FIG. 1, the external memory interface 120 may be used to interface with an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the handset 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.
With continued reference to fig. 1, internal memory 121 may be used to store computer-executable program code, including instructions. The processor 110 executes various functional applications of the cellular phone 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, and a stereo recording function in the embodiment of the present application) required for at least one function, and the like. The data storage area may store data created during use of the mobile phone 100 (such as stereo audio data recorded based on the technical scheme provided in the embodiment of the present application), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.
With continued reference to fig. 1, the charge management module 140 is operable to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging implementations, the charge management module 140 may receive a charging input of the wired charger through the USB interface 130. In some wireless charging implementations, the charge management module 140 may receive wireless charging input through a wireless charging coil of the cell phone 100. The charging management module 140 may also supply power to the terminal device through the power management module 141 while charging the battery 142.
With continued reference to fig. 1, the power management module 141 is configured to connect the battery 142, the charge management module 140, and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other implementations, the power management module 141 may also be provided in the processor 110. In other implementations, the power management module 141 and the charge management module 140 may also be disposed in the same device.
With continued reference to fig. 1, the wireless communication function of the handset 100 may be implemented by an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, a modem processor, a baseband processor, and the like.
The antennas 1 and 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the handset 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other implementations, the antenna may be used in conjunction with a tuning switch.
With continued reference to fig. 1, the mobile communication module 150 may provide a solution for wireless communications, including 2G/3G/4G/5G, applied to the handset 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some implementations, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some implementations, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.
With continued reference to fig. 1, the wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (NEAR FIELD communication, NFC), infrared technology (IR), etc., as applied to the handset 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.
It should be noted that, in the following implementation manner, the neural network model for stereo noise reduction may be obtained through training by a cloud server or other servers. For such an implementation scenario, the handset 100 may communicate with a cloud server or other server providing a neural network through the mobile communication module 150 or the wireless communication module 160. For example, the mobile phone 100 may send a request to the cloud server to obtain or update the neural network model through the mobile communication module 150. Accordingly, the cloud server may issue a trained neural network model to the mobile phone 100 according to the request of the mobile phone 100.
In addition, it should be further noted that, in the scenario where the neural network model is trained by the cloud server (or other servers), the cloud server may customize the neural network model suitable for different mobile phones 100 according to the customization requirements corresponding to the mobile phones 100 configured differently, and update and iterate the neural network model according to the noise reduction results fed back by different mobile phones 100.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
With continued reference to fig. 1, the audio module 170 may include a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and the like. Illustratively, the handset 100 may implement audio functionality through a speaker 170A, a receiver 170B, a microphone 170C, an earpiece interface 170D, etc. in the application processor and audio module 170. Such as an audio and video recording function.
Wherein, in implementing the audio function by the application processor and the audio module 170, the audio module 170 may be used to convert digital audio information into an analog audio signal output, as well as to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some implementations, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.
In particular, in the embodiment of the present application, the mobile phone 100 capable of implementing stereo recording needs to include at least 2 microphones 170C. The positions of the 2 microphones 170C may be referred to above, and will not be described here.
With continued reference to fig. 1, the sensor module 180 may include a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc., which are not further illustrated herein, but are not limiting.
With continued reference to fig. 1, the keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The handset 100 may receive key inputs, generating key signal inputs related to user settings and function control of the handset 100.
With continued reference to fig. 1, motor 191 may generate a vibration alert. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback.
With continued reference to fig. 1, the indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, an indication message, a missed call, a notification, or the like.
With continued reference to fig. 1, a camera 193 is used to capture still images or video. The mobile phone 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display 194, an application processor, and the like. Specifically, the object generates an optical image through a lens and projects the optical image onto a photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some implementations, the cell phone 100 may include 1 or N cameras 193, N being a positive integer greater than 1.
With continued reference to FIG. 1, a display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. In some implementations, the cell phone 100 may include 1 or N display screens 194, N being a positive integer greater than 1. The cell phone 100 may implement display functions through a GPU, a display 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
As to the hardware architecture of the handset 100, it should be understood that the handset 100 shown in fig. 1 is only one example, and in a specific implementation, the handset 100 may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in fig. 1 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
Based on the mobile phone with the structure shown in fig. 1, when a user uses the mobile phone to record and record audio, the user can receive dual-channel audio through two microphones located at different positions or a TWS earphone connected with the mobile phone 100, i.e. realize stereo recording.
However, in a video-on-audio scene, the recorded stereo may include not only the target speech signal but also noise signals other than the target speech signal, subject to environmental influences. In order to improve user experience, the application provides a stereo noise reduction method, which aims to realize noise reduction processing under the condition of ensuring the fidelity of a target voice signal in stereo, and simultaneously, the spatial characteristics of the stereo are saved, so that the audio and video experience is improved.
In an exemplary embodiment, in a stereo noise reduction method provided by the embodiment of the present application, in an audio recording scene where a user wishes to retain a voice (a target voice signal), and eliminates or suppresses a sound (a noise signal) other than the voice, such as interview, speech, conference, etc., the user may start an audio noise reduction function, so that after the audio noise reduction function is started, the terminal device may determine an amplitude mask suitable for performing noise reduction processing on two audio signals collected by a left earphone and a right earphone of two microphones or TWS headphones by using a neural network model obtained by training noise audio training data based on a single channel, and then process the two audio signals by using the same amplitude mask, thereby realizing noise reduction on two audio signals in the stereo, and preserving spatial characteristics of the stereo, so as to improve audio recording and video experience.
In an exemplary embodiment, according to the stereo noise reduction method provided by the embodiment of the present application, in an audio recording scene where a user wants to keep noise, for example, a concert, an outdoor scene where a user wants to keep sound in a musical instrument or a nature, the user may close the audio noise reduction function, so that the terminal device may directly synthesize stereo with recorded original audio.
Taking a terminal device as a mobile phone as an example, specifically, in practical application, the change of a user interface related to an audio/video scene is realized based on the stereo noise reduction method provided by the embodiment of the present application, for example, as shown in fig. 2 to 5.
Referring to fig. 2 (1), an interface (interface 10 a) of a cellular phone is exemplarily shown. Referring to fig. 2 (1), one or more controls, such as power icons, network icons, and various application icons, may be included on the interface 10. The application icons include, for example, a camera application icon S1, a setting application icon S2, a recorder application icon, and the like.
Illustratively, when the user clicks the icon S1 of the camera application in the interface 10a, the mobile phone will start the camera application in response to the operation, and the interface will switch from the interface 10a shown in fig. 2 (1) to the interface 10b shown in fig. 2 (2).
Referring to fig. 2 (2), exemplary, one or more controls may be included on interface 10 b. Including controls include, but are not limited to: preview window S3, shutter control S4, front and rear camera switching control S5, shooting mode list S6, multiple selection list S7, function option list (content displayed in the upper area of preview window S3), picture viewing control (control on the left side of shutter control S4), and the like.
The preview window S3 may display an image acquired by the camera in real time. The shutter control S4 may monitor a user operation triggering photographing, that is, when the mobile phone detects a user operation acting on the shutter control S4, photographing is performed in response to the operation, and an image obtained by photographing is stored in the gallery application. The front and rear camera switching control S5 can respond to the operation behavior of the user when the user clicks, for example, when the user clicks the front and rear camera switching control S5 when the user is currently positioned on the front camera, the mobile phone responds to the operation behavior and switches to the rear camera to shoot. Shown in the shooting mode list S6 are shooting modes selectable by the user, such as an aperture mode, a night view mode, a portrait mode, a shooting mode, a video mode, a smiling face mode, and the like. Shown in the magnification selection list S7 are magnification factors selectable by the user, such as 0.6 times (super wide angle), 1 times (main lens), 2.5 times (tele lens), 10 times (super tele lens), and the like.
Illustratively, in the interface 10b shown in (2) of fig. 2, taking the shooting mode selected in the shooting mode list S6 as "shooting", the magnification selected in the magnification selection list S7 is 1-fold as an example.
For example, when the user slides the shooting mode in the shooting mode list S6 to the left, slides "video" to the position where "shoot" is located in fig. 2 (2), or directly clicks the "video" option in the shooting mode list S6, the mobile phone switches the shooting mode from "shoot" mode to "video" mode in response to the operation behavior, as shown in fig. 3 (1).
It will be appreciated that the video recording is a continuous operation, and therefore, when the shooting mode is switched to the "video recording" mode, the shutter control S4 will be switched to the video recording control S4' shown in fig. 3 (1).
Illustratively, when the user clicks the record control S4', the mobile phone starts recording (recording the sound and picture of the current environment) in response to the operation, and the interface of the mobile phone is switched from the interface 10b to the record interface, such as the interface 10c shown in fig. 3 (2).
It should be noted that, in order to implement the stereo noise reduction method provided by the embodiment of the present application in the video mode, an option for the user to turn on or off the audio noise reduction function may be set in the interface 10 c. Referring to fig. 3 (2), in an exemplary interface 10c corresponding to the video recording mode, in addition to a preview window S3 for displaying a recording picture, a front-back camera switching control S5, and a multiple selection list S7, an audio noise reduction option S8, a recording time display control S9, a photographing control S10 for photographing a frame picture in video recording, a control S4″ operated in the video recording process, and the like may be included.
It will be appreciated that during recording, the user may pause the recording halfway or end the recording directly. Thus, to facilitate the user clicking on different controls as desired, control S4 "may include control S4-1" for pausing recording and control S4-2 "for ending recording.
In addition, it should be noted that, for the audio noise reduction option S8, different states may be set, so that the audio noise reduction function is turned on or turned off by one option. In particular, in this embodiment, the pattern shown in the audio noise reduction option S8 indicates that it is in the first state, and the pattern shown in the audio noise reduction option S8' indicates that it is in the second state.
For example, when the audio noise reduction option is in the first state, it may be agreed that the mobile phone closes the audio noise reduction function, that is, in the video recording process, the mobile phone does not adopt the stereo noise reduction method provided by the embodiment of the present application to reduce noise of two paths of audio signals collected by two microphones, so as to obtain a denoised stereo, and performs stereo synthesis directly according to the two paths of audio signals collected by the two microphones (such a scene is referred to as an acoustic mode in the following).
For example, when the audio noise reduction option is in the second state, it may be agreed that the mobile phone starts the audio noise reduction function, that is, in the video recording process, the mobile phone may adopt the stereo noise reduction method provided by the embodiment of the present application to reduce noise of two paths of audio signals collected by two microphones, so as to obtain a denoised stereo (such a scene is referred to as a noise reduction mode in the following).
Based on the above description of the state of the audio noise reduction option, when the corresponding interface is the interface 10c shown in (2) of fig. 3 and the audio noise reduction option is in the S8 mode, the mobile phone recognizes that the state of the audio noise reduction option is the first state, so that the recording is performed in the acoustic mode.
Illustratively, during recording, for example, 5S (see time "00: 05" shown in S9 in FIG. 4 (1)), the user clicks on the audio noise reduction option in style S8. Based on the above description, it is known that in response to the operation, the mobile phone switches the audio noise reduction option from the first state to the second state, that is, from the pattern shown in S8 to the pattern shown in S8' in (1) of fig. 4, thereby turning on the audio noise reduction function. At this time, the mobile phone stops recording in the original sound mode, but instead records in the noise reduction mode.
For example, in some implementations, when the audio noise reduction function is turned on, a pop-up window may prompt the user in the interface 10c that the audio noise reduction function is currently turned on, such as the prompt message "audio noise reduction is turned on" shown in fig. 4 (1).
In addition, in order not to affect the use of the user, the display duration of the prompt information, for example, 2s, may be set so as to avoid the prompt information from shielding the video picture for a long time. Accordingly, after the prompt message is displayed in the interface 10c for 2s, the prompt message automatically disappears from the interface 10 c. As in (2) of fig. 4, time "00" shown in S9: 00:07", i.e. after turning on the audio noise reduction function 2s, the prompt message will disappear.
Illustratively, during recording, for example, 5 minutes 25S (see time "00:05:25" shown in S9 in FIG. 5 (1)), the user clicks on the audio noise reduction option in style S8'. Based on the above description, the mobile phone responds to the operation behavior to switch the audio noise reduction option from the second state to the first state, that is, from the mode shown in S8' to the mode of S8, so as to turn off the audio noise reduction function. At this time, the mobile phone stops recording in the noise reduction mode, but instead records in the original sound mode.
For example, in some implementations, when the audio noise reduction function is turned off, a pop-up window may prompt the user in the interface 10c that the audio noise reduction function is currently turned off, such as the prompt for "audio noise reduction is turned off" shown in fig. 5 (1).
In addition, in order not to affect the use of the user, the display duration of the prompt information, for example, 2s, may be set so as to avoid the prompt information from shielding the video picture for a long time. Accordingly, after the prompt message is displayed in the interface 10c for 2s, the prompt message automatically disappears from the interface 10 c. As in (2) of fig. 5, time "00" shown in S9: 05:27", i.e. after turning off the audio noise reduction function 2s, the prompt message will disappear.
Therefore, the audio noise reduction options for starting or stopping the audio noise reduction function are newly added in the interface 10c, so that a user can conveniently start or stop the audio noise reduction function at any time as required in the video recording process, and further the switching of the recording of the original sound mode or the recording of the noise reduction mode is realized based on the stereo noise reduction method provided by the embodiment of the application. That is, according to the stereo noise reduction method provided by the embodiment of the application, not only can a stereo corresponding to a complete audio and video be a stereo after noise reduction, but also a stereo corresponding to a complete audio and video can be a stereo without noise reduction, and a stereo corresponding to a complete audio and video can be a stereo part with noise reduction and a stereo part without noise reduction, so that the method is better suitable for actual stereo recording scenes.
In addition, it should be noted that when the clicking operation of the user on S4-2", that is, ending the recording option, is received during the recording process, the mobile phone responds to the operation behavior to end the recording, and the recorded content is also saved in the gallery application of the mobile phone. When the video recording is finished, in one implementation, the mobile phone can directly restore the audio noise reduction option to the first state, and the audio noise reduction function is closed. That is, the mobile phone does not save the state corresponding to the audio noise reduction function in the recording process, no matter whether the audio noise reduction option is in the first state (such as the pattern S8) or the second state (such as the pattern S8') when the recording is finished, the mobile phone sets the state of the audio noise reduction option to the first state when the recording is finished. That is, based on this implementation, after the user clicks S4-2 "to return to the interface 10b shown in (1) in fig. 3 in the interface 10c shown in (2) in fig. 4 (the audio noise reduction option is in the second state) or in the interface 10c shown in (1) in fig. 5 (the audio noise reduction option is in the first state), when the user clicks S4' in the interface 10b again, the mobile phone enters the corresponding interface 10c for video capturing in response to the operation behavior, and the audio noise reduction option is still in the style shown in S8.
In another implementation, the mobile phone may record (save) the state corresponding to the audio noise reduction function during the recording process when the recording is finished, and when the recording is triggered by the subsequent reselection (a new recording task is started), the mobile phone will modify the state of the audio noise reduction option directly according to the state information recorded when the recording is finished last time, and further record according to the mode corresponding to the modified state.
For example, based on this implementation, after the user clicks S4-2 "to return to the interface 10b shown in (1) in fig. 3 in the interface 10c shown in (2) in fig. 4 (the audio noise reduction option is in the second state), when the user clicks S4 'in the interface 10b again, the mobile phone responds to the operation action and enters into the corresponding interface 10c for video capturing, where the audio noise reduction option is directly in the style shown in S8'. Thus, the mobile phone can directly record in the noise reduction mode.
Also for example, based on this implementation, after the user clicks S4-2 "to return to the interface 10b shown in (1) in fig. 3 in the interface 10c shown in (1) in fig. 5 (the audio noise reduction option is in the first state), when the user clicks S4' in the interface 10b again, the mobile phone responds to the operation behavior and enters into the corresponding interface 10c of the video capturing, where the audio noise reduction option is still in the style shown in S8. Thus, the mobile phone can directly record in the original sound mode.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
For example, in another implementation, the entry for turning on the audio noise reduction function may be integrated in the interface corresponding to the setup application. For this implementation, before recording starts, the audio noise reduction function needs to be turned on to the interface corresponding to the setup application. For example, the user needs to click on the icon S2 of the setting application displayed in the interface 10a shown in (1) in fig. 2.
Illustratively, after the user clicks S2, the handset will initiate a setup application in response to this action, switching from interface 10a to interface 10d shown in fig. 6 (1).
Referring to fig. 6 (1), one or more controls may be included on interface 10 d. Including controls include, but are not limited to: account center, flight mode, stereo noise reduction options 10d-1, WLAN, bluetooth, notification, application, display and brightness, sound and vibration, system and update, battery, storage, security, etc. control options.
Illustratively, in this embodiment, taking the example that the style shown by the stereo noise reduction option 10d-1 indicates that the audio noise reduction function is not turned on, when the user clicks the stereo noise reduction option 10d-1, the mobile phone will turn on the audio noise reduction function in response to the operation behavior, and the stereo noise reduction option 10d-1 will switch to the style of the stereo noise reduction option 10d-1' shown in (2) of fig. 6.
Illustratively, in the case of a stereo noise reduction option of the type shown as 10d-1', when the user is recording through the camera application, the recording will be in a noise reduction mode.
Illustratively, in the case of a stereo noise reduction option of the type shown as 10d-1, when the user is recording through the camera application, recording will be performed in the acoustic mode.
Illustratively, in one implementation, the audio noise reduction options may not be displayed in the interface 10c during recording by setting the stereo noise reduction options provided in the application to turn the audio noise reduction function on or off, as described above for S8 or S8'. Therefore, during the recording process, the user cannot dynamically switch between the noise reduction mode and the original sound mode. If the switching between the noise reduction mode and the original sound mode is needed, the recording is paused through S4-1", then the interface 10d is entered, the stereo noise reduction option is operated, and the switching between the noise reduction mode and the original sound mode is realized.
For example, in another implementation, the stereo noise reduction options provided in interface 10d and the audio noise reduction options provided in interface 10c described above may be bundled. That is, when the stereo noise reduction option in the interface 10d is the style of 10d-1, when the user records a video through the camera application, after entering the interface 10c by clicking S4', the audio noise reduction option displayed in the interface 10c will be in the style of S8. When the user clicks S8 to switch the audio noise reduction option to the pattern of S8', the stereo noise reduction option will also automatically switch to the pattern of 10 d-1'. That is, the states of the stereo noise reduction option and the audio noise reduction option remain synchronized.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
In order to better understand the stereo noise reduction method provided by the embodiment of the present application, the stereo noise reduction method provided by the embodiment of the present application is specifically described by taking the example that the user turns on or off the audio noise reduction function through the audio noise reduction option shown in the interface 10 c.
Referring to fig. 7, the stereo noise reduction method provided by the embodiment of the present application specifically includes:
101, displaying a first interface, which displays audio noise reduction options.
For example, in some implementations, the first interface may be, for example, a video recording interface corresponding to the camera application described in the above embodiments, that is, the interface 10c. Correspondingly, when the first interface is the interface 10c, the audio noise reduction options displayed in the first interface are the audio noise reduction options with the style S8 or S8' displayed in the interface 10c.
For example, in other implementations, the first interface may also be a recording interface corresponding to a recorder application. For the implementation scene of the first interface being the recording interface corresponding to the recorder application, in order to implement the stereo noise reduction method provided by the embodiment of the application, an audio noise reduction option with a style of S8 or S8' can be set in the recording interface, and then the audio noise reduction function is turned on or off by operating the audio noise reduction option, so that when the audio noise reduction function is turned on, the audio noise reduction method provided by the embodiment of the application is used for recording by adopting the noise reduction mode. Otherwise, the sound recording is carried out by adopting the original sound mode.
For convenience of explanation, in this embodiment, the first interface is taken as a video recording interface corresponding to the camera application, that is, the interface 10c is taken as an example.
102, When the audio noise reduction option is in the first state, and when a clicking operation on the audio noise reduction option is received, the audio noise reduction option is switched to the second state, and the audio noise reduction function is started.
As can be seen from the description of the above embodiment, when the audio noise reduction option is in the style of S8, that is, in the first state, it indicates that the audio noise reduction function is not turned on. And when the audio noise reduction option is in the mode of S8', namely in the second state, the audio noise reduction function is started. Based on this, when the audio noise reduction option is in the first state, i.e., the style of S8, and when a click operation on the audio noise reduction option is received, the mobile phone responds to the operation behavior, the audio noise reduction option will switch to the second state, i.e., switch from the style of S8 to the style of S8'. After the audio noise reduction options are switched to the second state, the audio noise reduction function is started, and the mobile phone can record by adopting the noise reduction model.
Specifically, regarding the stereo noise reduction method provided by the embodiment of the present application, the specific implementation process of recording in the noise reduction mode under the condition of starting the audio noise reduction function may be implemented through steps 103 to 105.
103, After the audio noise reduction function is started, determining noise-containing audio according to the microphone state of the first microphone, the microphone state of the second microphone, the first audio signal collected by the first microphone and the second audio signal collected by the second microphone.
In some implementations, the first microphone may be, for example, a terminal device that is currently recording, such as a microphone on top of a cell phone. The second microphone may be, for example, a microphone at the bottom of the handset. For specific positions of the microphones at the top and bottom, reference may be made to the description of the above embodiments, and the description thereof will be omitted.
In other implementations, for example, in the case where the terminal device currently performing video recording, such as a mobile phone, establishes a communication link with a real wireless headset, such as a left headset and a right headset of a TWS headset, respectively, the first microphone may be, for example, a microphone located at the left headset, and the second microphone may be, for example, a microphone located at the right headset.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment. As long as it is ensured that the synthesized stereo audio signal comes from microphones in different positions.
In addition, it should be noted that, since the neural network model used to determine the amplitude mask of the noise-containing frequency in step 104 is obtained by training the training data of the noise-containing frequency, and the training data of the noise-containing frequency is obtained by fusing the noise data on the single-channel non-noise audio, that is, the noise-containing frequency input to the neural network model should be the single-channel audio. The currently acquired audio signals have two paths (i.e. two paths), so in order to ensure that the neural network model obtained by training the audio data based on a single path is used to obtain an amplitude mask, one path of audio signals needs to be determined according to the two paths of audio signals.
Illustratively, the above-mentioned merging of noise data on the single-channel non-noisy audio refers specifically to dividing the single-channel non-noisy audio and the noise data into the same number of non-noisy audio time frames and noise audio time frames according to the same time period. And then, for the noise-free frequency time frame and the noise audio time frame at the same moment, superposing the frequency domain features in the noise-free frequency time frame and the frequency domain features in the noise audio time frame. I.e. fusion as referred to herein is the fusion of different audio signals in the time domain.
For convenience of explanation, the first audio signal collected by the first microphone is taken as a left channel input, and the second audio signal collected by the second microphone is taken as a right channel input. Referring to fig. 8, when a first audio signal input by a left channel and a second audio signal input by a right channel are received, the audio signals of the two channels are fused in combination with the microphone states of microphones corresponding to the two audio signals, so as to determine the noise-containing audio which can be input into the neural network model. Thus, the neural network model can determine an amplitude mask suitable for the first audio signal and the second audio signal based on the noise-containing frequency of the input single channel. The audio noise reduction module may mask the first audio signal and the second audio signal based on the same amplitude mask determined by the neural network model. Thus, after masking processing is performed by the same amplitude mask, the left channel output and the right channel output are noise-reduced audio signals. Thus, by matching the left channel and the right channel, a stereophonic sound after noise reduction can be obtained.
It should be noted that the amplitude mask is specifically used to describe the relation between clean speech, that is, noise-free audio (target audio signal, or clean speech, human voice, etc.), and noise-containing frequency (audio of various noises such as wind noise, road noise, music sound, etc. except the target audio signal) in the frequency domain.
In addition, when the amplitude mask describes a relation in a frequency domain, masking based on the amplitude mask can be understood as spectrum masking. The spectrum masking means that the strong sound signal can mask the weak sound signal under the same frequency, so that part of the weak sound signal can be directly removed within a certain threshold value, and the audio noise reduction is realized.
For example, when the microphone state of the first microphone indicates that the first microphone is operating normally (e.g., is not blocked and can pick up sound normally), and the microphone state of the second microphone indicates that the second microphone is operating normally, i.e., when the two channel input audio streams are normal, the noisy audio may be determined according to the first audio signal and the second audio signal.
For example, in some implementations, the frequency domain features (amplitude spectrum) of the first audio signal and the frequency domain features of the second audio signal may be subjected to mean processing, and then the audio signal corresponding to the frequency domain features obtained after the mean processing is used as the noise-containing audio.
Furthermore, it will be appreciated that the difference in noise energy in the left and right channels is typically small due to the stereo left and right channels, but in some scenarios the difference in noise energy in the left and right channels is large. The signal-to-noise ratio difference between the left and right channels is therefore mainly due to the noise energy. Therefore, the audio signal with low energy is used as noise-containing frequency, and the amplitude mask is calculated through the neural network model, so that harmonic distortion can be reduced, and a better fidelity effect can be obtained. Thus, in other implementations, the energy of the two channels may be calculated, and the low energy audio signal is selected from the first audio signal and the second audio signal as the noisy audio.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
For example, in the case where an audio stream input by one channel is normal and an audio stream input by another channel is abnormal, an audio signal corresponding to one channel in which the audio stream is normal may be selected as the noise-containing audio.
For example, the first audio signal is determined to be noisy audio when the microphone state of the first microphone indicates that the first microphone is operating normally and the microphone state of the second microphone indicates that the second microphone is abnormal (e.g., blocked, or bad).
For example, when the microphone state of the first microphone indicates that the first microphone is abnormal and the microphone state of the second microphone indicates that the second microphone is operating normally, the second audio signal is determined to be noisy audio.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
Illustratively, in one extreme case, such as where the audio streams of both channels are abnormal, in one implementation, the noise-containing frequency may be determined in two ways given for determining the noise-containing audio given that the audio streams of both channels are normal. In another implementation manner, a prompt window can be displayed in the current interface, such as the first interface, and prompt information indicating that the first microphone and the second microphone are abnormal is displayed in the prompt window, so that a user can be prevented from using the microphone with the abnormality to record audio and video under the condition of unknowing, and finally recorded audio and video files are free of sound or are all interference audio.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
104, Inputting the noise-containing frequency into the neural network model to obtain an amplitude mask of the noise-containing frequency.
The neural network model is understandably pre-trained and built into the terminal device based on the noisy training data described above. In this embodiment, the neural Network model includes a convolutional Network (Convolutional Neural Network, CNN) layer, a Long Short-Term Memory (LSTM) layer, and a full-Link Network (LN) layer, which are sequentially arranged. That is, the neural network model in the present embodiment is CRNN (Convolutional Recurrent Neural Network) in structure.
In order to facilitate the description of the neural network model construction process, the following description is made in terms of both the training phase and the testing phase.
Training phase:
And collecting clean voice (without noise frequency) of a single channel, and fusing noise data (such as noise data without human voice, wind noise, road noise, music sound and the like) of different scenes with different signal to noise ratios and different amplitudes to form noise-containing frequency training data. Furthermore, the model which is constructed and meets the neural network structure and can embody the loss function of the time-frequency domain relation is used for carrying out iterative training until the convergence condition is met by using the noise-containing frequency training data of the type.
For example, in the training process, the clean voice may be set as x (t), the noise data is set as n (t), and then the noise-containing training data y (t) may be expressed as y (t) =x (t) +n (t).
For example, when training the neural network model based on the noisy training data, each piece of noisy training data may be divided into time frames according to a preset time period, and then frequency domain features (amplitude spectra) may be extracted from each time frame. Then, the amplitude spectrum corresponding to the noise-containing training data of the current frame and the consecutive frames is denoted as y= [ Y 1(f),…,Yt(f),…,yT(f)]H∈Rt×F.
Where T represents the number of frames, F represents the number of frequency points, and Y t (F) represents the frequency domain amplitude of the current frame.
Illustratively, after the sample Y is input into the neural network model and processed by the CNN layer, the LSTM layer, and the LN layer in the neural network model, the amplitude mask (f) corresponding to the current frame Y t (f) is output.
Based on this, the loss function can be expressed as:
Wherein, Masking the amplitude mask outputted by the neural network model, namely the voice after noise reduction treatment, can be expressed as
Based on the relation, in the process of carrying out iterative training on the model of the CRNN structure based on the noise-containing frequency training data, the neural network model meeting the iteration requirement can be obtained by minimizing the loss function.
For specific details on training neural network models, reference may be made to the relevant literature for models of CRNN structures, which are not described here in detail.
Testing:
When the sampling rate of the noise-containing frequency training data and the sampling rate of the noise-containing frequency test data do not match, the frequency band expansion process may be performed so that the neural network model may output an amplitude mask suitable for the noise-containing frequency test data as well. Specifically, an amplitude mask obtained by processing the neural network model is expanded to a full frequency band of noisy frequency test data.
It will be appreciated that since the sampling frequency of current noisy training data is mostly 16kHz, the bandwidth of the amplitude mask determined based on the neural network model obtained by training the corresponding noisy training data with a sampling frequency of 16kHz is typically 0kHz to 8kHz. Therefore, for the noise-containing frequency test data with the sampling frequency of 16kHz, or in the subsequent application scenario, when the sampling frequency of the audio signals acquired by the first microphone and the second microphone is also 16kHz, the proper amplitude mask can be determined directly based on the neural network model obtained by training the noise-containing frequency training data with the sampling frequency of 16 kHz. Namely, when the sampling frequency of the noise-containing audio is 16kHz, inputting the frequency domain characteristics corresponding to the noise-containing frequency into a neural network model to obtain an amplitude mask with the bandwidth between 0kHz and 8 kHz; and then taking the amplitude mask with the bandwidth between 0kHz and 8kHz as the amplitude mask of the noise-containing audio.
Based on the amplitude mask determined in the above manner, the noise reduction processing of stereo is realized, which can be expressed as follows:
Wherein mask 0-8k (f) is an amplitude mask with the bandwidth of 0 kHz-8 kHz and the frequency (frequency point) of f, which is determined by a neural network model. X l (f) is the audio signal input in the left channel, as the first audio signal collected by the first microphone and not subjected to noise reduction. X' l (f) is the audio signal output from the left channel, and the first audio signal is masked according to the amplitude mask as described above. X r (f) is an audio signal input in the right channel, and is a second audio signal acquired by the second microphone and not subjected to noise reduction. x' r (f) is the audio signal output from the right channel, and the second audio signal is masked according to the amplitude mask as described above.
For example, for noise-containing frequency test data with a sampling frequency greater than 16kHz, or in a subsequent application scenario, when the sampling frequency of the audio signals collected by the first microphone and the second microphone is also greater than 16kHz, a frequency domain expansion process is required.
The above-mentioned sampling frequency is more than 16kHz, for example, 32kHz or 48kHz. Because the voice is mainly concentrated at medium and low frequencies, the neural network model obtained by training based on the noise-containing frequency training data with the sampling frequency of 16kHz is still applicable, but the premise is to match the characteristic dimension. The frequency domain expansion process will be specifically described below by taking a sampling frequency of 48kHz as an example:
For example, in the training stage, noise-containing training data with the sampling frequency of 16kHz can be up-sampled to the sampling frequency of 48kHz, then the frequency domain feature extraction is carried out according to the preset feature dimension, but in the actual training, only feature data with the sampling frequency of less than 8kHz is input into a model of a CRNN structure for training.
It should be noted that upsampling is used here to increase the sampling frequency of the audio sequence. The upsampling operation can be divided into two steps: firstly, interpolation is carried out according to the sampling proportion, and a mirror image frequency spectrum is introduced at the moment; and secondly, filtering the image spectrum through a low-pass filtering algorithm to obtain an up-sampled signal.
For specific details of upsampling noise-containing training data with a sampling frequency of 16kHz to a sampling frequency of 48kHz, reference may be made to the relevant literature of the upsampling method, which is not repeated here.
Illustratively, in the test phase, the frequency domain expansion process is specifically: the amplitude mask determined by the neural network model is directly used within 8kHz, the average value of the current amplitude mask is adopted above 8kHz, and gains of the amplitude mask and the current amplitude mask are fused in a certain proportion near 8kHz (such as 7kHz-8 kHz), so that the continuity of a frequency spectrum is ensured. Namely, when the sampling frequency of the noise-containing audio is 32kHz or 48kHz, inputting the frequency domain characteristics corresponding to the noise-containing frequency into a neural network model to obtain an amplitude mask with the bandwidth between 0kHz and 8 kHz; taking the average value of amplitude masks with bandwidths between 0kHz and 8kHz output by the neural network model as the amplitude mask corresponding to the bandwidths above 8 kHz; fusing an amplitude mask with the bandwidth between 0kHz and 8kHz output by the neural network model and an amplitude mask corresponding to the bandwidth above 8kHz according to a set gain proportion to obtain an amplitude mask corresponding to the bandwidth near 8 kHz; and taking an amplitude mask with the bandwidth between 0kHz and 8kHz, an amplitude mask corresponding to the bandwidth near 8kHz and an amplitude mask corresponding to the bandwidth above 8kHz as the amplitude mask of the noise-containing audio.
Based on the amplitude mask of the noisy audio determined by the frequency domain expansion processing mode, the noise reduction processing of stereo is realized, and the method can be specifically expressed as (taking a left channel as an example, and a right channel as the same) that:
Wherein the weight alpha is uniformly increased from 0 to 1 in the transition frequency band.
For better understanding, the following description is given in detail with reference to fig. 9. For example, when the left channel input and the right channel input are audio signals with a sampling frequency of 48kHz, when the stereo noise reduction method provided by the embodiment of the present application is used for noise reduction, the audio signals with the sampling frequency of 48kHz need to be converted into audio signals with a bandwidth of 24kHz based on fast fourier transform (fast Fourier transform, FFT).
With continued reference to fig. 9, an exemplary FFT-transformed audio signal having a bandwidth of 24kHz has its spectrum un-denoised, i.e., is an un-denoised audio signal.
With continued reference to fig. 9, exemplary, after the audio signals for the left and right channels input having a bandwidth of 24kHz are obtained, the noise-containing frequency may be determined based on the manner described above in step 103, and then input into a trained neural network model. Because the noise-containing audio with the sampling frequency being more than 16kHz is input, the amplitude masks corresponding to different frequency ranges in the whole frequency range from 0kHz to 24kHz are determined based on the frequency domain expansion processing mode. And then masking the audio signals input by the left channel and the right channel based on amplitude masks corresponding to different frequency ranges, so that the noise-reduced audio signals with the bandwidths of 24kHz corresponding to the left channel and the right channel can be obtained. Finally, the audio signal with the bandwidth of 24kHz is converted into the audio signal with the sampling frequency of 48kHz by adopting inverse fast Fourier transform (INVERSE FAST Fourier Transform, IFFT).
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
In addition, it should be noted that, in the present embodiment, the neural network model may be obtained by training the terminal device itself, or may be obtained by training the server and transmitted to the terminal device.
In order to reduce the resources and power consumption of the terminal device, the present embodiment takes server training acquisition as an example. Specifically, in this embodiment, in order to reduce the training pressure on the server as much as possible, and at the same time, enable the trained neural network model to be suitable for most types and configurations of terminal devices, the first trained neural network model of the server may train an initial neural network model based on the noise-containing frequency training data obtained from the big data platform, and then push the trained initial neural network model to each terminal device, for example, the mobile phone 1, the mobile phone 2 and the mobile phone 3 in fig. 10, respectively, or push the trained initial neural network model to the corresponding terminal device after receiving the request of each terminal device.
Further, to achieve the customization requirement, the server may also collect the single-channel clean audio generated by each terminal device using the initial neural network model, such as the 16kHz single-channel clean audio generated by the mobile phone 1, the 32kHz single-channel clean audio generated by the mobile phone 2, and the 48kHz single-channel clean audio generated by the mobile phone 3 in fig. 10. And then respectively carrying out optimization training on the initial neural network model according to different single-channel clean audios, further obtaining target neural network models aiming at different terminal devices, and respectively pushing the target neural network models of the different terminal devices to the corresponding terminal devices for use.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment.
And 105, masking the first audio signal and the second audio signal by adopting an amplitude mask to obtain the stereophonic sound after noise reduction.
Therefore, the amplitude mask suitable for the first audio signal and the second audio signal is determined by adopting the neural network model obtained by training the noise audio training data based on the single channel, and then the two paths of audio signals are processed by utilizing the same amplitude mask, so that the noise reduction of the two paths of audio signals in stereo is realized.
In addition, compared with the method for separately reducing the noise of the left and right channels, the stereo noise reduction method provided by the embodiment expands the single-channel noise reduction to the stereo noise reduction mode, so that the expenditure of a noise reduction algorithm can be obviously reduced, the spatial characteristics of the stereo can be saved, and the audio and video experience is improved.
When the stereo noise reduction method provided by the embodiment of the application is applied to the terminal equipment, the method not only needs to relate to a software part of the terminal equipment, but also relates to a hardware part of the terminal equipment. Taking the terminal device as an example of the mobile phone and taking the hardware structure as shown in fig. 1 as an example, in order to better understand the software structure of the mobile phone 100 shown in fig. 1, the software structure of the mobile phone 100 is described below. Before explaining the software structure of the mobile phone 100, an architecture that the software system of the mobile phone 100 can employ will be first described.
Specifically, in practical applications, the software system of the mobile phone 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
Furthermore, it is understood that software systems used by currently mainstream terminal devices include, but are not limited to, windows systems, android systems, and iOS systems. For convenience of explanation, the embodiment of the present application takes an Android system with a layered architecture as an example, and illustrates a software structure of the mobile phone 100. In a specific implementation, the stereo noise reduction method provided by the embodiment of the application is also applicable to other systems.
In addition, it should be appreciated that the layered architecture of current handsets divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. These layers may include, for example, an application layer, an application framework layer, an Zhuoyun rows (Android runtime) and system libraries, a hardware abstraction layer, a kernel layer, and the like.
Referring to fig. 11, a software architecture diagram of a mobile phone 100 according to an embodiment of the present application is shown.
The application layer may include a series of application packages, among other things. The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. In some implementations, these programming interfaces and programming frameworks can be described as functions.
With continued reference to fig. 11, in particular, in the technical solution provided in the embodiment of the present application, the application program layer may include a camera application capable of recording stereo, a recorder application, a setting application integrated with a function of turning on or off audio noise reduction, and the like.
With continued reference to fig. 11, in particular, in the technical solution provided in the embodiment of the present application, the application framework layer may include an audio service, a camera service, a stereo noise reduction module, and so on.
With continued reference to FIG. 11, exemplary Android Runtime includes a core library and virtual machines. Android run is responsible for scheduling and management of the Android system.
The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional (3D) graphics processing Libraries (e.g., openGL ES), two-dimensional (2D) graphics engines (e.g., SGL), etc.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
Media libraries support a variety of commonly used audio, video formats for playback and recording, still image files, and the like. The media library may support a variety of audio video encoding formats, such as: MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.
It will be appreciated that the 2D graphics engine described above is a drawing engine for 2D drawing.
When recording stereo sound through the video mode of camera application, the content related to the picture will be processed by the functional module related to the graphics in the system library.
Furthermore, it is understood that the kernel layer in the Android system is a layer between hardware and software. With continued reference to fig. 11, the core layer illustratively contains at least a display driver, a camera driver, an audio driver, and the like. For example, in a stereo recording scenario, the audio driver may drive a first microphone and a second microphone in the audio module to pick up audio signals. As to the software structure of the mobile phone 100, it will be understood that the layers and the components included in the layers in the software structure shown in fig. 11 do not constitute a specific limitation on the mobile phone 100. In other embodiments of the present application, the handset 100 may include more or fewer layers than shown and may include more or fewer components per layer, as the application is not limited.
Based on the hardware structure of the mobile phone shown in fig. 1 and the software structure shown in fig. 11, the following describes the related hardware and software structures when implementing the stereo noise reduction method provided by the embodiment of the present application.
Taking the application of recording stereo as the camera application, referring to fig. 12, for example, after the user clicks S4' shown in (1) in fig. 3 in the above embodiment, the mobile phone responds to the operation, and the camera application will issue a stereo recording instruction to the audio service, that is, inform the audio service that stereo recording is about the present period.
Understandably, when recording in the video mode of the camera application, the recording of pictures is also involved, i.e. the camera service is also required, and the camera driving, the camera, and the functional modules of the video stream collected by the camera are also involved. Only audio related content is described herein.
With continued reference to fig. 12, the exemplary audio service, upon receiving a stereo recording instruction, initiates a corresponding process to invoke an audio driver to drive an audio module, such as a first microphone and a second microphone in the audio module. Thus, during the recording process, the first microphone may collect the first audio signal (without noise reduction), and the second microphone may collect the second audio signal (without noise reduction).
The audio signals collected by the first microphone and the second microphone are sent to the Modem to obtain stereo sound, and the stereo sound obtained by the first microphone and the second microphone is stored in the memory when recording is finished.
The processing of the first audio signal (without noise reduction) and the second audio signal (without noise reduction) by the Modem is specifically determined according to the instruction given by the stereo noise reduction module.
With continued reference to fig. 12, for example, when the user clicks the audio noise reduction option in the first state, such as the audio noise reduction option in the style S8 in the interface 10c in the above embodiment, the stereo noise reduction module determines that the user has turned on the audio noise reduction function, that is, needs the Modem to process the first audio signal (not noise reduced) and the second audio signal (not noise reduced) in the noise reduction mode. In this case, the stereo noise reduction module sends a noise reduction instruction to the Modem, so that the Modem will reduce the noise of the first audio signal (without noise reduction) and the second audio signal (without noise reduction) based on the stereo noise reduction method provided by the embodiment of the present application, specifically, the processing from step 103 to step 105 in the above embodiment may be performed on the first audio signal (without noise reduction) and the second audio signal (without noise reduction).
For example, when the user clicks the audio noise reduction option in the second state, such as the audio noise reduction option in the style S8' in the interface 10c in the above embodiment, the stereo noise reduction module determines that the user turns off the audio noise reduction function, that is, the Modem needs to process the first audio signal (not noise reduced) and the second audio signal (not noise reduced) in the acoustic mode. In this case, the stereo noise reduction module will send an acoustic instruction to the Modem, so that the Modem will not perform noise reduction processing on the first audio signal (without noise reduction) and the second audio signal (without noise reduction) in steps 103 to 105 in the above embodiment.
It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not the only limitation of the present embodiment. The above description of the various instructions is also merely for explaining when each functional module and hardware perform any operation, and in the specific implementation, the names of the instructions are not limited.
Therefore, the terminal equipment based on the software and hardware structures can realize noise reduction processing under the condition of ensuring the fidelity of the target voice signal in the stereo, simultaneously preserve the spatial characteristics of the stereo and improve the audio and video experience by the stereo noise reduction method provided by the embodiment of the application.
Furthermore, it is understood that the terminal device, in order to implement the above-mentioned functions, comprises corresponding hardware and/or software modules for performing the respective functions. The present application can be implemented in hardware or a combination of hardware and computer software, in conjunction with the example algorithm steps described in connection with the embodiments disclosed herein. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Furthermore, it should be noted that, in an actual application scenario, the stereo noise reduction method provided in the foregoing embodiments implemented by the terminal device may also be executed by a chip system included in the terminal device, where the chip system may include a processor. The chip system may be coupled to a memory such that the chip system, when running, invokes a computer program stored in the memory, implementing the steps performed by the terminal device. The processor in the chip system can be an application processor or a non-application processor.
In addition, an embodiment of the present application further provides a computer readable storage medium, where computer instructions are stored, which when executed on a terminal device, cause the terminal device to execute the related method steps to implement the stereo noise reduction method in the foregoing embodiment.
In addition, the embodiment of the application also provides a computer program product, which when being run on the terminal equipment, causes the terminal equipment to execute the related steps so as to realize the stereo noise reduction method in the embodiment.
In addition, embodiments of the present application also provide a chip (which may also be a component or module) that may include one or more processing circuits and one or more transceiver pins; wherein the transceiver pin and the processing circuit communicate with each other through an internal connection path, and the processing circuit executes the related method steps to implement the stereo noise reduction method in the above embodiment, so as to control the receiving pin to receive signals, and control the transmitting pin to transmit signals.
In addition, as can be seen from the above description, the terminal device, the computer-readable storage medium, the computer program product or the chip provided by the embodiments of the present application are used to execute the corresponding methods provided above, so that the advantages achieved by the method can refer to the advantages in the corresponding methods provided above, and are not repeated herein.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.
Claims (18)
1. A method of stereo noise reduction, applied to a terminal device, the method comprising:
displaying a first interface, wherein the first interface displays audio noise reduction options;
When the audio noise reduction options are in a first state, and when clicking operation on the audio noise reduction options is received, the audio noise reduction options are switched to a second state, and an audio noise reduction function is started;
After the audio noise reduction function is started, determining noise-containing audio according to the microphone state of the first microphone, the microphone state of the second microphone, a first audio signal collected by the first microphone and a second audio signal collected by the second microphone; wherein the first microphone and the second microphone are microphones located at different positions; the determining the noise-containing frequency according to the microphone state of the first microphone, the microphone state of the second microphone, the first audio signal collected by the first microphone and the second audio signal collected by the second microphone comprises the following steps: determining the noise-containing audio according to the first audio signal and the second audio signal when the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone works normally; determining the first audio signal as the noisy audio when the microphone state of the first microphone indicates that the first microphone is working normally and the microphone state of the second microphone indicates that the second microphone is abnormal; determining the second audio signal as the noisy audio when the microphone state of the first microphone indicates that the first microphone is abnormal and the microphone state of the second microphone indicates that the second microphone is working normally;
Inputting the noise-containing frequency into a neural network model to obtain an amplitude mask suitable for the first audio signal and the second audio signal; the neural network model is obtained by training noise-containing frequency training data, wherein the noise-containing frequency training data is obtained by fusing noise data on single-channel noise-free audio;
Masking the first audio signal and the second audio signal by adopting the amplitude mask to obtain the stereophonic sound after noise reduction.
2. The method of claim 1, wherein the determining the noisy audio from the first audio signal and the second audio signal when the microphone state of the first microphone indicates that the first microphone is operating normally and the microphone state of the second microphone indicates that the second microphone is operating normally comprises:
When the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone works normally, carrying out average processing on the frequency domain characteristics of the first audio signal and the frequency domain characteristics of the second audio signal;
and taking the audio signal corresponding to the frequency domain characteristic obtained after the mean value processing as the noise-containing audio.
3. The method of claim 1, wherein the determining the noisy audio from the first audio signal and the second audio signal when the microphone state of the first microphone indicates that the first microphone is operating normally and the microphone state of the second microphone indicates that the second microphone is operating normally comprises:
And when the microphone state of the first microphone indicates that the first microphone works normally and the microphone state of the second microphone indicates that the second microphone works normally, selecting an audio signal with low energy from the first audio signal and the second audio signal as the noise-containing audio.
4. The method according to claim 1, wherein the method further comprises:
Displaying a prompt window in the first interface when the microphone state of the first microphone indicates that the first microphone is abnormal and the microphone state of the second microphone indicates that the second microphone is abnormal;
and the prompt window displays prompt information of abnormality of the first microphone and the second microphone.
5. The method of any one of claims 1 to 4, wherein after turning on the audio noise reduction function, the method further comprises:
When a clicking operation on the audio noise reduction option is received, the audio noise reduction option is switched to the first state, and the audio noise reduction function is closed;
After the audio noise reduction function is turned off, synthesizing the stereo without noise reduction according to the first audio signal and the second audio signal.
6. The method of any one of claims 1 to 4, wherein the first interface further displays an end recording option;
The method further comprises the steps of:
after the audio noise reduction function is started, when a click operation of the recording ending option is received, the audio noise reduction option is restored to the first state, and the audio noise reduction function is closed.
7. The method of any one of claims 1 to 4, wherein the first interface further displays an end recording option;
The method further comprises the steps of:
After the audio noise reduction function is started, when a click operation of the recording ending option is received, recording the second state;
And when the recording is triggered again, directly starting the audio noise reduction function according to the recorded second state.
8. The method according to any one of claims 1 to 4, wherein the sampling frequency of the noisy training data is 16kHz, and the bandwidth of the amplitude mask determined by training the noisy training data to obtain the neural network model is between 0kHz and 8kHz.
9. The method of claim 8, wherein said inputting the noisy audio into a neural network model results in an amplitude mask for the noisy audio, comprising:
When the sampling frequency of the noise-containing audio is 16kHz, inputting the frequency domain characteristics corresponding to the noise-containing frequency into the neural network model to obtain an amplitude mask with the bandwidth between 0kHz and 8 kHz;
And taking the obtained amplitude mask with the bandwidth between 0kHz and 8kHz as the amplitude mask of the noise-containing audio.
10. The method according to claim 9, wherein the method further comprises:
When the sampling frequency of the noise-containing audio is 32kHz or 48kHz, inputting the frequency domain characteristics corresponding to the noise-containing frequency into the neural network model to obtain an amplitude mask with the bandwidth between 0kHz and 8 kHz;
Taking the average value of amplitude masks with bandwidths between 0kHz and 8kHz output by the neural network model as the amplitude mask corresponding to the bandwidths above 8 kHz;
fusing an amplitude mask with the bandwidth between 0kHz and 8kHz output by the neural network model and an amplitude mask corresponding to the bandwidth above 8kHz according to a set gain proportion to obtain an amplitude mask corresponding to the bandwidth near 8 kHz;
And taking an amplitude mask with the bandwidth between 0kHz and 8kHz, an amplitude mask corresponding to the bandwidth near 8kHz and an amplitude mask corresponding to the bandwidth above 8kHz as the amplitude mask of the noise-containing audio.
11. The method of any of claims 1 to 4, wherein the fusing noise data over single channel noise-free audio comprises:
Dividing the single-channel noise-free audio and the noise data into the same number of noise-free audio time frames and noise audio time frames according to the same time period;
And superposing the frequency domain features in the noise-free frequency time frame and the frequency domain features in the noise audio time frame for the noise-free frequency time frame and the noise audio time frame at the same time.
12. The method of any one of claims 1 to 4, wherein the neural network model comprises a convolutional network layer, a long-short-term memory network layer, and a fully-connected network layer, which are sequentially arranged;
the convolution network layer is used for acquiring local features of frequency domain features corresponding to the noise-containing frequency, the long-short-time memory network layer is used for acquiring time sequence features among frames in the noise-containing audio, and the full-connection network layer is used for mapping the local features of the frequency domain features acquired by the convolution network layer and the time sequence features acquired by the long-short-time memory network layer to feature dimensions corresponding to the amplitude mask.
13. The method of any of claims 1 to 4, wherein the first microphone is located at a top of the terminal device and the second microphone is located at a bottom of the terminal device.
14. The method according to any of claims 1 to 4, wherein the terminal device establishes a communication link with a left earpiece and a right earpiece of a real wireless earpiece, respectively;
the first microphone is located at the left earphone, and the second microphone is located at the right earphone.
15. The method of any one of claims 1 to 4, wherein the first interface is a video interface corresponding to a camera application.
16. The method of any one of claims 1 to 4, wherein the first interface is a recording interface corresponding to a recording application.
17. A terminal device, characterized in that the terminal device comprises: a memory and a processor, the memory and the processor coupled; the memory stores program instructions that, when executed by the processor, cause the terminal device to perform the stereo noise reduction method as defined in any one of claims 1 to 16.
18. A computer readable storage medium comprising a computer program which, when run on a terminal device, causes the terminal device to perform the stereo noise reduction method as defined in any one of claims 1 to 16.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310481612.6A CN117133305B (en) | 2023-04-27 | 2023-04-27 | Stereo noise reduction method, apparatus and storage medium |
PCT/CN2023/131385 WO2024221844A1 (en) | 2023-04-27 | 2023-11-14 | Noise reduction method for stereo sound, and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310481612.6A CN117133305B (en) | 2023-04-27 | 2023-04-27 | Stereo noise reduction method, apparatus and storage medium |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410931313.2A Division CN118899004A (en) | 2023-04-27 | Stereo noise reduction method, apparatus and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117133305A CN117133305A (en) | 2023-11-28 |
CN117133305B true CN117133305B (en) | 2024-08-06 |
Family
ID=88861698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310481612.6A Active CN117133305B (en) | 2023-04-27 | 2023-04-27 | Stereo noise reduction method, apparatus and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117133305B (en) |
WO (1) | WO2024221844A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117133306A (en) * | 2023-04-27 | 2023-11-28 | 荣耀终端有限公司 | Stereo noise reduction method, apparatus and storage medium |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3863706B2 (en) * | 2000-07-04 | 2006-12-27 | 三洋電機株式会社 | Speech coding method |
US9094645B2 (en) * | 2009-07-17 | 2015-07-28 | Lg Electronics Inc. | Method for processing sound source in terminal and terminal using the same |
KR101585852B1 (en) * | 2011-09-29 | 2016-01-15 | 돌비 인터네셔널 에이비 | High quality detection in fm stereo radio signals |
US9966067B2 (en) * | 2012-06-08 | 2018-05-08 | Apple Inc. | Audio noise estimation and audio noise reduction using multiple microphones |
CA2959090C (en) * | 2014-12-12 | 2020-02-11 | Huawei Technologies Co., Ltd. | A signal processing apparatus for enhancing a voice component within a multi-channel audio signal |
CN111344778B (en) * | 2017-11-23 | 2024-05-28 | 哈曼国际工业有限公司 | Method and system for speech enhancement |
US10573301B2 (en) * | 2018-05-18 | 2020-02-25 | Intel Corporation | Neural network based time-frequency mask estimation and beamforming for speech pre-processing |
CN110660406A (en) * | 2019-09-30 | 2020-01-07 | 大象声科(深圳)科技有限公司 | Real-time voice noise reduction method of double-microphone mobile phone in close-range conversation scene |
CN111883091B (en) * | 2020-07-09 | 2024-07-26 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio noise reduction method and training method of audio noise reduction model |
CN112151059A (en) * | 2020-09-25 | 2020-12-29 | 南京工程学院 | Microphone array-oriented channel attention weighted speech enhancement method |
CN113726940B (en) * | 2021-06-15 | 2023-08-22 | 北京荣耀终端有限公司 | Recording method and device |
CN114999514A (en) * | 2022-05-31 | 2022-09-02 | 青岛信芯微电子科技股份有限公司 | Training method, device and equipment of speech enhancement model |
-
2023
- 2023-04-27 CN CN202310481612.6A patent/CN117133305B/en active Active
- 2023-11-14 WO PCT/CN2023/131385 patent/WO2024221844A1/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117133306A (en) * | 2023-04-27 | 2023-11-28 | 荣耀终端有限公司 | Stereo noise reduction method, apparatus and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2024221844A1 (en) | 2024-10-31 |
CN117133305A (en) | 2023-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113726950B (en) | Image processing method and electronic equipment | |
CN117133306B (en) | Stereo noise reduction method, apparatus and storage medium | |
CN113873379B (en) | Mode control method and device and terminal equipment | |
JP2023532078A (en) | Headset noise processing method, device and headset | |
JP7533854B2 (en) | Projection data processing method and apparatus | |
CN118051111A (en) | High-energy-efficiency display processing method and equipment | |
CN114338965B (en) | Audio processing method and electronic equipment | |
EP4138381A1 (en) | Method and device for video playback | |
CN113810589A (en) | Electronic device, video shooting method and medium thereof | |
CN116052701B (en) | Audio processing method and electronic equipment | |
CN117911299A (en) | Video processing method and device | |
CN114339429A (en) | Audio and video playing control method, electronic equipment and storage medium | |
CN113593567B (en) | Method for converting video and sound into text and related equipment | |
CN116665692B (en) | Voice noise reduction method and terminal equipment | |
CN116347217B (en) | Image processing method, device and storage medium | |
CN117133305B (en) | Stereo noise reduction method, apparatus and storage medium | |
CN115460343B (en) | Image processing method, device and storage medium | |
CN116528209A (en) | Bluetooth scanning method, device, chip system and storage medium | |
CN118899004A (en) | Stereo noise reduction method, apparatus and storage medium | |
CN117440194A (en) | Method and related device for processing screen throwing picture | |
CN115550559A (en) | Video picture display method, device, equipment and storage medium | |
CN114449393A (en) | Sound enhancement method, earphone control method, device and earphone | |
CN111294509A (en) | Video shooting method, device, terminal and storage medium | |
CN116095512B (en) | Photographing method of terminal equipment and related device | |
CN117119349B (en) | Volume control method, graphic interface and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |