CN115497500B - Audio processing method and device, storage medium and intelligent glasses - Google Patents

Audio processing method and device, storage medium and intelligent glasses Download PDF

Info

Publication number
CN115497500B
CN115497500B CN202211417559.5A CN202211417559A CN115497500B CN 115497500 B CN115497500 B CN 115497500B CN 202211417559 A CN202211417559 A CN 202211417559A CN 115497500 B CN115497500 B CN 115497500B
Authority
CN
China
Prior art keywords
frequency domain
sound source
noise
calculating
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211417559.5A
Other languages
Chinese (zh)
Other versions
CN115497500A (en
Inventor
李逸洋
张新科
崔潇潇
鲁勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Intengine Technology Co Ltd
Original Assignee
Beijing Intengine Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Intengine Technology Co Ltd filed Critical Beijing Intengine Technology Co Ltd
Priority to CN202211417559.5A priority Critical patent/CN115497500B/en
Publication of CN115497500A publication Critical patent/CN115497500A/en
Application granted granted Critical
Publication of CN115497500B publication Critical patent/CN115497500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/017Head mounted
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • G10L19/0216Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation using wavelet decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/017Head mounted
    • G02B2027/0178Eyeglass type
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L2021/065Aids for the handicapped in understanding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The embodiment of the application discloses an audio processing method and device, a storage medium and intelligent glasses. The method comprises the following steps: calculating a sound source direction estimated value of a target sound source through sound source positioning in a preset search range, calculating a noise power estimated value corresponding to each frequency point in all channels of the air conduction microphone array, performing self-adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal, calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask. The embodiment of the application can enhance the single-channel frequency domain signal obtained by self-adaptive beam forming so as to improve the accuracy of audio data and the communication efficiency of hearing-impaired people.

Description

Audio processing method and device, storage medium and intelligent glasses
Technical Field
The application relates to the technical field of data processing, in particular to an audio processing method, an audio processing device, a storage medium and intelligent glasses.
Background
At present, the scale of the people with hearing impairment in China reaches nearly thirty million, and most of the people with hearing impairment can communicate with healthy people to a certain extent by means of hearing aids. However, the effect of the hearing aid cannot be guaranteed for different situations of the hearing impaired, the effect of using the hearing aid is not ideal for many hearing impaired people, and ear diseases may be caused by wearing the hearing aid for a long time. Along with scientific and technological progress and social development, wearable equipment gradually walks into people's daily life, and intelligent glasses have brought the facility for user's life, also provide the instrument of a kind of and healthy people's interchange for hearing impaired personage. The existing scheme for assisting hearing-impaired people to communicate through intelligent glasses mainly focuses on voice recognition, brain wave recognition, sign language recognition and the like.
The applicant finds that in the prior art, the brain wave recognition scheme acquires and processes the brain wave signals of a user through a brain wave receiver on intelligent glasses, converts the brain wave signals into image-text information and displays the image-text information on the outer sides of the glasses for a healthy person to communicate with the user, but the implementation is complex; the sign language recognition scheme converts sign language information of a sound person into voice or characters through a radar or a camera on the intelligent glasses, and the voice or characters are displayed through playing or near-to-eye display for a user to communicate with the sound person, but not all sound persons can use sign language, and the sign language recognition scheme is difficult to popularize; the voice recognition scheme has the problems of low recognition accuracy rate, poor user experience and the like in a noise environment.
Disclosure of Invention
The embodiment of the application provides an audio processing method, an audio processing device, a storage medium and intelligent glasses, which can enhance a single-channel frequency domain signal obtained through adaptive beamforming so as to improve the accuracy of audio data and the communication efficiency of hearing-impaired people.
The embodiment of the application provides an audio processing method, which is applied to intelligent glasses, wherein the intelligent glasses comprise an air conduction microphone array, and the method comprises the following steps:
calculating a sound source direction estimation value of a target sound source through sound source positioning in a preset search range;
calculating a noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array;
performing adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal;
and calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask.
In one embodiment, the calculating the sound source direction estimation value of the target sound source by sound source localization within the preset search range includes:
calculating an angle spectrum function of the air guide microphone array according to the preset search range;
and traversing the angle spectrum function, and determining the sound source direction estimation value of the target sound source according to the local maximum value in the angle spectrum function.
In an embodiment, the calculating the noise power estimation value corresponding to each frequency point in all channels of the air conduction microphone array includes:
acquiring a signal frequency domain smooth power spectrum corresponding to each frequency point in all channels in the air guide microphone array;
updating the power minimum value of each frequency point in all channels according to the signal frequency domain smooth power spectrum;
calculating the voice existence probability of each frequency point in all channels according to the signal frequency domain smooth power spectrum and the power minimum value;
and updating the noise smoothing factor of each frequency point in all channels based on the voice existence probability, and calculating the noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array according to the noise smoothing factor.
In an embodiment, the performing adaptive beamforming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency-domain signal includes:
extracting the phase of frequency domain data corresponding to each frequency point in all channels in the air guide microphone array;
determining noise frequency domain data of each frequency point in all channels according to the phase and the noise power estimation value, and calculating a noise covariance matrix based on the noise frequency domain data;
calculating a self-adaptive beam forming weight vector according to a guide vector of the sound source direction estimated value and the noise covariance matrix;
and performing frequency domain filtering on the current frame through the self-adaptive beam forming weight vector to obtain the single-channel frequency domain signal.
In an embodiment, the training process of the preset noise reduction model includes;
generating a noisy audio by using the noisy audio and the clean speech audio;
performing framing, windowing and Fourier transform on the noisy audio to extract frequency domain features of the noisy audio;
building a noise reduction network by adopting an encoder-decoder structure, inputting the frequency domain characteristics of the noisy audio frequency into the noise reduction network, and calculating a loss function between a first time-frequency mask predicted by a model and a second time-frequency mask of a clean voice audio frequency;
and training the noise reduction network through a back propagation method and a gradient descent algorithm based on the loss function.
In an embodiment, the obtaining of the preset search range includes:
acquiring eyeball characteristic information of a current user to determine a focusing direction of the current user;
and determining the preset search range according to the neighborhood range of the focusing direction.
In an embodiment, after enhancing the single channel frequency domain signal, the method further comprises:
and converting the enhanced audio signal into text information, and displaying the text information on the intelligent glasses.
The embodiment of the present application further provides an audio processing apparatus, is applied to smart glasses, smart glasses include empty guide microphone array, include:
the first calculation module is used for calculating a sound source direction estimation value of a target sound source through sound source positioning in a preset search range;
the second calculation module is used for calculating a noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array;
the forming module is used for carrying out self-adaptive beam forming on the basis of the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal;
and the enhancement module is used for calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model and enhancing the single-channel frequency domain signal according to the time-frequency mask.
Embodiments of the present application further provide a storage medium, where the storage medium stores a computer program, and the computer program is suitable for being loaded by a processor to perform the steps in the audio processing method according to any one of the above embodiments.
The embodiment of the present application further provides a pair of smart glasses, each of the smart glasses includes a memory and a processor, a computer program is stored in the memory, and the processor executes the steps in the audio processing method according to any one of the above embodiments by calling the computer program stored in the memory.
The audio processing method, the audio processing device, the storage medium and the smart glasses provided by the embodiment of the application can calculate the sound source direction estimated value of a target sound source through sound source positioning in a preset search range, calculate the noise power estimated value corresponding to each frequency point in all channels in the air conduction microphone array, perform self-adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal, calculate the time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhance the single-channel frequency domain signal according to the time-frequency mask. The embodiment of the application can enhance the single-channel frequency domain signal obtained by self-adaptive beam forming so as to improve the accuracy of audio data and the communication efficiency of hearing-impaired people.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a system diagram of an audio processing apparatus according to an embodiment of the present disclosure.
Fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application.
Fig. 3 is a schematic flowchart of another audio processing method according to an embodiment of the present application.
Fig. 4 is a schematic diagram of a noise reduction model training process provided in the embodiment of the present application.
Fig. 5 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure.
Fig. 6 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of smart glasses provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides an audio processing method and device, a storage medium and intelligent glasses. Specifically, the audio processing method according to the embodiment of the present application may be executed by an electronic device, where the electronic device may be smart glasses, and the smart glasses include an air conduction microphone, and the air conduction microphone is used for acquiring a voice signal of another person.
For example, when the audio processing method is operated on intelligent glasses, a sound source direction estimation value of a target sound source is calculated through sound source positioning in a preset search range, a noise power estimation value corresponding to each frequency point in all channels in an air conduction microphone array is calculated, adaptive beam forming is performed based on the sound source direction estimation value and the noise power estimation value to obtain a single-channel frequency domain signal, a time-frequency mask of the single-channel frequency domain signal is calculated through a preset noise reduction model, and the single-channel frequency domain signal is enhanced according to the time-frequency mask. Wherein the smart glasses may interact with the user through a graphical user interface. The manner in which the smart glasses provide the graphical user interface to the user may include a variety of ways, for example, a display screen displayed on the smart glasses lenses may be rendered, or a holographic projection may be made on the smart glasses lenses to present the graphical user interface. For example, the smart glasses may include a display screen for presenting a graphical user interface and receiving user operation instructions generated by a user acting on the graphical user interface, and a processor.
Referring to fig. 1, fig. 1 is a system schematic diagram of an audio processing apparatus according to an embodiment of the present disclosure. The system may include smart glasses 1000, at least one server or personal computer 2000. The smart glasses 1000 held by the user may be connected to a server or a personal computer through a network. The smart glasses 1000 may be a terminal device having computing hardware capable of supporting and executing software products corresponding to multimedia, for example, capable of supporting voice recognition and text conversion. In addition, the smart glasses 1000 may also have a display screen or a projection device. In addition, the smart glasses 1000 may be interconnected with a server or a personal computer 2000 through a network. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, different smart glasses 1000 may be connected to other smart glasses or to a server, a personal computer, and the like using their own bluetooth network or a hotspot network. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.
The embodiment of the application provides an audio processing method which can be executed by intelligent glasses or a server. The embodiment of the present application is described by taking an example in which the audio processing method is executed by smart glasses. The intelligent glasses comprise an air conduction microphone array and a processor, wherein the processor is configured to calculate a sound source direction estimated value of a target sound source through sound source positioning in a preset search range, calculate a noise power estimated value corresponding to each frequency point in all channels in the air conduction microphone array, perform self-adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal, calculate a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhance the single-channel frequency domain signal according to the time-frequency mask.
Referring to fig. 2, the specific process of the method may be as follows:
step 101, calculating a sound source direction estimation value of a target sound source through sound source positioning in a preset search range.
In the embodiment of the present application, the microphones of the smart glasses may include two types, respectively, a null conductive microphone array, wherein the null conductive microphone array is used for receiving signals from the external environment through air conduction, such as ambient sound and speaking sound of other people. The number of the microphone channels of the air conduction microphone array is at least two, the array configuration and the microphone distance are known, and the array configuration and the microphone distance can be a linear array, an area array or other regular array or an irregular array.
In one embodiment, when performing the sound source localization, the localization may be performed by a null microphone array, wherein the null microphone array includes a plurality of microphones and is arranged according to a rule. The microphones synchronously collect sound signals, and the signal phase difference among the microphones is utilized to obtain the sending position of the sound source signal.
In an embodiment, a preset search range may be determined according to the direction of the point of interest of the user, and sound source localization may be performed within the preset search range. Specifically, the smart glasses can obtain the prior information (theta) of the direction of the attention point of the sight of the user through eye movement tracking 00 ) Wherein theta 0 An azimuth angle representing a user's gaze point of interest, 9811 0 Representing the elevation angle of the user's gaze point of interest. Then, focusing on the direction of the point with the user's gazeA priori information (theta) 00 ) Defining a sound source location search range for the center, wherein the azimuth search range is [ theta ] 0 -3σ,θ 0 +3σ]Azimuth search interval is delta theta, and pitch search range is [ 98111 0 -3σ,ϕ 0 +3σ]The search interval of the pitch angle is delta \981wheresigma represents the standard deviation of the angle estimation, 3 sigma represents the neighborhood range of the search interval, the confidence of the interval is 99.74%, that is, the probability that the sound source localization true value falls within the 3 sigma neighborhood range of the user sight line attention point is 99.74%, and the standard deviation sigma of the angle estimation is related to the beam width of the microphone array and the signal-to-noise ratio of the received signal. And finally, positioning the sound source in the searching range to determine the target sound source direction, wherein the target sound source direction can be represented by a sound source direction estimated value.
In an embodiment, the sound source may be located within a preset search range by a preset method, where the preset method may include a cross-correlation or super-resolution algorithm, and also include a deep learning algorithm implemented by a structure such as a convolutional neural network or a cyclic neural network. Further, the number of local peak values during sound source localization in a preset search range can be obtained in the sound source localization process, and if the number of local peak values is 1, the direction corresponding to the local peak value is determined as the target sound source direction; if the number of the local peak values is larger than 1, prompting the sound source directions corresponding to the local peak values on the intelligent glasses, and receiving a user instruction to confirm the target sound source direction from the sound source directions.
For example, if there are multiple speakers opposite to the current user, after the search interval is defined according to the preset search range, there are still a plurality of local peaks searched in the interval with a maximum probability, the smart glasses may prompt the user that there are multiple sound sources and designate any sound source by the user through near-to-eye display or the like, and after the user confirms the sound source direction, the user may be determined as the target sound source direction, and the voice of the speaker is directionally enhanced to assist the user in communication.
And 102, calculating a noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array.
In an embodiment, the signal frequency domain smoothed power spectrum of each frequency point of each channel in the air guide microphone array is updated first, and the calculation formula is as follows:
Figure 16522DEST_PATH_IMAGE001
wherein S is m (t, f) represents the frequency domain smooth power spectrum of the mth microphone channel at the tth frame and the fth time frequency point, S m (t-1, f) represents the frequency domain smooth power spectrum of the mth microphone channel at the t-1 frame and the f time frequency point, | Y m (t,f)| 2 The frequency domain power spectrum, alpha, of the mth microphone channel at the tth frame and the fth time frequency point s Representing the frequency domain power spectrum smoothing factor, | - | representing the modulo operation. Then, according to the obtained signal frequency domain smooth power spectrum, updating the power minimum value of each frequency point of each channel, wherein the calculation formula is as follows:
Figure 730400DEST_PATH_IMAGE002
wherein S is m,min And (t, f) represents the minimum value of the power of the mth microphone channel at the tth frame and the fth time frequency point, and gamma and beta both represent empirical constants. Then, the speech existence probability of each frequency point of each channel can be calculated according to the obtained signal frequency domain smooth power spectrum and the minimum value of the frequency domain power:
Figure 41295DEST_PATH_IMAGE003
wherein, 206 m (t, f) represents whether the speech of the mth microphone channel exists on the tth frame and the fth time frequency point, ξ (f) represents the threshold value of the fth frequency point, and the frequency domain smooth power spectrum S of the mth microphone channel on the tth frame and the fth time frequency point m Frequency domain power minimum value S of (t, f) and mth microphone channel at the tth frame and the fth time frequency point m,min If the ratio of (t, f) exceeds the threshold value, determining that the voice exists, otherwise, determining that the voice does not exist; p is m (t, f) represents the voice existence probability of the mth microphone channel at the tth frame and the fth time frequency point, P m (t-1, f) represents the voice existence probability of the mth microphone channel at the t-1 frame and the fth time frequency point, alpha p Representing a speech presence probability smoothing factor. Then, according to the obtained speech existence probability, the noise smoothing factor of each frequency point of each channel can be updated:
Figure 702084DEST_PATH_IMAGE004
wherein alpha is m (t, f) represents the noise smoothing factor of the mth microphone channel at the tth frame and the fth time frequency point, alpha d Representing the noise smoothing factor coefficient. And finally, obtaining a noise power estimation value of each frequency point of each channel according to the frequency domain power spectrum of the received signal and the noise smoothing factor:
Figure 47615DEST_PATH_IMAGE005
wherein the content of the first and second substances,
Figure 401235DEST_PATH_IMAGE006
representing the estimated value of the noise power of the mth microphone channel at the tth frame and the fth time frequency point,
Figure 199427DEST_PATH_IMAGE007
and the estimated value of the noise power of the mth microphone channel at the t-1 th frame and the f-th time frequency point is shown. That is, the step of calculating the noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array may include: acquiring a signal frequency domain smooth power spectrum corresponding to each frequency point in all channels in the air guide microphone array, updating the power minimum value of each frequency point in all channels according to the signal frequency domain smooth power spectrum, calculating the voice existence probability of each frequency point in all channels according to the signal frequency domain smooth power spectrum and the power minimum value, and updating each frequency point in all channels based on the voice existence probabilityAnd calculating a noise power estimation value corresponding to each frequency point in all channels in the air guide microphone array according to the noise smoothing factor.
And 103, performing adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal.
After the target sound source direction is determined, an audio signal, specifically a single-channel frequency domain beam forming signal, can be obtained through adaptive beam forming. The target sound source direction is represented by a sound source direction estimated value, the sound source direction estimated value is accurate, adaptive beam forming is performed by using the estimated value, the voice signal in the direction of the interactive object can be directionally enhanced, and the performance of the audio signal formed by the adaptive beam forming is ensured. Although directional voice can be directionally enhanced by performing adaptive beam forming, the signal after the adaptive beam forming still contains a certain degree of environmental noise, and the output signal-to-noise ratio can be further improved by performing single-channel voice enhancement again, so that a more accurate voice recognition result is obtained, and the user experience is improved.
The adaptive beamforming method includes, but is not limited to, minimum variance undistorted response, generalized sidelobe cancellation, and the like. The noise estimation method in the adaptive beamforming includes, but is not limited to, traditional algorithms such as minimum tracking, recursive least squares, and the like, and further includes a deep learning algorithm implemented in a structure such as a convolutional neural network or a cyclic neural network. The single-channel speech enhancement method includes, but is not limited to, traditional algorithms such as wiener filtering, minimum mean square error estimation and the like, and also includes a deep learning algorithm implemented in a structure such as a convolutional neural network or a cyclic neural network.
Specifically, the phase of the voice signal frequency domain data of each frequency point of each channel in the air guide microphone array can be extracted, and the calculation formula is as follows:
Θ m (t,f)=angle[Y m (t,f)]
wherein, theta m And (t, f) represents the phase of the mth microphone channel at the tth frame and the fth time frequency point, and angle (.) represents the phase taking operation. Then can utilizePhase theta m (t, f) and noise power estimate
Figure 929486DEST_PATH_IMAGE006
Calculating to obtain the noise signal frequency domain data of each frequency point of each channel:
Figure 129523DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 654045DEST_PATH_IMAGE009
indicates the fifth->
Figure 960041DEST_PATH_IMAGE010
And j represents an imaginary unit according to the noise signal frequency domain data of the microphone channel at the t frame and the f time frequency point. And then further uses the noise signal->
Figure 228211DEST_PATH_IMAGE009
Calculating a noise covariance matrix of each frequency point:
Figure 17176DEST_PATH_IMAGE011
wherein R is NN (t, f) represents the noise covariance matrix of the microphone array at the t frame and the f time frequency point, alpha R Represents the noise covariance matrix smoothing factor, () H Representing a transposed conjugation operation. Next, the adaptive beamforming weight vector may be calculated by using the steering vector and the noise covariance matrix corresponding to the sound source direction estimation value obtained in step 101:
Figure 243758DEST_PATH_IMAGE012
wherein, ω (t, f) represents the adaptive beamforming weight vector corresponding to the tth frame and the fth time-frequency point.
Figure 16542DEST_PATH_IMAGE013
And representing the guide vector corresponding to the f time frequency point of the sound source direction estimated value. The calculation formula is as follows:
Figure 822824DEST_PATH_IMAGE014
wherein x, y and z respectively represent direction vectors of an abscissa, an ordinate and an ordinate of the microphone array, and c represents the speed of light. After the adaptive beamforming weight vector is obtained, frequency-domain filtering may be performed on the current frame frequency-domain data by using the adaptive beamforming weight vector to obtain a single-channel adaptive beamforming frequency-domain signal, where the calculation formula is as follows:
Y BF (t,f)=ω(t,f) H Y(t,f)
wherein, Y BF And (t, f) represents a single-channel adaptive beamforming frequency domain signal corresponding to the t-th frame and the f-th time-frequency point. That is, the step of performing adaptive beamforming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal may include: extracting the phase of frequency domain data corresponding to each frequency point in all channels in the air guide microphone array, determining noise frequency domain data of each frequency point in all channels according to the phase and the noise power estimation value, calculating a noise covariance matrix based on the noise frequency domain data, calculating an adaptive beam forming weight vector according to a guide vector of the sound source direction estimation value and the noise covariance matrix, and performing frequency domain filtering on a current frame through the adaptive beam forming weight vector to obtain a single-channel frequency domain signal.
And step 104, calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask.
In an embodiment, the single-channel adaptive beamforming frequency domain signal obtained in step 103 is input into a pre-trained noise reduction model to obtain a time-frequency mask predicted by the network:
mask(t,f)=E{ Y BF (t,f)}
wherein E { } represents a noise reduction neural network model. And finally, the obtained time-frequency mask is acted on a single-channel frequency domain beam forming signal, so that the enhanced single-channel frequency domain signal is obtained:
Y enhanced (t,f)= mask(t,f) ·Y BF (t,f)
as can be seen from the above, the audio processing method provided in this embodiment of the present application may calculate a sound source direction estimation value of a target sound source through sound source positioning within a preset search range, calculate a noise power estimation value corresponding to each frequency point in all channels of an air conduction microphone array, perform adaptive beam forming based on the sound source direction estimation value and the noise power estimation value, obtain a single-channel frequency domain signal, calculate a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhance the single-channel frequency domain signal according to the time-frequency mask. The embodiment of the application can enhance the single-channel frequency domain signal obtained by self-adaptive beam forming so as to improve the accuracy of audio data and the communication efficiency of hearing-impaired people.
Please refer to fig. 3, which is a schematic flow chart of an audio processing method according to an embodiment of the present application. The specific process of the method can be as follows:
in step 201, eyeball characteristic information of a current user is acquired to determine a focusing direction of the current user.
In one embodiment, the eyes of the current user may be photographed by a camera on the smart glasses to obtain an eye image, and then the initial focusing direction of the current user is determined based on eyeball feature information in the image. Specifically, the tracking can be performed through the characteristic changes of the eyeball and the eyeball periphery, or the tracking can be performed according to the angle change of the iris, or the characteristics can be extracted by actively projecting light beams such as infrared rays to the iris, and then the tracking can be performed according to the characteristics. This embodiment does not further limit this.
Step 202, determining a preset search range according to the neighborhood range of the focusing direction.
The focusing direction is used as prior information, and due to the fact that the eyes have intrinsic blinking and shaking, the user or an interactive object may also move in the communication, so that the focusing direction of the sight of the user is not accurate, and if the self-adaptive beam forming is directly performed in the focusing direction, the subsequent voice enhancement performance is poor and the voice recognition result is inaccurate due to inaccurate positioning, and the user experience is influenced. On the other hand, if there is no prior information of the focusing direction of the user's sight, the air guide microphone array needs to search and position in the full airspace, the computation complexity is very high, the search interval cannot be divided too finely, the accuracy of sound source positioning is still not high, and the speech enhancement performance of beam forming is also affected. Therefore, in this embodiment, after obtaining the prior information, i.e., the focusing direction of the user's sight line, by using techniques such as eye tracking, a search interval of a certain neighborhood range may be defined as a preset search range with the prior information as a center, and sound source localization is further performed within the preset search range to determine a final target sound source and a target sound source direction.
Step 203, calculating an angle spectrum function of the air guide microphone array according to a preset search range.
In particular, no-repeat microphone pairing may be performed on an air conduction microphone array to pair microphones
Figure 997453DEST_PATH_IMAGE015
And the microphone->
Figure 598199DEST_PATH_IMAGE016
The pair combination of (b) is exemplified, and the generalized cross-correlation function of the pair combination is calculated as:
R m1m2 (t,f)= Ψ m1m2 (f)Y m1 (t,f)Y m2 * (t,f)
wherein R is m1m2 (t, f) represents the generalized cross-correlation function of microphone m1 and microphone m2 at the t frame and the f time frequency point, Ψ m1m2 (f) The weighting functions of the microphone m1 and the microphone m2 at the f-th time frequency point may be phase conversion, smooth coherent conversion, or the like * Indicating a conjugate operation. Then calculating the inverse Fourier transform of the generalized cross-correlation function of the paired combination to obtain the paired groupResultant angular spectral function P m1m2 (θ, 9811. And traversing all microphone pairing combinations, repeating the steps, and accumulating the angle spectrum functions of all the microphone pairing combinations to obtain an angle spectrum function P (theta, 9811.
And step 204, traversing the angle spectrum function, and determining the sound source direction estimation value of the target sound source according to the local maximum value in the angle spectrum function.
In one embodiment, the angular spectrum function P (θ, 981t
Figure 858279DEST_PATH_IMAGE017
If only one local maximum value exists, the search range only contains one sound source, and the azimuth angle and the pitch angle corresponding to the local maximum value are combined and/or judged>
Figure 468251DEST_PATH_IMAGE018
As direction estimate for the interaction object, wherein>
Figure 497387DEST_PATH_IMAGE019
An estimate representing the azimuth of the interacting object, is based on the comparison of the measured values>
Figure 800193DEST_PATH_IMAGE020
Representing an estimate of the pitch angle of the interacting object. If the angular spectrum function P (θ, v 9811comprises a plurality of local maxima, it indicates that there are other sound sources besides the interactive object in the search range, then the user is prompted to designate a certain sound source direction through near-eye display, and the direction designated by the user is ÷ or ÷ determined by the user>
Figure 281990DEST_PATH_IMAGE021
As the direction estimate of the interactive object.
And step 205, calculating a noise power estimation value corresponding to each frequency point in all channels of the air conduction microphone array.
And step 206, performing adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal.
The above-mentioned content of calculating the noise power estimation value and obtaining the single-channel frequency-domain signal through the adaptive beamforming may refer to the related description in the previous embodiment, which is not further limited in this embodiment.
And step 207, calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask.
In an embodiment, a single-channel noise reduction model may be obtained by pre-training, and the training process is as shown in fig. 4, and includes the following steps:
a1: and constructing a model training data set, and specifically performing data amplification on clean voice audio by using noise audio to obtain a data set for noise reduction model training. Firstly, adding reverberation to clean audio to obtain reverberation audio; then, according to the appointed signal-to-noise ratio range, the reverberation audio energy and the noise audio energy are respectively calculated to obtain signal-to-noise ratio coefficients, and then the noise audio with the corresponding proportion is superposed on the reverberation audio to obtain the noise-carrying audio.
a2: and carrying out operations such as framing, windowing, fourier transform and the like on the noisy audio, and extracting the frequency domain characteristics of the noisy audio.
a3: building a network training noise reduction model: a noise reduction network is built by adopting an encoder-decoder structure, and the network comprises a plurality of nonlinear layers and can be realized by convolutional layers, long-time memory network layers, full connection layers and the like. And (c) inputting the frequency domain characteristics with the noise obtained in the step a2 into the built noise reduction network, and calculating a loss function between the time frequency mask predicted by the model and the clean voice time frequency mask, wherein the loss function can select a mean square error and the like. And training by using a loss function through a back propagation and gradient descent algorithm to obtain the noise reduction model.
That is, the training process of the preset denoising network may include: generating a noisy audio by using the noisy audio and a clean voice audio, performing framing, windowing and Fourier transform on the noisy audio to extract frequency domain characteristics of the noisy audio, building a noise reduction network by using an encoder-decoder structure, inputting the frequency domain characteristics of the noisy audio into the noise reduction network, calculating a loss function between a first time-frequency mask predicted by a model and a second time-frequency mask of the clean voice audio, and training the noise reduction network through a back propagation method and a gradient descent algorithm based on the loss function.
After the training is completed, the single-channel adaptive beamforming frequency domain signal obtained in step 206 is input into the trained noise reduction model to obtain the time-frequency mask for network prediction
Figure 695653DEST_PATH_IMAGE022
. Finally, the obtained time-frequency mask is used for determining whether the time-frequency mask is greater than or equal to the preset value>
Figure 579296DEST_PATH_IMAGE023
Acting on the single-channel frequency domain beam forming signal to obtain an enhanced single-channel frequency domain signal.
And step 208, converting the enhanced audio signal into text information, and displaying the text information on the intelligent glasses.
In an embodiment, the enhanced audio signal is a single-channel frequency domain signal, the single-channel frequency domain signal is subjected to feature extraction, extracted feature parameters are input to a pre-trained recognition network to obtain a recognition result, and the recognition result is displayed on a display screen of the intelligent glasses lens or is directly projected on the lens of the intelligent glasses in a projection manner, so that the voice-impaired people can conveniently communicate.
The text conversion mode of the intelligent glasses can be manually started through user operation, for example, a user clicks a key on the intelligent glasses or starts through a preset gesture. In another embodiment, the text conversion mode of the smart glasses may also be automatically turned on when the trigger condition is met, for example, when the air conduction microphone array receives a first voice signal including a preset keyword or receives a second voice signal through a bone conduction microphone in the smart glasses, the text conversion mode of the smart glasses is automatically turned on, which is not further described in this embodiment.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.
As can be seen from the above, the audio processing method provided in this embodiment of the present application may obtain eyeball feature information of a current user to determine a focusing direction of the current user, determine a preset search range according to a neighborhood range of the focusing direction, calculate an angle spectral function of an air conduction microphone array according to the preset search range, traverse the angle spectral function, determine a sound source direction estimation value of a target sound source according to a local maximum value in the angle spectral function, calculate a noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array, perform adaptive beam forming based on the sound source direction estimation value and the noise power estimation value, obtain a single-channel frequency domain signal, calculate a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, enhance the single-channel frequency domain signal according to the time-frequency mask, convert the enhanced audio signal into text information, and display the text information on intelligent glasses. The embodiment of the application can enhance the single-channel frequency domain signal obtained by self-adaptive beam forming to improve the accuracy of audio data, and can convert the enhanced signal into characters to be displayed for a user, thereby improving the communication efficiency of hearing-impaired people.
In order to better implement the audio processing method according to the embodiment of the present application, an embodiment of the present application further provides an audio processing apparatus. Referring to fig. 5, fig. 5 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure. The audio processing apparatus may include:
a first calculating module 301, configured to calculate a sound source direction estimation value of a target sound source through sound source localization within a preset search range;
a second calculating module 302, configured to calculate a noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array;
a forming module 303, configured to perform adaptive beam forming based on the sound source direction estimated value and the noise power estimated value, to obtain a single-channel frequency domain signal;
and the enhancing module 304 is configured to calculate a time-frequency mask of the single-channel frequency-domain signal through a preset noise reduction model, and enhance the single-channel frequency-domain signal according to the time-frequency mask.
In an embodiment, please further refer to fig. 6, where fig. 6 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure. Among them, the first calculation module 301 may include:
the first calculating submodule 3011 is configured to calculate an angle spectral function of the air conduction microphone array according to the preset search range;
and the traversing submodule 3012 is configured to traverse the angle spectrum function, and determine the sound source direction estimation value of the target sound source according to a local maximum in the angle spectrum function.
In an embodiment, the second calculation module 302 may include:
an obtaining submodule 3021, configured to obtain a signal frequency domain smooth power spectrum corresponding to each frequency point in all channels in the air conduction microphone array;
the first updating submodule 3022 is configured to update the minimum power value of each frequency point in all channels according to the signal frequency domain smoothed power spectrum;
the second calculating submodule 3023 is configured to calculate the voice existence probability of each frequency point in all channels according to the signal frequency domain smoothed power spectrum and the power minimum value;
and the second updating sub-module 3024 is configured to update the noise smoothing factor of each frequency point in all channels based on the speech existence probability, and calculate a noise power estimation value corresponding to each frequency point in all channels in the null guide microphone array according to the noise smoothing factor.
In one embodiment, the forming module 303 may include:
an extraction submodule 3031, configured to extract a phase of frequency domain data corresponding to each frequency point in all channels in the air conduction microphone array;
a third calculating submodule 3032, configured to determine noise frequency domain data of each frequency point in all channels according to the phase and the noise power estimation value, and calculate a noise covariance matrix based on the noise frequency domain data;
a fourth calculating submodule 3033, configured to calculate an adaptive beamforming weight vector according to the steering vector of the sound source direction estimation value and the noise covariance matrix;
and a filtering submodule 3034, configured to perform frequency domain filtering on the current frame through the adaptive beamforming weight vector, so as to obtain the single-channel frequency domain signal.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.
As can be seen from the above, in the audio processing apparatus provided in this embodiment of the application, the first computing module 301 computes a sound source direction estimation value of a target sound source through sound source positioning within a preset search range, the second computing module 302 computes a noise power estimation value corresponding to each frequency point in all channels of the air conduction microphone array, the forming module 303 performs adaptive beam forming based on the sound source direction estimation value and the noise power estimation value to obtain a single-channel frequency domain signal, and the enhancing module 304 computes a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhances the single-channel frequency domain signal according to the time-frequency mask. The embodiment of the application can enhance the single-channel frequency domain signal obtained by self-adaptive beam forming so as to improve the accuracy of audio data and the communication efficiency of hearing-impaired people.
Correspondingly, the embodiment of the present application further provides smart glasses, where the smart glasses may be a terminal or a server, and the terminal may be a terminal device such as a smart phone, a tablet Computer, a notebook Computer, a touch screen, a game machine, a Personal Computer (PC), a Personal Digital Assistant (PDA), and the like. As shown in fig. 7, fig. 7 is a schematic structural diagram of smart glasses provided in the embodiment of the present application. The smart glasses 400 include a processor 401 having one or more processing cores, a memory 402 having one or more storage media, and a computer program stored on the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402. Those skilled in the art will appreciate that the smart eyewear configuration shown in the figures does not constitute a limitation of the smart eyewear, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.
The processor 401 is a control center of the smart glasses 400, connects various parts of the entire smart glasses 400 using various interfaces and lines, and performs various functions of the smart glasses 400 and processes data by running or loading software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby integrally monitoring the smart glasses 400.
In the embodiment of the present application, the processor 401 in the smart glasses 400 loads instructions corresponding to processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions:
calculating a sound source direction estimation value of a target sound source through sound source positioning in a preset search range;
calculating a noise power estimation value corresponding to each frequency point in all channels in the air guide microphone array;
performing adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal;
and calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Optionally, as shown in fig. 7, the smart glasses 400 further include: touch-sensitive display screen 403, radio frequency circuit 404, audio circuit 405, input unit 406 and power 407. The processor 401 is electrically connected to the touch display screen 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power source 407. Those skilled in the art will appreciate that the smart eyewear configuration shown in fig. 7 does not constitute a limitation of smart eyewear, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.
The touch display screen 403 may be used for displaying a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. Among other things, the display panel may be used to display information input by or provided to the user as well as various graphical user interfaces of the smart glasses, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 401, and can receive and execute commands sent by the processor 401. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel may transmit the touch operation to the processor 401 to determine the type of the touch event, and then the processor 401 may provide a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to realize input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display screen 403 may also be used as a part of the input unit 406 to implement an input function.
In the embodiment of the present application, an application program is executed by the processor 401 to generate a graphical user interface on the touch display screen 403. The touch display screen 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.
The rf circuit 404 may be configured to transmit and receive rf signals to establish wireless communication with a network device or other smart glasses via wireless communication, and transmit and receive signals with the network device or other electronic devices.
The audio circuit 405 may be used to provide an audio interface between the user and the smart glasses through speakers, microphones. The audio circuit 405 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 405 and converted into audio data, which is then processed by the audio data output processor 401 and then transmitted to, for example, another electronic device via the rf circuit 404, or the audio data is output to the memory 402 for further processing. The audio circuit 405 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.
The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
The power supply 407 is used to power the various components of the smart eyewear 400. Optionally, the power source 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 407 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, or any other component.
Although not shown in fig. 7, the smart glasses 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
As can be seen from the above, the smart glasses provided in this embodiment can calculate a sound source direction estimation value of a target sound source through sound source positioning within a preset search range, calculate a noise power estimation value corresponding to each frequency point in all channels in an air conduction microphone array, perform adaptive beam forming based on the sound source direction estimation value and the noise power estimation value to obtain a single-channel frequency domain signal, calculate a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhance the single-channel frequency domain signal according to the time-frequency mask. The embodiment of the application can enhance the single-channel frequency domain signal obtained by self-adaptive beam forming so as to improve the accuracy of audio data and the communication efficiency of hearing-impaired people.
It will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by instructions or by instructions controlling associated hardware, and the instructions may be stored in a storage medium and loaded and executed by a processor.
To this end, the present application provides a storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any one of the audio processing methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:
calculating a sound source direction estimation value of a target sound source through sound source positioning in a preset search range;
calculating a noise power estimation value corresponding to each frequency point in all channels in the air guide microphone array;
performing adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal;
and calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.
Since the computer program stored in the storage medium can execute the steps of any audio processing method provided in the embodiments of the present application, beneficial effects that can be achieved by any audio processing method provided in the embodiments of the present application can be achieved, for which details are given in the foregoing embodiments and are not described herein again.
The audio processing method, the audio processing device, the storage medium, and the smart glasses provided in the embodiments of the present application are described in detail above, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the description of the embodiments above is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (9)

1. An audio processing method is applied to intelligent glasses, the intelligent glasses comprise an empty guide microphone array, and the audio processing method is characterized by comprising the following steps:
calculating an angle spectrum function of the air guide microphone array according to a preset search range, traversing the angle spectrum function, and determining a sound source direction estimation value of a target sound source according to a local maximum value in the angle spectrum function;
calculating a noise power estimation value corresponding to each frequency point in all channels in the air guide microphone array;
performing adaptive beam forming based on the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal;
and calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model, and enhancing the single-channel frequency domain signal according to the time-frequency mask.
2. The audio processing method of claim 1, wherein the calculating the noise power estimation value corresponding to each frequency point in all channels of the null steering microphone array comprises:
acquiring a signal frequency domain smooth power spectrum corresponding to each frequency point in all channels in the air guide microphone array;
updating the power minimum value of each frequency point in all channels according to the signal frequency domain smooth power spectrum;
calculating the voice existence probability of each frequency point in all channels according to the signal frequency domain smooth power spectrum and the power minimum value;
and updating the noise smoothing factor of each frequency point in all channels based on the voice existence probability, and calculating the noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array according to the noise smoothing factor.
3. The audio processing method of claim 1, wherein the performing adaptive beamforming based on the sound source direction estimate and the noise power estimate to obtain a single-channel frequency-domain signal comprises:
extracting the phase of frequency domain data corresponding to each frequency point in all channels in the air guide microphone array;
determining noise frequency domain data of each frequency point in all channels according to the phase and the noise power estimation value, and calculating a noise covariance matrix based on the noise frequency domain data;
calculating a self-adaptive beam forming weight vector according to a guide vector of the sound source direction estimated value and the noise covariance matrix;
and carrying out frequency domain filtering on the current frame through the self-adaptive beam forming weight vector to obtain the single-channel frequency domain signal.
4. The audio processing method of claim 1, wherein the training process of the preset noise reduction model comprises;
generating a noisy audio by using the noisy audio and the clean speech audio;
performing framing, windowing and Fourier transform on the noisy audio to extract frequency domain features of the noisy audio;
building a noise reduction network by adopting an encoder-decoder structure, inputting the frequency domain characteristics of the noisy audio frequency into the noise reduction network, and calculating a loss function between a first time-frequency mask predicted by a model and a second time-frequency mask of a clean voice audio frequency;
and training the noise reduction network by a back propagation method and a gradient descent algorithm based on the loss function.
5. The audio processing method according to claim 1, wherein the obtaining of the preset search range comprises:
acquiring eyeball characteristic information of a current user to determine a focusing direction of the current user;
and determining the preset search range according to the neighborhood range of the focusing direction.
6. The audio processing method of any of claims 1-5, wherein after enhancing the single channel frequency domain signal, the method further comprises:
and converting the enhanced audio signal into text information, and displaying the text information on the intelligent glasses.
7. An audio processing device is applied to smart glasses, and the smart glasses comprise an empty guide microphone array, and is characterized by comprising:
the first calculation module is used for calculating an angle spectrum function of the air guide microphone array according to a preset search range, traversing the angle spectrum function and determining a sound source direction estimation value of a target sound source according to a local maximum value in the angle spectrum function;
the second calculation module is used for calculating a noise power estimation value corresponding to each frequency point in all channels in the air conduction microphone array;
the forming module is used for carrying out self-adaptive beam forming on the basis of the sound source direction estimated value and the noise power estimated value to obtain a single-channel frequency domain signal;
and the enhancement module is used for calculating a time-frequency mask of the single-channel frequency domain signal through a preset noise reduction model and enhancing the single-channel frequency domain signal according to the time-frequency mask.
8. A storage medium, characterized in that the storage medium stores a computer program adapted to be loaded by a processor for performing the steps in the audio processing method according to any of claims 1-6.
9. Smart glasses, characterized in that they comprise a memory in which a computer program is stored and a processor which, by calling the computer program stored in the memory, performs the steps in the audio processing method according to any one of claims 1 to 6.
CN202211417559.5A 2022-11-14 2022-11-14 Audio processing method and device, storage medium and intelligent glasses Active CN115497500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211417559.5A CN115497500B (en) 2022-11-14 2022-11-14 Audio processing method and device, storage medium and intelligent glasses

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211417559.5A CN115497500B (en) 2022-11-14 2022-11-14 Audio processing method and device, storage medium and intelligent glasses

Publications (2)

Publication Number Publication Date
CN115497500A CN115497500A (en) 2022-12-20
CN115497500B true CN115497500B (en) 2023-03-24

Family

ID=85115659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211417559.5A Active CN115497500B (en) 2022-11-14 2022-11-14 Audio processing method and device, storage medium and intelligent glasses

Country Status (1)

Country Link
CN (1) CN115497500B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115775564B (en) * 2023-01-29 2023-07-21 北京探境科技有限公司 Audio processing method, device, storage medium and intelligent glasses
CN116030823B (en) * 2023-03-30 2023-06-16 北京探境科技有限公司 Voice signal processing method and device, computer equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4897519B2 (en) * 2007-03-05 2012-03-14 株式会社神戸製鋼所 Sound source separation device, sound source separation program, and sound source separation method
CN101644773B (en) * 2009-03-20 2011-12-28 中国科学院声学研究所 Real-time frequency domain super-resolution direction estimation method and device
JP6517124B2 (en) * 2015-10-26 2019-05-22 日本電信電話株式会社 Noise suppression device, noise suppression method, and program
CN111341339A (en) * 2019-12-31 2020-06-26 深圳海岸语音技术有限公司 Target voice enhancement method based on acoustic vector sensor adaptive beam forming and deep neural network technology
CN111239680B (en) * 2020-01-19 2022-09-16 西北工业大学太仓长三角研究院 Direction-of-arrival estimation method based on differential array
CN113889135A (en) * 2020-07-03 2022-01-04 华为技术有限公司 Method for estimating direction of arrival of sound source, electronic equipment and chip system
CN111866665B (en) * 2020-07-22 2022-01-28 海尔优家智能科技(北京)有限公司 Microphone array beam forming method and device
CN112951257A (en) * 2020-09-24 2021-06-11 上海译会信息科技有限公司 Audio image acquisition equipment and speaker positioning and voice separation method
CN114724574A (en) * 2022-02-21 2022-07-08 大连理工大学 Double-microphone noise reduction method with adjustable expected sound source direction

Also Published As

Publication number Publication date
CN115497500A (en) 2022-12-20

Similar Documents

Publication Publication Date Title
US20220165288A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
EP3346462B1 (en) Speech recognizing method and apparatus
CN115497500B (en) Audio processing method and device, storage medium and intelligent glasses
CN110364144B (en) Speech recognition model training method and device
US20220172737A1 (en) Speech signal processing method and speech separation method
CN115620727B (en) Audio processing method and device, storage medium and intelligent glasses
CN115620728B (en) Audio processing method and device, storage medium and intelligent glasses
EP3839950A1 (en) Audio signal processing method, audio signal processing device and storage medium
CN110047468B (en) Speech recognition method, apparatus and storage medium
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN115775564B (en) Audio processing method, device, storage medium and intelligent glasses
CN113763933B (en) Speech recognition method, training method, device and equipment of speech recognition model
CN111863020B (en) Voice signal processing method, device, equipment and storage medium
US20220254369A1 (en) Electronic device supporting improved voice activity detection
CN111986691B (en) Audio processing method, device, computer equipment and storage medium
CN113539290B (en) Voice noise reduction method and device
WO2022227507A1 (en) Wake-up degree recognition model training method and speech wake-up degree acquisition method
CN112466327B (en) Voice processing method and device and electronic equipment
CN113823313A (en) Voice processing method, device, equipment and storage medium
CN115168643B (en) Audio processing method, device, equipment and computer readable storage medium
CN112989134A (en) Node relation graph processing method, device, equipment and storage medium
CN115662436B (en) Audio processing method and device, storage medium and intelligent glasses
CN116935883B (en) Sound source positioning method and device, storage medium and electronic equipment
CN112740219A (en) Method and device for generating gesture recognition model, storage medium and electronic equipment
CN117012202B (en) Voice channel recognition method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant