CN115775564B

CN115775564B - Audio processing method, device, storage medium and intelligent glasses

Info

Publication number: CN115775564B
Application number: CN202310043222.0A
Authority: CN
Inventors: 李逸洋; 张新科; 崔潇潇; 苏悦; 鲁勇
Original assignee: Beijing Intengine Technology Co Ltd
Current assignee: Beijing Intengine Technology Co Ltd
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-07-21
Anticipated expiration: 2043-01-29
Also published as: CN115775564A

Abstract

The embodiment of the application discloses an audio processing method, an audio processing device, a storage medium and intelligent glasses. The method comprises the following steps: receiving multichannel audio signals based on a microphone array, respectively calculating beam forming signals and corresponding power values in a plurality of preset directions according to the multichannel audio signals, determining a target observation area according to the power values, performing sound source positioning in the target observation area to determine a target sound source direction, performing self-adaptive beam forming on the target sound source direction to obtain single-channel enhanced signals, performing sound event detection on the single-channel enhanced signals, and displaying detection results and the target sound source direction on the intelligent glasses. According to the embodiment of the application, the sound source positioning and the sound event detection are performed on the target observation area, and the sound event detection is displayed in the intelligent glasses, so that the event reminding efficiency of the hearing-impaired person is improved.

Description

Audio processing method, device, storage medium and intelligent glasses

Technical Field

The application relates to the technical field of data processing, in particular to an audio processing method, an audio processing device, a storage medium and intelligent glasses.

Background

Currently, the scale of hearing-impaired people in China reaches nearly three tens of millions, and most hearing-impaired people can communicate with sound people to a certain extent by means of hearing aids. However, the effect of hearing aids cannot be ensured for different situations of hearing impaired persons, the effect of using hearing aids by many hearing impaired persons is not ideal, and it is also possible to cause ear diseases if the hearing aid is worn for a long time. Along with technological progress and social development, the wearable equipment gradually goes into daily life of people, and intelligent glasses bring convenience to life of users and also provide a tool for communication with sound people for hearing impaired people. The traditional scheme for assisting the communication of the hearing impaired through the intelligent glasses mainly focuses on sound source positioning, voice recognition and the like.

The applicant found that in the prior art, sound source localization is performed through intelligent glasses, only the sound source position can be determined, but the meaning of sound production cannot be clearly determined; speech recognition converts speech into text to assist the hearing impaired, but cannot process non-speech signals such as door ring, alarm sounds, vehicle whistling, etc. These problems all result in limited lives for the hearing impaired.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device, a storage medium and intelligent glasses, which can improve the reminding efficiency of hearing impaired people through sound source positioning and sound event detection and a display mode in the intelligent glasses.

The embodiment of the application provides an audio processing method which is applied to intelligent glasses, wherein the intelligent glasses comprise microphone arrays and comprises the following steps:

receiving a multi-channel audio signal based on the microphone array, and respectively calculating beam forming signals and corresponding power values in a plurality of preset directions according to the multi-channel audio signal;

determining a target observation area according to the power value, and performing sound source positioning in the target observation area to determine a target sound source direction;

performing adaptive beam forming aiming at the target sound source direction to obtain a single-channel enhanced signal;

and detecting the sound event of the single-channel enhanced signal, and displaying the detection result and the target sound source direction on the intelligent glasses.

In an embodiment, the calculating the beam forming signals and the corresponding power values in a plurality of preset directions according to the multi-channel audio signal includes:

framing, windowing and Fourier transforming the multichannel audio signals to obtain processed frequency domain signals;

carrying out beam forming on the frequency domain signals in a plurality of preset directions to generate beam forming signals;

And calculating the frequency domain power values of the beam forming signals corresponding to the preset directions respectively.

In an embodiment, the determining the target observation area according to the power value includes:

comparing the frequency domain power values of the beam forming signals corresponding to the preset directions respectively;

and determining a target direction according to the comparison result, and determining a target observation area according to the target direction.

In an embodiment, the performing sound source localization in the target observation area to determine a target sound source direction includes:

selecting at least one target microphone from the microphone array according to the target observation area;

and forming the at least one target microphone into a microphone subarray, and performing sound source positioning based on the microphone subarray so as to determine a target sound source direction.

In an embodiment, the performing acoustic localization based on the microphone subarrays to determine a target acoustic source direction includes:

carrying out repeated-free microphone pairing on the microphone subarrays, and calculating a generalized cross-correlation function of pairing combination;

performing inverse Fourier transform on the generalized cross-correlation function to obtain an angle spectrum function of the pairing combination;

Traversing all microphone pairing combinations to accumulate the angle spectrum functions of all microphone pairing combinations to obtain the angle spectrum functions of the microphone subarray;

and extracting at least one local maximum value of the angle spectrum function of the microphone subarray, and determining a sound source direction estimated value according to the azimuth angle and the pitch angle corresponding to the local maximum value meeting the preset condition.

In an embodiment, the adaptively beamforming for the target sound source direction to obtain a single channel enhancement signal includes:

acquiring a noise signal power spectrum in each channel of the microphone subarray;

and performing adaptive beam forming based on the noise signal power spectrum and the sound source direction estimated value to obtain the single-channel enhanced signal.

In an embodiment, the acquiring the noise signal power spectrum among the channels of the microphone subarray includes:

acquiring a signal frequency domain smooth power spectrum of each frequency point in the microphone subarray;

updating the power minimum value of each frequency point in the microphone subarray according to the signal frequency domain smooth power spectrum, and calculating the voice existence probability of each frequency point;

updating the noise smoothing factor of each frequency point in the microphone subarray according to the voice existence probability;

And obtaining a noise power estimated value of each frequency point in the microphone subarray according to the signal frequency domain power spectrum received by the microphone subarray and the noise smoothing factor.

The embodiment of the application also provides an audio processing device, is applied to intelligent glasses, intelligent glasses includes the microphone array, includes:

the computing module is used for receiving the multichannel audio signals based on the microphone array and respectively computing beam forming signals and corresponding power values in a plurality of preset directions according to the multichannel audio signals;

the positioning module is used for determining a target observation area according to the power value, and performing sound source positioning on the target observation area to determine a target sound source direction;

the enhancement module is used for carrying out self-adaptive beam forming aiming at the target sound source direction so as to obtain a single-channel enhancement signal;

and the detection module is used for detecting the sound event of the single-channel enhanced signal and displaying the detection result and the target sound source direction on the intelligent glasses.

Embodiments of the present application also provide a storage medium storing a computer program adapted to be loaded by a processor to perform the steps of the audio processing method according to any of the embodiments above.

The embodiment of the application also provides a smart glasses, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the steps in the audio processing method according to any embodiment by calling the computer program stored in the memory.

According to the audio processing method, the audio processing device, the storage medium and the intelligent glasses, multichannel audio signals can be received based on the microphone array, beam forming signals and corresponding power values in a plurality of preset directions are calculated according to the multichannel audio signals respectively, a target observation area is determined according to the power values, sound source positioning is conducted in the target observation area to determine a target sound source direction, self-adaptive beam forming is conducted on the target sound source direction, single-channel enhancement signals are obtained, sound event detection is conducted on the single-channel enhancement signals, and detection results and the target sound source direction are displayed on the intelligent glasses. According to the embodiment of the application, the sound source positioning and the sound event detection are performed on the target observation area, and the sound event detection is displayed in the intelligent glasses, so that the event reminding efficiency of the hearing-impaired person is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic system diagram of an audio processing apparatus according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of an audio processing method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a microphone array arrangement according to an embodiment of the present application.

Fig. 4 is a schematic view of a scenario of selecting a microphone sub-array according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a sound event detection model according to an embodiment of the present application.

Fig. 6 is another flow chart of an audio processing method according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

Fig. 8 is another schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of smart glasses according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides an audio processing method, an audio processing device, a storage medium and intelligent glasses. Specifically, the audio processing method of the embodiment of the present application may be executed by an electronic device, where the electronic device may be smart glasses, and the smart glasses include a microphone array, where the microphone array is used to acquire an audio signal.

For example, when the audio processing method is operated on the intelligent glasses, the microphone array is monitored, when the microphone array receives the multi-channel audio signals, the reminding mode of the intelligent glasses is started, the beam forming signals and the corresponding power values in a plurality of preset directions are respectively calculated according to the multi-channel audio signals, a target observation area is determined according to the power values, sound source positioning is carried out in the target observation area to determine a target sound source direction, self-adaptive beam forming is carried out on the target sound source direction to obtain single-channel enhanced signals, sound event detection is carried out on the single-channel enhanced signals, and the detection results and the target sound source direction are displayed on the intelligent glasses. The intelligent glasses can display text information through a graphical user interface and interact with a user. The manner in which the smart glasses provide the graphical user interface to the user may include, for example, rendering a display screen displayed on the smart glasses lens or, through holographic projection, on the smart glasses lens to present the graphical user interface. For example, the smart glasses may include a display screen for presenting a graphical user interface and receiving operational instructions generated by a user acting on the graphical user interface and a processor.

Referring to fig. 1, fig. 1 is a schematic system diagram of an audio processing apparatus according to an embodiment of the present application. The system may include smart glasses 1000, at least one server or personal computer 2000. The smart glasses 1000 held by the user may be connected to a server or a personal computer through a network. The smart glasses 1000 may be terminal devices with computing hardware capable of supporting and executing software products corresponding to multimedia, for example, sound source localization and sound event detection. In addition, the smart glasses 1000 may also have a display screen or a projection device for displaying text. In addition, the smart glasses 1000 may be connected to a server or a personal computer 2000 through a network. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, the different smart glasses 1000 may be connected to other smart glasses or to a server, a personal computer, or the like using their own bluetooth network or hotspot network. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.

The embodiment of the application provides an audio processing method which can be executed by intelligent glasses or a server. The embodiments of the present application will be described by taking an example in which an audio processing method is performed by smart glasses. The intelligent glasses comprise a display screen and a processor, wherein the processor is configured to start a reminding mode of the intelligent glasses when the microphone array receives multichannel audio signals, respectively calculate beam forming signals and corresponding power values in a plurality of preset directions according to the multichannel audio signals, determine a target observation area according to the power values, perform sound source positioning in the target observation area to determine a target sound source direction, perform self-adaptive beam forming on the target sound source direction to obtain single-channel enhanced signals, perform sound event detection on the single-channel enhanced signals, and finally display detection results and the target sound source direction on the intelligent glasses.

Referring to fig. 2, the specific flow of the method may be as follows:

step 101, receiving a multi-channel audio signal based on a microphone array, and calculating beam forming signals and corresponding power values in a plurality of preset directions according to the multi-channel audio signal.

In this application embodiment, the microphone array of intelligent glasses contains at least three microphone, and install on two mirror legs at least, at least both sides that the intelligent glasses all have the microphone to constitute planar array or three-dimensional array, the microphone array can carry out 360 full airspace sound source localization promptly, both can fix a position the sound source that is located intelligent glasses user body front promptly, like telephone ring, domestic appliance suggestion tone etc. suggestion user incoming telegram or electrical apparatus accomplish work, ensure user's normal life, also can fix a position the sound source that is located intelligent glasses user body back like car whistle, non-motor vehicle bell etc. suggestion user is careful to dodge, ensure user's trip safety.

In one embodiment, the smart glasses are known in size and distance between the microphones. For simplicity of description, the microphones on either temple are collectively referred to as the same-side microphone, the microphones on either temple and the frame are collectively referred to as the different-side microphone, and the microphones on both temples are collectively referred to as the opposite-side microphone. Because the width of the lens frame is wider, in order to ensure the lower grating lobe level, the opposite side microphones need to be installed in parallel, namely, the distance between the opposite side microphones is the width of the lens frame. On the other hand, if the maximum distance between the adjacent side microphones or the same side microphones is smaller than the distance between the opposite side microphones, the main lobe width should be as narrow as possible in order to ensure higher sound source positioning accuracy, and at the same time, the degree to which the microphone subarrays are blocked by the head of the user should be as small as possible, and the maximum distance between the adjacent side microphones or the same side microphones should be at least one half of the distance between the opposite side microphones. If the positioning accuracy of dangerous prompts such as whistling of vehicles is required to be high, the distance needs to be increased as much as possible to reduce the grating lobe level, so that the probability of selecting the grating lobe angle is reduced.

In an embodiment, the full airspace observation range is divided into a plurality of observation areas, then, before sound source positioning, beam forming is performed in the preset direction of each observation area, then, several or more small observation areas are selected from the full airspace according to beam forming signals in the preset directions and corresponding power values, and finally, sound source positioning is performed in the small observation areas, so that high operation burden caused when sound source positioning is performed in the full airspace can be avoided. The multiple observation regions do not intersect each other, and the union of the observation regions is the full-airspace observation range, that is, the beam forming is performed only in the preset direction of each observation region, but is basically equivalent to the full-airspace observation. The preset direction can be set directly, such as the direction of an angular bisector of each observation area, or can be set according to a certain rule, such as the right front, right rear, and the like of the intelligent glasses user. If the beam forming is performed in the preset direction in advance, the area where the target is located can be primarily judged through the energy of the beam forming signals in each preset direction, namely, the probability of occurrence of the sound source in the area with higher energy is higher, the operation complexity can be reduced by performing the sound source positioning in the area with higher energy, and meanwhile, the accuracy of the sound source positioning can be improved at smaller observation intervals. The beamforming algorithm in the preset direction includes, but is not limited to, fixed beamforming, superdirective beamforming, pattern synthesis beamforming, and other beamforming algorithms.

In an embodiment, the microphone array arrangement, the coordinate system plan view and the preset direction setting of the smart glasses are shown in fig. 3, wherein the smart glasses have 4 microphones in totalThe microphone, set up the coordinate system with the center of the intelligent glasses as the origin, the coordinates of microphone 1, microphone 2, microphone 3 and microphone 4 are respectively、/>、/>And->Wherein->，And->The abscissa, ordinate and vertical coordinate of the mth (m=1, 2,3, 4) microphone are respectively indicated. The 360-degree full-airspace observation range is divided into four areas of [0,90 ], [90,180 ], [180,270 ], [270,360 ], [ as first, second, third and fourth observation areas, respectively, and four preset directions are 45 degrees, 135 degrees, 225 degrees and 315 degrees, respectively, namely angular bisector directions of the first, second, third and fourth observation areas are respectively referred to as first, second, third and fourth preset directions. In this embodiment, the beam forming signals and the corresponding power values in the four preset directions are calculated according to the multi-channel audio signal.

And 102, determining a target observation area according to the power value, and performing sound source positioning in the target observation area to determine the target sound source direction.

In an embodiment, during the process of wearing the smart glasses by a user, for sound source signals from different directions, some microphones are blocked by the head of the user, that is, the quality of the received signals of the microphones is low, which will affect the accuracy of sound source positioning to a certain extent. Therefore, the target observation area can be selected to perform sound source localization according to the beam forming signals in the preset directions and the corresponding power values, and only microphones with higher received signal quality are selected to form microphone subarrays to perform sound source localization and voice enhancement without using partial received signals with lower quality, so that the performance of sound source localization and voice event detection is improved. It should be noted that at least three microphones are configured as the microphone sub-arrays and a planar array or a stereo array is configured, so if there are only three microphones of the smart glasses, the microphone sub-arrays need not be selected to ensure full airspace sound source localization.

In the embodiment of the invention, the interval between the microphones of the microphone array of the intelligent glasses is larger, grating lobes are more obvious, and only partial microphones are selected to form the microphone subarray, so that the grating lobe level is further raised although the width of the main lobe is slightly changed compared with that of the whole microphone array. When the sound source positioning is carried out in the small observation area, grating lobes are mainly distributed in other non-observation areas, so that compared with the full-airspace sound source positioning, the influence of grating lobe problems on the sound source positioning performance can be reduced, and the positioning accuracy is improved.

In an embodiment, when the above sound source positioning is performed, the plurality of microphones may collect the sound signals synchronously, and the signal phase difference between the plurality of microphones is used to obtain the emission position of the sound source signal. In other embodiments, the sound source localization algorithm includes, but is not limited to, cross-correlation, super-resolution, and the like.

For example, the target observation area is determined according to the beam forming signal power in 4 preset directions, if

Then it is considered that there is a sound source in the i-th observation area, in whichThe threshold coefficient is represented, that is, according to the actually obtained threshold value, the target observation area can be one or a plurality of target observation areas, that is, a plurality of sound sources located in different areas can be observed at the same time. That is, the step of determining the target observation area according to the power value may include: comparing the frequency domain power values of the beam forming signals corresponding to the preset directions, determining a target direction according to a comparison result, and determining a target observation area according to the target direction.

Then, a microphone subarray is selected according to the target observation area and the intelligent glasses microphone distribution situation. If the sound source is located in the ith observation area, the frequency domain received signals of the microphone subarray are recorded asThe number of microphones is denoted as N (N < M), and as shown in fig. 4, if the target sound source is located in the first observation area, that is, the microphone 3 is blocked to a high degree, the microphone 1, the microphone 2, and the microphone 4 are selected to form a sub-array. Correspondingly, if the target sound source is positioned in the second observation area, selecting a microphone 1, a microphone 2 and a microphone 3 to form a subarray; if the target sound source is positioned in the third observation area, selecting a microphone 2, a microphone 3 and a microphone 4 to form a subarray; if the target sound source is located in the fourth observation area, the microphone 1, the microphone 3 and the microphone 4 are selected to form a subarray. Finally, sound source localization is carried out in the target observation area so as to determine the target sound source direction. That is, the step of performing sound source localization in the target observation area to determine the target sound source direction may include: and selecting at least one target microphone from the microphone arrays according to the target observation area, forming at least one target microphone into microphone subarrays, and performing sound source positioning based on the microphone subarrays so as to determine the target sound source direction.

Step 103, performing adaptive beam forming for the target sound source direction to obtain a single-channel enhanced signal.

After determining the target sound source and the corresponding target sound source direction, a single channel enhancement signal, in particular a single channel frequency domain beamformed signal, may be obtained by adaptive beamforming. In an embodiment, the target sound source direction may be represented by a sound source direction estimated value, where the sound source direction estimated value is relatively accurate, and the adaptive beamforming is performed by using the estimated value, so that a signal for enhancing the sound source direction may be oriented, and the performance of the adaptive beamformed audio signal is ensured. Therefore, a more accurate sound event detection result is obtained, and user experience is improved. Meanwhile, a beam forming signal buffer zone with a certain duration is set for sound event detection. On the one hand, short-time sounds such as automobile whistling sounds, knocking sounds and the like can be prominently detected for the cache areas with different time durations, and on the other hand, long-time sounds such as alarm sounds, household appliance prompt sounds and the like can be detected. The time length of the buffer area can be set by default or by a user according to the needs of the user.

Although directional speech can be directionally enhanced by performing adaptive beamforming, the signal after adaptive beamforming still contains a certain degree of environmental noise, and the output signal-to-noise ratio can be further improved by performing single-channel speech enhancement again, so that a more accurate speech recognition result is obtained, and the user experience is improved. The adaptive beam forming method comprises, but is not limited to, minimum variance undistorted response, generalized sidelobe cancellation algorithm and the like. The noise estimation method in the adaptive beam forming comprises, but is not limited to, traditional algorithms such as minimum tracking, recursive least square and the like, and further comprises a deep learning algorithm realized by a convolutional neural network or a cyclic neural network and the like. The single-channel voice enhancement method comprises, but is not limited to, traditional algorithms such as wiener filtering, minimum mean square error estimation and the like, and further comprises a deep learning algorithm realized by a convolutional neural network or a cyclic neural network and the like.

And 104, detecting sound events of the single-channel enhanced signals, and displaying detection results and the target sound source direction on the intelligent glasses.

In an embodiment, the enhanced audio signal is a single-channel frequency domain signal, the single-channel frequency domain signal is subjected to feature extraction, the extracted feature parameters are input into a pre-trained sound event detection model to obtain a detection result, and finally the detection result and the target sound source direction are displayed in a display screen of the intelligent glasses or are directly projected on lenses of the intelligent glasses in a projection mode.

In one embodiment, the detection of sound event refers to detecting a non-voice signal which is difficult to be identified by a hearing impaired user, and prompting the user what happens in what direction through event classification and combination with sound source positioning information, so as to remind the user to pay attention to. If the automobile alarm is outdoor, the ring tones of non-motor vehicles such as automobile whistle, bicycles, electric vehicles and the like are detected, and a user is reminded to pay attention to avoiding; if in a closed place, detecting the bell of the alarm to remind the user of paying attention to personal safety; if in home, detecting door ring, telephone ring, electric appliance prompting voice, etc., reminding users of visiting, incoming call, electric appliance completing work and timely powering off; other non-voice signals such as child crying, pet crying and the like can be further detected. Through sound source localization and sound event detection, the safety of the user can be guaranteed, and the life of the user can be assisted.

In an embodiment, the sound event detection model is further required to be trained in advance, and the training process may include the following steps:

step a: sound events to be detected are defined and classified. For example, whistling sounds and car ringing sounds are classified into one type, ringing sounds and prompt sounds are classified into one type, alarm sounds are classified into one type, knocking sounds are classified into one type, crying sounds and laughing sounds are classified into one type, and the types are respectively marked as one to five. That is, when sound events of any one of all the categories are detected, sound source localization and sound event detection results are fed back to the user.

Step b: and constructing a training data set of the sound event detection model and preparing labels. Firstly, preparing various audio data with balanced data volume according to defined sound event categories; secondly, since various sound events such as bell sound, knock sound, etc. may occur at the same time, training data of a mixed sound event needs to be randomly sampled from a data set and superimposed according to the type of the sound event which may occur at the same time; then, the training set data size is further expanded by the expansion modes such as time stretching, pitch conversion and the like for the training set operation; then, randomly splicing the amplified audio data to obtain audio with different lengths; finally, the sound event type corresponding to each audio is encoded, for example, five sound event types shown in step a may be represented by five-bit binary numbers, that is, 10000 if only type one exists, 01010 if both type two and type four exist, and so on.

Step c: and building a sound event detection model and training. The acoustic event detection model includes a plurality of nonlinear layers, and may be constructed by a convolution layer, a full connection layer, an attention layer, a long-short neural network layer, and the like, and the structure of the acoustic event detection model is shown in fig. 5.

During training, the sample audio is subjected to framing, windowing, fourier transformation, mel filtering and other operations to extract characteristic parameters, the characteristic parameters are input into a built sound event detection model, the network outputs predicted class codes by using a classifier layer, and an output layer of the classifier can be composed of sigmoid units to provide probability values for each event class. And training to obtain a final sound event detection model by using sigmoid cross entropy between the output layer and the real label as a loss function and utilizing the loss function through a back propagation and gradient descent algorithm.

After training, the characteristic parameters of each single-channel enhancement signal in the K beamforming signal buffer areas obtained in step 103 can be extracted, and the characteristic parameters are sequentially input into a sound event detection model to obtain probability values corresponding to each sound event code predicted in K directions, whether the sound event exists is judged according to a set threshold value, if so, the sound event code is converted into a corresponding sound event, and the sound event code is associated with a sound source positioning estimated value. And finally, displaying the sound event detection result and the sound source positioning estimated value of the sound event on the intelligent glasses lens to remind the user of paying attention.

As can be seen from the foregoing, the audio processing method provided in the embodiments of the present application may receive a multi-channel audio signal based on a microphone array, calculate beam forming signals and corresponding power values in a plurality of preset directions according to the multi-channel audio signal, determine a target observation area according to the power values, perform sound source localization in the target observation area to determine a target sound source direction, perform adaptive beam forming for the target sound source direction, so as to obtain a single-channel enhanced signal, perform sound event detection on the single-channel enhanced signal, and display the detection result and the target sound source direction on the smart glasses. According to the method and the device for detecting the sound event, the beam forming is carried out in the preset direction, the subarray is selected in the preset microphone array, the sound source positioning operation complexity is reduced, meanwhile, the positioning accuracy and the algorithm performance of the self-adaptive beam forming can be guaranteed, the accuracy of the sound event detection is improved, and through sound source positioning and the sound event detection, the user safety can be guaranteed, and the life of the user can be assisted.

Fig. 6 is a schematic flow chart of an audio processing method according to an embodiment of the present application. The specific flow of the method can be as follows:

In step 201, a multi-channel audio signal is received based on a microphone array, and framing, windowing and fourier transform processing are performed on the multi-channel audio signal to obtain a processed frequency domain signal.

In one embodiment, first, the 4-channel time domain signals received by the smart-lens microphone array are respectively subjected to framing, windowing and fourier transformation, and the 4-channel time domain signals are converted into the frequency domain:

wherein, the liquid crystal display device comprises a liquid crystal display device,frequency domain signals representing the mth microphone channel of the microphone array at the t frame and the f frequency point; if the sampling rate is +.>If the frame length is N, the frame shift is N/2, the window length is N, the number of fourier transform points is N, and the corresponding frequency domain signals have f=n/2+1 frequency points, and the frequency interval is +.>，/>Representing a transpose operation; the windowing window function can be a common window function such as a hamming window, a hanning window and the like.

Step 202, beam forming is performed on the frequency domain signals in a plurality of preset directions, so as to generate beam forming signals.

Further, beam forming is performed on the 4 preset directions respectively to obtain frequency domain beam forming signals in each preset direction:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the frequency domain beam forming signal of the ith (i=1, 2,3, 4) preset direction, the t frame, the f frequency point,

The beam forming weight vector of the f frequency point in the i preset direction is represented,represents the microphone array abscissa column vector, < >>Represents a microphone array ordinate column vector, c represents sound velocity, j represents imaginary units, +.>Representing the conjugate transpose operation.

In step 203, the frequency domain power values of the beam forming signals corresponding to the preset directions are calculated.

Finally, the frequency domain total power of the beam forming signals in 4 preset directions is calculated respectively:

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating the total power of the frequency domain in the ith preset direction, |·| indicates taking absolute value.

And 204, determining a target observation area according to the power value, and performing sound source positioning in the target observation area to determine a sound source direction estimated value.

Specifically, a target observation area is determined according to the power value of the beam forming signal, then a microphone subarray is selected according to the target observation area, and finally sound source localization is carried out in the target observation area to obtain a sound source direction estimated value.

In one embodiment, the step of performing sound source localization within the target observation area may include:

a. first, the microphone subarrays in the region are subjected to non-repetition microphone pairing to obtain a microphone n ₁ And microphone n ₂ For example, the pairing combination of (a):

b. calculating a generalized cross-correlation function of the pairing combination:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing microphone->And microphone->Generalized cross-correlation function at t frame, f frequency point, < >>Representing microphone->And microphone->The weighting function at the f-th frequency point may be a weighting function such as phase transformation, smooth coherent transformation, or the like, < >>Frequency domain signal representing nth microphone channel of microphone sub-array at t frame, f frequency point,/for>Representing a conjugation operation;

c. calculating the inverse Fourier transform of the pairing combination generalized cross-correlation function to obtain the angle spectrum function of the pairing combination；

d. Then, traversing all microphone pairing combinations, repeating the steps b-c, and accumulating the angle spectrum functions of all microphone pairing combinations to obtain an angle spectrum function A (theta, ϕ) of the microphone subarray;

e. then, the angle spectrum function A (θ, ϕ) is traversed, and the local maxima of the angle spectrum function are extractedWherein Q represents the number of local maxima;

f. finally, by calculation, if

Then combine the azimuth and pitch angles corresponding to the maxima of the Q (q=1, 2, … Q) th angular spectrum functionAs an estimate of the direction of the sound source, wherein +.>Representing the threshold of the angular spectrum function,/->Estimation representing azimuth of sound source Value of->The estimated value representing the pitch angle of the sound source, that is, the sound source signal observed in any observation area may be one or more according to the actual threshold value, that is, sound sources in a plurality of directions may be observed at the same time.

Repeating the steps a-f to finish sound source localization of all observation areas, and obtaining estimated values of K sound source directions.

In step 205, a power spectrum of noise signals among channels of the microphone sub-array is obtained.

First, the power spectrum of noise signal of each channel in each microphone sub-array corresponding to each sound source is estimated, taking the K (k=1, 2, … K) sound source as an example, the frequency domain received signal of the corresponding microphone sub-array isThe number of microphones is N:

a. updating the signal frequency domain smooth power spectrum of each frequency point of each channel:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the frequency domain smoothed power spectrum of the nth microphone channel at the nth frame, the f frequency point,/th frame, and the f frequency point>Representing the frequency domain power spectrum smoothing factor.

b. Secondly, updating the power minimum value of each frequency point of each channel according to the obtained signal frequency domain smooth power spectrum:

wherein, the liquid crystal display device comprises a liquid crystal display device,the power minimum value of the nth microphone channel at the nth frame and the f frequency point is represented, and gamma and beta both represent empirical constants;

c. Then, according to the obtained signal frequency domain smooth power spectrum and the frequency domain power minimum value, calculating the voice existence probability of each frequency point of each channel:

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating whether the nth microphone channel is present at the nth frame, f frequency point,/or not>Threshold value representing f-th frequency bin, +.>Representing the probability of speech presence of the nth microphone channel at the nth frame, the f frequency point,/for the nth microphone channel>Representing the speech presence probability smoothing factor.

d. Then, according to the obtained voice existence probability, updating the noise smoothing factor of each frequency point of each channel:

wherein, the liquid crystal display device comprises a liquid crystal display device,noise smoothing factor representing nth microphone channel at nth frame, f frequency point,/>Representing the noise smoothing factor coefficient.

e. Finally, according to the frequency domain power spectrum of the received signal and the noise smoothing factor, obtaining the noise power estimated value of each frequency point of each channel:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the noise power estimate of the nth microphone channel at the nth frame, f frequency point.

It is noted that since the microphone sub-arrays corresponding to the K acoustic sources have intersections, the same parameter is calculated only once, and the sub-arrays share the parameter.

And 206, performing adaptive beam forming based on the noise signal power spectrum and the sound source direction estimated value to obtain a single-channel enhanced signal.

Taking the kth sound source as an example:

a. firstly, extracting the phase of each frequency point voice signal frequency domain data of each channel:

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating the phase of the nth microphone channel at the nth frame, f frequency bin, angle () indicates a phase taking operation.

b. Then, use the phaseAnd noise power estimate +.>Obtaining noise signal frequency domain data of each frequency point of each channel:

wherein, the liquid crystal display device comprises a liquid crystal display device,and the noise signal frequency domain data of the nth microphone channel at the nth frame and the f frequency point are represented.

c. Second, use noise signalCalculating a noise covariance matrix of each frequency point:

wherein, the liquid crystal display device comprises a liquid crystal display device,noise covariance matrix of microphone subarray at t frame and f frequency point>Representing the noise covariance matrix smoothing factor.

d. Then, calculating an adaptive beamforming weight vector by using the steering vector and the noise covariance matrix corresponding to the sound source direction estimated value obtained in the step 2):

wherein, the liquid crystal display device comprises a liquid crystal display device,representing adaptive beam forming weight vector corresponding to the t frame and the f frequency point,>and representing a steering vector corresponding to the f-th time-frequency point of the sound source direction estimated value. The calculation formula is as follows:

wherein x, y, z respectively represent the abscissa vectors of the microphones of the microphone subarraysA vertical coordinate vector and a vertical coordinate vector, Estimated value representing the azimuth of the kth sound source, < >>An estimated value representing a pitch angle of a kth sound source;

e. finally, frequency domain filtering is carried out on the current frame frequency domain data by utilizing the self-adaptive beam forming weight vector, and a single-channel self-adaptive beam forming frequency domain signal of the kth sound source is obtained:

wherein, the liquid crystal display device comprises a liquid crystal display device,and the signal is used for representing the single-channel self-adaptive wave beam forming frequency domain signal corresponding to the kth sound source in the t frame and the f frequency point.

And finally, caching the K frequency domain beam forming signals to beam forming signal cache areas corresponding to the K sound sources. If the user does not specify the buffer duration, buffering the T-frame beam forming signal for any sound source by default for sound event detection. If any sound source stops sounding, the sound source positioning information and the beam forming buffer area are cleared after a period of time.

And 207, detecting the sound event of the single-channel enhanced signal, and displaying the detection result and the target sound source direction on the intelligent glasses.

And extracting characteristic parameters aiming at the single-channel enhanced signal obtained in the step 206, inputting the characteristic parameters into a sound event detection model to obtain a sound event detection result, and finally displaying the sound source positioning estimated value and the sound event detection result on the intelligent glasses lens. If the automobile alarm is outdoor, the ring tones of non-motor vehicles such as automobile whistle, bicycles, electric vehicles and the like are detected, and a user is reminded to pay attention to avoiding; if in a closed place, detecting the bell of the alarm to remind the user of paying attention to personal safety; if in home, detecting door ring, telephone ring, electric appliance prompting voice, etc., reminding users of visiting, incoming call, electric appliance completing work and timely powering off; other non-voice signals such as child crying, pet crying and the like can be further detected. Through sound source positioning and sound event detection, the safety of the user can be ensured, and the life of the user can be assisted

All the above technical solutions may be combined to form an optional embodiment of the present application, which is not described here in detail.

As can be seen from the foregoing, the audio processing method provided in the embodiments of the present application may receive a multi-channel audio signal based on a microphone array, perform framing, windowing and fourier transform processing on the multi-channel audio signal to obtain a processed frequency domain signal, perform beam forming on the frequency domain signal in a plurality of preset directions, generate a beam forming signal, calculate frequency domain power values of the beam forming signals corresponding to the plurality of preset directions, respectively, determine a target observation area according to the power values, perform sound source positioning in the target observation area to determine a sound source direction estimated value, obtain a noise signal power spectrum in each channel of the microphone sub-array, perform adaptive beam forming based on the noise signal power spectrum and the sound source direction estimated value, obtain a single-channel enhanced signal, perform sound event detection on the single-channel enhanced signal, and display a detection result and a target sound source direction on the smart glasses. According to the embodiment of the application, the sound source positioning and the sound event detection are performed on the target observation area, and the sound event detection is displayed in the intelligent glasses, so that the event reminding efficiency of the hearing-impaired person is improved.

In order to facilitate better implementation of the audio processing method of the embodiment of the application, the embodiment of the application also provides an audio processing device. Referring to fig. 7, fig. 7 is a schematic structural diagram of an audio processing device according to an embodiment of the present application. The audio processing apparatus may include:

a calculating module 301, configured to receive a multi-channel audio signal based on the microphone array, and calculate beam forming signals and corresponding power values in a plurality of preset directions according to the multi-channel audio signal, respectively;

the positioning module 302 is configured to determine a target observation area according to the power value, and perform sound source positioning in the target observation area to determine a target sound source direction;

an enhancing module 303, configured to perform adaptive beamforming for the target sound source direction, so as to obtain a single-channel enhanced signal;

and the detection module 304 is configured to detect a sound event of the single-channel enhanced signal, and display a detection result and the target sound source direction on the smart glasses.

In an embodiment, please further refer to fig. 8, fig. 8 is another schematic structural diagram of an audio processing apparatus according to an embodiment of the present application. Wherein, the positioning module 302 may include:

A selecting submodule 3021, configured to select at least one target microphone from the microphone array according to the target observation area;

the positioning sub-module 3022 is configured to form the at least one target microphone into a microphone sub-array, and perform sound source positioning based on the microphone sub-array to determine a target sound source direction.

In an embodiment, the enhancement module 303 may include:

an acquisition submodule 3031, configured to acquire a noise signal power spectrum among channels of the microphone subarray;

and an enhancement submodule 3032, configured to perform adaptive beamforming based on the noise signal power spectrum and the sound source direction estimated value to obtain the single-channel enhancement signal.

As can be seen from the foregoing, the audio processing apparatus provided in the embodiments of the present application may receive a multi-channel audio signal based on a microphone array, calculate beam forming signals and corresponding power values in a plurality of preset directions according to the multi-channel audio signal, determine a target observation area according to the power values, perform sound source positioning in the target observation area to determine a target sound source direction, perform adaptive beam forming for the target sound source direction, so as to obtain a single-channel enhanced signal, perform sound event detection on the single-channel enhanced signal, and display a detection result and the target sound source direction on the smart glasses. According to the embodiment of the application, the sound source positioning and the sound event detection are performed on the target observation area, and the sound event detection is displayed in the intelligent glasses, so that the event reminding efficiency of the hearing-impaired person is improved.

Correspondingly, the embodiment of the application also provides an intelligent glasses which can be a terminal or a server, wherein the terminal can be terminal equipment such as a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game console, a personal computer (PC, personalComputer), a personal digital assistant (Personal Digital Assistant, a PDA) and the like. Fig. 9 is a schematic structural diagram of the smart glasses according to the embodiment of the present application, as shown in fig. 9. The smart glasses 400 include a processor 401 having one or more processing cores, a memory 402 having one or more storage media, and a computer program stored on the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402. It will be appreciated by those skilled in the art that the smart glasses structure shown in the figures is not limiting and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.

The processor 401 is a control center of the smart glasses 400, connects respective portions of the entire smart glasses 400 using various interfaces and lines, and performs various functions of the smart glasses 400 and processes data by running or loading software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the smart glasses 400.

In this embodiment of the present application, the processor 401 in the smart glasses 400 loads the instructions corresponding to the processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 9, the smart glasses 400 further include: a touch display 403, a radio frequency circuit 404, an audio circuit 405, an input unit 406, and a power supply 407. The processor 401 is electrically connected to the touch display 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power supply 407, respectively. It will be appreciated by those skilled in the art that the smart eyeglass structure shown in fig. 9 is not limiting and may include more or fewer components than shown, or may be combined with certain components, or may be arranged in a different arrangement of components.

The touch display 403 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the smart glasses, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a liquid crystal Display (LCD, liquidCrystal Display), an organic light Emitting Diode (OLED, organicLight-Emitting Diode), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 401, and can receive and execute commands sent from the processor 401. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 401 to determine the type of touch event, and the processor 401 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to implement the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the touch-sensitive display 403 may also implement an input function as part of the input unit 406.

In the embodiment of the present application, the application program is executed by the processor 401 to generate a graphical user interface on the touch display screen 403. The touch display 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The radio frequency circuit 404 may be configured to receive and transmit radio frequency signals to and from a network device or other electronic device by establishing wireless communication with the network device or other smart glasses.

The audio circuit 405 may be used to provide an audio interface between the user and the smart glasses through speakers, microphones. The audio circuit 405 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 405 and converted into audio data, which are processed by the audio data output processor 401 and sent via the radio frequency circuit 404 to e.g. another electronic device, or which are output to the memory 402 for further processing. The audio circuit 405 may also include an ear bud jack to provide communication of the peripheral headphones with the electronic device.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 407 is used to power the various components of the smart glasses 400. Alternatively, the power supply 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 407 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 9, the smart glasses 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the foregoing, the smart glasses provided in this embodiment receive multi-channel audio signals based on the microphone array, respectively calculate beam forming signals and corresponding power values in a plurality of preset directions according to the multi-channel audio signals, determine a target observation area according to the power values, perform sound source positioning in the target observation area to determine a target sound source direction, perform adaptive beam forming for the target sound source direction, so as to obtain a single-channel enhanced signal, perform sound event detection on the single-channel enhanced signal, and display the detection result and the target sound source direction on the smart glasses. According to the embodiment of the application, the sound source positioning and the sound event detection are performed on the target observation area, and the sound event detection is displayed in the intelligent glasses, so that the event reminding efficiency of the hearing-impaired person is improved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions or by controlling associated hardware, which may be stored in a storage medium and loaded and executed by a processor.

To this end, the present embodiments provide a storage medium in which a plurality of computer programs are stored, which are capable of being loaded by a processor to perform the steps of any of the audio processing methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access memory (RAM, random AccessMemory), magnetic or optical disk, and the like.

The steps in any audio processing method provided in the embodiments of the present application may be executed by the computer program stored in the storage medium, so that the beneficial effects that any audio processing method provided in the embodiments of the present application may be achieved, which are detailed in the previous embodiments and are not repeated herein.

The foregoing describes in detail an audio processing method, an audio processing device, a storage medium and intelligent glasses provided in the embodiments of the present application, and specific examples are applied to illustrate the principles and implementations of the present application, where the foregoing description of the embodiments is only used to help understand the method and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims

1. An audio processing method is applied to intelligent glasses, the intelligent glasses comprise a microphone array, the microphone array comprises opposite side microphones respectively arranged on two glasses legs, the opposite side microphones are arranged in parallel, and the maximum distance between the same side microphones on the same glasses leg is at least one half of the distance between the opposite side microphones, and the audio processing method is characterized by comprising the following steps:

receiving a multi-channel audio signal based on the microphone array, and respectively calculating beam forming signals and corresponding frequency domain power values in a plurality of preset directions according to the multi-channel audio signal, wherein the plurality of preset directions are angular bisector directions of a plurality of observation areas;

comparing the frequency domain power values of the beam forming signals corresponding to the preset directions, determining a target direction according to a comparison result, determining a target observation area according to the target direction, selecting at least one target microphone from the microphone arrays according to the target observation area, forming a microphone sub-array by the at least one target microphone, and performing sound source positioning based on the microphone sub-array to determine a target sound source direction;

and detecting the sound event of the single-channel enhanced signal, and displaying the type information of the detected sound event and the target sound source direction on the intelligent glasses.

2. The audio processing method of claim 1, wherein the calculating beam forming signals and corresponding power values in a plurality of preset directions from the multi-channel audio signal, respectively, comprises:

carrying out beam forming on the frequency domain signals in the preset directions to generate beam forming signals;

3. The audio processing method of claim 1, wherein the performing acoustic localization based on the microphone subarrays to determine the target acoustic source direction comprises:

4. The audio processing method of claim 3, wherein said adaptively beamforming for said target sound source direction to obtain a single channel enhancement signal comprises:

5. The audio processing method of claim 4, wherein the obtaining the noise signal power spectrum among the channels of the microphone subarray comprises:

6. An audio processing device is applied to intelligent glasses, the intelligent glasses include microphone array, the microphone array includes the contralateral microphone of installing respectively on two mirror legs, contralateral microphone parallel arrangement, the biggest interval of homonymy microphone that is located on same mirror leg is at least the contralateral microphone interval half, characterized in that includes:

the computing module is used for receiving the multichannel audio signals based on the microphone array and respectively computing beam forming signals and corresponding frequency domain power values in a plurality of preset directions according to the multichannel audio signals, wherein the plurality of preset directions are angular bisector directions of a plurality of observation areas;

The positioning module is used for comparing the frequency domain power values of the beam forming signals corresponding to the preset directions, determining a target direction according to a comparison result, determining a target observation area according to the target direction, selecting at least one target microphone from the microphone arrays according to the target observation area, forming a microphone subarray by the at least one target microphone, and performing sound source positioning based on the microphone subarray to determine a target sound source direction;

and the detection module is used for detecting the sound event of the single-channel enhanced signal and displaying the type information of the detected sound event and the target sound source direction on the intelligent glasses.

7. A storage medium storing a computer program adapted to be loaded by a processor to perform the steps of the audio processing method according to any of claims 1-5.

8. A smart glasses comprising a memory in which a computer program is stored and a processor that performs the steps in the audio processing method according to any one of claims 1-5 by calling the computer program stored in the memory.