CN110556103A

CN110556103A - Audio signal processing method, apparatus, system, device and storage medium

Info

Publication number: CN110556103A
Application number: CN201810548055.4A
Authority: CN
Inventors: 田彪; 余涛; 刘勇; 万玉龙; 高杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2019-12-10
Anticipated expiration: 2038-05-31
Also published as: CN110556103B

Abstract

the invention discloses an audio signal processing method, an audio signal processing device, an audio signal processing system, audio signal processing equipment and a storage medium. The method comprises the following steps: processing the audio signals received by the microphone array to obtain multi-path wave beams aiming at multiple directions; carrying out sound source target detection on the audio signals in the multi-path wave beams, and determining a plurality of candidate sound source target positions; using a plurality of wake-up recognition components, performing wake-up recognition on audio signals from a plurality of candidate sound source target positions in parallel; and determining the final awakening state according to the awakening identification result. According to the audio signal processing method provided by the embodiment of the invention, the monitoring and awakening can be simultaneously carried out on the multi-channel audio signals, and the processing efficiency and the identification efficiency are improved.

Description

Audio signal processing method, apparatus, system, device and storage medium

Technical Field

the present invention relates to the field of data processing technologies, and in particular, to an audio signal processing method, apparatus, system, device, and storage medium.

background

With the continuous development of the voice recognition technology, the voice recognition technology is rapidly developed in the fields of automobile driving, smart home, intelligent business systems and the like, and the voice recognition technology can rapidly and accurately execute corresponding functions through voice recognition.

The existing voice recognition technology can only recognize the audio signal of a single user to execute corresponding functions, and for mixed audio signals from different directions, the mixed audio signals are usually subjected to signal separation firstly, the sound source direction is calculated, and then the voice recognition technology is utilized to carry out recognition, so that the processing efficiency and the recognition efficiency are low.

Disclosure of Invention

Embodiments of the present invention provide an audio signal processing method, an audio signal processing assembly, an audio signal processing system, and a storage medium, which can simultaneously monitor and wake up multiple audio signals, thereby improving audio processing efficiency and recognition efficiency.

According to an aspect of an embodiment of the present invention, there is provided an audio signal processing method including:

Processing the audio signals received by the microphone array to obtain multi-path wave beams aiming at multiple directions; carrying out sound source target detection on the audio signals in the multi-path wave beams, and determining a plurality of candidate sound source target positions; using a plurality of wake-up identification components, performing wake-up identification on audio signals from candidate sound source target positions in parallel; and determining the final awakening state according to the awakening identification result.

According to another aspect of embodiments of the present invention, there is provided an audio signal processing apparatus including:

The beam forming module is used for processing the audio signals received by the microphone array to obtain a plurality of paths of beams aiming at a plurality of directions; the sound source target detection module is used for carrying out sound source target detection on the audio signals in the multi-path wave beams and determining a plurality of candidate sound source target positions; the parallel awakening identification module is used for parallelly awakening and identifying the audio signals from the candidate sound source target positions by using a plurality of awakening identification components; and the awakening result determining module is used for determining a final awakening state according to the awakening identification result.

According to still another aspect of embodiments of the present invention, there is provided an audio signal processing system including: a memory and a processor; the memory is used for storing programs; the processor is used for reading the executable program codes stored in the memory to execute the audio signal processing method.

According to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the audio signal processing method of the above-described aspects.

According to still another aspect of the embodiments of the present invention, there is provided an audio interaction device, including: a microphone array and a processor; the microphone array is used for collecting mixed audio signals, and the mixed audio signals comprise audio signals of main driving and audio signals of auxiliary driving; the processor is in communication connection with the microphone array and is used for processing the audio signals received by the microphone array to obtain multi-path beams aiming at multiple directions; carrying out sound source target detection on the audio signals in the multi-path wave beams, and determining a plurality of candidate sound source target positions; using a plurality of wake-up identification components, performing wake-up identification on audio signals from candidate sound source target positions in parallel; and determining the final awakening state according to the awakening identification result.

According to still another aspect of embodiments of the present invention, there is provided an audio interaction apparatus including: the microphone array is used for acquiring mixed audio signals, and the mixed audio signals comprise audio signals of a plurality of conference participants; the processor is in communication connection with the microphone array and is used for processing the audio signals received by the microphone array to obtain multi-path beams aiming at multiple directions; carrying out sound source target detection on the audio signals in the multi-path wave beams, and determining a plurality of candidate sound source target positions; using a plurality of wake-up identification components, performing wake-up identification on audio signals from candidate sound source target positions in parallel; and determining the final awakening state according to the awakening identification result.

according to the audio signal processing method, the device, the system, the equipment and the storage medium in the embodiment of the invention, the monitoring and awakening can be simultaneously carried out on the multi-channel audio signals, and the processing efficiency and the identification efficiency are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating an application scenario of an audio signal processing method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an audio signal processing method according to an embodiment of the present invention;

FIG. 3a is a schematic diagram illustrating wake up recognition in the prior art;

FIG. 3b is a schematic diagram illustrating wake up recognition according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a structure of a wake up identification component according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an application scenario of a vehicle-mounted voice control environment of an audio signal processing method according to an embodiment of the present invention;

fig. 6 is a schematic view illustrating an application scenario of a smart home environment of an audio signal processing method according to an embodiment of the present application;

Fig. 7 is a schematic view illustrating an application scenario of a conference environment of an audio signal processing method according to an embodiment of the present application;

Fig. 8 is a flowchart illustrating an audio signal processing method according to another embodiment of the present invention;

fig. 9 is a schematic structural diagram illustrating an audio signal processing apparatus provided according to an embodiment of the present invention;

Fig. 10 is a block diagram illustrating an exemplary hardware architecture of a computing device in which the audio signal processing method and apparatus according to the embodiments of the present invention may be implemented;

FIG. 11 is a schematic structural diagram illustrating an audio interaction device according to one embodiment of the present invention;

Fig. 12 is a schematic structural diagram showing an audio interaction device according to another embodiment of the present invention.

Detailed Description

features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In embodiments of the invention, a microphone array may be used to capture mixed audio signals from different directions. Where an audio signal is understood to be a carrier carrying sound frequency information and sound amplitude variation information, the audio signal may comprise at least one of speech, music and sound effects. The microphone array may be arranged by a certain number of sound collecting devices in a predetermined topology to perform signal sampling and signal processing on audio signals from spatially different directions. Each sound collecting device in the microphone array may be referred to as an array element, and each microphone array includes at least two array elements. As one example, the sound collecting device may be an acoustic sensor or a microphone. Fig. 1 is a schematic diagram illustrating an application scenario of an audio signal processing method according to an embodiment of the present invention. As shown in fig. 1, in an audio environment in which a plurality of audio signals of a plurality of sound source directions exist, including a plurality of speakers such as speaker 01, speaker 02, and speaker 03, a microphone array a0, and an audio interaction device 10, an audio signal processing system 11 and a speech control system 12 may be included in the audio interaction device 10. In one embodiment, the microphone array a0 may also be integrated into the audio interaction device 10.

In one embodiment, the microphone array a0 may include a plurality of sound collecting devices such as sound collecting device a1, sound collecting device a2, and sound collecting device A3. The shape arrangement rule of the sound collecting devices in the microphone array a0 may be referred to as a topology of a microphone array, and the microphone array may be divided into a linear microphone array, a planar microphone array, and a stereo microphone array according to the topology of the microphone array.

As an example, a linear microphone array may indicate that the centers of the array elements of the microphone array are located on the same straight line, such as a horizontal array; the planar microphone array can represent that the centers of array elements of the microphone array are distributed on a plane, such as a triangular array, a circular array, a T-shaped array, an L-shaped array, a square array and the like; the stereo microphone array may represent that the centers of the array elements of the microphone array are distributed in a stereo space, such as a polyhedral array, a spherical array, and the like.

The audio signal processing method according to the embodiment of the present invention is not particularly limited to the specific form of the microphone array used. As one example, the microphone array may be a horizontal array, a T-shaped array, an L-shaped array, a square array. For simplicity of description, fig. 1 illustrates the acquisition of audio signals from multiple directions by way of example in a horizontal array. The description is not to be interpreted as limiting the scope or implementation possibilities of the present solution, and the processing methods of microphone arrays of other topologies than horizontal arrays remain the same as for horizontal arrays.

As shown in fig. 1, speaker 01, speaker 02, and speaker 03 are in different directions of microphone array a0, and speaker 01, speaker 02, and speaker 03 may or may not produce audio signals simultaneously. When there are multiple speakers producing audio signals, i.e. the number of speakers is more than one, the microphone array may receive audio signals from a plurality of different directions, which form a mixed audio signal.

As an example, the mixed audio signal may include an audio signal from speaker 01, an audio signal from speaker 02, and an audio signal from speaker 03.

In the prior art, an audio signal processing system needs to perform positioning according to audio signals acquired by a microphone array, estimate a specific direction of target voice, perform waveform enhancement processing on the target audio signal in the specific direction, and perform wake-up recognition on the specific direction, so that the processing efficiency and the recognition efficiency are low. Moreover, the audio signal processing in the prior art can only monitor the audio signal of a single speaker, lacks the ability to simultaneously monitor voices in multiple directions, and cannot perform more intelligent multi-user voice interaction.

in view of this, an embodiment of the present invention provides an audio signal processing method, which can simultaneously perform waveform enhancement processing on audio signals from multiple spatial directions at multiple preset angles to obtain multiple beams, and then perform wake-up recognition on the multiple beams at the same time to determine a final wake-up state of a voice control system, thereby improving processing efficiency and recognition efficiency of the audio signals and implementing more intelligent multi-user voice interaction.

for a better understanding of the present invention, an audio signal processing method according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that these examples are not intended to limit the scope of the present disclosure.

Fig. 2 shows a schematic flow diagram of an audio signal processing method according to an embodiment of the invention. As shown in fig. 2, the audio signal processing method 100 in the embodiment of the present invention may include the following steps:

step S110, preprocessing the acquired mixed audio signal.

in an embodiment of the present invention, the pre-processing of the mixed audio signal may include an echo cancellation process, and a conversion process of the audio signal from a time domain signal to a frequency domain signal.

In one embodiment, when an audio signal is collected by using a microphone array, if there is a sound played by a far-end speaker that propagates and reflects in space to form a repeated sound signal, the repeated sound signal is transmitted back to a microphone to form noise interference, and such noise interference may be referred to as Acoustic Echo (Acoustic Echo), which is called Echo for short.

Therefore, in order to improve the quality of the audio signal and eliminate the signal interference caused by the acoustic echo, it is necessary to perform echo cancellation processing on the collected mixed audio signal, that is, to remove the acoustic echo in the collected audio signal.

In this embodiment, a sound signal emitted by the far-end speaker, that is, a far-end signal, may be used as a reference signal, an Acoustic Echo Cancellation (AEC) technique is used to identify an Echo propagation path between the far-end speaker and the microphone by using an adaptive filter, and the sound signal in the Echo propagation path is tracked, and a model of the Echo propagation path is built to estimate an Echo signal that may be generated by the reference signal, where the estimated Echo signal is opposite in phase to an Echo signal acquired by the microphone array. Therefore, the estimated echo signal is subtracted from the collected mixed audio signal with echo, so that the echo signal in the mixed audio signal can be eliminated, and the audio signal quality is improved.

In another embodiment, the pre-processing of the audio signal may further comprise: and converting the received audio signal from the time domain signal into a frequency domain signal to obtain the audio signal converted into the frequency domain signal.

In this embodiment, the time domain analysis and the frequency domain analysis of the audio signal are two different ways of analyzing the audio signal. In brief, the time domain can be used to describe the relationship between the audio signal and time, that is, the dynamic change of the audio signal along with time is analyzed by taking time as a variable; and the frequency domain can be used to describe the relationship between the audio signal and the frequency, i.e. the characteristics of the audio signal at different frequencies are analyzed with the frequency as a variable.

In the embodiment of the invention, when the time domain analysis is carried out on the audio signal, the information such as the period, the amplitude and the like of the audio signal can be intuitively obtained, the audio signal is converted into the frequency domain signal from the time domain signal, the audio signal is processed in the frequency domain by analyzing the frequency spectrum characteristic of the audio signal, and higher processing efficiency and performance can be obtained.

As one example, the audio signal may be transformed from a time domain signal to a frequency domain signal by a fourier transform algorithm. The basic principle of the fourier transform algorithm can be understood as: the audio signal obtained by continuous measurement is represented as an infinite superposition of sinusoidal signals of different frequencies. Thus, the fourier transform may take the directly measured audio signal as the original signal and calculate the frequency, amplitude and phase of the different sine wave signals in the audio signal in this superimposed manner for subsequent audio signal processing.

As another example, a frequency domain signal of the audio signal may be converted into a time domain signal using an inverse fourier transform algorithm.

Step S120, performing beamforming processing on the acquired mixed audio signal according to a plurality of preset directions to obtain a plurality of paths of beams for the plurality of directions.

in embodiments of the present invention, the microphone array may receive audio signals from various directions, which typically include interfering signals and a large amount of noise. The beam forming method carries out weighted summation processing on the audio signals output by each array element, and adjusts the receiving direction of the microphone array to the audio signals to the specified sound source direction according to the preset angle, thereby realizing directional selection of the microphone array to the audio signals, enhancing the audio signals in the specified direction and inhibiting the interference sound sources and noises in other directions.

in one embodiment, the basic idea of the beamforming method is: the microphone array forms a beam in a desired sound source direction by setting a weighting coefficient of each array element in the microphone array and performing weighted summation processing on an audio signal output by each array element. By collecting signals in the wave beam and eliminating noise outside the wave beam, the aim of enhancing the audio signal in the expected direction can be achieved.

In the embodiment of the present invention, the beam forming method can be divided into two types: a fixed Beam Forming (CBF) method and an Adaptive Beam Forming (ABF) method.

in one embodiment, with the CBF method, the delay estimation can be performed on the transmission delay of the audio signal from the desired sound source to each array element; time delay compensation is carried out on the transmission time delay through time delay control, so that the audio signals output by each array element are kept synchronous; and weighting and summing the synchronous audio signals by using the weighting coefficient of the output signal of each array element to obtain the wave beam of the microphone array.

In the CBF method, the weighting coefficient of each array element output signal is fixed and unchangeable, so the whole processing process is simpler and has low computation amount, and the suppression of background noise in a reverberation environment can be realized.

In one embodiment, when the beam forming processing is performed in a desired direction according to a preset angle by using the ABF method, the weighting coefficient of the output signal of each array element is not fixed, but can be automatically adjusted by using an adaptive algorithm according to the change of the environment and the signal, so that the formed beam always points to the preset angle, thereby suppressing the interference source and the noise and enhancing the useful signal in the beam.

in this embodiment, the adaptive algorithm that can be used by the ABF method includes various algorithms, such as an algorithm based on the signal-to-Noise ratio (snr) (signal Noise ratio) maximum criterion, an algorithm based on the minimum Mean square error (mmse) (minimum Mean square error) criterion, and an algorithm based on the linear constrained minimum variance (lcmv) (linear constrained minimum variance) criterion. The embodiment of the present invention does not specifically limit the specific algorithm used in the ABF method.

As an example, an algorithm based on LCMV criteria, referred to as LCMV algorithm for short, may be understood as taking a desired sound source direction as a target signal direction of beamforming on the premise that the desired sound source direction of beamforming is known, pointing a beam to the desired sound source direction through a vector constraint condition on the target signal direction, and using an algorithm that minimizes variance to find an optimal weighting coefficient of each array element output signal, so that power of interference and noise output by a microphone array is minimized, thereby receiving a useful signal to the maximum extent and suppressing noise and interference.

Step S130, according to the obtained multi-path beams, performing sound source target detection on the audio signals in the obtained multi-path beams, and determining candidate sound source target positions.

in the embodiment Of the present invention, the Direction Of Arrival (DOA) Of the audio signal in the beam is a Direction angle Of the audio signal in the space to the reference array element in the microphone array, that is, an angle between a propagation Direction Of the audio signal and a normal Direction Of the microphone array relative to the reference array element Of the microphone array. In some embodiments, the Angle may also be referred to as an Angle Of Arrival (Angle Of Arrival) Of the audio signal.

In the embodiment of the present invention, the beam direction of the beam, i.e. the direction of arrival of the audio signal in the beam, can be used to measure the positioning probability of the target position of the candidate sound source. Specifically, the smaller the angle of arrival of the audio signal in the beam is, the stronger the signal intensity of the audio signal in the beam is, the higher the probability that the sound source position corresponding to the audio signal in the beam is taken as the sound source target position is, and the higher the accuracy of the positioning result is.

In one embodiment, the localization of the sound source target location may be performed using a direction of arrival estimation method. Specifically, the direction-of-arrival line of the beam relative to the microphone arrays is determined through direction-of-arrival estimation, and triangulation is performed by using the direction-of-arrival lines determined by the receiving array elements of the plurality of microphone arrays, so that the estimated position of the sound source target position of the audio signals in the beam is obtained.

in this embodiment, the direction of arrival of the audio signal in the beam is estimated to obtain the angle of arrival of the audio signal in each path of beam, and the directions of the beams of the audio signal whose angle of arrival satisfies the threshold range of the angle of arrival may be used as the candidate target positions of the multiple sound sources.

As an example, the range of the angle of arrival threshold may be set to 0 degree or more and 30 degrees or less, i.e., the angle of arrival of the audio signal within the beam may be a beam angle of arrival of 30 degrees or less. When the arrival angle of the audio signal in the beam is 0 degree, the sound source position corresponding to the audio signal in the beam is opposite to the microphone array.

In the embodiment of the invention, the arrival direction of each path of wave beam is estimated, so that the sound source positions of the multiple paths of audio signals received by the microphone array can be respectively detected, the directions of the wave beams of the audio signals meeting the threshold range of the arrival angle are used as the target positions of the candidate sound sources, and the efficiency and the accuracy of sound source positioning are improved.

As shown in fig. 2, the noise reduction process may be performed on the audio signals within the beam from the candidate multiple sound source target positions through step S140.

In one embodiment, the Noise reduction processing may include Noise Suppression (NS) processing of beams from the candidate multiple sound source target positions and further Noise reduction processing of the audio signals of the Noise-suppressed candidate sound source target positions using a Post Filter (Post Filter).

as an example, the noise suppression process may use a specific filter to perform a filtering process on the specific multi-path beams, i.e., to filter noise and interference from the audio signals of the multi-path beams to extract useful audio signals.

As a specific example, a Wiener Filtering (Wiener Filtering) device may perform Filtering processing on multiple audio signals, so as to minimize a mean square error between an output audio signal of the filter and an audio signal desired to be output, thereby achieving an effect of performing noise suppression to the maximum extent.

In this embodiment, the ambient noise may be further filtered using a noise suppression process to improve the fidelity of the audio signal while suppressing the noise.

In one embodiment, the noise suppression process may further include: and performing further noise reduction processing on the audio signals of the candidate sound source target positions subjected to noise suppression by using a post filter.

In this embodiment, the post filter may further perform filtering processing on the background noise, for example, the residual music noise, and may adjust the weighting coefficient of the post filter according to whether the to-be-processed audio signal includes the unvoiced audio signal and the background noise, so that after the filtering processing of the post filter, the speech parameter of the output audio signal is closer to the speech parameter of the driver at the target position of the candidate sound source, for example, the pitch is not distorted, thereby enhancing the speech.

As shown in fig. 2, in an embodiment, the audio signal of one beam after the post-filtering process may obtain sounds of two different frequency bands, so that the audio signal after the post-filtering process may be subjected to speech synthesis, so as to perform subsequent processing such as wake-up recognition by using the synthesized audio signal.

Step S140 is performed to wake up and recognize audio signals at a plurality of candidate sound source target positions in parallel.

The following describes the principle of using the LFR acoustic model for wake up recognition in the embodiment of the present invention in detail with reference to fig. 3a and 3 b. FIG. 3a shows a schematic diagram of wake up recognition in the prior art; fig. 3b shows a schematic diagram of the wake up identification according to an embodiment of the invention.

As shown in fig. 3a, in the existing intelligent voice control system, after an audio signal is collected, the audio signal may be subjected to framing processing, then each frame of voice data in the audio signal is wakened and identified, and then a wakening result of the audio signal is determined according to an identification result of each frame of voice data.

For example, as shown in fig. 3a, the audio signal after the framing processing includes 7 frames of voice data, that is, 7 frames of voice data from time t-i-3 to time t-i +3, and then the 7 frames of voice data are wakened and recognized in a wakening recognition link, so as to determine a wakening result of the audio signal by combining the wakening recognition result of the 7 frames of voice data.

the awakening identification link in the processing process occupies a large amount of CPU resources, the CPU resources of the intelligent voice control system are often extremely limited, and the CPU resources which can be distributed to the awakening identification link are bound to be more limited, so that the existing intelligent voice control system can only respond to the audio signal of a single user to execute corresponding functions, and the flexibility is lacked.

In the embodiment of the invention, when a plurality of awakening identification components are started to perform awakening identification, a Low Frame Rate (LFR) acoustic model is adopted to perform awakening identification on the collected audio signals, so that CPU resources and power consumption occupied by an awakening identification link are reduced.

In one embodiment, when performing wake-up recognition on the acquired audio signal by using the LFR acoustic model, the audio signal may be divided into a plurality of recognition units, wherein each recognition unit includes a plurality of consecutive frames.

In one example, the acquired audio signal is firstly subjected to framing processing, one frame of voice data is selected from every preset number of frames of voice data after the audio signal is subjected to framing processing, and multi-frame voice data adjacent to the target frame of voice data and the target frame of voice data are used as recognition units to perform awakening recognition on the target frame of voice data. Wherein, the same voice data frame can be included between the adjacent identification units.

as shown in fig. 3b, the audio signal after the framing processing may include N frames of voice data, and 7 frames of voice data are taken as an example for explanation, that is, 7 frames of voice data from time t-i-3 to time t-i +3 are taken as an example for explanation.

Specifically, when performing the wake-up recognition on the audio signal, one frame may be selected as the target frame voice data in every 3 frames of voice data, for example, 7 frames of voice data from the time t-i-3 to the time t-i +3, and the voice data at the time t-i-3, the voice data at the time t-i, and the voice data at the time t-i +3 may be selected as the target frame voice data.

When performing wake-up recognition, voice data at time t-i-6, voice data at time t-i-5, voice data at time t-i-4, voice data at time t-i-3, voice data at time t-i-2, voice data at time t-i-1, and voice data at time t-i are combined to perform wake-up recognition. Similarly, when performing the wake-up recognition, the voice data at the time t-i-3, the voice data at the time t-i-2, the voice data at the time t-i-1, the voice data at the time t-i +1, the voice data at the time t-i +2, and the voice data at the time t-i +3 are combined to perform the wake-up recognition.

Compared with the prior art that each frame of voice data in the audio signal is identified, the LFR acoustic model-based awakening identification process shown in FIG. 3b can obviously reduce the identification times or frequency during awakening identification, has lower statistical inference frequency, and further reduces the CPU resource occupied by the awakening identification link. Meanwhile, in the recognition process, the number of recognized voice data frames is reduced, so that the recognition efficiency is improved.

and when performing the wake-up recognition based on the LFR acoustic model, after the target frame voice data is selected, the target frame voice data may be wake-up recognized in units of multiple frames of voice data adjacent to the target frame voice data and the target frame voice data. Therefore, the awakening identification process based on the LFR acoustic model has a larger modeling unit and a lower statistical inference frequency, the occupation and the power consumption of a CPU can be greatly reduced, and the awakening identification efficiency is improved while the accuracy of the awakening identification is effectively ensured.

In the embodiment of the invention, when the mixed audio signal collected by the microphone array is processed, the awakening identification refers to identifying the semantics contained in the audio signal by performing semantic identification on the audio signal, and determining the final awakening state of the intelligent voice control system according to the semantic identification result of the audio signal, and if the final awakening state of the intelligent voice control system is awakening successfully, controlling the intelligent voice control system to be converted from the dormant state into the activated working state capable of receiving the voice command.

In one embodiment, the wake up recognition may be a wake up recognition with a wake up word or a wake up recognition without a wake up word. The wake-up word refers to a password or a command for activating the voice control system, and may be, for example, a predefined specific word, a specific sentence, a specific signal, or the like.

as an example, the wake-up recognition with a wake-up word means that the audio signal processing system can recognize whether the received audio signal contains the wake-up word or not by performing semantic recognition on the received audio signal according to the semantic recognition result of the audio signal, and if the received audio signal contains the wake-up word, the voice control system is set to be in a working state.

as an example, the wake-up recognition without a wake-up word means that the audio signal processing system can perform semantic recognition on the received audio signal, search target data matched with the semantic recognition result in a preset database according to the semantic recognition result of the audio signal, and then control the voice control system to set and execute a corresponding function according to a control instruction corresponding to the searched target data.

In this embodiment, when the voice control system searches for the target data matched with the semantic recognition result in the database according to the semantic recognition result, the target data matched with the semantic recognition result may be searched for in the database locally stored in the voice control system, or the semantic recognition result may be uploaded to the cloud server or the cloud computing platform, and the target data matched with the semantic recognition result may be searched for in the database of the cloud server or the database of the cloud computing platform, which is not limited in this application.

the following describes a process of performing wake-up recognition on an audio signal by using a wake-up recognition component in an embodiment of the present invention with reference to fig. 4. Fig. 4 shows a schematic structural diagram of a wake-up identification component according to an embodiment of the present invention.

in the embodiment of the present invention, in terms of the composition structure of the speech signal, the speech unit is a unit that divides the speech into different sizes according to the pronunciation characteristics of the speech. The speech signal may comprise a plurality of consecutive speech units.

in one embodiment, the smallest phonetic unit in the speech signal may be referred to as a phoneme. As an example, one pronunciation action may constitute one phoneme, analyzed according to the pronunciation action of a syllable. For example, phonemes may include vowels and consonants. Since the continuous speech units are co-pronounced, speech recognition needs to consider the recognition states of a plurality of continuous speech units in the audio signal.

As an example, a phoneme generally does not exist in isolation, but a plurality of phonemes cooperate to form a pronunciation, and the semantic recognition process needs to consider the context of each phonetic unit.

The following describes a process of performing wake-up detection on a plurality of continuous speech units in an audio signal by using a wake-up recognition component in an embodiment of the present invention, taking a wake-up recognition component constructed based on a Context-Dependent Phone (CD-Phone) model as an example.

As shown in fig. 4, the wake-up recognition component for performing recognition based on context-dependent phonemes may include a plurality of wake-up recognition units, and divide the audio signal in each beam into a plurality of continuous phonemes, such as a plurality of initials and finals, for recognition, referring to the case of co-pronunciation between continuous speeches.

As an example, assume that the audio content of an audio signal is "hello clout" and its speech is "ni hao xiao yun", and the phonemes include: n, i, h, ao, x, iao, y, and un. Considering the case of co-articulation between successive voices, the voice is split into a plurality of successive voice units, for example, 7 voice units represented by "n + i", "n-i + h", "i-h + ao", "h-ao + x", "ao-x + iao", "x-iao + y", "iao-y + un".

In this example, each speech unit may be represented in the form of "l-c + r", where c represents the center primitive, l represents the left-related information, and r represents the right-related information.

as an example, in one continuous speech unit "h-ao + x", three phonemes h, ao, and x are included, wherein the phoneme ao may be regarded as a center primitive, i.e., a current phoneme, the phoneme h is left related information of the center primitive ao, i.e., a phoneme before the current phoneme ao, and the phoneme x is right related information of the center primitive ao, i.e., a phoneme after the current phoneme ao.

in one embodiment, before the audio signal is subjected to wake-up recognition, the method may further include: and filtering the junk words in the audio signal.

In one embodiment, the wake-up recognition component may include a mute state in the audio signal received, which may be a long pause, or a short pause between two consecutive syllables. Therefore, the audio signal can be subjected to silence detection firstly, and silence frames in the audio signal are removed, so that the influence of the silence frames on speech recognition of front and rear frames can be eliminated, unnecessary calculation amount can be reduced, and the calculation efficiency can be improved.

In one embodiment, the silence recognition unit can be used for carrying out silence detection on the audio signal, and the silence recognition unit can be used for eliminating the silence frames in the audio signal and eliminating the pause with short saving time among the continuous syllables, so that the overall efficiency of voice recognition is improved.

as an example, when semantic recognition is performed on the audio content of the audio signal, such as "hello, clout" after silence detection, the first continuous speech unit corresponding to the wake-up word may be represented as "sil-n + i", where sil may be used to represent the silence frame that needs to be removed.

In the embodiment of the present invention, in the process of recognizing the audio signal by each wakeup recognition unit, consecutive phonemes in a "l-c + r" form of consecutive speech recognition units may sequentially correspond to: a phoneme starting stage, a phoneme stabilizing stage and a phoneme ending stage. Wherein the recognition state of each phoneme may include an input state, an output state, a state in which each phoneme primitive resides in a current stage, or a state in which the phoneme transitions to a next stage. During the voice wakeup recognition process, the state transition of the phoneme in the continuous voice unit includes the state that the phoneme in the continuous voice unit resides in the current phoneme, namely, the phoneme in the continuous voice unit stays in the state of the phoneme, or the phoneme in the continuous voice unit shifts to the state of the next phoneme.

In one embodiment, the probability of the audio signal from each of the candidate sound source target positions being correctly recognized may be determined using the recognition status of a plurality of consecutive phonetic units in the audio signal of each of the candidate sound source target positions, thereby determining the probability of the audio signal from each of the candidate sound source target positions being awakened.

In the embodiment of the invention, in the awakening identification process based on the LFR acoustic model, the awakening identification component for performing semantic identification on the basis of the phonemes related to the context can reduce the occupation and power consumption of the awakening identification component on the CPU, improve the detection rate of awakening detection and reduce the false alarm rate of the awakening detection.

In an embodiment of the present invention, each wake recognition component may include a plurality of wake recognition units, and each wake recognition unit may recognize a keyword in a speech signal. For a plurality of candidate sound source target positions, a plurality of awakening identification components can be started, and awakening identification is carried out on the plurality of candidate sound source target positions simultaneously, so that the effect of awakening detection on a plurality of sound source directions simultaneously is achieved.

And step S250, determining a final awakening state according to the awakening identification result.

In this step, if the audio signal at the target position of the candidate multiple sound sources is detected to contain the awakening word in the awakening recognition result, or in the process of awakening recognition, when the target data matched with the semantic recognition result is searched in the database according to the semantic recognition result of the audio signal, the awakening result is determined to be successful in awakening.

in one embodiment, the sound source target position corresponding to the audio signal whose wake-up result is successful may be taken as an effective sound source target position in the candidate multiple sound source target positions.

in an embodiment of the present invention, the final wake-up state of the audio signals from the candidate multiple sound source target positions may be determined by the localization probabilities of the multiple candidate sound source target positions and the wake-up probability of the audio signal from each of the candidate sound source target positions.

In one embodiment, the candidate sound source target position with the highest localization probability may be selected as the effective sound source target position, and the final wake-up state of the voice control system may be determined according to the wake-up probability of the audio signal from the effective sound source target position. For example, if the wake-up probability of the audio signal from the target position of the valid sound source exceeds a predetermined probability threshold, the final wake-up state of the voice control system is determined to be successful.

in one embodiment, the audio signals with awakening results of successful awakening in the audio signals from the candidate multiple sound source target positions can be sent to the voice control system; and controlling the voice control system to execute corresponding functions according to the received audio signals after being awakened.

According to the audio signal processing method provided by the embodiment of the invention, the directional voice enhancement processing can be carried out on the audio signal in multiple directions by utilizing the microphone array, so that multiple candidate sound source target positions can be obtained; the voice control system can wake up and recognize audio signals of a plurality of candidate sound source target positions by using a plurality of wake-up recognition components, and can determine an effective sound source target position and wake up the voice control system according to the effective sound source target position by combining a wake-up recognition result, so that low power consumption simultaneous monitoring and wake-up of multi-channel audio signals are realized, and a better interaction effect of multiple persons and the voice control system is achieved.

in the description of the above embodiments, the audio signal processing method of the embodiments of the present invention may be applied to a speech environment having a plurality of audio signals from spatially different directions. In one embodiment, the voice environment may be an in-vehicle voice control environment, a smart speaker environment, or a conference environment.

fig. 5 is a schematic diagram illustrating an application scenario of a vehicle-mounted voice control environment of an audio signal processing method according to an embodiment of the present invention.

as shown in fig. 5, in the vehicle-mounted voice control environment, a primary driver 51, a secondary driver 52 and an audio interaction device 53 are included, wherein a microphone array is included in the audio interaction device 53.

as shown in fig. 5, the microphone array in the audio interaction device 53 may capture a mixed audio signal including the audio signal of the primary driver 51 and the audio signal of the secondary driver 52 in real time. The mixed audio signal may also include sound signals played by a car stereo, audio signals of rear seat personnel in the car and environmental noise.

In one embodiment, the audio signal from the primary driver 51 and the audio signal from the secondary driver 52 may include voice signals.

The primary driver 51 and the secondary driver 52 are in different orientations of the audio interaction device 53 when the microphone array captures the mixed audio signal. Therefore, the audio signal of the primary driver 11 and the audio signal of the secondary driver 52 are from different directions for the microphone array, and the position of the primary driver, the position of the secondary driver, and the position of the microphone array are relatively fixed, so that a plurality of directions for beam forming can be set in advance according to the position of the primary driver relative to the microphone array and the position of the secondary driver relative to the microphone array, and beams can be formed in the plurality of directions by using the mixed audio signal collected by the microphone array to directionally select the audio signal, so that the sound wave from the plurality of directions can be enhanced to suppress the sound signal in other directions.

As an example, when the primary driver 51, the secondary driver 52, and other passengers simultaneously utter voice signals, the beamforming processing is performed in a plurality of directions, resulting in a plurality of beams for the plurality of directions. Specifically, the sound source positions of the multiple beams may be detected by direction of arrival estimation, and the probability that the sound source position of a beam having an angle of arrival within a preset angle threshold range is considered to be a candidate sound source target position is large, and as an example, the sound source position of a beam having an angle of arrival of 30 degrees or less is considered to be a candidate sound source target position. Through the above steps, multi-path directional speech enhancement processing of the mixed audio signal received by the microphone array can be realized, and the direction of the beam can be directed to candidate sound source directions, such as a main driving position and a sub-driving position.

As an example, a plurality of wake recognition components that perform semantic recognition using context-dependent phonemes may be simultaneously activated during wake recognition based on an LFR acoustic model to perform wake detection on a sound source target location. And determining whether each audio signal from the candidate sound source target position can be awakened according to the awakening identification result.

as an example, when detecting that the audio signal from the target position of the effective sound source, for example, the audio signal of the primary driver or the secondary driver, contains a wakeup word or a voice control command, the plurality of wakeup modules in the audio interaction device 53 set the voice control system to be in an operating state, and perform voice control according to the audio signal of the user within a subsequent preset time period. The preset time duration may be set according to an empirical value, for example: for 30 seconds.

In one embodiment, a plurality of wake recognition components in the audio interaction device 53 may be connected to the voice control system, and when it is detected that the semantic recognition result of one of the wake modules includes a wake word, the semantic recognition result is sent to the voice control system.

as an example, when the candidate sound source target positions include the primary driving position and the secondary driving position, after semantic recognition is performed in parallel on the audio signal of the primary driver 51 at the primary driving position and the audio signal of the secondary driver 52 at the secondary driving position, the semantic recognition result of the audio signal of the primary driver 51 and the semantic recognition result of the audio signal of the secondary driver 52 may be detected in parallel by the plurality of wake-up recognition components in the audio interaction device 53.

for example, if it is detected that the wake-up result of the audio signal from the primary driver 51 is wake-up success, the audio signal from the primary driver 51 is sent to the voice control system and is subjected to voice control by the primary driver 51; if the wake-up result of the audio signal of the secondary driver 52 is detected as wake-up success, the audio signal from the secondary driver 52 is sent to the voice control system and is subjected to voice control by the secondary driver 52.

According to the audio signal processing method provided by the embodiment of the invention, directional voice enhancement processing can be carried out on multi-channel audio signals from multiple spatial directions to obtain multiple candidate sound source target positions; and for a plurality of candidate sound source target positions, a plurality of modules are started to perform awakening identification in parallel, and an awakening result is determined by combining the awakening identification result, so that low-power consumption simultaneous monitoring and awakening of the multi-channel audio signals are realized, and a better interaction effect of a plurality of people and a voice control system is achieved.

The audio signal processing method provided by the embodiment of the present application is described above with reference to a vehicle-mounted environment, and the embodiment of the present application can also be used in other intelligent devices including a voice control system. Among others, smart devices may include, but are not limited to: intelligent audio amplifier, intelligent TV, automatic shopping machine.

Fig. 6 shows an application scenario diagram of a smart home environment of an audio signal processing method according to an embodiment of the present application. As shown in fig. 6, in the smart home environment, the smart sound box 60, the user 61, and the user 62 are included, and the smart sound box 60 includes a microphone array for collecting audio signals, a wake-up recognition system, and a voice control system.

during specific use, both the user 61 and the user 62 are within the recognition range of the smart sound box 60, and send a control command to the smart sound box 60 through voice.

In one embodiment, the microphone array in smart sound box 60 captures a mixed audio signal containing the user 61 audio signal and the user 52 audio signal, which may include a speech signal. Performing a beamforming process in a plurality of directions using a microphone array to obtain a plurality of beams for the plurality of directions, performing sound source target detection on audio signals in the plurality of beams, and determining a plurality of candidate sound source target positions, where the plurality of candidate sound source target positions may include at least one of azimuth information of the user 61 and azimuth information of the user 62; wake-up detection is performed on sound source target locations from a plurality of candidates using a plurality of wake-up recognition components.

In one embodiment, the wake-up recognition system in the smart speaker 60 starts the wake-up engine for the candidate multiple sound source target positions after receiving the audio signals from the candidate multiple sound source target positions, and detects the wake-up recognition result of the audio signal of each candidate sound source target position in parallel. For example, the wake-up engine 1 and the wake-up engine 2 are started, the wake-up engine 1 and the wake-up engine 2 run in parallel, the wake-up engine 1 detects whether a wake-up word is included in the wake-up recognition result of the audio signal of the user 61, and the wake-up engine 2 detects whether a wake-up word or a voice control command is included in the wake-up recognition result of the audio signal from the user 62.

If the wake-up engine 1 detects that the wake-up recognition result from the audio signal of the user 61 includes a wake-up word, determining that the wake-up result is that the voice control system is woken up by the audio signal from the user 61; if the wake-up engine 2 detects that the semantic recognition result of the audio signal of the user 62 includes a wake-up word, it determines that the voice control system is woken up by the audio signal from the user 62.

After the voice control system in the smart speaker 60 is awakened, the matched target data can be searched in the database according to the voice recognition result of the received audio signal, and then the smart speaker 60 is controlled to execute the corresponding function according to the control instruction corresponding to the searched target data.

The voice control system searches for target data matched with the semantic recognition result in the database according to the semantic recognition result, may search for target data matched with the semantic recognition result in a database locally stored in the smart speaker 60, may also upload the semantic recognition result to the cloud server or the cloud computing platform, and searches for target data matched with the semantic recognition result in the database of the cloud server or the database of the cloud computing platform, which is not limited in this application.

Fig. 7 is a schematic diagram illustrating an application scenario of a conference environment of an audio signal processing method according to an embodiment of the present application. As shown in fig. 7, in a conference environment, an audio interaction device 70 and conference participants participating in the conference, such as a speaker 71, a speaker 72, a speaker 74, a speaker 75, and a speaker 76, are included. The audio interaction device 70 includes a microphone array for collecting audio signals, a wake-up recognition system, and a voice control system.

in this application scenario, the conference participants are all within the recognition range of the audio interaction device 71, and send control commands to the audio interaction device 71 through voice.

As shown in fig. 7, the microphone arrays in the audio interaction device 71 may be used to capture audio signals from participants in different orientations. Further, the positions of the conference participants such as the talker 71, the talker 72, the talker 74, the talker 75, and the talker 76 are relatively fixed, a plurality of directions of beam formation are set in advance according to the position of each conference participant with respect to the microphone array, and a beam is formed in a plurality of directions by using the mixed audio signal collected by the microphone array to perform directional selection on the audio signal, so that the sound waves from the plurality of directions are enhanced to suppress sound signals in other directions, and a plurality of sound source target positions to be candidates can be obtained.

In one embodiment, the audio interaction device 71 may use multiple wake-up recognition components to wake-up detect sound source target locations from multiple candidates. If the wake-up result of one of the wake-up components is wake-up success, which indicates that the voice control system is woken up by the audio signal from the user 62, the state of the voice control system is set to be working.

After the voice control system in the audio interaction device 71 is awakened, the matched target data can be searched in the database according to the voice recognition result of the received audio signal, and then the voice control system is controlled to execute the corresponding function according to the control instruction corresponding to the searched target data.

fig. 8 is a flowchart illustrating an audio signal processing method according to another embodiment of the present invention. As shown in fig. 8, the audio signal processing method 200 in the embodiment of the present invention may include the steps of:

Step S210, processing the audio signal received by the microphone array to obtain multiple beams in multiple directions.

Step S220, performing sound source target detection on the audio signals in the multiple paths of beams, and determining candidate multiple sound source target positions.

Step S230 is to perform wake-up recognition on audio signals from a plurality of candidate sound source target positions in parallel using a plurality of wake-up recognition components.

Step S240, determining a final wake-up state according to the wake-up recognition result.

According to the processing method provided by the embodiment of the invention, the low-power consumption simultaneous monitoring and awakening of multiple paths acquired by the microphone array can be realized, the processing efficiency is improved, and a better multi-person voice interaction effect is achieved.

In an embodiment, step S210 may specifically include:

Step S211, performs echo cancellation processing on the received audio signal.

Step S212, the audio signal after the echo cancellation process is converted from the time domain to the frequency domain.

Step S213 performs beamforming processing on the frequency domain audio signal in multiple directions to obtain multiple paths of beams for multiple directions.

In an embodiment, step S210 may further include:

In step S214, noise suppression processing is performed on the multiple beams.

and step S215, performing noise reduction processing on the multi-path wave beams after the noise suppression processing by using a post filter to obtain signal enhanced multi-path wave beams.

In this embodiment, noise and interference are filtered out from the audio signal of the multi-beam by a noise suppression process to extract a useful audio signal; the residual music noise can be further filtered through the post filter, so that a high-quality audio signal is obtained, and the fidelity of the processed audio signal is high.

In an embodiment, step S220 may specifically include:

step S221, sound source target detection of the audio signals in each path of wave beam is carried out through the estimation of the direction of arrival, and the angle of arrival of the audio signals in each path of wave beam is obtained.

step S222, regarding the direction of the beam of the audio signal whose angle of arrival satisfies the threshold range of the angle of arrival as the target position of the candidate sound source.

In this embodiment, the direction of arrival estimation is performed using the array signals, the sound source position of each beam is detected, and the candidate sound source target position is determined.

in an embodiment, step S230 may specifically include:

the audio signal of each sound source target position is divided into a plurality of identification units for identification of audio signals from a plurality of sound source target positions candidate using a plurality of wake-up identification components, wherein each identification unit comprises a plurality of consecutive frames.

Specifically, the step of dividing the audio signal of each sound source target position into a plurality of identification units for identification may specifically include:

Step S231, respectively performing frame division processing on the audio signals of each candidate sound source target position to obtain multi-frame audio data

step S232, in the multi-frame audio data, one frame is selected from every preset number of frames of audio data as target frame audio data.

In step S233, the audio signal of each sound source target position candidate is identified by using the plurality of frames of audio data adjacent to the target frame of audio data and the target frame of audio data as the identification unit.

In the embodiment, based on the low frame rate acoustic model, the context-dependent phoneme is used for wake-up modeling, and by using the context-dependent wake-up model, the false wake-up of the system is reduced, the recall rate is improved, and therefore the user experience is improved.

in an embodiment, step S240 may specifically include:

Step S241 determines the localization probabilities of the target positions of the plurality of sound sources according to the beam directions of the multi-path beams.

In step S242, the wake-up probabilities of the audio signals from the candidate multiple sound source target positions are determined using the recognition states of multiple consecutive speech units in the audio signal of each of the candidate sound source target positions.

Step S243, determining the final wake-up state of the audio signals from the candidate multiple sound source target positions through the localization probability and the wake-up probability.

In one embodiment, step S240 may further include:

and determining an effective sound source target position in the candidate multiple sound source target positions according to the awakening identification result.

In one embodiment, the audio signal processing method 200 may further include:

step S240, sending the audio signal with the wake-up result being successful from the audio signals of the candidate multiple sound source target positions to the voice control system.

And step S250, controlling the voice control system to execute corresponding functions according to the received audio signals after being awakened.

In the embodiment, the voice control system can perform awakening detection according to the awakening identification result, so that multi-path low-power consumption simultaneous monitoring and awakening are realized, and a better multi-user interaction effect is achieved.

An audio signal processing apparatus according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 9 is a schematic structural diagram of an audio signal processing apparatus according to an embodiment of the present invention. As shown in fig. 9, the audio signal processing apparatus 600 includes:

the multi-path beam forming module 610 is configured to process the audio signal received by the microphone array to obtain multi-path beams for multiple directions.

And a sound source target detection module 620, configured to perform sound source target detection on the audio signals in the multiple paths of beams, and determine candidate multiple sound source target positions.

A parallel wake up identification module 630, configured to perform wake up identification on audio signals from a plurality of candidate sound source target positions in parallel using a plurality of wake up identification components.

And a wake-up result determining module 640, configured to determine a final wake-up state according to a wake-up recognition result.

in one embodiment, the multi-path beamforming module 610 may specifically include:

And the echo cancellation unit is used for carrying out echo cancellation processing on the received audio signal.

And the signal conversion unit is used for converting the audio signal subjected to the echo cancellation processing from a time domain to a frequency domain.

the beam forming module is further configured to perform beam forming processing on the frequency domain audio signal in multiple directions to obtain multiple paths of beams for the multiple directions.

In one embodiment, the multi-path beamforming module 610 may further include:

The noise suppression unit is used for carrying out noise suppression processing on the multi-path wave beams; and the post-filter unit is used for carrying out noise reduction processing on the multi-path wave beams after the noise suppression processing by using a post-filter to obtain the multi-path wave beams with enhanced signals.

in one embodiment, the sound source target detection module 620 may specifically include:

The system comprises a direction-of-arrival estimation unit, a direction-of-arrival estimation unit and a control unit, wherein the direction-of-arrival estimation unit is used for carrying out sound source target detection on audio signals in each path of wave beam through direction-of-arrival estimation to obtain the angle of arrival of the audio signals in each path of wave beam; and

And the beam direction determining unit is used for taking the direction of the beam of the audio signal with the arrival angle meeting the threshold range of the arrival angle as the target position of the candidate sound source.

In an embodiment, the parallel wake-up recognition module 630 may be specifically configured to:

In one embodiment, the parallel wake up recognition module 630 may include:

And the signal framing unit is used for respectively framing the audio signals of the candidate target positions of each sound source to obtain multi-frame audio data.

And the target frame selecting unit is used for selecting one frame from every preset number of frames of audio data in the multi-frame audio data as the target frame audio data.

And the signal identification unit is used for identifying the audio signal of each candidate sound source target position by taking the multi-frame audio data adjacent to the target frame audio data and the target frame audio data as identification units.

In an embodiment, the wake up result determining module 640 may specifically include:

And the positioning probability determining unit is used for determining the positioning probability of the target positions of the plurality of sound sources according to the beam directions of the multi-path beams.

And the awakening probability determining unit is used for determining the awakening probability of the audio signals from the candidate multiple sound source target positions by utilizing the recognition states of the multiple continuous voice units in the audio signals of the candidate each sound source target position.

The wake-up result determining module 640 is further configured to determine a final wake-up state of the audio signals from the candidate multiple sound source target positions according to the localization probability and the wake-up probability.

In one embodiment, the audio signal processing apparatus 600 may further include:

And the sound source positioning module is used for determining an effective sound source target position in the candidate multiple sound source target positions according to the awakening identification result.

The signal sending module is used for sending the audio signals of which the awakening results are successful to the voice control system from the audio signals of the candidate multiple sound source target positions;

And the voice control module is used for controlling the voice control system to execute corresponding functions according to the received audio signals after being awakened.

other details of the audio signal processing apparatus according to the embodiment of the invention are similar to the audio signal processing method according to the embodiment of the invention described above with reference to fig. 1 to 9, and are not repeated herein.

Fig. 10 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing an audio signal processing method and apparatus according to an embodiment of the present invention.

As shown in fig. 10, computing device 700 includes an input device 701, an input interface 702, a central processor 703, a memory 704, an output interface 705, and an output device 706. The input interface 702, the central processing unit 703, the memory 704, and the output interface 705 are connected to each other via a bus 710, and the input device 701 and the output device 706 are connected to the bus 710 via the input interface 702 and the output interface 705, respectively, and further connected to other components of the computing device 700. Specifically, the input device 701 receives input information from the outside (e.g., a microphone array), and transmits the input information to the central processor 703 through the input interface 702; the central processor 703 processes input information based on computer-executable instructions stored in the memory 704 to generate output information, stores the output information temporarily or permanently in the memory 704, and then transmits the output information to the output device 706 through the output interface 705; the output device 706 outputs output information external to the computing device 700 for use by a user.

That is, the computing device shown in fig. 10 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the audio signal processing method and apparatus described in connection with fig. 1-9. Here, the processor may communicate with the voice control system to execute computer-executable instructions based on relevant information from the voice control system and/or the microphone array to implement the audio signal processing methods and apparatus described in connection with fig. 1-9.

In one embodiment, the computing device 700 shown in FIG. 9 may be implemented to include: a memory 705 and a processor 703; the memory 705 is used for storing executable program codes; the processor 703 is used to read executable program codes stored in the memory to perform the audio signal processing method in the above-described embodiment.

Fig. 11 shows a schematic structural diagram of an audio interaction device according to an embodiment of the present invention. As shown in FIG. 11, in one embodiment, an audio interaction device 1100, comprises: a microphone array 1101 and a processor 1102.

Therein, the microphone array 1101 is used for collecting mixed audio signals, and the mixed audio signals comprise audio signals of main driving and audio signals of auxiliary driving.

The processor 1102 is in communication connection with the microphone array 1101 and is used for processing the audio signals received by the microphone array to obtain multiple paths of beams in multiple directions; carrying out sound source target detection on the audio signals in the multi-path wave beams, and determining a plurality of candidate sound source target positions; using a plurality of wake-up recognition components, performing wake-up recognition on audio signals from a plurality of candidate sound source target positions in parallel; and determining the final awakening state according to the awakening identification result.

Fig. 12 shows a schematic structural diagram of an audio interaction device according to another embodiment of the present invention. As shown in fig. 12, in one embodiment, the audio interaction device 1200 may include: a microphone array 1201 and a processor 1202.

The microphone array 1201 is used for acquiring mixed audio signals, wherein the mixed audio signals comprise audio signals of a plurality of conference participants;

A processor 1202, communicatively connected to the microphone array 1201, configured to process an audio signal received by the microphone array to obtain multiple beams for multiple directions; carrying out sound source target detection on the audio signals in the multi-path wave beams, and determining a plurality of candidate sound source target positions; using a plurality of wake-up recognition components, performing wake-up recognition on audio signals from a plurality of candidate sound source target positions in parallel; and determining the final awakening state according to the awakening identification result.

the audio interaction device of the embodiment of the invention can simultaneously monitor and awaken the multi-channel audio signals, thereby improving the processing efficiency and the recognition efficiency.

in the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product or computer-readable storage medium. The computer program product or computer-readable storage medium includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims

1. An audio signal processing method comprising:

Processing the audio signals received by the microphone array to obtain multi-path wave beams aiming at multiple directions;

Carrying out sound source target detection on the audio signals in the multipath wave beams, and determining a plurality of candidate sound source target positions;

Performing wake-up recognition on audio signals from the candidate multiple sound source target positions in parallel using multiple wake-up recognition components;

And determining a final awakening state according to the awakening identification result.

2. The audio signal processing method of claim 1, wherein the processing the audio signals received by the microphone array to obtain multi-path beams for multiple directions comprises:

Carrying out echo cancellation processing on the received audio signal;

converting the audio signal subjected to the echo cancellation processing from a time domain to a frequency domain;

And performing beamforming processing on the audio signals of the frequency domain in the multiple directions to obtain multiple paths of beams aiming at the multiple directions.

3. the audio signal processing method of claim 2, further comprising:

Carrying out noise suppression processing on the multi-path wave beams;

And carrying out noise reduction processing on the multipath wave beams after the noise suppression processing by using a post filter to obtain signal enhanced multipath wave beams.

4. The audio signal processing method of claim 1, wherein the performing sound source target detection on the audio signals in the multiple beams and determining candidate multiple sound source target positions comprises:

sound source target detection of the audio signals in each path of wave beam is carried out through direction-of-arrival estimation, and the angle-of-arrival of the audio signals in each path of wave beam is obtained;

and taking the direction of the wave beam of the audio signal with the arrival angle meeting the arrival angle threshold range as the target position of the candidate sound source.

5. The audio signal processing method of claim 1, wherein said performing, in parallel, wake-up recognition on audio signals from said candidate multiple sound source target locations using multiple wake-up recognition components comprises:

dividing the audio signal of each sound source target position into a plurality of identification units for identification of the audio signals from the candidate plurality of sound source target positions using a plurality of wake-up identification components, wherein each identification unit comprises a plurality of consecutive frames.

6. The audio signal processing method according to claim 5, wherein the dividing the audio signal of each sound source target position into a plurality of identification units for identification includes:

Respectively carrying out frame division processing on the audio signals of each candidate sound source target position to obtain multi-frame audio data;

Selecting one frame from every preset number of frames of audio data in the multi-frame audio data as target frame audio data;

And identifying the audio signal of each candidate sound source target position by taking the multi-frame audio data adjacent to the target frame audio data and the target frame audio data as identification units.

7. The audio signal processing method according to claim 1, wherein the determining a final wake state according to the result of the wake recognition comprises:

Determining the positioning probability of the target positions of the plurality of sound sources according to the beam directions of the multipath beams;

Determining the awakening probability of the audio signals from the candidate multiple sound source target positions by using the recognition states of multiple continuous voice units in the audio signals of the candidate multiple sound source target positions;

Determining a final wake-up state of the audio signals from the candidate multiple sound source target positions by the localization probability and the wake-up probability.

8. The audio signal processing method of claim 1, wherein the determining a final wake state according to the result of the wake recognition further comprises:

9. The audio signal processing method of claim 1, further comprising:

Sending the audio signals of which the awakening results are successful to a voice control system from the audio signals of the candidate multiple sound source target positions;

And controlling the voice control system to execute corresponding functions according to the received audio signals after being awakened.

10. An audio signal processing apparatus comprising:

The multi-path beam forming module is used for processing the audio signals received by the microphone array to obtain multi-path beams aiming at multiple directions;

The sound source target detection module is used for carrying out sound source target detection on the audio signals in the multipath wave beams and determining a plurality of candidate sound source target positions;

a parallel wake-up recognition module for performing wake-up recognition on audio signals from the candidate multiple sound source target positions in parallel using multiple wake-up recognition components;

And the awakening result determining module is used for determining a final awakening state according to the awakening identification result.

11. The audio signal processing apparatus of claim 10, wherein the multi-path beamforming module comprises:

the echo cancellation unit is used for carrying out echo cancellation processing on the received audio signal;

the signal conversion unit is used for converting the audio signal subjected to the echo cancellation processing from a time domain to a frequency domain;

the beam forming module is further configured to perform beam forming processing on the audio signal in the frequency domain in the multiple directions to obtain multiple beams for the multiple directions.

12. the audio signal processing apparatus of claim 11, wherein the multi-path beamforming module further comprises:

A noise suppression unit, configured to perform noise suppression processing on the multiple beams;

And the post-filter unit is used for performing noise reduction processing on the multi-path wave beams subjected to the noise suppression processing by using a post-filter to obtain signal-enhanced multi-path wave beams.

13. the audio signal processing apparatus of claim 10, wherein the sound source target detection module comprises:

the system comprises a direction-of-arrival estimation unit, a direction-of-arrival estimation unit and a control unit, wherein the direction-of-arrival estimation unit is used for carrying out sound source target detection on audio signals in each path of wave beam through direction-of-arrival estimation to obtain the angle of arrival of the audio signals in each path of wave beam;

And the beam direction determining unit is used for taking the direction of the beam of the audio signal with the arrival angle meeting the arrival angle threshold range as the target position of the candidate sound source.

14. The audio signal processing device according to claim 10, wherein the parallel wake-up identification module is specifically configured to:

15. The audio signal processing apparatus of claim 14, wherein the parallel wake-up identification module comprises:

the signal framing unit is used for respectively framing the audio signals of the candidate sound source target positions to obtain multi-frame audio data;

The target frame selecting unit is used for selecting one frame from every preset number of frames of audio data in the multi-frame audio data as target frame audio data;

16. The audio signal processing apparatus of claim 10, the wake up result determination module, comprising:

A positioning probability determining unit for determining positioning probabilities of the plurality of sound source target positions according to the beam directions of the multi-path beams;

a wake-up probability determining unit for determining a wake-up probability of an audio signal from a plurality of sound source target positions of a candidate using recognition states of a plurality of consecutive speech units in the audio signal for each sound source target position of the candidate;

and the awakening result determining module is further used for determining the final awakening state of the audio signals from the candidate multiple sound source target positions according to the positioning probability and the awakening probability.

17. the audio signal processing apparatus of claim 10, further comprising:

18. The audio signal processing apparatus of claim 10, further comprising:

The signal sending module is used for sending the audio signals of which the awakening results are successful to a voice control system from the audio signals of the candidate multiple sound source target positions;

19. An audio signal processing system comprising: a memory and a processor; wherein the content of the first and second substances,

The memory is used for storing executable program codes;

The processor is configured to read executable program code stored in the memory to perform the audio signal processing method of any one of claims 1 to 9.

20. A computer-readable storage medium, wherein the computer-readable storage medium comprises instructions which, when run on a computer, cause the computer to perform the audio signal processing method of any one of claims 1-9.

21. an audio interaction device, comprising: a microphone array and a processor; wherein the content of the first and second substances,

The microphone array is used for acquiring mixed audio signals, and the mixed audio signals comprise audio signals of main driving and audio signals of auxiliary driving;

The processor is in communication connection with the microphone array and is used for processing the audio signals received by the microphone array to obtain multi-path wave beams aiming at multiple directions; carrying out sound source target detection on the audio signals in the multipath wave beams, and determining a plurality of candidate sound source target positions; performing wake-up recognition on audio signals from the candidate multiple sound source target positions in parallel using multiple wake-up recognition components; and determining a final awakening state according to the awakening identification result.

22. an audio interaction device, comprising: a microphone array and a processor; wherein the content of the first and second substances,

The microphone array is used for acquiring mixed audio signals, and the mixed audio signals comprise audio signals of a plurality of conference participants;