CN113257269A

CN113257269A - Beam forming method based on deep learning and storage device

Info

Publication number: CN113257269A
Application number: CN202110431846.0A
Authority: CN
Inventors: 李茂发; 江正梁; 陈时钦
Original assignee: Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-08-13

Abstract

The present invention relates to the field of beam processing technologies, and in particular, to a beam forming method and a storage device based on deep learning. The beam forming method based on deep learning comprises the following steps: the acquired voice data are processed through a deep learning technology to obtain the voice and the non-voice noise, and compared with the traditional self-adaptive beam forming algorithm, the method is more accurate and intelligent in recognition and judgment of the voice and the non-voice noise; and carrying out signal energy detection in the identified voice direction, and carrying out weighted superposition calculation on the beam size according to an energy detection result, so that voice can be picked up in multiple directions at the same time, and the requirement of picking up the voices of multiple persons in a conference scene or any other scene is met.

Description

Beam forming method based on deep learning and storage device

Technical Field

The present invention relates to the field of beam processing technologies, and in particular, to a beam forming method and a storage device based on deep learning.

Background

In conventional adaptive microphone array beamforming techniques, such as super-directional beamforming, scattered noise is minimized while keeping the direction of arrival output unchanged, thereby suppressing noise. However, such methods often need to know the direction of arrival in advance, and the correlated noise of the humanlike sound often causes inaccurate estimation of the direction of arrival, thereby affecting the beam effect.

In an actual conference scene, a requirement of speaking by multiple persons often exists, and if the existing adaptive microphone array beam forming technology is used, because the direction of arrival cannot be known in advance, noise cannot be well removed, the beam effect is influenced, and the requirement of sound pickup of speaking by multiple persons in the conference scene or other arbitrary scenes cannot be met.

Disclosure of Invention

Therefore, a beam forming method based on deep learning is needed to be provided for solving the problems that the existing adaptive microphone array beam forming technology has poor effect of removing non-human voice noise and cannot meet the pickup requirement of multi-person speaking. The specific technical scheme is as follows:

a beam forming method based on deep learning comprises the following steps:

processing the obtained voice data through a deep learning technology to obtain voice and non-voice noise;

and carrying out signal energy detection in the identified human voice direction, and carrying out weighted superposition calculation on the beam size according to an energy detection result.

Further, the "processing the acquired voice data through the deep learning technology to obtain the voice and the non-voice noise" specifically includes the following steps:

and calculating the voice existence probability of the acquired voice data through a preset algorithm, and obtaining the voice and the non-voice noise according to the calculation result of the voice existence probability.

Further, the "detecting signal energy in the identified voice direction, and performing weighted superposition calculation on the beam size according to the energy detection result" specifically includes the steps of:

calculating energy weighting coefficients for the output multiple beam directions;

and calculating a final beam weighting coefficient according to the voice existence probability and the energy weighting coefficient to obtain final beam output.

Further, the preset algorithm includes: and deep learning the trained neural network.

In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:

a storage device having stored therein a set of instructions for performing:

Further, the set of instructions is further for performing:

the method comprises the following steps of processing the acquired voice data through a deep learning technology to obtain voice and non-voice noise, and specifically comprises the following steps:

Further, the set of instructions is further for performing:

the method comprises the following steps of performing signal energy detection in the identified human voice direction, and performing weighted superposition calculation on the beam size according to an energy detection result, and specifically comprises the following steps:

The invention has the beneficial effects that: the acquired voice data are processed through a deep learning technology to obtain the voice and the non-voice noise, and compared with the traditional self-adaptive beam forming algorithm, the method is more accurate and intelligent in recognition and judgment of the voice and the non-voice noise; and carrying out signal energy detection in the identified voice direction, and carrying out weighted superposition calculation on the beam size according to an energy detection result, so that voice can be picked up in multiple directions at the same time, and the requirement of picking up the voices of multiple persons in a conference scene or any other scene is met.

Drawings

Fig. 1 is a flowchart illustrating a deep learning based beamforming method according to an embodiment;

FIG. 2 is a schematic diagram of a beam without deep learning processing according to an embodiment;

FIG. 3 is a schematic diagram of a beam processed by a deep learning technique to filter noise according to an embodiment;

FIG. 4 is a schematic diagram of a beam before computation of an unweighted meter stack in accordance with an illustrative embodiment;

FIG. 5 is a schematic diagram of a beam after superposition calculation of the weighting meters according to the embodiment;

fig. 6 is a block diagram of a storage device according to an embodiment.

Description of reference numerals:

600. a storage device.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Referring to fig. 1 to 5, in the present embodiment, a deep learning based beamforming method can be applied to a storage device, including but not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc. The specific implementation mode is as follows:

the technical idea of the present application is explained below for use in a conference:

when the application scene is a conference, the core technical idea of the application is as follows: because in the conference application scene, the voice is taken as the main voice, the beam forming should be preferentially pointed to the voice direction of the person, and meanwhile, the conference has the situation that when a plurality of persons speak in discussion, the beam cannot be a single beam. Therefore, the application mainly makes two improvements: one is to introduce deep learning techniques such as: training voice recognition of a person through a neural network, and enabling beam forming to recognize voice and non-voice noise; one is to detect the signal energy in the recognized voice direction, and to perform weighted superposition calculation on the beam size according to the strength of the voice signal, so as to pick up the voice in multiple directions.

It should be noted that, besides the conference scene, the application scene core of the present application is a multi-person conversation scene, and thus may also be an informal tea session occasion, a reading session discussion occasion, and the like, as long as there is a multi-person conversation in the scene.

The following detailed description is made with reference to fig. 1 to 5:

step S101: and processing the acquired voice data through a deep learning technology to obtain voice and non-voice noise.

Step S102: and carrying out signal energy detection in the identified human voice direction, and carrying out weighted superposition calculation on the beam size according to an energy detection result.

In the present embodiment, the processing may be performed by any array, including, but not limited to, linear array, circular array, etc., and the following description will be given of step S101 and step S102 by taking any array as an example:

suppose that θ is calculated₁，θ₂And theta₃Beams in three directions, where the corresponding beam output is y₁＝ω_bf1x，y₂＝ω_bf2x and y₃＝ω_bf2x。

Step S101 specifically further includes the steps of: and calculating the voice existence probability of the acquired voice data through a preset algorithm, and obtaining the voice and the non-voice noise according to the calculation result of the voice existence probability.

In this embodiment, the preset algorithm takes a deep learning trained neural network as an example to determine the existence probability of the speech, and the formula is as follows:

ω_dnn1＝dnn_speech_probability_compute(ω₁x)

in this formula, ω₁x is input speech, i.e. neural network input, omega_dnn1The probability is output for the network.

dnn _ speed _ mobility _ computer is the whole network flow, and the specific flow includes: audio input- > framing- > feature extraction- > neural network- > decoding- > judgment- > output voice probability.

The beam diagram without deep learning processing is shown in fig. 2, and the beam is directed to noise and speaker spk at the same time; the filtered-noise beam diagram after the deep learning technique is shown in fig. 3, where the beam is only directed to the speaker spk.

After denoising, executing step S102, wherein step S102 further includes:

The description is continued by taking the above-mentioned 3mic circular array as an example:

meanwhile, energy weighting coefficients are calculated for a plurality of output beam directions, and the calculation formula is as follows:

ω_energy1＝energy_weight_compute(ω₁x)

in this formula, ω₁x is input speech, omega_energy1Is the multi-beam speech segment energy ratio.

energy _ weight _ computer is a speech segment energy ratio calculation process.

The specific calculation process is as follows: 1. computing total energy y of multi-beam voice segment_energy＝ω_bf1x+ω_bf2*x+ω_bf3X, 2, calculating the sub-beam energy fraction omega_energy1＝ω_bf1x/y_energy。

Calculating the final beam weighting coefficient according to the voice existence probability and the energy weighting coefficient to obtain the final beam output

y＝ω_dnn1*ω_energy1*ω_bf1x+ω_dnn2*ω_energy2*ω_bf2*x+ω_dnn3*ω_energy3*ω_bf3*x。

Wherein fig. 4 and 5 show the beam effect after beam weighting is performed on each directional beam in combination with the energy weighting method. FIG. 4 shows that the beam sizes of the directional talker spk1 and the talker spk2 are the same before being unweighted; FIG. 5 shows the effect of weighting the beams in each direction in conjunction with the energy weighting method, with speaker spk1 pointing at it with a larger beam than speaker spk2 because the sound is larger than speaker spk 2.

The acquired voice data are processed through a deep learning technology to obtain the voice and the non-voice noise, and compared with the traditional self-adaptive beam forming algorithm, the method is more accurate and intelligent in recognition and judgment of the voice and the non-voice noise; and carrying out signal energy detection in the identified voice direction, and carrying out weighted superposition calculation on the beam size according to an energy detection result, so that voice can be picked up in multiple directions at the same time, and the requirement of picking up the voices of multiple persons in a conference scene or any other scene is met.

Referring to fig. 2 to fig. 6, in the present embodiment, an embodiment of a memory device 600 is as follows:

a storage device 600 having stored therein a set of instructions for performing:

The scheme can be applied to any array, including but not limited to linear array, circular array, etc., and in the present embodiment, the command executed by the above instruction set is explained by taking any array as an example:

suppose that any array is calculated at θ₁，θ₂And theta₃Beams in three directions, at this timeThe corresponding beam output is y₁＝ω_bf1x，y₂＝ω_bf2x and y₃＝ω_bf2x。

Further, the set of instructions is further for performing:

The method for processing the acquired voice data to obtain the voice and the non-voice noise through the deep learning technology specifically comprises the following steps: and calculating the voice existence probability of the acquired voice data through a preset algorithm, and obtaining the voice and the non-voice noise according to the calculation result of the voice existence probability.

ω_dnn1＝dnn_speech_probability_compute(ω₁x)

After denoising, the set of instructions is further configured to perform:

The description is continued by taking as an example any of the arrays mentioned above:

meanwhile, energy weighting coefficients are calculated for a plurality of beam directions output by any array, and the calculation formula is as follows:

ω_energy1＝energy_weight_compute(ω₁x)

Executing commands through the instruction set on the storage device 600: the acquired voice data are processed through a deep learning technology to obtain the voice and the non-voice noise, and compared with the traditional self-adaptive beam forming algorithm, the method is more accurate and intelligent in recognition and judgment of the voice and the non-voice noise; and carrying out signal energy detection in the identified voice direction, and carrying out weighted superposition calculation on the beam size according to an energy detection result, so that voice can be picked up in multiple directions at the same time, and the requirement of picking up the voices of multiple persons in a conference scene or any other scene is met.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. A beam forming method based on deep learning is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of processing the acquired voice data to obtain vocal sounds and non-vocal sounds noise by using a deep learning technique further comprises the steps of:

3. The method according to claim 2, wherein the step of performing signal energy detection in the identified human voice direction and performing weighted superposition calculation on the beam size according to the energy detection result further comprises the steps of:

4. The deep learning based beamforming method according to claim 2 or 3,

the preset algorithm comprises the following steps: and deep learning the trained neural network.

5. A storage device having a set of instructions stored therein, the set of instructions being operable to perform:

6. The storage device of claim 5, wherein the set of instructions is further configured to perform:

7. The storage device of claim 6, wherein the set of instructions is further configured to perform:

8. The storage device according to claim 6 or 7, wherein the preset algorithm comprises: and deep learning the trained neural network.