CN117935838A

CN117935838A - Audio acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN117935838A
Application number: CN202410344779.2A
Authority: CN
Inventors: 黎荣晋; 张伟彬; 陈东鹏; 李亚桐
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2024-03-25
Filing date: 2024-03-25
Publication date: 2024-04-26
Anticipated expiration: 2044-03-25
Also published as: CN117935838B

Abstract

The application discloses an audio acquisition method, an audio acquisition device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a sample audio fragment; acquiring masks corresponding to the channels respectively based on the corresponding real number spectrum and the corresponding imaginary number spectrum; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of the neural network model to be trained, and obtaining a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and acquiring a target loss function based on the single-channel real number spectrum and the single-channel imaginary number spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model. The method improves the attention capability of the neural network model to key channels of the audio clips and improves the audio pick-up effect.

Description

Audio acquisition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of audio acquisition technologies, and in particular, to an audio acquisition method, an apparatus, an electronic device, and a storage medium.

Background

The directional pickup is one of the methods of speech enhancement, which is to pick up a target signal in a mixed signal according to the direction of the sound source, that is, only pick up the sound signal transmitted in a specific direction, and the noise and interference signals in other directions are not picked up but attenuated or shielded, so as to achieve the effect of enhancing the target speech. To reduce the implementation cost of directional pickup, the model may be trained to accomplish pickup of the target audio signal by training the resulting model. However, multiple reflections of the audio signal from the indoor sound field may form multi-channel reverberant audio, and the model may have difficulty in accurately picking up the audio of the key channel during the directional pickup, thereby affecting the directional pickup effect.

Disclosure of Invention

The application provides an audio acquisition method, an audio acquisition device, electronic equipment and a storage medium, so as to solve the problems.

In a first aspect, the present application provides an audio acquisition method, the method comprising: acquiring a sample audio fragment, wherein the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real number spectrum and an imaginary number spectrum; acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model.

In a second aspect, the present application provides an audio acquisition apparatus, the apparatus comprising: the system comprises a sample audio acquisition module, a sampling module and a sampling module, wherein the sample audio acquisition module is used for acquiring a sample audio fragment, the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real number spectrum and an imaginary number spectrum; a mask acquisition module, configured to acquire masks corresponding to the multiple channels respectively based on the corresponding real number spectrum and imaginary number spectrum; the parameter updating module is used for updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; the training module is used for inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of the neural network model to be trained so as to train the designated attention pooling layer and acquire a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and the audio acquisition module is used for acquiring a target loss function based on the single-channel real number spectrum and the single-channel imaginary number spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model.

In a third aspect, the present application provides an electronic device, comprising: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the audio acquisition method provided in the first aspect above.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein program code which is callable by a processor to perform the audio acquisition method provided in the first aspect described above.

The application provides an audio acquisition method, an audio acquisition device, electronic equipment and a storage medium, wherein a sample audio fragment is acquired, the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real number spectrum and an imaginary number spectrum; acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model. Therefore, the multi-channel mask is introduced, the spectrum parameters of each channel are updated based on the mask, and then the appointed attention pooling layer of the neural network model to be trained is trained by using the updated spectrum parameters, so that the attention capability of the neural network model to key channels of the audio fragment is improved, and the effect of picking up the audio by the target neural network model obtained through training is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flowchart of an audio acquisition method according to an embodiment of the present application.

Fig. 2 is a flowchart of an audio acquisition method according to another embodiment of the present application.

Fig. 3 shows a block diagram of an audio acquisition device according to an embodiment of the present application.

Fig. 4 shows a block diagram of an electronic device according to an embodiment of the present application.

Fig. 5 illustrates a storage unit for storing or carrying program codes for implementing an audio acquisition method according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

The inventor finds through long-term research that a sample audio fragment can be obtained, wherein the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real number spectrum and an imaginary number spectrum; acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model. Therefore, the multi-channel mask is introduced, the spectrum parameters of each channel are updated based on the mask, and then the appointed attention pooling layer of the neural network model to be trained is trained by using the updated spectrum parameters, so that the attention capability of the neural network model to key channels of the audio fragment is improved, and the effect of picking up the audio by the target neural network model obtained through training is further improved.

Therefore, in order to improve the above problems, the inventor proposes an audio acquisition method, an apparatus, an electronic device and a storage medium, which can improve the attention capability of the neural network model to the key channels of the audio clips, and further improve the audio pickup effect of the target neural network model obtained through training.

In order to facilitate a better understanding of the solution described by way of example of the application, the following brief description of the relevant terms involved in the implementation of the application will be given:

Hadamard product (Hadamard product): is a type of operation of the matrix, and if a= (aij) and b= (bij) are two same-order matrices, the matrix cij=aij×bij is the hadamard product of a and B, or the base product.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of an audio acquisition method according to an embodiment of the present application is shown, and the embodiment provides an audio acquisition method that can be applied to an electronic device, where the electronic device in the embodiment may be a mobile communication device with a network connection function, such as a smart phone, a computer, etc., and a specific device type may not be limited. The method comprises the following steps:

step S110: a sample audio segment is acquired, the sample audio segment comprising a plurality of channels, each channel having a corresponding real and imaginary spectrum.

The sample audio in the embodiment of the application is a piece of reverberant audio including human voice and environmental noise, and the specific duration of the sample audio is not limited, for example, the sample audio may be a piece of audio with a duration of 30 minutes. The sample audio comprises a plurality of audio segments, each of which may comprise a plurality of channels (the specific number of channels may not be limiting), i.e. each of which is a multi-channel audio segment.

In a specific embodiment, assuming that the sample audio segment is an audio segment containing 9 channels, the sample audio segment may be represented asWherein/>Characterization of sample Audio fragments,/>Characterization of first channel Audio,/>Characterizing second channel Audio,/>Characterizing third channel audio, and so on.

As an implementation manner, in order to facilitate training the neural network model to be trained by using the sample audio segment, fourier transformation may be performed on the sample audio segment, so as to calculate a real number spectrum and an imaginary number spectrum corresponding to each channel of the sample audio segment. In particular, by passageFor example, the corresponding real and imaginary spectrums may be calculated as follows:

wherein, The window length characterizing the fourier transform, e.g., window length N may be equal to fourier transform points nfft = 1024 points; n represents an nth sampling point within the window; /(I)The characterization window function may be, for example, a hanning window,/>Characterizing a position point of sliding of the window function on the time axis; /(I)Characterizing the frequency; i characterizes the imaginary unit and STFT (Short-Time Fourier Transform) characterizes the Short-time Fourier transform.

As an implementation mode, exp in the above formula can be expanded to obtain the channel) The corresponding real and imaginary spectra, the above formula after expansion varies as follows:

wherein, Characterization of channel/>Is a real spectrum of (2); /(I)Characterization of channel/>Is an imaginary spectrum of (a); the sample audio segment includes 9 channels, and the corresponding real spectrum and imaginary spectrum also include 9, which can be expressed as:

Wherein the dimensions of the real and imaginary spectra are [ ch, nfft/2, T ]. ch represents the number of channels; t represents the number of audio frames for a channel, and the specific value may not be limited. Alternatively, the dimensions of the real spectrum and the imaginary spectrum in this embodiment may be [9, 512,8].

Step S120: and acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and the corresponding imaginary spectrums.

In this embodiment, the real spectrum and the imaginary spectrum corresponding to different channels may be different according to different channel positions, so that in order to facilitate the accurate identification of the voice to be picked up and the noise to be suppressed by the model in the subsequent model training process, the mask of each channel may be obtained based on the real spectrum and the imaginary spectrum corresponding to each channel, so that the masks corresponding to the channels of the sample audio segment may be obtained.

In the process of obtaining the masks corresponding to the channels of the sample audio fragment, as an implementation manner, the amplitude spectrum parameters corresponding to the channels can be obtained based on the corresponding real spectrum and imaginary spectrum; and then inputting the amplitude spectrum parameters corresponding to the channels into the neural network model to be trained, and obtaining masks corresponding to the channels output by the neural network model to be trained. Specifically, for each channel, the amplitude spectrum parameter of the channel may be calculated based on the real spectrum and the imaginary spectrum of the channel, and the calculation formula may be expressed as follows:

wherein, Characterization of amplitude spectrum parameters,/>Characterization of real spectra,/>The imaginary spectrum is characterized.

In the case of calculating the amplitude spectrum parameter of the channel, in order to facilitate reduction of the data processing amount of the model, the amplitude spectrum parameter may be subjected to compression processing. For example, the amplitude spectrum parameters may be compressed as follows:

wherein, Characterization of compressed amplitude spectrum parameters (compressed amplitude spectrum for short)/>In this embodiment, the value range of c may be (0.1, 1), and the specific value may not be limited, for example, the value of c may be 0.5.

After all channels of the sample audio fragment are processed, the compressed amplitude spectrums of the 9 channels can be obtained, in this way, the compressed amplitude spectrums of the 9 channels can be input into the neural network model to be trained so as to train the neural network model to be trained, and masks corresponding to the 9 channels respectively and output by the neural network model to be trained are obtained. Alternatively, the neural network model to be trained in the embodiment may be a model of a U-Net type, and the specific type of the neural network model to be trained may not be limited.

Step S130: and updating the real number spectrum and the imaginary number spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real number spectrum and a new multi-channel imaginary number spectrum.

Under the condition that masks corresponding to a plurality of audio channels of a sample audio fragment are obtained, in order to facilitate optimization of a training process of a neural network model to be trained, a real number spectrum and an imaginary number spectrum corresponding to each channel can be updated based on the corresponding masks, and a new multi-channel real number spectrum and a new multi-channel imaginary number spectrum are obtained. As an embodiment, for each channel's corresponding real spectrum and imaginary spectrum, the product of the channel's corresponding mask and the real spectrum may be taken as the new real spectrum for the channel, and the product of the channel's corresponding mask and the imaginary spectrum may be taken as the new imaginary spectrum for the channel.

Specifically, assume that a plurality of audio channels of a sample audio clip each correspond to a mask asAs an implementation manner, the mask may be multiplied by the real spectrum and the imaginary spectrum corresponding to each channel one by one according to the following formula:

Wherein k represents a kth channel in the range of 1-9; A mask characterizing a kth channel; /(I) Characterizing a real spectrum of a kth channel; /(I)Characterizing an imaginary spectrum of a kth channel; /(I)Characterizing a new real spectrum of a kth channel; /(I)The new imaginary spectrum of the kth channel is characterized. The mask is multiplied by the real spectrum corresponding to each channel one by one, and the mask is multiplied by the imaginary spectrum corresponding to each channel one by one, so as to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum.

Step S140: inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment.

The neural network model to be trained in the embodiment of the application comprises a designated attention pooling layer, which can be understood as a multi-channel attention pooling layer. In order to facilitate improving the attention capability of the neural network model to be trained to the key channels of the audio fragment, the new multi-channel real number spectrum and the new multi-channel imaginary number spectrum can be input into a designated attention pooling layer of the neural network model to be trained so as to train the designated attention pooling layer.

The appointed attention pooling layer in the embodiment can determine the key channel and the secondary channel in the multi-channel according to the new multi-channel real number spectrum and the new multi-channel imaginary number spectrum, so that corresponding weights are distributed for channels with different importance degrees, and the neural network model to be trained can strengthen the effect of the key channel and weaken the effect of the secondary channel based on the appointed attention pooling layer, so that the robustness of the target neural network model obtained by subsequent training can be enhanced.

After the assigned attention pooling layer distributes corresponding weights for channels with different importance degrees, multiple channels can be combined into one single channel by combining the weights of the channels, and a single-channel real number spectrum and a single-channel imaginary number spectrum corresponding to the sample audio fragment are output.

Step S150: and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model.

In order to evaluate the model training effect, the target loss function may be obtained based on a single-channel real spectrum and a single-channel imaginary spectrum. Specifically, the single-channel real spectrum and the single-channel imaginary spectrum can be converted into single-channel audio, and then the target loss function can be obtained based on the single-channel audio and the preset tag audio. The preset tag audio may be understood as a tag of the sample audio fragment, where the tag is an audio carrying an identifier of a sound source position, and optionally, for directional pickup, if the sound source position is a pickup area, the tag may be the sample audio fragment itself; and if the sound source location is a non-pickup area, the tag may be silent audio having a duration equal to that of the sample audio piece.

The single-channel audio can be obtained by carrying out Fourier transform on the single-channel real spectrum and the single-channel imaginary spectrum; and then the difference between the single-channel audio and the preset tag audio can be calculated through calculation modes such as mean square error or square absolute error or other difference calculation modes (the specific calculation mode is not limited), so that the target loss function is obtained.

Under the condition that the target loss function is obtained, the target loss function can be reversely transmitted to update the neural network model to be trained and the appointed attention pooling layer until the target loss function is converged, and then the neural network model to be trained when the target loss function is converged is used as the target neural network model.

As an implementation mode, the audio frequency can be picked up through the target neural network model obtained through final training, compared with the prior art, in the implementation mode, the audio frequency to be processed and the appointed attention pooling layer of the audio frequency to be processed are respectively trained through introducing the masks corresponding to the channels of the sample audio frequency fragments, so that training precision can be improved, attention degree of the model to key channels in the sample audio frequency fragments is improved, stability and robustness of the target neural network model obtained through training are further improved, and accordingly audio frequency picking effect of the target neural network model is improved.

According to the audio acquisition method provided by the embodiment, a sample audio fragment is acquired, wherein the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real spectrum and an imaginary spectrum; acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model. Therefore, the multi-channel mask is introduced, the spectrum parameters of each channel are updated based on the mask, and then the appointed attention pooling layer of the neural network model to be trained is trained by using the updated spectrum parameters, so that the attention capability of the neural network model to key channels of the audio fragment is improved, and the effect of picking up the audio by the target neural network model obtained through training is further improved.

Referring to fig. 2, a flowchart of an audio acquisition method according to another embodiment of the present application is shown, where the embodiment provides an audio acquisition method applicable to an electronic device, and the method includes:

step S210: a sample audio segment is acquired, the sample audio segment comprising a plurality of channels, each channel having a corresponding real and imaginary spectrum.

The specific implementation of step S210 may refer to the related description of step S110 in the foregoing embodiment, which is not described herein.

Step S220: and acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and the corresponding imaginary spectrums.

The specific implementation of step S220 may refer to the related description of step S120 in the foregoing embodiment, which is not described herein.

Step S230: and updating the real number spectrum and the imaginary number spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real number spectrum and a new multi-channel imaginary number spectrum.

The specific implementation of step S230 may refer to the related description of step S130 in the foregoing embodiment, which is not described herein.

Step S240: inputting the new multi-channel real number spectrum and the new multi-channel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring multi-channel attention weight parameters based on the new multi-channel real number spectrum and the new multi-channel imaginary number spectrum respectively through the designated attention pooling layer.

In this embodiment, in order to highlight the important audio channel and weaken the secondary audio channel, the specified attention pooling layer may perform the dimension reduction processing on the sample audio segment, i.e. convert the new multi-channel real spectrum and the new multi-channel imaginary spectrum of the sample audio segment into a single-channel real spectrum and a single-channel imaginary spectrum.

In the process of acquiring the single-channel real number spectrum and the single-channel imaginary number spectrum which are output by the appointed attention pooling layer and correspond to the sample audio fragment, the appointed attention pooling layer can firstly acquire the multi-channel attention weight parameters based on the new multi-channel real number spectrum and the multi-channel imaginary number spectrum respectively, namely, for the new multi-channel real number spectrum, the appointed attention pooling layer can acquire the multi-channel attention weight parameters; for the new multi-channel imaginary spectrum, the multi-channel attention weighting parameters are also acquired by the appointed attention pooling layer.

Specifically, the specified attention pooling layer may include a linear mapping layer L and a multi-head attention layer H, where the linear mapping layer L and the multi-head attention layer H are parameter matrices, and the sizes are respectivelyAnd/>Wherein/>Can be set to nfft/2, and/>Or/>And the like, divisible by a value. As an embodiment, with a new multichannel real spectrum/>For example, the multichannel attention weighting parameters may be obtained as follows:

wherein, Characterizing a multichannel attention weighting parameter; tan h characterizes a hyperbolic tangent activation function; /(I)Characterizing a new multichannel real spectrum; l represents a linear mapping layer; h characterizes the multi-headed note layer. /(I)The size of (C) may be [ ch,1, T ]. For the new multi-channel imaginary spectrum, the calculation mode is the same as above, and will not be described here again.

Step S250: and acquiring a regular weight parameter based on the multi-channel attention weight parameter.

Further, the normalization weight parameter may be obtained based on the multi-channel attention weight parameter. Specifically, the normalization weight parameter may be obtained according to the following formula:

wherein, Characterizing a regular weight parameter; softmax characterizes the activation function; /(I)The size of (c) may be [ ch,1, t ], optionally under the normalization of softmax in the channel number dimension, by which the weights of the key channels of the sample audio piece can be highlighted.

Step S260: and acquiring a multichannel attention weighting parameter based on the regular weighting parameter.

As an embodiment, to further highlight the weights of the key channels, the multi-channel attention weighting parameters may be obtained based on the regular weight parameters, specifically, continuing with the new multi-channel real spectrumFor example, the multichannel attention weighting parameters may be obtained as follows:

wherein, Characterizing a multi-channel attention weighting parameter; /(I)The new multichannel real spectrum is characterized. Alternatively to this, the method may comprise,It can also be understood that the normalization weight parameter/>And mask results input to the multichannel attention pooling layer/>Hadamard products (element-wise multiplication) are performed such that different channels have respective weights.

Step S270: and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum corresponding to the sample audio fragment based on the multi-channel attention weighting parameter.

Further, a single-channel real spectrum and a single-channel imaginary spectrum corresponding to the sample audio segment may be obtained based on the multi-channel attention weighting parameters. As an implementation, the single-channel real spectrum and the single-channel imaginary spectrum corresponding to the sample audio segment may be obtained according to the following formula:

wherein, Characterization pair/>And summing to realize that the multiple channels are combined into a single channel, so that a single-channel real number spectrum and a single-channel imaginary number spectrum can be obtained.

It should be noted that for the new multi-channel real spectrumThe corresponding single-channel real number spectrum/> can be obtained after processing according to the mode; For a new multichannel imaginary spectrum/>The processing is also carried out in the above manner, and the corresponding single-channel imaginary spectrum/> -can be obtained after the processing. Alternatively, a single channel real spectrum/>, hereThe size of (2) can be [1, nfft/2, T ], single-channel imaginary spectrum/>The size of (C) may also be [1, nfft/2, T ].

Step S280: and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model.

The specific implementation of step S280 may refer to the related description of step S150 in the foregoing embodiment, which is not described herein.

According to the audio acquisition method provided by the embodiment, a sample audio fragment is acquired, wherein the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real spectrum and an imaginary spectrum; acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multi-channel real number spectrum and the new multi-channel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring multi-channel attention weight parameters based on the new multi-channel real number spectrum and the new multi-channel imaginary number spectrum respectively through the designated attention pooling layer; acquiring a regular weight parameter based on the multi-channel attention weight parameter; acquiring a multichannel attention weighting parameter based on the regular weighting parameter; acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum corresponding to the sample audio fragment based on the multi-channel attention weighting parameters; and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model. Therefore, the multi-channel mask is introduced, the spectrum parameters of each channel are updated based on the mask, and then the appointed attention pooling layer of the neural network model to be trained is trained by using the updated spectrum parameters, so that the attention capability of the neural network model to key channels of the audio fragment is improved, and the effect of picking up the audio by the target neural network model obtained through training is further improved.

By introducing the multichannel attention pooling layer, the attention effect on multichannel dimension can be realized, so that the model can be trained to have the capability of focusing on the key channel of the sample audio fragment, and the robustness of the trained model is improved.

Referring to fig. 3, a block diagram of an audio acquisition device according to an embodiment of the present application is provided, and the embodiment provides an audio acquisition device 300 that may be operated in an electronic apparatus, where the audio acquisition device 300 includes: sample audio acquisition module 310, mask acquisition module 320, parameter update module 330, training module 340, and audio acquisition module 350:

a sample audio acquisition module 310 is configured to acquire a sample audio segment, where the sample audio segment includes a plurality of channels, and each channel has a real spectrum and an imaginary spectrum.

A mask obtaining module 320, configured to obtain masks corresponding to the channels respectively based on the corresponding real spectrum and imaginary spectrum.

As an embodiment, the mask obtaining module 320 may be configured to obtain amplitude spectrum parameters corresponding to each of the plurality of channels based on the corresponding real spectrum and imaginary spectrum; and inputting the amplitude spectrum parameters corresponding to the channels into the neural network model to be trained, and obtaining masks corresponding to the channels, which are output by the neural network model to be trained.

And the parameter updating module 330 is configured to update the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask, so as to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum.

For one embodiment, the parameter updating module 330 may be configured to, for each channel, use a product of the mask corresponding to the channel and the real spectrum as a new real spectrum of the channel and use a product of the mask corresponding to the channel and the imaginary spectrum as a new imaginary spectrum of the channel.

The training module 340 is configured to input the new multi-channel real spectrum and the new multi-channel imaginary spectrum into a designated attention pooling layer of the neural network model to be trained, so as to train the designated attention pooling layer, and obtain a single-channel real spectrum and a single-channel imaginary spectrum, which are output by the designated attention pooling layer and correspond to the sample audio segment.

As an implementation, the training module 340 may be configured to obtain, by the specified attention pooling layer, a multichannel attention weighting parameter based on the new multichannel real spectrum and multichannel imaginary spectrum, respectively; acquiring a regular weight parameter based on the multi-channel attention weight parameter; acquiring a multichannel attention weighting parameter based on the regular weighting parameter; and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum corresponding to the sample audio fragment based on the multi-channel attention weighting parameter.

The audio acquisition module 350 is configured to acquire a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, and take a neural network model to be trained when the target loss function converges as a target neural network model, and perform audio pickup through the target neural network model.

As one implementation, the audio acquisition module 350 may be configured to acquire single-channel audio based on the single-channel real spectrum and single-channel imaginary spectrum; and acquiring a target loss function based on the single-channel audio and a preset tag audio.

Optionally, the audio obtaining apparatus 300 may further include a data processing module, configured to perform fourier transform on the sample audio segment to obtain a real spectrum and an imaginary spectrum corresponding to each channel of the sample audio segment. Similarly, the obtaining single-channel audio based on the single-channel real spectrum and the single-channel imaginary spectrum may include: and performing inverse Fourier transform on the single-channel real spectrum and the single-channel imaginary spectrum to obtain the single-channel audio.

As an embodiment, the audio acquisition module 350 may first counter-propagate the objective loss function to update the neural network model to be trained and the specified attention pooling layer until the objective loss function converges; and taking the neural network model to be trained when the target loss function converges as a target neural network model.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided by the present application, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

Referring to fig. 4, based on the above-mentioned audio acquisition method and apparatus, an embodiment of the present application further provides an electronic device 100 capable of executing the above-mentioned audio acquisition method. The electronic device 100 includes a memory 102 and one or more (only one is shown) processors 104 coupled to each other, with communication lines connecting the memory 102 and the processors 104. The memory 102 stores therein a program that can execute the contents of the foregoing embodiments, and the processor 104 can execute the program stored in the memory 102.

Wherein the processor 104 may include one or more processing cores. The processor 104 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 102, and invoking data stored in the memory 102. Alternatively, the processor 104 may be implemented in hardware in at least one of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 104 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 104 and may be implemented solely by a single communication chip.

Memory 102 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). Memory 102 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 102 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the foregoing embodiments, etc. The storage data area may also store data created by the electronic device 100 in use (e.g., phonebook, audiovisual data, chat log data), and the like.

Referring to fig. 5, a block diagram of a computer readable storage medium according to an embodiment of the application is shown. The computer readable storage medium 400 has stored therein program code that can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 400 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, computer readable storage medium 400 comprises a non-volatile computer readable storage medium (non-transitory computer-readable storage medium). The computer readable storage medium 400 has storage space for program code 410 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 410 may be compressed, for example, in a suitable form.

In summary, according to the audio acquisition method, the device, the electronic equipment and the storage medium provided by the embodiments of the present application, a sample audio segment is acquired, where the sample audio segment includes a plurality of channels, and each channel has a real spectrum and an imaginary spectrum corresponding to the sample audio segment; acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums; updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum; inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment; and acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model. Therefore, the multi-channel mask is introduced, the spectrum parameters of each channel are updated based on the mask, and then the appointed attention pooling layer of the neural network model to be trained is trained by using the updated spectrum parameters, so that the attention capability of the neural network model to key channels of the audio fragment is improved, and the effect of picking up the audio by the target neural network model obtained through training is further improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of audio acquisition, the method comprising:

Acquiring a sample audio fragment, wherein the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real number spectrum and an imaginary number spectrum;

acquiring masks corresponding to the channels respectively based on the corresponding real spectrums and imaginary spectrums;

updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum;

Inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of a neural network model to be trained so as to train the designated attention pooling layer, and acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment;

And acquiring a target loss function based on the single-channel real spectrum and the single-channel imaginary spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model.

2. The method of claim 1, wherein the obtaining a single-channel real spectrum and a single-channel imaginary spectrum corresponding to the sample audio segment output by the specified attention pooling layer comprises:

acquiring a multichannel attention weight parameter based on the new multichannel real spectrum and the multichannel imaginary spectrum respectively through the appointed attention pooling layer;

Acquiring a regular weight parameter based on the multi-channel attention weight parameter;

Acquiring a multichannel attention weighting parameter based on the regular weighting parameter;

And acquiring a single-channel real number spectrum and a single-channel imaginary number spectrum corresponding to the sample audio fragment based on the multi-channel attention weighting parameter.

3. The method of claim 2, wherein the obtaining masks corresponding to each of the plurality of channels based on the corresponding real and imaginary spectrums comprises:

Acquiring amplitude spectrum parameters corresponding to the channels respectively based on the corresponding real spectrum and imaginary spectrum;

And inputting the amplitude spectrum parameters corresponding to the channels into the neural network model to be trained, and obtaining masks corresponding to the channels, which are output by the neural network model to be trained.

4. The method of claim 2, wherein updating the real and imaginary spectra corresponding to each channel based on the corresponding mask comprises:

For each channel corresponding real spectrum and imaginary spectrum, taking the product of the mask corresponding to the channel and the real spectrum as a new real spectrum of the channel, and taking the product of the mask corresponding to the channel and the imaginary spectrum as a new imaginary spectrum of the channel.

5. The method of any of claims 1-4, wherein the obtaining a target loss function based on the single-channel real spectrum and single-channel imaginary spectrum comprises:

Acquiring single-channel audio based on the single-channel real spectrum and the single-channel imaginary spectrum;

and acquiring a target loss function based on the single-channel audio and a preset tag audio.

6. The method according to claim 5, wherein the training neural network model with the target loss function converged as the target neural network model comprises:

counter-propagating the objective loss function to update the neural network model to be trained and the specified attention pooling layer until the objective loss function converges;

and taking the neural network model to be trained when the target loss function converges as a target neural network model.

7. The method of claim 5, wherein the method further comprises:

performing Fourier transform on the sample audio fragment to obtain a real number spectrum and an imaginary number spectrum corresponding to each channel of the sample audio fragment;

the obtaining single-channel audio based on the single-channel real spectrum and the single-channel imaginary spectrum comprises the following steps:

And performing inverse Fourier transform on the single-channel real spectrum and the single-channel imaginary spectrum to obtain the single-channel audio.

8. An audio acquisition device, the device comprising:

The system comprises a sample audio acquisition module, a sampling module and a sampling module, wherein the sample audio acquisition module is used for acquiring a sample audio fragment, the sample audio fragment comprises a plurality of channels, and each channel has a corresponding real number spectrum and an imaginary number spectrum;

a mask acquisition module, configured to acquire masks corresponding to the multiple channels respectively based on the corresponding real number spectrum and imaginary number spectrum;

the parameter updating module is used for updating the real spectrum and the imaginary spectrum corresponding to each channel based on the corresponding mask to obtain a new multi-channel real spectrum and a new multi-channel imaginary spectrum;

The training module is used for inputting the new multichannel real number spectrum and the new multichannel imaginary number spectrum into a designated attention pooling layer of the neural network model to be trained so as to train the designated attention pooling layer and acquire a single-channel real number spectrum and a single-channel imaginary number spectrum which are output by the designated attention pooling layer and correspond to the sample audio fragment;

And the audio acquisition module is used for acquiring a target loss function based on the single-channel real number spectrum and the single-channel imaginary number spectrum, taking the neural network model to be trained when the target loss function is converged as a target neural network model, and carrying out audio pickup through the target neural network model.

9. An electronic device comprising one or more processors and memory;

one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, wherein the program code, when being executed by a processor, performs the method of any of claims 1-7.