CN113921022A

CN113921022A - Audio signal separation method, device, storage medium and electronic equipment

Info

Publication number: CN113921022A
Application number: CN202111517138.5A
Authority: CN
Inventors: 智鹏鹏; 陈昌滨
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-01-11
Anticipated expiration: 2041-12-13
Also published as: CN113921022B

Abstract

The present disclosure relates to an audio signal separation method, apparatus, storage medium, and electronic device, wherein the method comprises: acquiring an original audio signal which comprises a target audio signal and a background audio signal; carrying out short-time Fourier transform processing on the original audio signal to obtain the frequency spectrum of the original audio signal; inputting an original audio signal into a preset separation model to obtain a first mask corresponding to a target audio signal, and generating an amplitude spectrum corresponding to the target audio signal based on the first mask and an amplitude spectrum in the frequency spectrum of the original audio signal; obtaining a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal; inputting the first target frequency spectrum into a voice enhancement model to obtain a second mask corresponding to the target audio signal, and determining a second target frequency spectrum corresponding to the target audio signal based on the second mask and the first target frequency spectrum; and carrying out short-time inverse Fourier transform processing on the second target frequency spectrum to obtain a target audio signal.

Description

Audio signal separation method, device, storage medium and electronic equipment

Technical Field

The disclosed embodiments relate to the field of audio signal processing technologies, and in particular, to an audio signal separation method, an audio signal separation apparatus, and a computer-readable storage medium and an electronic device for implementing the audio signal separation method.

Background

Music classroom teaching plays a great role in the field of education, but a music signal in a classroom is often accompanied by human voice and background music such as accompaniment and noise, and a music classroom often needs to extract, for example, human voice for more convenient teaching, so how to separate human voice from a music signal becomes a problem to be solved.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide an audio signal separation method, an audio signal separation apparatus, and a computer-readable storage medium and an electronic device implementing the audio signal separation method.

In a first aspect, an embodiment of the present disclosure provides an audio signal separation method, including:

acquiring an original audio signal to be separated, wherein the original audio signal comprises a target audio signal and a background audio signal;

carrying out short-time Fourier transform processing on the original audio signal to obtain a frequency spectrum of the original audio signal, wherein the frequency spectrum comprises a phase spectrum and an amplitude spectrum;

inputting the original audio signal into a preset separation model to obtain a first mask corresponding to the target audio signal, and generating an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal;

obtaining a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal;

inputting the first target frequency spectrum into a voice enhancement model to obtain a second mask corresponding to the target audio signal, and determining a second target frequency spectrum corresponding to the target audio signal based on the second mask and the first target frequency spectrum;

and carrying out short-time inverse Fourier transform processing on the second target frequency spectrum to obtain a target audio signal.

In one embodiment, the speech enhancement model is a speech enhancement model with attention mechanism;

the inputting the first target frequency spectrum into a speech enhancement model to obtain a second mask corresponding to the target audio signal includes:

extracting characteristic information of the first target frequency spectrum;

extracting target feature information in the feature information based on the attention mechanism;

determining a second mask based on the feature information and the target feature information.

In one embodiment, before the short-time fourier transform processing on the original audio signal, the method includes:

adding noise signals with different preset signal-to-noise ratios to the original audio signal to obtain a mixed audio signal;

and taking the mixed audio signal as a new original audio signal, and returning to the step of performing short-time Fourier transform processing on the original audio signal.

In one embodiment, the performing short-time fourier transform processing on the original audio signal includes:

preprocessing the original audio signal to obtain a preprocessed audio signal; wherein the preprocessing comprises framing processing and windowing function processing;

and carrying out short-time Fourier transform processing on the preprocessed audio signal.

In one embodiment, the generating a corresponding magnitude spectrum of the target audio signal based on the first mask and a magnitude spectrum in a frequency spectrum of the original audio signal includes:

and carrying out Hadamard product processing on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal to obtain the amplitude spectrum corresponding to the target audio signal.

In one embodiment, the determining a second target spectrum corresponding to the target audio signal based on the second mask and the first target spectrum includes:

and performing dot multiplication processing on the second mask and the first target frequency spectrum to obtain a second target frequency spectrum corresponding to the target audio signal.

In one embodiment, the predetermined isolation model comprises a gru (gated regenerative unit) neural network model.

In a second aspect, an embodiment of the present disclosure provides an audio signal separation apparatus, including:

the device comprises an acquisition module, a separation module and a processing module, wherein the acquisition module is used for acquiring an original audio signal to be separated, and the original audio signal comprises a target audio signal and a background audio signal;

the transformation module is used for carrying out short-time Fourier transformation processing on the original audio signal to obtain a frequency spectrum of the original audio signal, wherein the frequency spectrum comprises a phase spectrum and an amplitude spectrum;

the first processing module is used for inputting the original audio signal into a preset separation model to obtain a first mask corresponding to the target audio signal, and generating an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal;

the second processing module is used for obtaining a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal;

the enhancement processing module is used for inputting the first target frequency spectrum into a voice enhancement model to obtain a second mask corresponding to the target audio signal, and determining a second target frequency spectrum corresponding to the target audio signal based on the second mask and the first target frequency spectrum;

and the inverse transformation module is used for carrying out short-time Fourier inverse transformation processing on the second target frequency spectrum to obtain a target audio signal.

In a third aspect, the disclosed embodiments provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the audio signal separation method according to any one of the above embodiments.

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including:

a processor; and

a memory for storing a computer program;

wherein the processor is configured to perform the steps of the audio signal separation method of any of the above embodiments via execution of the computer program.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

in the embodiment of the disclosure, an original audio signal to be separated is obtained, where the original audio signal includes a target audio signal and a background audio signal; carrying out short-time Fourier transform processing on the original audio signal to obtain a frequency spectrum of the original audio signal, wherein the frequency spectrum comprises a phase spectrum and an amplitude spectrum; inputting the original audio signal into a preset separation model to obtain a first mask corresponding to the target audio signal, and generating an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal; obtaining a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal; inputting the first target frequency spectrum into a voice enhancement model to obtain a second mask corresponding to the target audio signal, and determining a second target frequency spectrum corresponding to the target audio signal based on the second mask and the first target frequency spectrum; and carrying out short-time inverse Fourier transform processing on the second target frequency spectrum to obtain a target audio signal. Thus, the frequency spectrum of the original audio signal, namely the phase spectrum and the amplitude spectrum, is obtained through short-time Fourier transform processing, the first mask corresponding to the target audio signal is obtained through a preset separation model, then generating an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal, then obtaining a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal, that is, the above processing makes the signal input to the speech enhancement model include the first target frequency spectrum, i.e. the amplitude spectrum and the phase spectrum, corresponding to the target audio signal, when the speech enhancement model is processed, so that the speech enhancement stage takes the phase information into account, therefore, the accuracy of separating original audio signals such as music signals is improved, and the effect of separating the audio signals is better.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a flow chart of an audio signal separation method according to an embodiment of the disclosure;

FIG. 2 is a flowchart of an audio signal separation method according to another embodiment of the disclosure;

FIG. 3 is a schematic diagram of an audio signal separation apparatus according to an embodiment of the disclosure;

fig. 4 is a schematic diagram of an electronic device implementing an audio signal separation method according to an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

It is to be understood that, hereinafter, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated objects, meaning that there may be three relationships, for example, "a and/or B" may mean: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

Background music such as accompaniment in music signals is accompaniment composed of various musical instruments of various tones, various musical instruments are mixed with each other, and usually, sound signals of a plurality of musical instrument sound sources are mixed, so that the background music is a complex audio signal for distinguishing common voice signals. In the related art, the human voice separation research for music signals is realized based on signal processing, such as a robust principal component analysis method, and the separation performance is improved, but is still not ideal. With the development of the deep learning neural network, the nonlinear relation of processing music signals by using the neural network technology shows good nonlinear capacity, and the separation performance is improved.

However, considering that background music such as accompaniment is usually not smooth and has harmonic structure, which causes difficulty in separation, in order to improve these problems, the separation stage selects a gru (gated regenerative unit) network, and then the speech enhancement stage is currently performed on the frequency domain, which discards phase information, so that the separation accuracy of music signal is reduced.

In view of the above, the present disclosure provides an audio signal separation method, and fig. 1 is a flowchart of an audio signal separation method according to an embodiment of the present disclosure, where the audio signal separation method may be executed by an electronic device, such as a computer, an intelligent mobile device, and the like, and specifically includes the following steps:

step S101: original audio signals to be separated are obtained, and the original audio signals comprise target audio signals and background audio signals.

Illustratively, the original audio signal X may include a music signal of a classroom, and specifically may includeTarget audio signal X₁Such as a human voice signal and a background audio signal X₂Such as background music (e.g., accompaniment) signals. The original audio signal X may be recorded by an audio recording device, such as a recording application of a smart phone, but is not limited thereto.

Step S102: and carrying out short-time Fourier transform processing on the original audio signal to obtain a frequency spectrum of the original audio signal, wherein the frequency spectrum comprises a phase spectrum and an amplitude spectrum.

Specifically, a Short-Time Fourier Transform (STFT) is a mathematical Transform related to the Fourier Transform to determine the frequency and phase of the local area sinusoid of the Time-varying signal, as will be understood with reference to the prior art. In this embodiment, the original audio signal X is processed by using a short-time fourier transform to obtain a frequency spectrum thereof, such as a phase spectrum and an amplitude spectrum, and the frequency spectrum obtained by the short-time fourier transform is usually a complex spectrum, and includes information on both phase and amplitude, and is specifically embodied by, for example, a phase spectrum matrix P and an amplitude spectrum matrix a.

Step S103: inputting the original audio signal into a preset separation model to obtain a first mask corresponding to the target audio signal, and generating an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal.

For example, the predetermined separation model may include, but is not limited to, a gru (gated regenerative unit) neural network model. In one specific example, a five-layer GRU neural network is adopted, for example, wherein the hidden layer of each layer network can be provided with 512 neurons, for example, and the last layer can be a fully-connected layer such as a fully-connected layer provided with 2 layers, but is not limited thereto.

In one example, the original audio signal X is input into a predetermined separation model, such as a GRU neural network model, to separate the target audio signal X₁Corresponding first mask Y₁And a background audio signal X₂Corresponding mask, background audio signal X₂The corresponding mask is ignored and not processed. Based on a first mask Y₁And a magnitude spectrum such as a magnitude spectrum matrix in the frequency spectrum of the original audio signal XA generating a target audio signal X₁Corresponding amplitude spectra, e.g. amplitude spectrum matrix A₁. First mask Y₁An Ideal Ratio Mask (Ideal Ratio Mask) may be used, but the present invention is not limited thereto.

Step S104: and obtaining a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal.

Illustratively, based on the target audio signal X₁Corresponding amplitude spectra, e.g. amplitude spectrum matrix A₁And a phase spectrum, such as a phase spectrum matrix P, in the spectrum of the original audio signal X, determining the target audio signal X₁Corresponding first target spectrum, e.g. first target spectrum matrix M₁. At this time, the target audio signal X₁Corresponding first target spectrum, e.g. first target spectrum matrix M₁Containing the corresponding phase information and spectral information.

Step S105: inputting the first target frequency spectrum into a voice enhancement model to obtain a second mask corresponding to the target audio signal, and determining a second target frequency spectrum corresponding to the target audio signal based on the second mask and the first target frequency spectrum.

Illustratively, a first target spectrum, such as a first target spectrum matrix M₁Inputting a speech enhancement model for speech enhancement processing to obtain a target audio signal X₁Corresponding second mask Y₂Based on a second mask Y₂And a first target spectrum, e.g. a first target spectrum matrix M₁Determining a target audio signal X₁Corresponding second target spectrum, e.g. second target spectrum matrix M₂。

It will be appreciated that the second target spectrum, e.g. the second target spectrum matrix M, corresponding to the target audio signal obtained here₂Contains amplitude and phase information, when the target audio signal X has been made in the frequency domain₁Separation and enhancement treatment of (1).

Step S106: and carrying out short-time inverse Fourier transform processing on the second target frequency spectrum to obtain a target audio signal.

Examples of the inventionFor a second target spectrum, e.g. a second target spectrum matrix M₂Then, the target audio signal X is obtained by performing Short-Time Inverse Fourier Transform (ISTFT), i.e. transforming the target audio signal X back to the Time domain₁。

The audio signal separation method of the disclosed embodiment obtains a frequency spectrum, i.e. a phase spectrum and an amplitude spectrum, of an original audio signal through short-time Fourier transform processing, obtains a first mask corresponding to a target audio signal through a preset separation model, then generates an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal, and then obtains a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal, i.e. when the speech enhancement model is processed through the processing, a signal input into the speech enhancement model comprises the first target frequency spectrum, i.e. the amplitude spectrum and the phase spectrum, corresponding to the target audio signal, so that phase information is considered in a speech enhancement stage, thereby improving the separation accuracy of the original audio signal, e. a music signal, the audio signal separation is effective.

Based on the above implementation, in one embodiment, the speech enhancement model is a speech enhancement model with an attention (attention) mechanism. Illustratively, for example, the speech enhancement model may select a complex network with attention mechanism, and the complex network may be obtained by adding an attention (attention) layer on the basis of a convolutional recurrent neural network CRN. The attention layer is added to the jump-connection section, and the detailed features extracted by the encoder section in the model can be extracted to the decoder, so that information contributing to speech enhancement, such as feature information corresponding to the target audio signal, can be retained. In an embodiment, inputting the first target spectrum into a speech enhancement model to obtain a second mask corresponding to the target audio signal may specifically include the following steps:

step S201: and extracting the characteristic information of the first target frequency spectrum.

Illustratively, a first target spectrum, such as a first target spectrum matrix M₁Input speech enhancementThe model is used for speech enhancement processing, and an encoder in the speech enhancement model is based on a first target spectrum matrix M₁Extracting the characteristics to obtain characteristic information, namely the target audio signal X₁Corresponding feature information X', which may include noise feature information.

Step S202: and extracting target feature information in the feature information based on the attention mechanism.

Illustratively, the attention (attention) layer in the speech enhancement model extracts the target audio signal X from the encoder₁Extracting target feature information X '' from the corresponding feature information X ', such as the rest feature information except the noise feature information in the feature information X'.

Step S203: determining a second mask based on the feature information and the target feature information.

Illustratively, the target audio signal X may be obtained based on fusion of the feature information X' and the target feature information X ″₁Corresponding second mask Y₂. Second mask Y₂It may also be an ideal ratio mask.

The audio signal separation method of the embodiment of the disclosure enables the signal input into the speech enhancement model to include the first target frequency spectrum, i.e. the amplitude spectrum and the phase spectrum, corresponding to the target audio signal when the speech enhancement model is processed, so that the phase information is considered in the speech enhancement stage, and the speech enhancement model with the attention mechanism is adopted for processing, so that the accuracy of separating the original audio signal, such as a music signal, is further improved, and the audio signal separation effect is relatively better.

step i): and adding noise signals with different preset signal-to-noise ratios to the original audio signal to obtain a mixed audio signal.

Illustratively, noise signals with different predetermined signal-to-noise ratios are added to the original audio signal, such as, but not limited to, signal-to-noise ratios of-5 db, -3db, -4db, etc. This may allow the resulting mixed audio signal to better simulate noise components in a musical tutorial scene.

Step ii): and taking the mixed audio signal as a new original audio signal, and returning to the step of performing short-time Fourier transform processing on the original audio signal.

That is, the new original audio signal to which the noise signal is added is subjected to the short-time fourier transform processing in step S102, and then the processing procedures of steps S103 to S106 are performed.

In one embodiment, the short-time fourier transform processing on the original audio signal includes the following steps:

step a): preprocessing the original audio signal to obtain a preprocessed audio signal; wherein the preprocessing comprises framing processing and windowing function processing.

Illustratively, the framing process frames the original audio signal X into frames with a frame length of 25ms, a frame shift of 6.25ms, and a hamming window such as a hamming window of 2048 points, for example, but not limited thereto.

Step b): and carrying out short-time Fourier transform processing on the preprocessed audio signal.

For example, after the framing processing and the windowing function processing, the short-time fourier transform processing is performed, and a specific short-time fourier transform processing process can be understood with reference to the prior art, and is not described herein again.

Optionally, in an embodiment, generating a magnitude spectrum corresponding to the target audio signal based on the first mask and the magnitude spectrum in the frequency spectrum of the original audio signal may specifically include: and carrying out Hadamard product processing on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal to obtain the amplitude spectrum corresponding to the target audio signal.

Illustratively, for the first mask Y₁And performing Hadamard product (Hadamard product) processing on the amplitude spectrum, such as amplitude spectrum matrix A, in the frequency spectrum of the original audio signal X to generate a target audio signal X₁Corresponding amplitude spectra, e.g. amplitude spectrum matrix A₁. The specific procedures for Hadamard product processing can be understood with reference to the prior art, and are not described herein as burdensomeThe above-mentioned processes are described.

Optionally, in an embodiment, the determining a second target spectrum corresponding to the target audio signal based on the second mask and the first target spectrum may specifically include: and performing dot multiplication processing on the second mask and the first target frequency spectrum to obtain a second target frequency spectrum corresponding to the target audio signal.

For example, for the second mask Y₂And a first target spectrum, e.g. a first target spectrum matrix M₁Performing dot multiplication processing to determine a target audio signal X₁Corresponding second target spectrum, e.g. second target spectrum matrix M₂。

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc. Additionally, it will also be readily appreciated that the steps may be performed synchronously or asynchronously, e.g., among multiple modules/processes/threads.

An embodiment of the present disclosure provides an audio signal separation apparatus, and the audio signal separation apparatus shown in fig. 3 may include:

an obtaining module 301, configured to obtain an original audio signal to be separated, where the original audio signal includes a target audio signal and a background audio signal;

a transform module 302, configured to perform short-time fourier transform processing on the original audio signal to obtain a frequency spectrum of the original audio signal, where the frequency spectrum includes a phase spectrum and an amplitude spectrum;

a first processing module 303, configured to input the original audio signal into a preset separation model to obtain a first mask corresponding to the target audio signal, and generate an amplitude spectrum corresponding to the target audio signal based on the first mask and an amplitude spectrum in a frequency spectrum of the original audio signal;

a second processing module 304, configured to obtain a first target frequency spectrum corresponding to the target audio signal based on an amplitude spectrum corresponding to the target audio signal and a phase spectrum in a frequency spectrum of the original audio signal;

an enhancement processing module 305, configured to input the first target frequency spectrum into a speech enhancement model to obtain a second mask corresponding to the target audio signal, and determine a second target frequency spectrum corresponding to the target audio signal based on the second mask and the first target frequency spectrum;

and the inverse transformation module 306 is configured to perform short-time inverse fourier transform processing on the second target frequency spectrum to obtain a target audio signal.

The audio signal separation device of the disclosed embodiment obtains a frequency spectrum, i.e. a phase spectrum and an amplitude spectrum, of an original audio signal through short-time Fourier transform processing, obtains a first mask corresponding to a target audio signal through a preset separation model, then generates an amplitude spectrum corresponding to the target audio signal based on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal, and then obtains a first target frequency spectrum corresponding to the target audio signal based on the amplitude spectrum corresponding to the target audio signal and the phase spectrum in the frequency spectrum of the original audio signal, i.e. when the speech enhancement model is processed through the processing, a signal input into the speech enhancement model comprises the first target frequency spectrum, i.e. the amplitude spectrum and the phase spectrum, corresponding to the target audio signal, so that phase information is considered in a speech enhancement stage, thereby improving the separation accuracy of the original audio signal, e. a music signal, the audio signal separation is effective.

In one embodiment, the speech enhancement model is a speech enhancement model with attention mechanism. The enhancement processing module 305 is further configured to: extracting characteristic information of the first target frequency spectrum; extracting target feature information in the feature information based on the attention mechanism; determining a second mask based on the feature information and the target feature information.

In one embodiment, the apparatus further comprises a preprocessing module to: noise signals with different preset signal-to-noise ratios are added to the original audio signal to obtain a mixed audio signal, the mixed audio signal is used as a new original audio signal, and the short-time fourier transform processing is performed on the original audio signal by the trigger transform module 302.

In one embodiment, the transformation module 302 is further configured to: preprocessing the original audio signal to obtain a preprocessed audio signal; wherein the preprocessing comprises framing processing and windowing function processing; and carrying out short-time Fourier transform processing on the preprocessed audio signal.

In one embodiment, the first processing module 303 is further configured to: and carrying out Hadamard product processing on the first mask and the amplitude spectrum in the frequency spectrum of the original audio signal to obtain the amplitude spectrum corresponding to the target audio signal.

In one embodiment, the enhancement processing module 305 is further configured to: and performing dot multiplication processing on the second mask and the first target frequency spectrum to obtain a second target frequency spectrum corresponding to the target audio signal.

In one embodiment, the predetermined isolation model includes, but is not limited to, a gru (gated regenerative unit) neural network model.

The specific manner in which the above-mentioned embodiments of the apparatus, and the corresponding technical effects brought about by the operations performed by the respective modules, have been described in detail in the embodiments related to the method, and will not be described in detail herein.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units. The components shown as modules or units may or may not be physical units, i.e. may be located in one place or may also be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the wood-disclosed scheme. One of ordinary skill in the art can understand and implement it without inventive effort.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to the above embodiments of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 4, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 804 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above. For example, in some embodiments, the methods of the above embodiments may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the methods of the embodiments described above in any other suitable manner (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An audio signal separation method, comprising:

2. The audio signal separation method according to claim 1, wherein the speech enhancement model is a speech enhancement model with attention mechanism;

extracting characteristic information of the first target frequency spectrum;

3. The audio signal separation method according to claim 1 or 2, wherein before the short-time fourier transform processing of the original audio signal, the method comprises:

4. The audio signal separation method according to claim 3, wherein the performing short-time Fourier transform processing on the original audio signal comprises:

5. The audio signal separation method according to claim 1 or 2, wherein the generating a corresponding amplitude spectrum of the target audio signal based on the first mask and an amplitude spectrum in a frequency spectrum of the original audio signal comprises:

6. The audio signal separation method according to claim 1 or 2, wherein the determining a second target spectrum corresponding to the target audio signal based on the second mask and the first target spectrum comprises:

7. The audio signal separation method according to claim 1 or 2, wherein the predetermined separation model comprises a gru (gated regenerative unit) neural network model.

8. An audio signal separating apparatus, comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the audio signal separation method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing a computer program;

wherein the processor is configured to perform the steps of the audio signal separation method of any one of claims 1 to 7 via execution of the computer program.