CN112989107A

CN112989107A - Audio classification and separation method and device, electronic equipment and storage medium

Info

Publication number: CN112989107A
Application number: CN202110537306.0A
Authority: CN
Inventors: 马路; 杨嵩
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-06-18
Anticipated expiration: 2041-05-18
Also published as: CN112989107B

Abstract

The invention discloses an audio classification and separation method, an audio classification and separation device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining an audio signal to be processed; and inputting the audio signal to be processed into an audio coding module, and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain. The two-dimensional space comprises time domain characteristics of the audio and channel information of the audio coding module. And inputting the audio signal of the two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space. And inputting the multi-scale features to the cascade layer to obtain spliced multi-scale features, and inputting the spliced multi-scale features to a classifier module to classify the audio signals to be processed. And inputting the multi-scale features into the superimposer to obtain the superimposed multi-scale features, and inputting the superimposed multi-scale features into a separation network to separate the audio signals to be processed. The invention solves the problem of audio classification while audio separation is performed.

Description

Audio classification and separation method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio classifying and separating method, an audio classifying and separating device, electronic equipment and a storage medium.

Background

Speaker separation is a scientific description of the "cocktail party" problem, aiming at separating out the desired speaker's voice from multiple speakers. The conventional method mainly uses spatial information, and uses a microphone array to pick up sound in a specific direction by using a beam forming algorithm. Sound sources with similar distances cannot be separated, and the beam forming effect is seriously reduced along with the change of environmental noise and reverberation; meanwhile, the requirement for consistency between the microphones is higher than that of the microphone array, and the hardware cost is improved.

In a voice interaction scene, the quality of voice separation directly influences the back-end voice recognition rate and the listening experience of a user, and is a key core technology of a voice technology. In recent years, the separation effect of the deep learning method is obviously better than that of the traditional method, and two network training frameworks of deep clustering and unchanged arrangement are typical. However, both methods work well for separating audio that is completely overlapped, and generally for separating audio that is not mixed in an indefinite amount or has a small overlap ratio. In actual use, the number of mixed audios is uncertain, the overlapped parts of the mixed audios are few, and the performance of the methods is seriously reduced in actual use, so that the actual requirements cannot be met.

Aiming at the problems that the mixed audio quantity is uncertain, the audio separation effect with less overlapped parts is poor and the audio classification can not be carried out simultaneously when the audio is separated in the prior art, an effective solution is not provided.

Disclosure of Invention

In view of this, embodiments of the present invention provide an audio classifying and separating method, an audio classifying and separating apparatus, an electronic device, and a storage medium, so as to solve the problems in the prior art that the number of mixed audios is uncertain, the audio separating effect is poor with less overlapping portions, and audio classification cannot be performed simultaneously when audio separation is performed.

Therefore, the embodiment of the invention provides the following technical scheme:

in a first aspect of the present invention, there is provided an audio classifying and separating method, including:

determining an audio signal to be processed;

inputting the audio signal to be processed to an audio coding module, and transforming the audio signal to be processed from a one-dimensional time domain to a two-dimensional space; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;

inputting an audio signal of a two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space;

inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed;

and inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed.

Optionally, the multi-scale mapping module includes: the system comprises a layer normalization layer, a first one-dimensional convolution module and a plurality of groups of cascaded expansion convolution networks; wherein each set of the dilated convolutional networks comprises a plurality of cascaded dilated convolutional blocks;

inputting the audio signal of the two-dimensional space to the layer normalization layer to obtain a first audio signal;

inputting the first audio signal to the first one-dimensional convolution module, and outputting a second audio signal;

and inputting the second audio signal to the multi-group of the expansion convolution networks, and extracting the multi-scale features.

Optionally, the method further comprises:

the expansion rate of each expanded volume is exponentially multiplied by 2 to be 2ⁱ-1; wherein when the swelling volume block is causal, the number of padding 0 is:

(dilation*(kernel_size-1))/2，

in the non-causal case, the number of padding 0 is:

dilation*(kernel_size-1)；

partition represents expansion ratio, kernel _ size represents convolution kernel size, X represents the number of expanded volume blocks in each set of the expanded network, i represents the ith volume block, wherein

And the maximum value of i is X.

Optionally, the dilated convolution block comprises: a point-by-point convolution layer, a first PReLU activation function layer, a first normalization layer, a depth convolution layer, a second PReLU activation function layer, a second normalization layer, a second one-dimensional convolution layer and a third one-dimensional convolution layer;

sequentially processing audio signals input to the point-by-point convolution layer through the point-by-point convolution layer, the first PReLU activation function layer, the first normalization layer, the depth convolution layer, the second PReLU activation function layer and the second normalization layer to obtain third audio signals;

inputting the third audio signal to the second one-dimensional convolutional layer and the third one-dimensional convolutional layer respectively to obtain a fourth audio signal and a fifth audio signal respectively;

inputting the fourth audio signal to the cascade layer and the superimposer;

superposing the fifth audio signal with the audio signal to obtain a sixth audio signal;

inputting the sixth audio signal to the next of the dilation volume blocks.

Optionally, the layer normalization layer comprises:

in a non-real-time scene, the layer normalization is global layer normalization, and the expression is as follows:

wherein the content of the first and second substances,

as a result of the layer normalization for F,

representing an audio feature of the two-dimensional space,

in order to train the parameters, the user may,

in order to stabilize the coefficient of heat transfer,

is the average value of the F, and is,

is the variance of F, N represents the length of the channel dimension, and T represents the length of the time dimension;

in a real-time scene, the layer normalization is cumulant layer normalization, and the expression is as follows:

wherein the content of the first and second substances,

in real time, for

As a result of the layer normalization being performed,

which represents the characteristics of the k-th frame,

represents the continuous k frame characteristics, namely:

，

in order to train the parameters, the user may,

the stability factor is expressed in terms of a stability factor,

the frame sequence k is

The average value of (a) of (b),

for successive k frames

The variance of (c).

Optionally, the classifier module comprises: a long-time and short-time memory network layer, a linear layer and a Softmax layer;

inputting the spliced multi-scale features to the long-time and short-time memory network layer for acquiring time sequence memory features;

inputting the timing memory characteristic to the linear layer;

and inputting the audio signal processed by the linear layer into a Softmax layer for classifying the audio signal processed by the linear layer.

Optionally, the separation network comprises: the mask code calculation module, the multiplier and the decoding module;

inputting the superimposed multi-scale features to the mask calculation module, calculating to obtain a mask value of each encoder output channel in the audio encoding module, and inputting the mask value and the audio signal of the two-dimensional space to the multiplier;

and inputting the result processed by the multiplier into the decoding module, and converting the audio signal of the two-dimensional space after mask processing into time domain characteristics to obtain an audio separation result.

Optionally, the method further comprises:

acquiring a plurality of audio frequencies through a data set;

cutting off a mute section in the middle of the plurality of audios, and arranging the plurality of audios from large to small according to the audio length;

acquiring the longest audio frequency in the multiple audio frequencies, supplementing a mute section at the front section of the longest audio frequency, wherein the length of the longest audio frequency after the mute section is supplemented is a specified length;

supplementing a mute section in the plurality of audios, wherein the length of each audio after the mute section is supplemented is the specified length; taking each audio after the mute section is supplemented as a reference audio;

superposing and mixing a plurality of reference audios to obtain mixed audio;

encoding the category of the mixed audio by one hot;

expanding the encoded class length to the length of the corresponding section for obtaining the classification label of each sampling point;

inputting the mixed audio into an audio classification and separation network to obtain an audio classification result and an audio separation result;

acquiring an audio classification network loss function through the audio classification result and the classification label, and acquiring an audio separation network loss function through the audio separation result and a reference audio;

and calculating a total loss function through the audio classification network loss function and the audio separation network loss function, and correcting the audio classification separation network model.

Optionally, the method further comprises:

obtaining the loss function of the audio separation network by a permutation invariance method, wherein the signal-to-noise ratio loss function in the separation network is as follows:

wherein the content of the first and second substances,

in order to separate the loss function of the network,

represents the sound source signal obtained by the audio separation network, i.e. the audio separation result,

representing a reference signal, M representing M of said sound source signals,

which means that for M permutation combinations of said sound source signals, there are M | combinations,

represents a performance index between the sound source signal and the reference signal after permutation and combination, and is represented as the following

The indexes are as follows:

，

to maximize the scale-invariant signal-to-noise ratio, wherein,

representing the jth audio separation result and the class label,

and j is the maximum value of M.

Optionally, the method further comprises:

wherein

Signal power representing x(x is

、

Or s),

for other interfering signals than the desired audio signal,

is the audio separation result.

Optionally, the method further comprises:

the cross entropy loss function is:

wherein the content of the first and second substances,

are all cross-entropy loss functions, and are,

representing the class distribution probability of the audio signal to be processed after being input into the audio classification network for classification,

representing the distribution probability of the classification label, C representing the number of classes,

representing the kth classification category or classification label,

and k is the maximum value of C.

Optionally, the method further comprises:

the total loss function of the audio classification and separation network is a weighted average result of a cross entropy loss function in the audio classification network and a signal-to-noise ratio loss function in the audio separation network, and the expression is as follows:

wherein the content of the first and second substances,

is a weight coefficient

For balancing the two tasks of classification and separation.

The invention provides a frequency classification and separation method device, which comprises the following steps:

the determining module is used for determining the audio signal to be processed;

the audio coding module is used for inputting the audio signal to be processed into the audio coding module and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;

the multi-scale mapping module is used for inputting the audio signal of the two-dimensional space to the multi-scale mapping module and extracting the multi-scale features of the audio signal of the two-dimensional space;

the audio classification module is used for inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to the classifier module, and classifying the audio signals to be processed;

and the audio separation module is used for inputting the multi-scale features to the superimposer to obtain the superimposed multi-scale features, inputting the superimposed multi-scale features to a separation network, and separating the audio signals to be processed.

In a third aspect of the present invention, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the audio classification and separation method of any one of the first aspects.

In a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the audio classification and separation method according to any one of the first aspect.

The technical scheme of the embodiment of the invention has the following advantages:

the embodiment of the invention provides an audio classification and separation method, an audio classification and separation device, electronic equipment and a storage medium, wherein the method comprises the following steps: an audio signal to be processed is determined. And inputting the audio signal to be processed into an audio coding module, and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain. The two-dimensional space comprises time domain characteristics of the audio and channel information of the audio coding module. And inputting the audio signal of the two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space. Inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed. And inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed. The embodiment of the invention solves the problems that the mixed audio quantity is uncertain, the audio separation effect with less overlapped parts is poor and the audio classification can not be carried out simultaneously when the audio is separated in the prior art. The embodiment of the invention provides a multi-target learning method, namely, a classification learning task is added while a separation task is learned. The audio classification and separation network can classify the mixed audio according to the number of each section of mixed audio, thereby improving the capability of the audio classification and separation network for separating the mixed audio and meeting the actual use requirement.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram of an audio classification and separation method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an audio classification separation network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a dilated convolution block structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of training audio acquisition according to an embodiment of the present invention;

FIG. 5 is an audio classification separation network configuration table according to an embodiment of the present invention;

fig. 6 is a block diagram of the structure of an audio classification apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be considered as limiting the present application. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

In this application, the word "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes are not set forth in detail in order to avoid obscuring the description of the present application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

In accordance with an embodiment of the present invention, there is provided an audio classification and separation method embodiment, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In this embodiment, an embodiment of an audio classification and separation method is provided, which can be used in a speech classification and separation system, and fig. 1 is a flowchart of an audio classification and separation method according to an embodiment of the present invention, as shown in fig. 1, where the flowchart includes the following steps:

step S101, determining an audio signal to be processed. The audio signal to be processed is a mixed audio signal, and can be processed by the audio classification and separation network in the embodiment of the invention.

Step S102, inputting the audio signal to be processed to an audio coding module (Encoder), and transforming the audio signal to be processed from a one-dimensional time domain to a two-dimensional space. The two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module. Specifically, the audio coding module is a one-dimensional convolution, whereby the audio signal to be processed is transformed from a one-dimensional time domain to a two-dimensional space.

Step S103, inputting the audio signal of the two-dimensional space to a Multi-Scale Mapping module (Multi-Scale Mapping), and extracting the Multi-Scale feature of the audio signal of the two-dimensional space. Specifically, the features of the audio signal input to the module are extracted through a plurality of groups of expansion convolution networks in the multi-scale mapping, and the extracted features are output.

And step S104, inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed. Specifically, the audio features extracted from each convolution block in the expanded convolution network are respectively spliced through the cascade layers, and the audio features are used for classifying the spliced audio signals by the classifier module.

And S105, inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed. Specifically, the audio features extracted from each convolution block in the expanded convolution network are respectively subjected to superposition processing by a superimposer, and the superimposer is used for separating the audio signals after superposition by a separation network.

Because the audio processing network in the prior art can only realize audio classification or audio separation independently. Different from the prior art, in the embodiment of the invention, the audio separation task is carried out, and meanwhile, the audio classification task is added. The embodiment of the invention solves the problems that the mixed audio quantity is uncertain, the audio separation effect is poor due to less overlapped parts, and the audio classification can not be carried out simultaneously when the audio is separated. According to the embodiment of the invention, classification is carried out according to the number of each section of mixed audio, so that the separation tasks of different mixed audio numbers can be learned in a self-adaptive manner, and the actual use requirements are met. And through piling up a plurality of volume blocks that the expansion rate is different, extract multiscale information, the model is simple, and the real-time rate is high, and audio classification and audio separation rate of accuracy is high.

To further illustrate the classifier, in an alternative embodiment, the classifier module includes: long and short time memory network Layers (LSTM Layers), Linear Layers (Linear), and Softmax Layers. And inputting the spliced multi-scale features into a long-time and short-time memory network layer for acquiring the time sequence memory features. Specifically, compared with the conventional neural network, the long-time and short-time memory network layer is very suitable for processing the audio signals highly related to the time sequence, so that the audio signals to be classified are processed by using the long-time and short-time memory network layer, the audio signals are easier to classify, and errors are avoided.

The timing memory characteristic is input to a linear layer. And full link connection is realized through the linear layer, and the spliced multi-scale features are mapped to the Softmax layer one by one.

The audio signal after the linear layer processing is input to the Softmax layer for classifying the audio signal after the linear layer processing. Specifically, the Softmax layer classifies the input audio signal by using a Softmax function, which is substantially the probability of calculating classification categories, and the classification by using the Softmax function is simple in calculation and remarkable in effect.

The above step S105 relates to a splitting network, which in an alternative embodiment comprises: mask calculation module, multiplier and decoding module. And inputting the superposed multi-scale features into a mask calculation module (Masking) for calculating to obtain a mask value of each encoder output channel in the audio encoding module, and inputting the mask value and the audio signal of the two-dimensional space into a multiplier. Specifically, the multi-scale features are input to a superimposer, and the input multi-scale features are superimposed by the superimposer to form an audio signal to be separated. And calculating a mask value of each encoder output channel in the audio encoding module through a mask calculation module, and synthesizing the output of the Sigmoid activation function and the audio signal of the two-dimensional space. In addition, the mask calculation module comprises a PReLU activation function, a one-dimensional convolutional layer (1 x1 Conv) and a Sigmoid activation function, wherein the activation function plays a role of nonlinear processing to increase the nonlinear fitting capability.

And inputting the result processed by the multiplier into the decoding module (Decoder) for converting the audio signal of the two-dimensional space processed by the mask into time domain characteristics to obtain an audio separation result. Specifically, the decoding module is formed by a transposed convolution network, and performs deconvolution processing on the audio signal input to the decoding module to obtain a time domain signal. As shown in fig. 2, with the audio classification and separation network in the embodiment of the present invention, a stacked one-dimensional convolutional network is used to extract multi-scale features of input audio, and the multi-scale features are used to perform audio classification and audio separation. Meanwhile, audio classification and audio separation are combined, the established model is more beneficial to audio separation when the characteristic of the neural network is utilized to train audio classification, so that the separation performance is improved, and meanwhile, the audio of different mixing degrees is used as a classification task, and voice separation can be assisted, so that the method is suitable for the conditions of different mixed voice quantities and the separation performance is improved.

To further illustrate the multi-scale mapping module, in an alternative embodiment, the multi-scale mapping module comprises: a layer normalization layer (LayerNorm), a first one-dimensional convolution module (1X 1 Conv), and a plurality of sets of cascaded dilated convolution networks. Wherein each set of the dilated convolutional networks comprises a plurality of cascaded dilated convolutional blocks (1-D Conv Block). And inputting the audio signal of the two-dimensional space to the normalization layer to obtain a first audio signal. To ensure that the separation network is insensitive to the amplitude of the input speech, the input audio signal is layer normalized before being multi-scale mapped.

And inputting the first audio signal to a first one-dimensional convolution module, and outputting a second audio signal. The calculation is simplified through the first one-dimensional convolution module, and the calculation efficiency of the audio classification and separation network is improved.

And inputting the second audio signal to the multi-group expansion convolution network, and extracting the multi-scale features. Compared with the traditional convolution, the characteristic extraction through the expansion convolution network reduces the processing process of the audio signal, namely, the pooling processing is not needed, but the purpose of increasing the receptive field can be realized.

To further illustrate the dilated convolution network, in an alternative embodiment, the dilation rate of each dilated convolution block is exponentially multiplied by 2 to be 2ⁱ-1; when the expansion volume block is a causal condition, the number of the filling 0 is as follows:

(dilation*(kernel_size-1))/2，

in the non-causal case, the number of padding 0 is:

dilation*(kernel_size-1)；

partition represents expansion ratio, kernel _ size represents convolution kernel size, X represents the number of expanded volume blocks in each set of the expanded network, and i represents the ith volume block, wherein

And the maximum value of i is X. Specifically, each dilated convolution block is a dilated convolution. Specifically, the dilated convolution network is formed by cascading a plurality of dilated convolution blocks. The expanded Convolution (scaled Convolution) is to fill 0 in a hole based on a Convolution map (Standard Convolution) to increase the reception field (reconstruction field). Therefore, the dilation convolution has one more hyper-parameter (hyper-parameter) on the basis of the standard convolution, which is called the dilation rate (dilation rate), and the dilation rate refers to the number of intervals of the convolution kernel (kernel). In the embodiment of the invention, different numbers of 0 are filled by judging whether the expansion volume block is causal, thereby increasing the receptive field without changing the receptive fieldThe purpose of increasing the amount of calculation.

To further illustrate the dilated convolution block, in an alternative embodiment, as shown in FIG. 3, the dilated convolution block comprises: a point-by-point convolution layer, a first PReLU activation function layer, a first normalization layer, a depth convolution layer, a second PReLU activation function layer, a second normalization layer, a second one-dimensional convolution layer, and a third one-dimensional convolution layer. And sequentially processing the audio signal input to the point-by-point convolution layer through the point-by-point convolution layer, the first PReLU activation function layer, the first normalization layer, the depth convolution layer, the second PReLU activation function layer and the second normalization layer to obtain a third audio signal. Specifically, the audio signal input to the point-by-point convolution layer includes the second audio signal and the audio signal output by the last dilation convolution. In an embodiment of the invention, the dilated convolution block processes the input audio signal by replacing the conventional convolution with a depth separable convolution, i.e. a split into a point-by-point convolution, denoted by 1x1-Conv, and a depth convolution, denoted by D-Conv. And the PReLU (parametric reconstructed linear unit) is adopted as an activation function to carry out nonlinear processing on the audio, so that the nonlinear fitting capability is increased. The PReLU function is expressed as:

wherein x represents an audio signal input to the activation function,

the slope of the negative portion.

The third audio signal is input to the second one-dimensional convolutional layer and the third one-dimensional convolutional layer respectively to obtain a fourth audio signal and a fifth audio signal respectively. Inputting the fourth audio signal to the cascade layer and superimposer. And processing the multi-scale features through the cascade layer and the superimposer so as to obtain audio signals input to the classifier and the separation network.

And superposing the fifth audio signal and the audio signal input to the point-by-point convolution layer to obtain a sixth audio signal. Inputting the sixth audio signal to the next of the dilation volume blocks. The sixth audio signal is superimposed on the audio signal input to the point-by-point convolution layer, so that the disappearance of the gradient can be prevented, and the network depth can be increased.

To further illustrate the layer normalization layer, in an alternative embodiment, the layer normalization layer comprises:

in a non-real-time scene, layer normalization is global layer normalization, and the expression is as follows:

wherein the content of the first and second substances,

as a result of the layer normalization for F,

an audio feature that is represented in a two-dimensional space,

in order to train the parameters, the user may,

in order to stabilize the coefficient of heat transfer,

is the average value of the F, and is,

in a real-time scene, layer normalization is cumulant layer normalization, and the expression is as follows:

wherein the content of the first and second substances,

in real time, for

As a result of the layer normalization being performed,

which represents the characteristics of the k-th frame,

represents the continuous k frame characteristics, namely:

，

in order to train the parameters, the user may,

the stability factor is expressed in terms of a stability factor,

the frame sequence k is

The average value of (a) of (b),

for successive k frames

The variance of (c). In a real-time scene, the layer normalization is to perform layer normalization on audio features continuously input to the layer normalization layer so as to ensure that the audio classification and separation network is insensitive to the amplitude of an input audio signal.

To illustrate training the audio classification separation network, in an alternative embodiment, multiple audios are acquired through a data set, as shown in FIG. 4. The audio is acquired through the data set, so that the quality of the acquired audio is high, and the pure audio is convenient to acquire because no more noise is interfered in the single audio.

And cutting off a mute section in the middle of the multiple audios, and arranging the multiple audios from large to small according to the audio length. The silence section in the middle of each audio frequency is intercepted, so that more excellent classification labels can be obtained, the classification result of the mixed audio signal after the audio classification separation network processing is easier and more convenient to compare with the classification labels, and the condition of classification errors is avoided.

And acquiring the longest audio frequency in the plurality of audio frequencies, supplementing a mute section at the front section of the longest audio frequency, wherein the length of the longest audio frequency after the mute section is supplemented is the designated length. Supplementing a mute section in the plurality of audios, wherein the length of each audio after the mute section is supplemented is the specified length; and taking each audio after the mute segment is supplemented as a reference audio. The silent segments are supplemented in such a way that the classification labels have a universality and the silent segments supplemented with the predetermined length are one third of the longest audio length. Meanwhile, all the audios are supplemented with the mute sections, so that the lengths of the audio signals obtained after all the mute sections are intercepted are consistent, mixing is facilitated, and reference audios and classification labels are obtained.

And mixing a plurality of reference audios in a superposition manner to obtain mixed audio. The categories of the mixed audio are one-hot encoded. And expanding the encoded class length to the length of the corresponding section for obtaining the classification label of each sampling point. The classification labels are digitized by a mixed audio class coding mode, so that accurate classification labels are obtained, and comparison with classification results is facilitated.

And inputting the mixed audio into the audio classification and separation network to obtain an audio classification result and an audio separation result. And acquiring an audio classification network loss function through the audio classification result and the classification label, and acquiring an audio separation network loss function through the audio separation result and the reference audio. And calculating a total loss function through the audio classification network loss function and the audio separation network loss function, and correcting the audio classification separation network model. Specifically, the audio classification and separation network is trained through pre-designed mixed audio, and a loss function is calculated through calculating the results of audio classification and audio separation, classification labels and reference audio, so that errors are obtained. And correcting the audio classification and separation network through the error of each training, further obtaining the audio classification and separation network with smaller error, and formally applying the audio classification and separation network.

In an optional embodiment, the audio separation network loss function is obtained by a permutation invariance method, and the signal-to-noise ratio loss function in the separation network is:

wherein the content of the first and second substances,

in order to separate the loss function of the network,

represents the sound source signal obtained through the audio separation network, i.e. the audio separation result,

representing a reference signal, M represents the presence of M sound source signals,

which represents a permutation and combination of M sound source signals, here M | combinations,

indicates the performance index between the arranged and combined sound source signal and the reference signal, here

The indexes are as follows:

，

to maximize the scale-invariant signal-to-noise ratio, wherein,

representing the jth audio separation result and the class label,

and j is the maximum value of M. Specifically, the separated audio signals and the reference signals are in one-to-one correspondence, permutation and combination are carried out, the corresponding SI-SNR is calculated, the minimum loss in all the combinations is taken as the final loss to carry out back propagation, and then the separation network model is corrected.

To illustrate the way SI-SNR is calculated, in an alternative embodiment, the maximized scale-invariant signal-to-noise ratio is expressed as:

wherein

Signal power representing x (x is

、

Or s),

for other interfering signals than the desired audio signal,

is the audio separation result.

And calculating the SI-SNR by the formula to further obtain the output of the separation network loss function.

In an alternative embodiment, the cross entropy loss function is:

wherein the content of the first and second substances,

are all cross-entropy loss functions, and are,

representing the distribution probability of the class labels, C representing the number of classes,

representing the kth classification category or classification label,

and k is the maximum value of C. Specifically, after the mixed audio of the training network is input, the error between the classification signal output by the classification network and the known classification label is calculated through a cross entropy loss function. And the output of the cross entropy loss function is used for reversely correcting the network model, so that the error between the classification result and the known classification label is smaller.

To illustrate the total loss function of the audio classification and separation network, in an alternative embodiment, the total loss function of the audio classification and separation network is a weighted average result of a cross entropy loss function in the audio classification network and a signal-to-noise ratio loss function in the audio separation network, and the expression is:

wherein the content of the first and second substances,

is a weight coefficient

For balancing the two tasks of classification and separation. The class cross entropy loss function is logged to keep both loss functions at the same order of magnitude. The total loss of the audio classification and separation network is calculated in a weighting mode, the overall performance of the audio classification and separation network is comprehensively considered, the network model is reversely corrected, and then the tasks of audio classification and audio separation are completed.

In an alternative embodiment, the audio classification separation network requires network configuration, and the network configuration table is shown in fig. 5. Wherein G represents the number of output channels of the Encoder; l represents the convolution kernel size of the Encoder; the Bottleneck layer (Bottleneck) comprises a middle normalization layer and a first one-dimensional convolution module in the Multi-Scale Mapping module, the number of output channels of the Bottleneck is B, the number of 1-D Conv blocks in each group in the Multi-Scale Mapping is X, and R groups are stacked together; the input channel of the Classifier is the number of channels B X R after the multi-scale feature stacking, the number of output channels is C, namely: dividing the audio into C categories; the output channel of Masking is M × F, where M represents the number of mixed tones, i.e., the sound source signal.

In this embodiment, an audio classifying and separating device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides an audio classifying and separating apparatus, as shown in fig. 6, including:

a determining module 61, configured to determine an audio signal to be processed;

the audio coding module 62 is configured to input the audio signal to be processed to the audio coding module, and transform the audio signal to be processed from a one-dimensional time domain to a two-dimensional space; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;

the multi-scale mapping module 63 is configured to input the audio signal in the two-dimensional space to the multi-scale mapping module, and extract multi-scale features of the audio signal in the two-dimensional space;

the audio classification module 64 is configured to input the multi-scale features to the cascade layer to obtain spliced multi-scale features, input the spliced multi-scale features to the classifier module, and classify the audio signal to be processed;

and the audio separation module 65 is configured to input the multi-scale features to the superimposer to obtain superimposed multi-scale features, and input the superimposed multi-scale features to the separation network to separate the audio signals to be processed.

The audio classification and separation apparatus in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An embodiment of the present invention further provides an electronic device, which has the audio frequency classification and separation apparatus shown in fig. 6.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, and as shown in fig. 7, the electronic device may include: at least one processor 701, such as a CPU (Central Processing Unit), at least one communication interface 703, a memory 707, and at least one communication bus 702. Wherein a communication bus 702 is used to enable connective communication between these components. The communication interface 703 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional communication interface 703 may also include a standard wired interface and a standard wireless interface. The Memory 707 may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 707 may optionally be at least one storage device located remotely from the processor 701. Wherein the processor 701 may be in connection with the apparatus described in fig. 6, an application program is stored in the memory 707, and the processor 701 invokes the program code stored in the memory 707 for performing any of the method steps described above.

The communication bus 702 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 702 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

The memory 707 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 707 may also comprise a combination of the above types of memory.

The processor 701 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 701 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 704 is also used for storing program instructions. The processor 701 may invoke program instructions to implement the audio classification and separation method as shown in the embodiment of fig. 1 of the present application.

Embodiments of the present invention further provide a non-transitory computer storage medium, where computer-executable instructions are stored, and the computer-executable instructions may execute the audio classification and separation method in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method of audio classification and separation, comprising:

determining an audio signal to be processed;

2. The audio classification and separation method of claim 1, wherein the multi-scale mapping module comprises: the system comprises a layer normalization layer, a first one-dimensional convolution module and a plurality of groups of cascaded expansion convolution networks; wherein each set of the dilated convolutional networks comprises a plurality of cascaded dilated convolutional blocks;

3. The audio classification and separation method of claim 2, wherein the dilation rate of each dilated volume block is exponentially multiplied by 2 to 2ⁱ-1; wherein when the swelling volume block is causal, the number of padding 0 is:

(dilation*(kernel_size-1))/2，

in the non-causal case, the number of padding 0 is:

dilation*(kernel_size-1)；

And the maximum value of i is X.

4. The audio classification and separation method of claim 3, wherein the dilated convolution block comprises: a point-by-point convolution layer, a first PReLU activation function layer, a first normalization layer, a depth convolution layer, a second PReLU activation function layer, a second normalization layer, a second one-dimensional convolution layer and a third one-dimensional convolution layer;

inputting the fourth audio signal to the cascade layer and the superimposer;

inputting the sixth audio signal to the next of the dilation volume blocks.

5. The audio classification and separation method of claim 1, wherein the layer normalization layer comprises:

wherein the content of the first and second substances,

as a result of the layer normalization for F,

representing said two dimensionsThe audio characteristics of the space are such that,

in order to train the parameters, the user may,

in order to stabilize the coefficient of heat transfer,

is the average value of the F, and is,

wherein the content of the first and second substances,

in real time, for

As a result of the layer normalization being performed,

which represents the characteristics of the k-th frame,

represents the continuous k frame characteristics, namely:

，

in order to train the parameters, the user may,

the stability factor is expressed in terms of a stability factor,

the frame sequence k is

The average value of (a) of (b),

for successive k frames

The variance of (c).

6. The audio classification and separation method of claim 1, wherein the classifier module comprises: a long-time and short-time memory network layer, a linear layer and a Softmax layer;

inputting the timing memory characteristic to the linear layer;

7. The audio classification and separation method according to claim 1, characterized in that the separation network comprises: the mask code calculation module, the multiplier and the decoding module;

8. The audio classification and separation method of claim 1, further comprising:

acquiring a plurality of audio frequencies through a data set;

superposing and mixing a plurality of reference audios to obtain mixed audio;

encoding the category of the mixed audio by one hot;

9. The audio classification and separation method according to claim 8, wherein the audio separation network loss function is obtained by a permutation invariance method, and the signal-to-noise ratio loss function in the separation network is:

wherein the content of the first and second substances,

in order to separate the loss function of the network,

representing a reference signal, M representing M of said sound source signals,

The indexes are as follows:

，

to maximize the scale-invariant signal-to-noise ratio, wherein,

representing the jth audio separation result and the class label,

and j is the maximum value of M.

10. The audio classification and separation method of claim 9, wherein the maximized scale-invariant signal-to-noise ratio expression is:

wherein

Signal power representing x (x is

、

Or s),

for other interfering signals than the desired audio signal,

is the audio separation result.

11. The audio classification and separation method of claim 8, wherein the cross-entropy loss function is:

wherein the content of the first and second substances,

are all cross-entropy loss functions, and are,

representing the kth classification category or classification label,

and k is the maximum value of C.

12. The audio classification and separation method according to any one of claims 9 to 11, wherein the total loss function of the audio classification and separation network is a weighted average of a cross-entropy loss function in the audio classification network and a signal-to-noise ratio loss function in the audio separation network, and is expressed as:

wherein the content of the first and second substances,

is a weight coefficient

For balancing the two tasks of classification and separation.

13. An audio classification and separation apparatus, comprising:

14. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the audio classification and separation method of any of claims 1-12.

15. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, carry out the audio classification and separation method of any of the preceding claims 1-12.