CN112989107A - Audio classification and separation method and device, electronic equipment and storage medium - Google Patents

Audio classification and separation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112989107A
CN112989107A CN202110537306.0A CN202110537306A CN112989107A CN 112989107 A CN112989107 A CN 112989107A CN 202110537306 A CN202110537306 A CN 202110537306A CN 112989107 A CN112989107 A CN 112989107A
Authority
CN
China
Prior art keywords
audio
layer
classification
inputting
separation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110537306.0A
Other languages
Chinese (zh)
Other versions
CN112989107B (en
Inventor
马路
杨嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110537306.0A priority Critical patent/CN112989107B/en
Publication of CN112989107A publication Critical patent/CN112989107A/en
Application granted granted Critical
Publication of CN112989107B publication Critical patent/CN112989107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses an audio classification and separation method, an audio classification and separation device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining an audio signal to be processed; and inputting the audio signal to be processed into an audio coding module, and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain. The two-dimensional space comprises time domain characteristics of the audio and channel information of the audio coding module. And inputting the audio signal of the two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space. And inputting the multi-scale features to the cascade layer to obtain spliced multi-scale features, and inputting the spliced multi-scale features to a classifier module to classify the audio signals to be processed. And inputting the multi-scale features into the superimposer to obtain the superimposed multi-scale features, and inputting the superimposed multi-scale features into a separation network to separate the audio signals to be processed. The invention solves the problem of audio classification while audio separation is performed.

Description

Audio classification and separation method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of audio processing, in particular to an audio classifying and separating method, an audio classifying and separating device, electronic equipment and a storage medium.
Background
Speaker separation is a scientific description of the "cocktail party" problem, aiming at separating out the desired speaker's voice from multiple speakers. The conventional method mainly uses spatial information, and uses a microphone array to pick up sound in a specific direction by using a beam forming algorithm. Sound sources with similar distances cannot be separated, and the beam forming effect is seriously reduced along with the change of environmental noise and reverberation; meanwhile, the requirement for consistency between the microphones is higher than that of the microphone array, and the hardware cost is improved.
In a voice interaction scene, the quality of voice separation directly influences the back-end voice recognition rate and the listening experience of a user, and is a key core technology of a voice technology. In recent years, the separation effect of the deep learning method is obviously better than that of the traditional method, and two network training frameworks of deep clustering and unchanged arrangement are typical. However, both methods work well for separating audio that is completely overlapped, and generally for separating audio that is not mixed in an indefinite amount or has a small overlap ratio. In actual use, the number of mixed audios is uncertain, the overlapped parts of the mixed audios are few, and the performance of the methods is seriously reduced in actual use, so that the actual requirements cannot be met.
Aiming at the problems that the mixed audio quantity is uncertain, the audio separation effect with less overlapped parts is poor and the audio classification can not be carried out simultaneously when the audio is separated in the prior art, an effective solution is not provided.
Disclosure of Invention
In view of this, embodiments of the present invention provide an audio classifying and separating method, an audio classifying and separating apparatus, an electronic device, and a storage medium, so as to solve the problems in the prior art that the number of mixed audios is uncertain, the audio separating effect is poor with less overlapping portions, and audio classification cannot be performed simultaneously when audio separation is performed.
Therefore, the embodiment of the invention provides the following technical scheme:
in a first aspect of the present invention, there is provided an audio classifying and separating method, including:
determining an audio signal to be processed;
inputting the audio signal to be processed to an audio coding module, and transforming the audio signal to be processed from a one-dimensional time domain to a two-dimensional space; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;
inputting an audio signal of a two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space;
inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed;
and inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed.
Optionally, the multi-scale mapping module includes: the system comprises a layer normalization layer, a first one-dimensional convolution module and a plurality of groups of cascaded expansion convolution networks; wherein each set of the dilated convolutional networks comprises a plurality of cascaded dilated convolutional blocks;
inputting the audio signal of the two-dimensional space to the layer normalization layer to obtain a first audio signal;
inputting the first audio signal to the first one-dimensional convolution module, and outputting a second audio signal;
and inputting the second audio signal to the multi-group of the expansion convolution networks, and extracting the multi-scale features.
Optionally, the method further comprises:
the expansion rate of each expanded volume is exponentially multiplied by 2 to be 2i-1; wherein when the swelling volume block is causal, the number of padding 0 is:
(dilation*(kernel_size-1))/2,
in the non-causal case, the number of padding 0 is:
dilation*(kernel_size-1);
partition represents expansion ratio, kernel _ size represents convolution kernel size, X represents the number of expanded volume blocks in each set of the expanded network, i represents the ith volume block, wherein
Figure 744412DEST_PATH_IMAGE001
And the maximum value of i is X.
Optionally, the dilated convolution block comprises: a point-by-point convolution layer, a first PReLU activation function layer, a first normalization layer, a depth convolution layer, a second PReLU activation function layer, a second normalization layer, a second one-dimensional convolution layer and a third one-dimensional convolution layer;
sequentially processing audio signals input to the point-by-point convolution layer through the point-by-point convolution layer, the first PReLU activation function layer, the first normalization layer, the depth convolution layer, the second PReLU activation function layer and the second normalization layer to obtain third audio signals;
inputting the third audio signal to the second one-dimensional convolutional layer and the third one-dimensional convolutional layer respectively to obtain a fourth audio signal and a fifth audio signal respectively;
inputting the fourth audio signal to the cascade layer and the superimposer;
superposing the fifth audio signal with the audio signal to obtain a sixth audio signal;
inputting the sixth audio signal to the next of the dilation volume blocks.
Optionally, the layer normalization layer comprises:
in a non-real-time scene, the layer normalization is global layer normalization, and the expression is as follows:
Figure 974403DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 285299DEST_PATH_IMAGE003
as a result of the layer normalization for F,
Figure 946087DEST_PATH_IMAGE004
representing an audio feature of the two-dimensional space,
Figure 291618DEST_PATH_IMAGE005
in order to train the parameters, the user may,
Figure 645239DEST_PATH_IMAGE006
in order to stabilize the coefficient of heat transfer,
Figure 443430DEST_PATH_IMAGE007
is the average value of the F, and is,
Figure 173489DEST_PATH_IMAGE008
is the variance of F, N represents the length of the channel dimension, and T represents the length of the time dimension;
in a real-time scene, the layer normalization is cumulant layer normalization, and the expression is as follows:
Figure 107947DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 133934DEST_PATH_IMAGE010
in real time, for
Figure 419422DEST_PATH_IMAGE011
As a result of the layer normalization being performed,
Figure 687592DEST_PATH_IMAGE012
which represents the characteristics of the k-th frame,
Figure 742136DEST_PATH_IMAGE013
represents the continuous k frame characteristics, namely:
Figure 703139DEST_PATH_IMAGE014
Figure 210343DEST_PATH_IMAGE015
in order to train the parameters, the user may,
Figure 16625DEST_PATH_IMAGE006
the stability factor is expressed in terms of a stability factor,
Figure 191255DEST_PATH_IMAGE016
the frame sequence k is
Figure 57580DEST_PATH_IMAGE011
The average value of (a) of (b),
Figure 816195DEST_PATH_IMAGE017
for successive k frames
Figure 160589DEST_PATH_IMAGE011
The variance of (c).
Optionally, the classifier module comprises: a long-time and short-time memory network layer, a linear layer and a Softmax layer;
inputting the spliced multi-scale features to the long-time and short-time memory network layer for acquiring time sequence memory features;
inputting the timing memory characteristic to the linear layer;
and inputting the audio signal processed by the linear layer into a Softmax layer for classifying the audio signal processed by the linear layer.
Optionally, the separation network comprises: the mask code calculation module, the multiplier and the decoding module;
inputting the superimposed multi-scale features to the mask calculation module, calculating to obtain a mask value of each encoder output channel in the audio encoding module, and inputting the mask value and the audio signal of the two-dimensional space to the multiplier;
and inputting the result processed by the multiplier into the decoding module, and converting the audio signal of the two-dimensional space after mask processing into time domain characteristics to obtain an audio separation result.
Optionally, the method further comprises:
acquiring a plurality of audio frequencies through a data set;
cutting off a mute section in the middle of the plurality of audios, and arranging the plurality of audios from large to small according to the audio length;
acquiring the longest audio frequency in the multiple audio frequencies, supplementing a mute section at the front section of the longest audio frequency, wherein the length of the longest audio frequency after the mute section is supplemented is a specified length;
supplementing a mute section in the plurality of audios, wherein the length of each audio after the mute section is supplemented is the specified length; taking each audio after the mute section is supplemented as a reference audio;
superposing and mixing a plurality of reference audios to obtain mixed audio;
encoding the category of the mixed audio by one hot;
expanding the encoded class length to the length of the corresponding section for obtaining the classification label of each sampling point;
inputting the mixed audio into an audio classification and separation network to obtain an audio classification result and an audio separation result;
acquiring an audio classification network loss function through the audio classification result and the classification label, and acquiring an audio separation network loss function through the audio separation result and a reference audio;
and calculating a total loss function through the audio classification network loss function and the audio separation network loss function, and correcting the audio classification separation network model.
Optionally, the method further comprises:
obtaining the loss function of the audio separation network by a permutation invariance method, wherein the signal-to-noise ratio loss function in the separation network is as follows:
Figure 189724DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE019
in order to separate the loss function of the network,
Figure 758109DEST_PATH_IMAGE020
represents the sound source signal obtained by the audio separation network, i.e. the audio separation result,
Figure 239906DEST_PATH_IMAGE021
Figure 653570DEST_PATH_IMAGE022
representing a reference signal, M representing M of said sound source signals,
Figure 537212DEST_PATH_IMAGE023
which means that for M permutation combinations of said sound source signals, there are M | combinations,
Figure 10919DEST_PATH_IMAGE024
represents a performance index between the sound source signal and the reference signal after permutation and combination, and is represented as the following
Figure 481477DEST_PATH_IMAGE025
The indexes are as follows:
Figure 167673DEST_PATH_IMAGE026
Figure 171401DEST_PATH_IMAGE027
Figure 816009DEST_PATH_IMAGE025
to maximize the scale-invariant signal-to-noise ratio, wherein,
Figure 6819DEST_PATH_IMAGE028
representing the jth audio separation result and the class label,
Figure 762285DEST_PATH_IMAGE029
and j is the maximum value of M.
Optionally, the method further comprises:
Figure 354941DEST_PATH_IMAGE030
wherein
Figure 170450DEST_PATH_IMAGE031
Signal power representing x(x is
Figure 114135DEST_PATH_IMAGE032
Figure 177687DEST_PATH_IMAGE033
Or s),
Figure 890428DEST_PATH_IMAGE033
for other interfering signals than the desired audio signal,
Figure 611259DEST_PATH_IMAGE032
is the audio separation result.
Optionally, the method further comprises:
the cross entropy loss function is:
Figure 776661DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 873930DEST_PATH_IMAGE035
are all cross-entropy loss functions, and are,
Figure 706757DEST_PATH_IMAGE036
representing the class distribution probability of the audio signal to be processed after being input into the audio classification network for classification,
Figure 598490DEST_PATH_IMAGE037
representing the distribution probability of the classification label, C representing the number of classes,
Figure 251188DEST_PATH_IMAGE038
representing the kth classification category or classification label,
Figure 152148DEST_PATH_IMAGE039
and k is the maximum value of C.
Optionally, the method further comprises:
the total loss function of the audio classification and separation network is a weighted average result of a cross entropy loss function in the audio classification network and a signal-to-noise ratio loss function in the audio separation network, and the expression is as follows:
Figure 75367DEST_PATH_IMAGE040
wherein the content of the first and second substances,
Figure 403580DEST_PATH_IMAGE041
is a weight coefficient
Figure 543574DEST_PATH_IMAGE042
For balancing the two tasks of classification and separation.
The invention provides a frequency classification and separation method device, which comprises the following steps:
the determining module is used for determining the audio signal to be processed;
the audio coding module is used for inputting the audio signal to be processed into the audio coding module and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;
the multi-scale mapping module is used for inputting the audio signal of the two-dimensional space to the multi-scale mapping module and extracting the multi-scale features of the audio signal of the two-dimensional space;
the audio classification module is used for inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to the classifier module, and classifying the audio signals to be processed;
and the audio separation module is used for inputting the multi-scale features to the superimposer to obtain the superimposed multi-scale features, inputting the superimposed multi-scale features to a separation network, and separating the audio signals to be processed.
In a third aspect of the present invention, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the audio classification and separation method of any one of the first aspects.
In a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the audio classification and separation method according to any one of the first aspect.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides an audio classification and separation method, an audio classification and separation device, electronic equipment and a storage medium, wherein the method comprises the following steps: an audio signal to be processed is determined. And inputting the audio signal to be processed into an audio coding module, and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain. The two-dimensional space comprises time domain characteristics of the audio and channel information of the audio coding module. And inputting the audio signal of the two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space. Inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed. And inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed. The embodiment of the invention solves the problems that the mixed audio quantity is uncertain, the audio separation effect with less overlapped parts is poor and the audio classification can not be carried out simultaneously when the audio is separated in the prior art. The embodiment of the invention provides a multi-target learning method, namely, a classification learning task is added while a separation task is learned. The audio classification and separation network can classify the mixed audio according to the number of each section of mixed audio, thereby improving the capability of the audio classification and separation network for separating the mixed audio and meeting the actual use requirement.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow diagram of an audio classification and separation method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of an audio classification separation network architecture according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a dilated convolution block structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of training audio acquisition according to an embodiment of the present invention;
FIG. 5 is an audio classification separation network configuration table according to an embodiment of the present invention;
fig. 6 is a block diagram of the structure of an audio classification apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description of the present application, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be considered as limiting the present application. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In this application, the word "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes are not set forth in detail in order to avoid obscuring the description of the present application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
In accordance with an embodiment of the present invention, there is provided an audio classification and separation method embodiment, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In this embodiment, an embodiment of an audio classification and separation method is provided, which can be used in a speech classification and separation system, and fig. 1 is a flowchart of an audio classification and separation method according to an embodiment of the present invention, as shown in fig. 1, where the flowchart includes the following steps:
step S101, determining an audio signal to be processed. The audio signal to be processed is a mixed audio signal, and can be processed by the audio classification and separation network in the embodiment of the invention.
Step S102, inputting the audio signal to be processed to an audio coding module (Encoder), and transforming the audio signal to be processed from a one-dimensional time domain to a two-dimensional space. The two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module. Specifically, the audio coding module is a one-dimensional convolution, whereby the audio signal to be processed is transformed from a one-dimensional time domain to a two-dimensional space.
Step S103, inputting the audio signal of the two-dimensional space to a Multi-Scale Mapping module (Multi-Scale Mapping), and extracting the Multi-Scale feature of the audio signal of the two-dimensional space. Specifically, the features of the audio signal input to the module are extracted through a plurality of groups of expansion convolution networks in the multi-scale mapping, and the extracted features are output.
And step S104, inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed. Specifically, the audio features extracted from each convolution block in the expanded convolution network are respectively spliced through the cascade layers, and the audio features are used for classifying the spliced audio signals by the classifier module.
And S105, inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed. Specifically, the audio features extracted from each convolution block in the expanded convolution network are respectively subjected to superposition processing by a superimposer, and the superimposer is used for separating the audio signals after superposition by a separation network.
Because the audio processing network in the prior art can only realize audio classification or audio separation independently. Different from the prior art, in the embodiment of the invention, the audio separation task is carried out, and meanwhile, the audio classification task is added. The embodiment of the invention solves the problems that the mixed audio quantity is uncertain, the audio separation effect is poor due to less overlapped parts, and the audio classification can not be carried out simultaneously when the audio is separated. According to the embodiment of the invention, classification is carried out according to the number of each section of mixed audio, so that the separation tasks of different mixed audio numbers can be learned in a self-adaptive manner, and the actual use requirements are met. And through piling up a plurality of volume blocks that the expansion rate is different, extract multiscale information, the model is simple, and the real-time rate is high, and audio classification and audio separation rate of accuracy is high.
To further illustrate the classifier, in an alternative embodiment, the classifier module includes: long and short time memory network Layers (LSTM Layers), Linear Layers (Linear), and Softmax Layers. And inputting the spliced multi-scale features into a long-time and short-time memory network layer for acquiring the time sequence memory features. Specifically, compared with the conventional neural network, the long-time and short-time memory network layer is very suitable for processing the audio signals highly related to the time sequence, so that the audio signals to be classified are processed by using the long-time and short-time memory network layer, the audio signals are easier to classify, and errors are avoided.
The timing memory characteristic is input to a linear layer. And full link connection is realized through the linear layer, and the spliced multi-scale features are mapped to the Softmax layer one by one.
The audio signal after the linear layer processing is input to the Softmax layer for classifying the audio signal after the linear layer processing. Specifically, the Softmax layer classifies the input audio signal by using a Softmax function, which is substantially the probability of calculating classification categories, and the classification by using the Softmax function is simple in calculation and remarkable in effect.
The above step S105 relates to a splitting network, which in an alternative embodiment comprises: mask calculation module, multiplier and decoding module. And inputting the superposed multi-scale features into a mask calculation module (Masking) for calculating to obtain a mask value of each encoder output channel in the audio encoding module, and inputting the mask value and the audio signal of the two-dimensional space into a multiplier. Specifically, the multi-scale features are input to a superimposer, and the input multi-scale features are superimposed by the superimposer to form an audio signal to be separated. And calculating a mask value of each encoder output channel in the audio encoding module through a mask calculation module, and synthesizing the output of the Sigmoid activation function and the audio signal of the two-dimensional space. In addition, the mask calculation module comprises a PReLU activation function, a one-dimensional convolutional layer (1 x1 Conv) and a Sigmoid activation function, wherein the activation function plays a role of nonlinear processing to increase the nonlinear fitting capability.
And inputting the result processed by the multiplier into the decoding module (Decoder) for converting the audio signal of the two-dimensional space processed by the mask into time domain characteristics to obtain an audio separation result. Specifically, the decoding module is formed by a transposed convolution network, and performs deconvolution processing on the audio signal input to the decoding module to obtain a time domain signal. As shown in fig. 2, with the audio classification and separation network in the embodiment of the present invention, a stacked one-dimensional convolutional network is used to extract multi-scale features of input audio, and the multi-scale features are used to perform audio classification and audio separation. Meanwhile, audio classification and audio separation are combined, the established model is more beneficial to audio separation when the characteristic of the neural network is utilized to train audio classification, so that the separation performance is improved, and meanwhile, the audio of different mixing degrees is used as a classification task, and voice separation can be assisted, so that the method is suitable for the conditions of different mixed voice quantities and the separation performance is improved.
To further illustrate the multi-scale mapping module, in an alternative embodiment, the multi-scale mapping module comprises: a layer normalization layer (LayerNorm), a first one-dimensional convolution module (1X 1 Conv), and a plurality of sets of cascaded dilated convolution networks. Wherein each set of the dilated convolutional networks comprises a plurality of cascaded dilated convolutional blocks (1-D Conv Block). And inputting the audio signal of the two-dimensional space to the normalization layer to obtain a first audio signal. To ensure that the separation network is insensitive to the amplitude of the input speech, the input audio signal is layer normalized before being multi-scale mapped.
And inputting the first audio signal to a first one-dimensional convolution module, and outputting a second audio signal. The calculation is simplified through the first one-dimensional convolution module, and the calculation efficiency of the audio classification and separation network is improved.
And inputting the second audio signal to the multi-group expansion convolution network, and extracting the multi-scale features. Compared with the traditional convolution, the characteristic extraction through the expansion convolution network reduces the processing process of the audio signal, namely, the pooling processing is not needed, but the purpose of increasing the receptive field can be realized.
To further illustrate the dilated convolution network, in an alternative embodiment, the dilation rate of each dilated convolution block is exponentially multiplied by 2 to be 2i-1; when the expansion volume block is a causal condition, the number of the filling 0 is as follows:
(dilation*(kernel_size-1))/2,
in the non-causal case, the number of padding 0 is:
dilation*(kernel_size-1);
partition represents expansion ratio, kernel _ size represents convolution kernel size, X represents the number of expanded volume blocks in each set of the expanded network, and i represents the ith volume block, wherein
Figure 982646DEST_PATH_IMAGE043
And the maximum value of i is X. Specifically, each dilated convolution block is a dilated convolution. Specifically, the dilated convolution network is formed by cascading a plurality of dilated convolution blocks. The expanded Convolution (scaled Convolution) is to fill 0 in a hole based on a Convolution map (Standard Convolution) to increase the reception field (reconstruction field). Therefore, the dilation convolution has one more hyper-parameter (hyper-parameter) on the basis of the standard convolution, which is called the dilation rate (dilation rate), and the dilation rate refers to the number of intervals of the convolution kernel (kernel). In the embodiment of the invention, different numbers of 0 are filled by judging whether the expansion volume block is causal, thereby increasing the receptive field without changing the receptive fieldThe purpose of increasing the amount of calculation.
To further illustrate the dilated convolution block, in an alternative embodiment, as shown in FIG. 3, the dilated convolution block comprises: a point-by-point convolution layer, a first PReLU activation function layer, a first normalization layer, a depth convolution layer, a second PReLU activation function layer, a second normalization layer, a second one-dimensional convolution layer, and a third one-dimensional convolution layer. And sequentially processing the audio signal input to the point-by-point convolution layer through the point-by-point convolution layer, the first PReLU activation function layer, the first normalization layer, the depth convolution layer, the second PReLU activation function layer and the second normalization layer to obtain a third audio signal. Specifically, the audio signal input to the point-by-point convolution layer includes the second audio signal and the audio signal output by the last dilation convolution. In an embodiment of the invention, the dilated convolution block processes the input audio signal by replacing the conventional convolution with a depth separable convolution, i.e. a split into a point-by-point convolution, denoted by 1x1-Conv, and a depth convolution, denoted by D-Conv. And the PReLU (parametric reconstructed linear unit) is adopted as an activation function to carry out nonlinear processing on the audio, so that the nonlinear fitting capability is increased. The PReLU function is expressed as:
Figure 524486DEST_PATH_IMAGE044
wherein x represents an audio signal input to the activation function,
Figure 23600DEST_PATH_IMAGE045
the slope of the negative portion.
The third audio signal is input to the second one-dimensional convolutional layer and the third one-dimensional convolutional layer respectively to obtain a fourth audio signal and a fifth audio signal respectively. Inputting the fourth audio signal to the cascade layer and superimposer. And processing the multi-scale features through the cascade layer and the superimposer so as to obtain audio signals input to the classifier and the separation network.
And superposing the fifth audio signal and the audio signal input to the point-by-point convolution layer to obtain a sixth audio signal. Inputting the sixth audio signal to the next of the dilation volume blocks. The sixth audio signal is superimposed on the audio signal input to the point-by-point convolution layer, so that the disappearance of the gradient can be prevented, and the network depth can be increased.
To further illustrate the layer normalization layer, in an alternative embodiment, the layer normalization layer comprises:
in a non-real-time scene, layer normalization is global layer normalization, and the expression is as follows:
Figure 650891DEST_PATH_IMAGE046
wherein the content of the first and second substances,
Figure 628074DEST_PATH_IMAGE047
as a result of the layer normalization for F,
Figure 24420DEST_PATH_IMAGE048
an audio feature that is represented in a two-dimensional space,
Figure 192971DEST_PATH_IMAGE049
in order to train the parameters, the user may,
Figure 307558DEST_PATH_IMAGE050
in order to stabilize the coefficient of heat transfer,
Figure 88432DEST_PATH_IMAGE051
is the average value of the F, and is,
Figure 339285DEST_PATH_IMAGE052
is the variance of F, N represents the length of the channel dimension, and T represents the length of the time dimension;
in a real-time scene, layer normalization is cumulant layer normalization, and the expression is as follows:
Figure 180202DEST_PATH_IMAGE053
wherein the content of the first and second substances,
Figure 782084DEST_PATH_IMAGE054
in real time, for
Figure 366649DEST_PATH_IMAGE055
As a result of the layer normalization being performed,
Figure 472009DEST_PATH_IMAGE056
which represents the characteristics of the k-th frame,
Figure 483827DEST_PATH_IMAGE057
represents the continuous k frame characteristics, namely:
Figure 808891DEST_PATH_IMAGE058
Figure 197147DEST_PATH_IMAGE059
in order to train the parameters, the user may,
Figure 157013DEST_PATH_IMAGE050
the stability factor is expressed in terms of a stability factor,
Figure 339733DEST_PATH_IMAGE060
the frame sequence k is
Figure 916208DEST_PATH_IMAGE055
The average value of (a) of (b),
Figure 842575DEST_PATH_IMAGE061
for successive k frames
Figure 922527DEST_PATH_IMAGE055
The variance of (c). In a real-time scene, the layer normalization is to perform layer normalization on audio features continuously input to the layer normalization layer so as to ensure that the audio classification and separation network is insensitive to the amplitude of an input audio signal.
To illustrate training the audio classification separation network, in an alternative embodiment, multiple audios are acquired through a data set, as shown in FIG. 4. The audio is acquired through the data set, so that the quality of the acquired audio is high, and the pure audio is convenient to acquire because no more noise is interfered in the single audio.
And cutting off a mute section in the middle of the multiple audios, and arranging the multiple audios from large to small according to the audio length. The silence section in the middle of each audio frequency is intercepted, so that more excellent classification labels can be obtained, the classification result of the mixed audio signal after the audio classification separation network processing is easier and more convenient to compare with the classification labels, and the condition of classification errors is avoided.
And acquiring the longest audio frequency in the plurality of audio frequencies, supplementing a mute section at the front section of the longest audio frequency, wherein the length of the longest audio frequency after the mute section is supplemented is the designated length. Supplementing a mute section in the plurality of audios, wherein the length of each audio after the mute section is supplemented is the specified length; and taking each audio after the mute segment is supplemented as a reference audio. The silent segments are supplemented in such a way that the classification labels have a universality and the silent segments supplemented with the predetermined length are one third of the longest audio length. Meanwhile, all the audios are supplemented with the mute sections, so that the lengths of the audio signals obtained after all the mute sections are intercepted are consistent, mixing is facilitated, and reference audios and classification labels are obtained.
And mixing a plurality of reference audios in a superposition manner to obtain mixed audio. The categories of the mixed audio are one-hot encoded. And expanding the encoded class length to the length of the corresponding section for obtaining the classification label of each sampling point. The classification labels are digitized by a mixed audio class coding mode, so that accurate classification labels are obtained, and comparison with classification results is facilitated.
And inputting the mixed audio into the audio classification and separation network to obtain an audio classification result and an audio separation result. And acquiring an audio classification network loss function through the audio classification result and the classification label, and acquiring an audio separation network loss function through the audio separation result and the reference audio. And calculating a total loss function through the audio classification network loss function and the audio separation network loss function, and correcting the audio classification separation network model. Specifically, the audio classification and separation network is trained through pre-designed mixed audio, and a loss function is calculated through calculating the results of audio classification and audio separation, classification labels and reference audio, so that errors are obtained. And correcting the audio classification and separation network through the error of each training, further obtaining the audio classification and separation network with smaller error, and formally applying the audio classification and separation network.
In an optional embodiment, the audio separation network loss function is obtained by a permutation invariance method, and the signal-to-noise ratio loss function in the separation network is:
Figure 276148DEST_PATH_IMAGE062
wherein the content of the first and second substances,
Figure 74339DEST_PATH_IMAGE019
in order to separate the loss function of the network,
Figure 297074DEST_PATH_IMAGE020
represents the sound source signal obtained through the audio separation network, i.e. the audio separation result,
Figure 231532DEST_PATH_IMAGE021
Figure 21633DEST_PATH_IMAGE022
representing a reference signal, M represents the presence of M sound source signals,
Figure 41542DEST_PATH_IMAGE023
which represents a permutation and combination of M sound source signals, here M | combinations,
Figure 575291DEST_PATH_IMAGE024
indicates the performance index between the arranged and combined sound source signal and the reference signal, here
Figure 364256DEST_PATH_IMAGE025
The indexes are as follows:
Figure 325259DEST_PATH_IMAGE063
Figure 832463DEST_PATH_IMAGE027
Figure 405789DEST_PATH_IMAGE064
to maximize the scale-invariant signal-to-noise ratio, wherein,
Figure 580419DEST_PATH_IMAGE028
representing the jth audio separation result and the class label,
Figure 181164DEST_PATH_IMAGE065
and j is the maximum value of M. Specifically, the separated audio signals and the reference signals are in one-to-one correspondence, permutation and combination are carried out, the corresponding SI-SNR is calculated, the minimum loss in all the combinations is taken as the final loss to carry out back propagation, and then the separation network model is corrected.
To illustrate the way SI-SNR is calculated, in an alternative embodiment, the maximized scale-invariant signal-to-noise ratio is expressed as:
Figure 441244DEST_PATH_IMAGE030
wherein
Figure 51217DEST_PATH_IMAGE031
Signal power representing x (x is
Figure 814774DEST_PATH_IMAGE032
Figure 852000DEST_PATH_IMAGE033
Or s),
Figure 599376DEST_PATH_IMAGE033
for other interfering signals than the desired audio signal,
Figure 13040DEST_PATH_IMAGE032
is the audio separation result.
And calculating the SI-SNR by the formula to further obtain the output of the separation network loss function.
In an alternative embodiment, the cross entropy loss function is:
Figure 631103DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 603345DEST_PATH_IMAGE035
are all cross-entropy loss functions, and are,
Figure 572438DEST_PATH_IMAGE036
representing the class distribution probability of the audio signal to be processed after being input into the audio classification network for classification,
Figure 524214DEST_PATH_IMAGE037
representing the distribution probability of the class labels, C representing the number of classes,
Figure 262362DEST_PATH_IMAGE038
representing the kth classification category or classification label,
Figure 641391DEST_PATH_IMAGE039
and k is the maximum value of C. Specifically, after the mixed audio of the training network is input, the error between the classification signal output by the classification network and the known classification label is calculated through a cross entropy loss function. And the output of the cross entropy loss function is used for reversely correcting the network model, so that the error between the classification result and the known classification label is smaller.
To illustrate the total loss function of the audio classification and separation network, in an alternative embodiment, the total loss function of the audio classification and separation network is a weighted average result of a cross entropy loss function in the audio classification network and a signal-to-noise ratio loss function in the audio separation network, and the expression is:
Figure 97780DEST_PATH_IMAGE040
wherein the content of the first and second substances,
Figure 853247DEST_PATH_IMAGE041
is a weight coefficient
Figure 711481DEST_PATH_IMAGE042
For balancing the two tasks of classification and separation. The class cross entropy loss function is logged to keep both loss functions at the same order of magnitude. The total loss of the audio classification and separation network is calculated in a weighting mode, the overall performance of the audio classification and separation network is comprehensively considered, the network model is reversely corrected, and then the tasks of audio classification and audio separation are completed.
In an alternative embodiment, the audio classification separation network requires network configuration, and the network configuration table is shown in fig. 5. Wherein G represents the number of output channels of the Encoder; l represents the convolution kernel size of the Encoder; the Bottleneck layer (Bottleneck) comprises a middle normalization layer and a first one-dimensional convolution module in the Multi-Scale Mapping module, the number of output channels of the Bottleneck is B, the number of 1-D Conv blocks in each group in the Multi-Scale Mapping is X, and R groups are stacked together; the input channel of the Classifier is the number of channels B X R after the multi-scale feature stacking, the number of output channels is C, namely: dividing the audio into C categories; the output channel of Masking is M × F, where M represents the number of mixed tones, i.e., the sound source signal.
In this embodiment, an audio classifying and separating device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
The present embodiment provides an audio classifying and separating apparatus, as shown in fig. 6, including:
a determining module 61, configured to determine an audio signal to be processed;
the audio coding module 62 is configured to input the audio signal to be processed to the audio coding module, and transform the audio signal to be processed from a one-dimensional time domain to a two-dimensional space; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;
the multi-scale mapping module 63 is configured to input the audio signal in the two-dimensional space to the multi-scale mapping module, and extract multi-scale features of the audio signal in the two-dimensional space;
the audio classification module 64 is configured to input the multi-scale features to the cascade layer to obtain spliced multi-scale features, input the spliced multi-scale features to the classifier module, and classify the audio signal to be processed;
and the audio separation module 65 is configured to input the multi-scale features to the superimposer to obtain superimposed multi-scale features, and input the superimposed multi-scale features to the separation network to separate the audio signals to be processed.
The audio classification and separation apparatus in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, which has the audio frequency classification and separation apparatus shown in fig. 6.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, and as shown in fig. 7, the electronic device may include: at least one processor 701, such as a CPU (Central Processing Unit), at least one communication interface 703, a memory 707, and at least one communication bus 702. Wherein a communication bus 702 is used to enable connective communication between these components. The communication interface 703 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional communication interface 703 may also include a standard wired interface and a standard wireless interface. The Memory 707 may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 707 may optionally be at least one storage device located remotely from the processor 701. Wherein the processor 701 may be in connection with the apparatus described in fig. 6, an application program is stored in the memory 707, and the processor 701 invokes the program code stored in the memory 707 for performing any of the method steps described above.
The communication bus 702 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 702 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
The memory 707 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 707 may also comprise a combination of the above types of memory.
The processor 701 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 701 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
Optionally, the memory 704 is also used for storing program instructions. The processor 701 may invoke program instructions to implement the audio classification and separation method as shown in the embodiment of fig. 1 of the present application.
Embodiments of the present invention further provide a non-transitory computer storage medium, where computer-executable instructions are stored, and the computer-executable instructions may execute the audio classification and separation method in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (15)

1. A method of audio classification and separation, comprising:
determining an audio signal to be processed;
inputting the audio signal to be processed to an audio coding module, and transforming the audio signal to be processed from a one-dimensional time domain to a two-dimensional space; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;
inputting an audio signal of a two-dimensional space to a multi-scale mapping module, and extracting multi-scale features of the audio signal of the two-dimensional space;
inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to a classifier module, and classifying the audio signals to be processed;
and inputting the multi-scale features into a superimposer to obtain superimposed multi-scale features, inputting the superimposed multi-scale features into a separation network, and separating the audio signals to be processed.
2. The audio classification and separation method of claim 1, wherein the multi-scale mapping module comprises: the system comprises a layer normalization layer, a first one-dimensional convolution module and a plurality of groups of cascaded expansion convolution networks; wherein each set of the dilated convolutional networks comprises a plurality of cascaded dilated convolutional blocks;
inputting the audio signal of the two-dimensional space to the layer normalization layer to obtain a first audio signal;
inputting the first audio signal to the first one-dimensional convolution module, and outputting a second audio signal;
and inputting the second audio signal to the multi-group of the expansion convolution networks, and extracting the multi-scale features.
3. The audio classification and separation method of claim 2, wherein the dilation rate of each dilated volume block is exponentially multiplied by 2 to 2i-1; wherein when the swelling volume block is causal, the number of padding 0 is:
(dilation*(kernel_size-1))/2,
in the non-causal case, the number of padding 0 is:
dilation*(kernel_size-1);
partition represents expansion ratio, kernel _ size represents convolution kernel size, X represents the number of expanded volume blocks in each set of the expanded network, i represents the ith volume block, wherein
Figure 919430DEST_PATH_IMAGE001
And the maximum value of i is X.
4. The audio classification and separation method of claim 3, wherein the dilated convolution block comprises: a point-by-point convolution layer, a first PReLU activation function layer, a first normalization layer, a depth convolution layer, a second PReLU activation function layer, a second normalization layer, a second one-dimensional convolution layer and a third one-dimensional convolution layer;
sequentially processing audio signals input to the point-by-point convolution layer through the point-by-point convolution layer, the first PReLU activation function layer, the first normalization layer, the depth convolution layer, the second PReLU activation function layer and the second normalization layer to obtain third audio signals;
inputting the third audio signal to the second one-dimensional convolutional layer and the third one-dimensional convolutional layer respectively to obtain a fourth audio signal and a fifth audio signal respectively;
inputting the fourth audio signal to the cascade layer and the superimposer;
superposing the fifth audio signal with the audio signal to obtain a sixth audio signal;
inputting the sixth audio signal to the next of the dilation volume blocks.
5. The audio classification and separation method of claim 1, wherein the layer normalization layer comprises:
in a non-real-time scene, the layer normalization is global layer normalization, and the expression is as follows:
Figure 367729DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 678624DEST_PATH_IMAGE003
as a result of the layer normalization for F,
Figure 604992DEST_PATH_IMAGE004
representing said two dimensionsThe audio characteristics of the space are such that,
Figure 950523DEST_PATH_IMAGE005
in order to train the parameters, the user may,
Figure 71188DEST_PATH_IMAGE006
in order to stabilize the coefficient of heat transfer,
Figure 869380DEST_PATH_IMAGE007
is the average value of the F, and is,
Figure 599438DEST_PATH_IMAGE008
is the variance of F, N represents the length of the channel dimension, and T represents the length of the time dimension;
in a real-time scene, the layer normalization is cumulant layer normalization, and the expression is as follows:
Figure 799475DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 58418DEST_PATH_IMAGE010
in real time, for
Figure 343906DEST_PATH_IMAGE011
As a result of the layer normalization being performed,
Figure 612077DEST_PATH_IMAGE012
which represents the characteristics of the k-th frame,
Figure 666620DEST_PATH_IMAGE013
represents the continuous k frame characteristics, namely:
Figure 627623DEST_PATH_IMAGE014
Figure 898942DEST_PATH_IMAGE015
in order to train the parameters, the user may,
Figure 439645DEST_PATH_IMAGE006
the stability factor is expressed in terms of a stability factor,
Figure 614274DEST_PATH_IMAGE016
the frame sequence k is
Figure 480599DEST_PATH_IMAGE017
The average value of (a) of (b),
Figure 740679DEST_PATH_IMAGE018
for successive k frames
Figure 85073DEST_PATH_IMAGE011
The variance of (c).
6. The audio classification and separation method of claim 1, wherein the classifier module comprises: a long-time and short-time memory network layer, a linear layer and a Softmax layer;
inputting the spliced multi-scale features to the long-time and short-time memory network layer for acquiring time sequence memory features;
inputting the timing memory characteristic to the linear layer;
and inputting the audio signal processed by the linear layer into a Softmax layer for classifying the audio signal processed by the linear layer.
7. The audio classification and separation method according to claim 1, characterized in that the separation network comprises: the mask code calculation module, the multiplier and the decoding module;
inputting the superimposed multi-scale features to the mask calculation module, calculating to obtain a mask value of each encoder output channel in the audio encoding module, and inputting the mask value and the audio signal of the two-dimensional space to the multiplier;
and inputting the result processed by the multiplier into the decoding module, and converting the audio signal of the two-dimensional space after mask processing into time domain characteristics to obtain an audio separation result.
8. The audio classification and separation method of claim 1, further comprising:
acquiring a plurality of audio frequencies through a data set;
cutting off a mute section in the middle of the plurality of audios, and arranging the plurality of audios from large to small according to the audio length;
acquiring the longest audio frequency in the multiple audio frequencies, supplementing a mute section at the front section of the longest audio frequency, wherein the length of the longest audio frequency after the mute section is supplemented is a specified length;
supplementing a mute section in the plurality of audios, wherein the length of each audio after the mute section is supplemented is the specified length; taking each audio after the mute section is supplemented as a reference audio;
superposing and mixing a plurality of reference audios to obtain mixed audio;
encoding the category of the mixed audio by one hot;
expanding the encoded class length to the length of the corresponding section for obtaining the classification label of each sampling point;
inputting the mixed audio into an audio classification and separation network to obtain an audio classification result and an audio separation result;
acquiring an audio classification network loss function through the audio classification result and the classification label, and acquiring an audio separation network loss function through the audio separation result and a reference audio;
and calculating a total loss function through the audio classification network loss function and the audio separation network loss function, and correcting the audio classification separation network model.
9. The audio classification and separation method according to claim 8, wherein the audio separation network loss function is obtained by a permutation invariance method, and the signal-to-noise ratio loss function in the separation network is:
Figure 114209DEST_PATH_IMAGE019
wherein the content of the first and second substances,
Figure 417014DEST_PATH_IMAGE020
in order to separate the loss function of the network,
Figure 898811DEST_PATH_IMAGE021
represents the sound source signal obtained by the audio separation network, i.e. the audio separation result,
Figure 548360DEST_PATH_IMAGE022
Figure 432003DEST_PATH_IMAGE023
representing a reference signal, M representing M of said sound source signals,
Figure 905709DEST_PATH_IMAGE024
which means that for M permutation combinations of said sound source signals, there are M | combinations,
Figure 874802DEST_PATH_IMAGE025
represents a performance index between the sound source signal and the reference signal after permutation and combination, and is represented as the following
Figure 560999DEST_PATH_IMAGE026
The indexes are as follows:
Figure 564727DEST_PATH_IMAGE027
Figure 209335DEST_PATH_IMAGE028
Figure 665724DEST_PATH_IMAGE026
to maximize the scale-invariant signal-to-noise ratio, wherein,
Figure 155611DEST_PATH_IMAGE029
representing the jth audio separation result and the class label,
Figure 240942DEST_PATH_IMAGE030
and j is the maximum value of M.
10. The audio classification and separation method of claim 9, wherein the maximized scale-invariant signal-to-noise ratio expression is:
Figure 56451DEST_PATH_IMAGE031
wherein
Figure 137DEST_PATH_IMAGE032
Signal power representing x (x is
Figure 559294DEST_PATH_IMAGE033
Figure 6456DEST_PATH_IMAGE034
Or s),
Figure 727287DEST_PATH_IMAGE034
for other interfering signals than the desired audio signal,
Figure 158269DEST_PATH_IMAGE033
is the audio separation result.
11. The audio classification and separation method of claim 8, wherein the cross-entropy loss function is:
Figure 255538DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 88364DEST_PATH_IMAGE036
are all cross-entropy loss functions, and are,
Figure 215983DEST_PATH_IMAGE037
representing the class distribution probability of the audio signal to be processed after being input into the audio classification network for classification,
Figure 134260DEST_PATH_IMAGE038
representing the distribution probability of the classification label, C representing the number of classes,
Figure 35220DEST_PATH_IMAGE039
representing the kth classification category or classification label,
Figure 456974DEST_PATH_IMAGE040
and k is the maximum value of C.
12. The audio classification and separation method according to any one of claims 9 to 11, wherein the total loss function of the audio classification and separation network is a weighted average of a cross-entropy loss function in the audio classification network and a signal-to-noise ratio loss function in the audio separation network, and is expressed as:
Figure 519608DEST_PATH_IMAGE041
wherein the content of the first and second substances,
Figure 659602DEST_PATH_IMAGE042
is a weight coefficient
Figure 364253DEST_PATH_IMAGE043
For balancing the two tasks of classification and separation.
13. An audio classification and separation apparatus, comprising:
the determining module is used for determining the audio signal to be processed;
the audio coding module is used for inputting the audio signal to be processed into the audio coding module and transforming the audio signal to be processed into a two-dimensional space from a one-dimensional time domain; the two-dimensional space comprises time domain characteristics of audio and channel information of the audio coding module;
the multi-scale mapping module is used for inputting the audio signal of the two-dimensional space to the multi-scale mapping module and extracting the multi-scale features of the audio signal of the two-dimensional space;
the audio classification module is used for inputting the multi-scale features to a cascade layer to obtain spliced multi-scale features, inputting the spliced multi-scale features to the classifier module, and classifying the audio signals to be processed;
and the audio separation module is used for inputting the multi-scale features to the superimposer to obtain the superimposed multi-scale features, inputting the superimposed multi-scale features to a separation network, and separating the audio signals to be processed.
14. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the audio classification and separation method of any of claims 1-12.
15. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, carry out the audio classification and separation method of any of the preceding claims 1-12.
CN202110537306.0A 2021-05-18 2021-05-18 Audio classification and separation method and device, electronic equipment and storage medium Active CN112989107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110537306.0A CN112989107B (en) 2021-05-18 2021-05-18 Audio classification and separation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110537306.0A CN112989107B (en) 2021-05-18 2021-05-18 Audio classification and separation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112989107A true CN112989107A (en) 2021-06-18
CN112989107B CN112989107B (en) 2021-07-30

Family

ID=76336665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110537306.0A Active CN112989107B (en) 2021-05-18 2021-05-18 Audio classification and separation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112989107B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299306A (en) * 2021-07-27 2021-08-24 北京世纪好未来教育科技有限公司 Echo cancellation method, echo cancellation device, electronic equipment and computer-readable storage medium
CN113506566A (en) * 2021-06-22 2021-10-15 荣耀终端有限公司 Sound detection model training method, data processing method and related device
CN113705715A (en) * 2021-09-04 2021-11-26 大连钜智信息科技有限公司 Time sequence classification method based on LSTM and multi-scale FCN
CN115083435A (en) * 2022-07-28 2022-09-20 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN115206294A (en) * 2022-09-16 2022-10-18 深圳比特微电子科技有限公司 Training method, sound event detection method, device, equipment and medium
WO2023207193A1 (en) * 2022-04-29 2023-11-02 哲库科技(上海)有限公司 Audio separation method and apparatus, training method and apparatus, and device, storage medium and product

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009026433A1 (en) * 2007-08-21 2009-02-26 Cortica, Ltd. Signature generation for multimedia deep-content-classification by a large-scale matching system and method thereof
US20110276849A1 (en) * 2010-05-10 2011-11-10 Periasamy Pradeep System, circuit, and device for asynchronously scan capturing multi-clock domains
US20190066713A1 (en) * 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN109935222A (en) * 2018-11-23 2019-06-25 咪咕文化科技有限公司 A kind of method, apparatus and computer readable storage medium constructing chord converting vector
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110648656A (en) * 2019-08-28 2020-01-03 北京达佳互联信息技术有限公司 Voice endpoint detection method and device, electronic equipment and storage medium
US10600184B2 (en) * 2017-01-27 2020-03-24 Arterys Inc. Automated segmentation utilizing fully convolutional networks
CN111179911A (en) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 Target voice extraction method, device, equipment, medium and joint training method
CN111326168A (en) * 2020-03-25 2020-06-23 合肥讯飞数码科技有限公司 Voice separation method and device, electronic equipment and storage medium
CN111524530A (en) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 Voice noise reduction method based on expansion causal convolution
CN111783431A (en) * 2019-04-02 2020-10-16 北京地平线机器人技术研发有限公司 Method and device for predicting word occurrence probability by using language model and training language model
CN111860138A (en) * 2020-06-09 2020-10-30 中南民族大学 Three-dimensional point cloud semantic segmentation method and system based on full-fusion network
CN112185352A (en) * 2020-08-31 2021-01-05 华为技术有限公司 Voice recognition method and device and electronic equipment
CN112200099A (en) * 2020-10-14 2021-01-08 浙江大学山东工业技术研究院 Video-based dynamic heart rate detection method
CN112735382A (en) * 2020-12-22 2021-04-30 北京声智科技有限公司 Audio data processing method and device, electronic equipment and readable storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009026433A1 (en) * 2007-08-21 2009-02-26 Cortica, Ltd. Signature generation for multimedia deep-content-classification by a large-scale matching system and method thereof
US20110276849A1 (en) * 2010-05-10 2011-11-10 Periasamy Pradeep System, circuit, and device for asynchronously scan capturing multi-clock domains
US20190066713A1 (en) * 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US10600184B2 (en) * 2017-01-27 2020-03-24 Arterys Inc. Automated segmentation utilizing fully convolutional networks
CN109935222A (en) * 2018-11-23 2019-06-25 咪咕文化科技有限公司 A kind of method, apparatus and computer readable storage medium constructing chord converting vector
CN111783431A (en) * 2019-04-02 2020-10-16 北京地平线机器人技术研发有限公司 Method and device for predicting word occurrence probability by using language model and training language model
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110648656A (en) * 2019-08-28 2020-01-03 北京达佳互联信息技术有限公司 Voice endpoint detection method and device, electronic equipment and storage medium
CN111179911A (en) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 Target voice extraction method, device, equipment, medium and joint training method
CN111326168A (en) * 2020-03-25 2020-06-23 合肥讯飞数码科技有限公司 Voice separation method and device, electronic equipment and storage medium
CN111524530A (en) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 Voice noise reduction method based on expansion causal convolution
CN111860138A (en) * 2020-06-09 2020-10-30 中南民族大学 Three-dimensional point cloud semantic segmentation method and system based on full-fusion network
CN112185352A (en) * 2020-08-31 2021-01-05 华为技术有限公司 Voice recognition method and device and electronic equipment
CN112200099A (en) * 2020-10-14 2021-01-08 浙江大学山东工业技术研究院 Video-based dynamic heart rate detection method
CN112735382A (en) * 2020-12-22 2021-04-30 北京声智科技有限公司 Audio data processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CESHINE LEE: "Implementing Temporal Concolutional Networks", 《HTTPS://MEDIUM.COM/THE-ARTIFICIAL-IMPOSTOR/NOTES-UNDERSTANDING-TENSORFLOW-PART-3-7F6633FCC7C7》 *
EDUARDO CORPENO: "Comparing Binary, Gray, and One-Hot Encoding", 《HTTPS://WWW.ALLABOUTCIRCUITS.COM/TECHNICAL-ARTICLES/COMPARING-BINARY-GRAY-ONE-HOT-ENCODING/》 *
李丹艳: "基于深度学习的语音情感识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
魏世超 等: "基于E-t-SNE的混合属性数据降维可视化方法", 《计算机工程与应用》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506566A (en) * 2021-06-22 2021-10-15 荣耀终端有限公司 Sound detection model training method, data processing method and related device
CN113299306A (en) * 2021-07-27 2021-08-24 北京世纪好未来教育科技有限公司 Echo cancellation method, echo cancellation device, electronic equipment and computer-readable storage medium
CN113299306B (en) * 2021-07-27 2021-10-15 北京世纪好未来教育科技有限公司 Echo cancellation method, echo cancellation device, electronic equipment and computer-readable storage medium
CN113705715A (en) * 2021-09-04 2021-11-26 大连钜智信息科技有限公司 Time sequence classification method based on LSTM and multi-scale FCN
CN113705715B (en) * 2021-09-04 2024-04-19 大连钜智信息科技有限公司 Time sequence classification method based on LSTM and multi-scale FCN
WO2023207193A1 (en) * 2022-04-29 2023-11-02 哲库科技(上海)有限公司 Audio separation method and apparatus, training method and apparatus, and device, storage medium and product
CN115083435A (en) * 2022-07-28 2022-09-20 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN115083435B (en) * 2022-07-28 2022-11-04 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN115206294A (en) * 2022-09-16 2022-10-18 深圳比特微电子科技有限公司 Training method, sound event detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN112989107B (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN112989107B (en) Audio classification and separation method and device, electronic equipment and storage medium
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN111179911B (en) Target voice extraction method, device, equipment, medium and joint training method
CN111626300B (en) Image segmentation method and modeling method of image semantic segmentation model based on context perception
WO2020098256A1 (en) Speech enhancement method based on fully convolutional neural network, device, and storage medium
CN109658943B (en) Audio noise detection method and device, storage medium and mobile terminal
CN113113041B (en) Voice separation method based on time-frequency cross-domain feature selection
CN113240683A (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN110544482A (en) single-channel voice separation system
Hao et al. A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments.
CN115602165A (en) Digital staff intelligent system based on financial system
CN116994564A (en) Voice data processing method and processing device
Jiang et al. An Improved Unsupervised Single‐Channel Speech Separation Algorithm for Processing Speech Sensor Signals
CN113241092A (en) Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
CN112259086A (en) Speech conversion method based on spectrogram synthesis
CN112989106B (en) Audio classification method, electronic device and storage medium
CN113099374B (en) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
CN113936680B (en) Single-channel voice enhancement method based on multi-scale information perception convolutional neural network
CN115691539A (en) Two-stage voice separation method and system based on visual guidance
CN112951218B (en) Voice processing method and device based on neural network model and electronic equipment
CN114067785B (en) Voice deep neural network training method and device, storage medium and electronic device
CN111312215A (en) Natural speech emotion recognition method based on convolutional neural network and binaural representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant