CN114186094A

CN114186094A - Audio scene classification method and device, terminal equipment and storage medium

Info

Publication number: CN114186094A
Application number: CN202111282304.8A
Authority: CN
Inventors: 高玉梅; 刘涛; 朱彪; 王丽
Original assignee: Shenzhen Horn Audio Co Ltd
Current assignee: Shenzhen Horn Audio Co Ltd
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-03-15

Abstract

The application is applicable to the technical field of computers, and provides an audio scene classification method, an audio scene classification device, terminal equipment and a storage medium, wherein the method comprises the following steps: acquiring audio data of a first preset duration in a target scene to obtain a first audio clip; extracting first characteristic information and second characteristic information of the first audio clip, wherein the first characteristic information represents frequency characteristic information with time as a reference, and the second characteristic information represents time characteristic information with frequency as a reference; dividing the first feature information into N first feature segments and dividing the second feature information into N second feature segments according to a preset rule to obtain N feature groups, wherein each feature group comprises a first feature segment and a second feature segment; calculating respective fusion characteristics of the N characteristic groups; and determining a first scene classification result of the first audio clip according to the respective fusion characteristics of the N characteristic groups. By the method, the accuracy of the audio scene classification result can be improved.

Description

Audio scene classification method and device, terminal equipment and storage medium

Technical Field

The present application belongs to the field of computer technologies, and in particular, to an audio scene classification method, apparatus, terminal device, and storage medium.

Background

Along with the progress of the times, the quality degree of the public is improved, and the civilized behaviors of playing the sound outside the public place are gradually reduced, so that the functional requirements of people on the earphone in daily life are improved. For example, in a noisy environment, the user may also wish to not degrade the audio quality heard over the headphones.

At present, in order to avoid that external noisy environmental sounds (such as various noises including vehicle sounds, crowd sounds and player sounds of merchants) are mixed in audio as much as possible, so that the quality of audio heard by a user through an earphone is reduced, the existing external audio signals heard by the user generally need to be processed, the specific mode is that the environmental sounds are effectively judged by classifying audio scenes, and then the environmental sounds are correspondingly processed according to different audio scene classification results. However, in the existing audio scene classification method, the audio scene classification result is obtained through the frequency characteristic calculation of audio, and in the face of audio data of similar scenes (such as different types of vehicles), the accuracy of the audio scene classification result obtained by adopting the classification method is low.

Disclosure of Invention

The embodiment of the application provides an audio scene classification method, an audio scene classification device, terminal equipment and a storage medium, and the accuracy of an audio scene classification result can be effectively improved.

In a first aspect, an embodiment of the present application provides an audio scene classification method, including:

acquiring audio data of a first preset duration in a target scene to obtain a first audio clip;

extracting first characteristic information and second characteristic information of the first audio segment, wherein the first characteristic information represents frequency characteristic information based on time, and the second characteristic information represents time characteristic information based on frequency;

dividing the first feature information into N first feature segments and dividing the second feature information into N second feature segments according to a preset rule, wherein N is a positive integer greater than 1;

dividing the N first feature segments and the N second feature segments into N feature groups according to the preset rule, wherein each feature group comprises a first feature segment and a second feature segment;

calculating respective fusion features of the N feature groups;

and determining a first scene classification result of the first audio clip according to the respective fusion characteristics of the N characteristic groups.

In a second aspect, an embodiment of the present application provides an audio scene classification apparatus, including:

the first acquisition module is used for acquiring audio data with a first preset duration in a target scene to obtain a first audio clip;

the first processing module is used for extracting first characteristic information and second characteristic information of the first audio clip, wherein the first characteristic information represents frequency characteristic information with time as a reference, and the second characteristic information represents time characteristic information with frequency as a reference;

the second processing module is used for dividing the first characteristic information into N first characteristic segments and dividing the second characteristic information into N second characteristic segments according to a preset rule, wherein N is a positive integer larger than 1;

a third processing module, configured to divide the N first feature segments and the N second feature segments into N feature groups according to the preset rule, where each feature group includes one first feature segment and one second feature segment;

a calculation module for calculating respective fusion features of the N feature groups;

and the first classification processing module is used for determining a first scene classification result of the first audio clip according to the respective fusion characteristics of the N characteristic groups.

In a third aspect, an embodiment of the present application provides a terminal device, including: memory, processor and computer program stored in the memory and executable on the processor, characterized in that the processor implements the audio scene classification method according to any of the above first aspects when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement the audio scene classification method according to any one of the above first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the audio scene classification method according to any one of the above first aspects.

Compared with the prior art, the embodiment of the first aspect of the application has the following beneficial effects: firstly, acquiring audio data with a first preset duration in a target scene to obtain a first audio clip; then extracting first characteristic information and second characteristic information of the first audio segment, wherein the first characteristic information represents frequency characteristic information based on time, and the second characteristic information represents time characteristic information based on frequency; dividing the first characteristic information into N first characteristic segments and dividing the second characteristic information into N second characteristic segments according to a preset rule, wherein N is a positive integer greater than 1; dividing the N first characteristic segments and the N second characteristic segments into N characteristic groups according to the preset rule, wherein each characteristic group comprises a first characteristic segment and a second characteristic segment; then calculating respective fusion characteristics of the N characteristic groups; because each feature group fuses the frequency domain features and the time domain features of the first audio clip, a group of feature expressions with time-frequency characteristics are obtained, information of audio signals on a time axis is not lost, the utilization degree of the time domain features can be improved, and finally, the first scene classification result of the first audio clip is determined according to the respective fusion features of the N feature groups, so that the first scene classification result of the first audio frequency band is determined jointly by using the features of the frequency domain and the time domain dimensions of the feature groups, and the accuracy of the audio scene classification result is improved.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart of an audio scene classification method provided by an embodiment of the present application;

FIG. 2 is a diagram illustrating an implementation of S105 in FIG. 1;

fig. 3 is a schematic structural diagram of a spectral feature information extraction model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a timing characteristic information extraction model according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of an audio scene classification method according to another embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio scene classification apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As used in the specification of this application and the appended claims, the term "if" may be interpreted contextually as "when … …" or "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted, depending on the context, to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Fig. 1 shows a schematic flow chart of an audio scene classification method provided in the present application, and with reference to fig. 1, the audio scene classification method is described in detail as follows:

s101, collecting audio data of a first preset duration in a target scene to obtain a first audio clip.

In this step, audio data of a first preset duration (for example, 10s) in the target scene may be acquired through an audio acquisition module (for example, a microphone) disposed on the terminal device. Here, the target scene is an environment scene in which the user is located when the terminal device outputs audio to the ear of the user through the headphone device.

The details of the scheme in this application can be found in the description of example one below.

S102, extracting first characteristic information and second characteristic information of the first audio clip, wherein the first characteristic information represents frequency characteristic information with time as a reference, and the second characteristic information represents time characteristic information with frequency as a reference.

Optionally, the implementation process of extracting the first feature information of the first audio segment may include:

the method comprises the steps of firstly carrying out short-time Fourier transform on a first audio clip to obtain a time-frequency signal of the first audio clip, then carrying out filtering processing (such as Mel filtering) on the time-frequency signal of the first audio clip, and finally carrying out logarithm calculation on the filtered time-frequency signal to obtain a spectrogram of the first audio clip changing along with time, namely first characteristic information of the first audio clip, wherein the first characteristic information represents frequency characteristic information taking time as a reference.

It should be understood that the frequency feature information based on time is a frequency feature corresponding to different times in statistics. For example, the frequency values corresponding to different times are counted, that is, the horizontal axis represents time, and the vertical axis represents frequency values.

Here, the second feature information indicates time feature information with reference to a frequency, and specifically, the second feature information may indicate time change information with reference to a frequency amplitude.

It should be understood that the time characteristic information based on frequency is a time characteristic corresponding to different statistical frequencies, for example, time corresponding to different statistical frequencies is counted, i.e. the horizontal axis is frequency and the vertical axis is time.

S103, dividing the first feature information into N first feature segments and dividing the second feature information into N second feature segments according to a preset rule, wherein N is a positive integer greater than 1;

in this step, the preset rule may be a preset time interval or a preset frequency interval.

Continuing with the above example, if the first feature information and the second feature information are respectively divided according to the preset time interval, the first feature segment is equivalent to a feature segment obtained by dividing the horizontal axis of the first feature information; the second feature segment corresponds to a feature segment obtained by dividing the vertical axis of the second feature information.

If the first characteristic information and the second characteristic information are respectively divided according to the preset frequency interval, the first characteristic segment is equivalent to a characteristic segment obtained by dividing the longitudinal axis of the first characteristic information; the second feature segment corresponds to a feature segment obtained by dividing the horizontal axis of the second feature information.

S104, dividing the N first feature segments and the N second feature segments into N feature groups according to the preset rule, wherein each feature group comprises a first feature segment and a second feature segment.

Here, taking a preset rule as an example of a preset time interval, according to the preset time interval, the first feature segment and the second feature segment belonging to the same time interval are divided into one feature group, so that N first feature segments and N second feature segments are divided into N feature groups.

Taking a preset rule as an example of a preset frequency interval, according to the preset frequency interval, dividing the first feature segments and the second feature segments belonging to the same frequency range into one feature group, so that the N first feature segments and the N second feature segments are divided into N feature groups.

And S105, calculating the respective fusion characteristics of the N characteristic groups.

Here, the fusion features of each of the N feature groups are calculated, respectively, to obtain the fusion features of each of the N feature groups.

Specifically, for each feature group of the N feature groups, the first feature segment and the second feature segment in the feature group are subjected to weighted fusion calculation to obtain the fusion feature of the feature group.

It should be noted that, by effectively extracting more representative feature information, each feature group includes a first feature segment representing a frequency feature and a second feature segment representing a time feature, and fusing the two feature segments, a group of feature expressions having time-frequency characteristics at the same time is obtained, which considers information on the frequency of the audio signal and information on the time axis of the audio signal, so as to improve the utilization degree of the time-domain features. And the accuracy of the scene classification result is improved by increasing the parameter dimension (i.e. the time domain feature).

Details of the scheme in this application can be found in the description of example two below.

S106, determining a first scene classification result of the first audio clip according to the respective fusion characteristics of the N characteristic groups.

In this step, a first scene classification result of the first audio segment may be determined by respectively inputting the fusion features of each of the N feature groups into the trained classification model.

Here, the classification model may be a Gradient Boosting model such as an XGBoost (Extreme Gradient Boosting) model. The classification model may be trained in advance to obtain a trained classification model. When the method is applied, the trained classification model is used for calculating the scene classification result of the audio clip, so that the calculation accuracy can be improved.

The details of the scheme in this application can be found in the description of example three below.

Example one

In a possible implementation manner, the implementation process of step S101 may include:

s1011, collecting original audio of a first preset time length in a target scene.

Here, the original audio of the first preset duration in the target scene may be collected through a microphone on the terminal device. The volume of the original audio can be obtained by volume detection.

And S1012, under the condition that the volume of the original audio is greater than a preset threshold, judging whether a target sound exists in the original audio.

If the volume of the original audio is smaller than a preset threshold (for example, 50dB), it is determined that the user is currently in a quiet environment, and in order to better reduce power consumption and increase the standby time of the device, the subsequent operations of the method of the present application do not need to be performed in this case.

If the volume of the original audio is larger than a preset threshold value, the situation that the user is currently in a noisy environment is judged, and a scene classification result under the scene needs to be determined, so that an accurate basis is provided for effective judgment of the subsequent environmental sound, the equipment is enabled to correspondingly process the environmental sound according to different audio scene classification results, and the purpose that the quality of the audio heard by the end user through the earphone cannot be reduced is achieved.

In this step, the target sound specifically refers to a prominent human voice, that is, the user's own voice. Whether the target sound exists in the original audio is judged because in the using process of the device, because the distance between the user and the microphone is very close, the voice which is sent by the user and collected by the microphone is stronger, and even representative sound characteristics in the environment can be covered, therefore, if stronger voice exists, the voice of the audio and the environmental sound need to be separated, the voice is filtered, and the environmental sound is reserved.

It should be noted that, parameters such as bone conduction and an accelerometer may be used to determine whether the audio contains the target sound.

S1013, if the target sound exists in the original audio, filtering the target sound in the original audio to obtain the first audio clip.

The situation corresponds to that a target sound, namely a remarkable human voice, exists in an original audio with the volume larger than a preset threshold value, and the target sound in the original audio is filtered to obtain a first audio fragment.

S1014, if the target sound does not exist in the original audio, determining the original audio as the first audio fragment.

The situation corresponds to that no target sound exists in the original audio with the volume greater than the preset threshold, that is, the microphone does not collect the remarkable human voice of the user, and the original audio is directly determined as the first audio segment.

Example two

Referring to fig. 2, in a possible implementation manner, the implementation process of step S105 may include:

step S1051, inputting the first feature segment in each feature group of the N feature groups into a spectrum feature information extraction model, and obtaining the frequency domain feature of the first feature segment in each feature group.

Optionally, the spectral feature information extraction model is obtained by deep separable convolutional neural network training.

Specifically, the spectrum characteristic information extraction model comprises a group of continuous depth separable convolution layers, and the parameter quantity and the complexity of the neural network model can be effectively reduced. Therefore, in practical engineering application, the running speed can be increased, and the power consumption can be reduced.

In an example, referring to fig. 3, the spectral feature information extraction model includes an input layer, a convolutional layer Conv1, a Dropout layer Dropout1, a separable convolutional layer dsConv1, a separable convolutional layer dsConv2, a Dropout layer Dropout2, a separable convolutional layer dsConv3, a separable convolutional layer dsConv4, a Dropout layer Dropout3, a separable convolutional layer dsConv5, a separable convolutional layer dsConv6, and an average pooling layer pool 1.

The following describes the role of each component in the spectral feature information extraction model with reference to fig. 3, in combination with the first feature segment in the feature set of the first audio segment of the present application.

(1) An input layer: the first feature fragment in each feature set of a single audio fragment is input to the input layer as input to the model.

Specifically, the first feature information extracted from the first audio segment is segmented into continuous non-overlapping feature segments, the feature matrix dimension of a single feature segment is 128 × 128, and the feature segment is used as the input of the network model.

(2) Convolutional layer Conv 1: the convolutional layer is a 2D convolutional layer, the convolutional layer comprises 16 filters, the size of each filter is 2 x 2, and the adopted activation function is a relu function, so that sparsity can be introduced into a network, and the training speed is improved.

Specifically, the output of convolutional layer Conv1 is characterized by a feature matrix dimension of 128 × 16, and each filter contains 128 × 128 — 16384 weighted values.

(3) Dropout layer Dropout 1: to prevent overfitting of the model, dropout layers are added, some neurons in the network are randomly "inactivated", i.e. assigned zero weights, the ratio of dropout is set to 0.2, and the output matrix size is still 128 x 16.

(4) Separable convolutional layer dsConv 1: the separable convolutional layer is a 2D depth separable convolutional layer, including a depth (depthwise) convolution operation and a pointwise (pointwise) convolution operation.

The depthwise convolution operation is called channel-by-channel convolution, each convolution kernel is responsible for one channel, one channel can be convolved by only one convolution kernel, and the number of feature images after convolution is the same as the number of input channels. The depthwise convolution only operates independently for each channel of the input layer, and cannot effectively utilize spatial information between different channels, so that a pointwise convolution operation is required to operate a feature map generated by pointwise.

Here, the pointwise convolution operation is referred to as point-by-point convolution, with the convolution kernel size fixed at 1 × 1. The result of convolutional layer Conv1 will be input to separable convolutional layer dsConv1, where in separable convolutional layer dsConv1 the number of convolution kernels is 16, the size of the convolution kernel for depthwise convolution is 3 x 3, and the size of the output matrix is 64 x 16.

(5) Separable convolutional layer dsConv 2: the separable convolutional layer is a 2D depth separable convolutional layer, the result of the separable convolutional layer dsConv1 will be input to the separable convolutional layer dsConv2, in the separable convolutional layer dsConv2, the number of convolutional kernels is updated to 32, and the remaining settings are in the same logic as the separable convolutional layer dsConv1, and the size of the output matrix is 32 x 32.

(6) Dropout layer Dropout 2: the output matrix is still 32 x 32 in size according to the same logic as Dropout layer Dropout 1.

(7) Separable convolutional layer dsConv 3: the separable convolutional layer is a 2D depth separable convolutional layer, the result of Dropout layer Dropout2 will be input to separable convolutional layer dsConv3, in separable convolutional layer dsConv3 the number of convolutional kernels is updated to 64, the remaining settings are in the same logic as separable convolutional layer dsConv1, and the size of the output matrix is 16 x 64.

(8) Separable convolutional layer dsConv 4: the separable convolutional layer is a 2D depth separable convolutional layer, the result of separable convolutional layer dsConv3 will be input to separable convolutional layer dsConv4, in separable convolutional layer dsConv4 the number of convolutional kernels is updated to 128, the remaining settings are in the same logic as separable convolutional layer dsConv1, the size of the output matrix is 8 x 128.

(9) Dropout layer Dropout 3: the output matrix is still 8 x 128 with the same logic as Dropout layer Dropout 1.

(10) Separable convolutional layer dsConv 5: the separable convolutional layer is a 2D depth separable convolutional layer, the result of Dropout layer Dropout3 will be input to separable convolutional layer dsConv5, in separable convolutional layer dsConv5 the number of convolutional kernels is updated to 256, and the remaining settings are in the same logic as separable convolutional layer dsConv1, the size of the output matrix is 4 x 256.

(11) Separable convolutional layer dsConv 6: the separable convolutional layer is a 2D depth separable convolutional layer, the result of the separable convolutional layer dsConv5 will be input to the separable convolutional layer dsConv6, in the separable convolutional layer dsConv6, the number of convolutional kernels is updated to 512, and the remaining settings are in the same logic as the separable convolutional layer dsConv1, and the size of the output matrix is 2 x 512.

(12) Mean pooling layer pool 1: in order to reduce the amount of computation and prevent overfitting, the feature maps after convolution were processed in a mean pooling manner, with a pooling size of 2 x 2 and an output matrix size of 1 x 512.

(13) Flatten layer: the results of the mean pooling layer pool1 will be input into a Flatten layer (not shown), flattening a matrix of 1 x 512 into vectors of length 512.

Here, the 512-dimensional vector is subsequently fused with the second feature segment in each feature group of the single audio segment extracted by the time-series feature information extraction model.

It should be noted that the general process of model training is as follows: firstly, initializing a convolution layer and a Dropout layer respectively, initializing all zeros for bias, then inputting a first feature segment in each feature group of a single audio segment into a separable convolution layer, updating a weight value, and training a network model.

The model training adopts a target function of cross entropy, network parameters are updated through a back propagation algorithm, and when the error of the verification set is smaller than or equal to an expected value or the maximum training cycle number is reached, the training is finished.

In the application, the frequency domain characteristics of the first characteristic segment of each characteristic group of the first audio segment are extracted through the continuous depth separable convolution layer, so that the parameter quantity can be obviously reduced, and the operation speed is improved. Meanwhile, in practical engineering application, under the requirement of the same power consumption, the number of network layers deeper than that of a common convolution structure can be realized, and the classification effect is further improved.

Step S1052, inputting the second feature segment in each feature group of the N feature groups to the time sequence feature information extraction model, and obtaining the time domain feature of the second feature segment in each feature group.

Optionally, the time sequence feature information extraction model is obtained by training a recurrent neural network. Specifically, the recurrent neural network may be an LSTM (Long Short-Term Memory) neural network.

Specifically, the time sequence feature information extraction model includes a set of LSTM layers with different hidden nodes, which can extract temporal internal relations of the audio segment to obtain a time sequence expression of the audio segment, that is, a time-domain feature of the second feature segment.

In an example, referring to fig. 4, the temporal feature extraction model includes an input layer, an LSTM layer L1, an LSTM layer L2, an LSTM layer L3, and a scatter layer (not shown in the figure).

The role of each component in the temporal feature extraction model is specifically described below with reference to the second feature segment in the feature set of the first audio segment in the present application.

(1) An input layer: the second feature segment in each feature set of a single audio segment is input to the input layer as input to the model.

Specifically, the first feature information extracted from the first audio segment is segmented into continuous non-overlapping feature segments, the feature matrix dimension of a single feature segment is 128 × 128, and the feature segment is used as the input of the network model, that is, the dimension of the output matrix is 128 × 128.

(2) LSTM layer L1: the output results of the input layer will be input into the LSTM layer L1.

Specifically, referring to fig. 4, the LSTM structure is composed of an input gate, a forgetting gate, and an output gate, and can retain the state at the previous time to obtain the relation of the features in the time domain. Where the LSTM layer L1 is defined as 32 hidden nodes, the activation function is the relu function, the return sequence is selected, and the dropout ratio is set to 0.2. Each output gate corresponds to an output, and the size of the output matrix is 128 x 32.

(3) LSTM layer L2: the output result of the LSTM layer L1 will be input into the LSTM layer L2, the hidden node of the LSTM layer L2 is updated to 8, and the remaining settings are in the same logic as the LSTM layer L1, and the size of the output matrix is 128 × 8.

(4) LSTM layer L3: the output result of the LSTM layer L2 will be input into the LSTM layer L3, the hidden node of the LSTM layer L3 is updated to 1, and the remaining settings are in the same logic as the LSTM layer L1, and the size of the output matrix is 128 × 1.

(5) Flatten layer: the output of the LSTM layer L3 will be input into the scatter layer, flattening the 128 x 1 two-dimensional matrix into vectors of length 128.

Here, the 128-dimensional vector is subsequently fused with the first feature segment in each feature group of the single audio segment extracted by the spectral feature information extraction model.

It should be noted that the general process of model training is as follows: the bias is initialized to be all zero, and the data of each frequency point in different tenses is input into an LSTM layer for training, wherein the frequency point is 128 in total in the example, and the frequency range is 20Hz to 8 KHz. The model training adopts a target function of cross entropy, network parameters are updated through a back propagation algorithm, and when the error of the verification set is smaller than or equal to an expected value or the maximum training cycle number is reached, the training is finished.

Step S1053, performing fusion processing on the frequency domain feature of the first feature segment and the time domain feature of the second feature segment of each of the N feature groups to obtain a fusion feature of each of the N feature groups.

Optionally, the implementation process of step S1053 may include:

and carrying out vector splicing on the frequency domain characteristic of the first characteristic segment and the time domain characteristic of the second characteristic segment in each of the N characteristic groups according to a preset weight value to obtain respective fusion characteristics of the N characteristic groups.

Specifically, firstly, the frequency domain feature of a first feature segment in each feature group of the N feature groups is multiplied by a first preset weight, and the time domain feature of a second feature segment in each feature group of the N feature groups is multiplied by a second preset weight; and then, carrying out vector splicing on the frequency domain feature of the first feature segment and the time domain feature of the second feature segment in each feature group after the processing to obtain respective fusion features of the N feature groups, wherein the sum of the first preset weight and the second preset weight is 1.

Optionally, the implementation process of this step is as follows:

for each of the N feature groups, obtaining a fused feature of each feature group by F ═ catenate (α × FP, (1- α) × FT); wherein, F represents the fusion feature of each feature group, associate () represents the vector splicing operation, α represents the first preset weight, FP represents the first feature segment in each feature group, and FT represents the second feature segment in each feature group.

It should be noted that the value of α is empirically selected, and optionally, α is 0.2.

Further, according to the calculated fusion characteristics of each characteristic group, calculating the frequency of the fusion characteristics in each characteristic group; and obtaining the fusion feature distribution of each feature group according to the frequency of the fusion features in each feature group. Such as obtaining a fused feature distribution histogram for each feature group.

EXAMPLE III

In a possible implementation manner, the implementation procedure of step S106 may include:

s1061, inputting the respective fusion characteristics of the N characteristic groups into the trained classification model respectively to obtain N probability matrixes.

And when the classification model is the XGboost model, optimizing the objective function through the tree structure, and outputting a first scene classification result of the first audio clip by using the leaf node. The XGboost model is briefly described below. The predicted value of the XGBoost model for a certain sample is:

wherein f is_kIs a base learner and the final model is a combination of multiple base learners. The initial objective function of the model is:

wherein the content of the first and second substances,

represents the predicted value of the first t-1 ensemble learners on the sample, f_t(X_i) Is the predicted value of the current learner for the sample, Ω (f)_t) Is the regularization term for the t-th learner. Then, taylor second-order expansion is performed on the target function, and the regularization is specifically expressed by the following formula:

wherein, T represents the number of leaf nodes, the number of leaf nodes in the model is taken as an L1 regular term, the weight values of the leaf nodes are taken as an L2 regular term, and the weight values of the leaf nodes are actually predicted values. After the simplified derivation, the objective function is finally determined as:

wherein, g_iIs a pair of functions

First derivative of, h_iIs a pair of functions

Second derivative of (I)_j＝{i|q(X_i)＝j}，q(X_i) J denotes a sample X_iIs divided into leaf nodes numbered j.

It should be noted that the maximum depth of the XGBoost model may be preset, the object may be preset, for example, set to binary, the logic may employ binary logistic regression, and the learning rate eta may be preset, for example, set to 0.1.

Here, the sample described in the above model refers to the fusion feature of the feature set of the audio sample segment in the present application, and specifically refers to the fusion feature distribution of the feature set of the audio sample segment, that is, the fusion feature distribution of the feature set of the audio sample segment is input into the model for training, and finally the trained model is obtained. Here the trained model is output as a probability matrix for the feature set.

S1062, calculating the mean value of the N probability matrixes to obtain a final probability matrix.

In step S1062, calculating the mean of the N probability matrices specifically includes: and summing and averaging the numerical values of the same positions corresponding to the probability matrixes of the N probability matrixes to obtain a final probability matrix.

S1063, determining the class label to which the element with the largest value in the final probability matrix belongs as the first scene classification result of the first audio segment.

Here, the class label to which the element with the largest value in the final probability matrix belongs, that is, the class label with the largest probability in the target scene (i.e., the scene classification result) is the class label to which the element belongs.

In an application, the audio data in a target scene is collected in real time, and due to continuity of real-time recording, individual mutations may exist in a continuous classification result, but the mutations are negligible, so in order to ensure robustness of an audio scene classification result, in a possible implementation manner, the method according to an embodiment of the present application further includes, after step S106:

s107, continuously collecting audio data of a second preset duration in the target scene to obtain a second audio clip.

For a specific implementation manner corresponding to step S107, reference may be made to the description of step S101, which is not described herein again.

S108, determining a second scene classification result of the second audio clip according to the fusion characteristic of the second audio clip;

the specific implementation manner of obtaining the fusion feature of the second audio segment in step S108 may refer to the descriptions of steps S102 to S105, and the specific implementation manner corresponding to this step may refer to the description of step S106, which is not described herein again.

S109, determining a final scene classification result of the target scene according to a first scene classification result of the first audio clip and a second scene classification result of the second audio clip.

The implementation process of step S109 may specifically include:

and determining the classification result with the largest occurrence frequency in all the scene classification results as the target final scene classification result, wherein all the scene classification results comprise a first scene classification result and a second scene classification result.

Here, if the classification result with the largest occurrence number includes two different classification results, the class label to which the element corresponding to the larger value belongs may be determined as the final scene classification result by comparing the magnitudes of the largest values in the respective final probability matrices.

It should be noted that, by combining this implementation with the implementation shown in fig. 1, the final scene classification result of the target scene is determined according to the scene classification results of the audio segments acquired multiple times. That is, the final scene classification result of the target scene is obtained by double determination, and the specific collection of audio clips several times can be set according to the actual situation, which is not limited in this embodiment.

It should be understood that the first re-decision of the double decision represents an implementation as shown in fig. 1, and the second re-decision represents this implementation.

For example, referring to fig. 5, a feature group (the feature group corresponds to an audio segment with a recording length of 1.28 s) is first used as an input feature of the scene classification system, that is, each feature group outputs a set of probability matrices. The first re-determination is performed by selecting eight continuous feature groups (that is, corresponding to 8 audio segments with a length of 1.28s, where the eight audio segments as a whole may be understood as the first audio segment described in the above embodiment, corresponding to wav _ index ═ 8 in fig. 5), averaging the output probability matrices, and selecting the category label corresponding to the index with the highest probability as the output result of the current overall audio segment; the second decision is expressed as retaining the class that appears most frequently as the final classification result within a processing time of one minute (corresponding to wav _ time of 6 in fig. 5).

In practical application, audio data M recorded by a microphone at present and having a duration of one minute is first divided into continuous non-overlapping audio segments M having a duration of 10s₁、m₂、…、m₆Then extracting a single audio piece m_iThe time-frequency characteristics of (1) are divided according to a preset time interval to obtain 8 characteristic groups, wherein the 8 characteristic groups are equivalent to continuous non-overlapping audio segments m with the corresponding duration of 1.28s_i1、m_i2、…、m_i8Then, the fusion features of the 8 feature groups are respectively input into a scene classification system (i.e., a trained classification model), and 8 probability matrices are output. Specifically, it can be represented by the following formula:

the current classification system is defined as

The relationship between the system output and the system input is as follows:

the output result of the single 10s audio is then:

wherein labels [. cndot.) represents the classification label matrix, and argmax (. cndot.) represents the index for obtaining the maximum value of the probability matrix.

Finally, the audio segments m are calculated separately₁、m₂、…、m₆And calculating the result with the most frequency of occurrence, namely the final scene classification result of the audio data M.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 6 shows a block diagram of an audio scene classification apparatus provided in an embodiment of the present application, corresponding to the method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 6, the apparatus 200 may include: a first acquisition module 210, a first processing module 220, a second processing module 230, a third processing module 240, a calculation module 250, and a first classification processing module 260.

The first acquisition module 210 is configured to acquire audio data of a first preset duration in a target scene to obtain a first audio clip; a first processing module 220, configured to extract first feature information and second feature information of the first audio segment, where the first feature information represents frequency feature information based on time, and the second feature information represents time feature information based on frequency; the second processing module 230 is configured to divide the first feature information into N first feature segments and divide the second feature information into N second feature segments according to a preset rule, where N is a positive integer greater than 1; a third processing module 240, configured to divide the N first feature segments and the N second feature segments into N feature groups according to the preset rule, where each feature group includes one first feature segment and one second feature segment; a calculating module 250, configured to calculate respective fusion features of the N feature groups; a first classification processing module 260, configured to determine a first scene classification result of the first audio segment according to the respective fusion features of the N feature groups.

In a possible implementation manner, the acquisition module 210 may specifically be configured to:

acquiring an original audio of a first preset duration in a target scene; under the condition that the volume of the original audio is larger than a preset threshold value, judging whether a target sound exists in the original audio; if the target sound exists in the original audio, filtering the target sound in the original audio to obtain the first audio clip; and if the target sound does not exist in the original audio, determining the original audio as the first audio fragment.

In a possible implementation manner, the calculation module 250 may specifically include:

the first calculation unit is used for inputting the first feature segment in each feature group of the N feature groups into a spectrum feature information extraction model to obtain the frequency domain feature of the first feature segment in each feature group;

the second calculation unit is used for inputting the second feature segment in each feature group of the N feature groups into the time sequence feature information extraction model to obtain the time domain feature of the second feature segment in each feature group;

and the third calculating unit is used for carrying out fusion processing on the frequency domain characteristics of the first characteristic segments and the time domain characteristics of the second characteristic segments of the N characteristic groups to obtain the fusion characteristics of the N characteristic groups.

Optionally, the spectrum feature information extraction model is obtained by training a deep separable convolutional neural network; the time sequence characteristic information extraction model is obtained by training a recurrent neural network.

In a possible implementation manner, the third computing unit is specifically configured to:

In a possible implementation manner, the first classification processing module 260 may specifically be configured to:

respectively inputting the respective fusion characteristics of the N characteristic groups into the trained classification model to obtain N probability matrixes;

calculating the mean value of the N probability matrixes to obtain a final probability matrix;

and determining the class label to which the element with the largest value in the final probability matrix belongs as a first scene classification result of the first audio segment.

In one possible implementation, the apparatus 200 further includes:

the second acquisition module is used for continuously acquiring audio data with a second preset duration in the target scene to obtain a second audio clip;

the second classification processing module is used for determining a second scene classification result of the second audio clip according to the fusion characteristic of the second audio clip;

and the third classification processing module is used for determining a final scene classification result of the target scene according to the first scene classification result of the first audio clip and the second scene classification result of the second audio clip.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides a terminal device, and referring to fig. 7, the terminal device 300 may include: at least one processor 310, a memory 320, and a computer program stored in the memory 320 and operable on the at least one processor 310, wherein the processor 310, when executing the computer program, implements the steps of any of the above-mentioned method embodiments, such as the steps S101 to S106 in the embodiment shown in fig. 1. Alternatively, the processor 310, when executing the computer program, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 210 to 260 shown in fig. 6.

Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 320 and executed by the processor 310 to accomplish the present application. The one or more modules/units may be a series of computer program segments capable of performing specific functions, which are used to describe the execution of the computer program in the terminal device 300.

Those skilled in the art will appreciate that fig. 7 is merely an example of a terminal device and is not limiting and may include more or fewer components than shown, or some components may be combined, or different components such as input output devices, network access devices, buses, etc.

The Processor 310 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 320 may be an internal storage unit of the terminal device, or may be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. The memory 420 is used for storing the computer programs and other programs and data required by the terminal device. The memory 420 may also be used to temporarily store data that has been output or is to be output.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The audio scene classification method provided by the embodiment of the application can be applied to terminal devices such as computers, tablet computers, notebook computers, netbooks and Personal Digital Assistants (PDAs), and the embodiment of the application does not limit the specific types of the terminal devices.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed terminal device, apparatus and method may be implemented in other ways. For example, the above-described terminal device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical function division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the method embodiments described above when the computer program is executed by one or more processors.

Also, as a computer program product, when the computer program product runs on a terminal device, the terminal device is enabled to implement the steps in the above-mentioned method embodiments when executed.

Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for audio scene classification, comprising:

calculating respective fusion features of the N feature groups;

2. The audio scene classification method of claim 1, wherein the acquiring of the audio data of the first preset duration in the target scene to obtain the first audio segment includes:

collecting original audio of a first preset duration in the target scene;

under the condition that the volume of the original audio is larger than a preset threshold value, judging whether a target sound exists in the original audio;

if the target sound exists in the original audio, filtering the target sound in the original audio to obtain the first audio clip;

and if the target sound does not exist in the original audio, determining the original audio as the first audio fragment.

3. The audio scene classification method of claim 1, wherein said computing the fused features for each of the N feature groups comprises:

inputting the first feature segment in each feature group of the N feature groups into a spectrum feature information extraction model to obtain the frequency domain feature of the first feature segment in each feature group;

inputting the second feature segment in each feature group of the N feature groups into a time sequence feature information extraction model to obtain the time domain feature of the second feature segment in each feature group;

and carrying out fusion processing on the frequency domain characteristics of the first characteristic segments and the time domain characteristics of the second characteristic segments of the N characteristic groups to obtain the fusion characteristics of the N characteristic groups.

4. The audio scene classification method of claim 3, characterized in that the spectral feature information extraction model is trained from a deep separable convolutional neural network; the time sequence characteristic information extraction model is obtained by training a recurrent neural network.

5. The audio scene classification method according to claim 3, wherein the obtaining of the fusion features of each of the N feature groups by performing fusion processing on the frequency domain features of the first feature segment and the time domain features of the second feature segment of each of the N feature groups comprises:

6. The method for audio scene classification according to claim 1, wherein said determining a first scene classification result for the first audio piece based on the respective fused features of the N feature sets comprises:

7. The method for audio scene classification according to claim 1, characterized in that after determining a first scene classification result for the first audio piece based on the respective fused features of the N feature sets, the method further comprises:

continuously acquiring audio data with a second preset duration in the target scene to obtain a second audio clip;

determining a second scene classification result of the second audio clip according to the fusion characteristic of the second audio clip;

and determining a final scene classification result of the target scene according to a first scene classification result of the first audio clip and a second scene classification result of the second audio clip.

8. An audio scene classification apparatus, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the audio scene classification method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the audio scene classification method according to any one of claims 1 to 7.