CN114627895A

CN114627895A - Acoustic scene classification model training method and device, intelligent terminal and storage medium

Info

Publication number: CN114627895A
Application number: CN202210319713.9A
Authority: CN
Inventors: 谭钦; 王佳旭; 苗健彰
Original assignee: Elevoc Technology Co ltd
Current assignee: Elevoc Technology Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-06-14

Abstract

The invention discloses an acoustic scene classification model training method, an acoustic scene classification model training device, an intelligent terminal and a storage medium, wherein the method comprises the following steps: acquiring training data; extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category; the acoustic scene classification model is obtained by improving a residual error neural network; and training the acoustic scene classification model according to the predicted acoustic scene category and the real label to obtain a trained acoustic scene classification model. In the embodiment of the invention, the characteristic signal of the sample audio is subjected to capacity expansion after being sliced, so that the size of the input sample signal is reduced, and the response speed is increased; and the sample signals are input into the improved residual error neural network for training, so that the accuracy of the classification result of the model is higher.

Description

Acoustic scene classification model training method and device, intelligent terminal and storage medium

Technical Field

The invention relates to the technical field of acoustics, in particular to an acoustic scene classification model training method and device, an intelligent terminal and a storage medium.

Background

Acoustic Scene Classification (ASC) is a process of analyzing Acoustic content included in audio and identifying an Acoustic Scene corresponding to the audio.

The existing acoustic scene classification method is easy to have the over-fitting problem in the acoustic scene classification process, the classification angle is single, the acoustic scene classification effect is poor, classification can not be effectively carried out under complex living and working environments, the accuracy is low, the model size is large, and the classification delay is high.

Thus, there is a need for improvement and development of the prior art.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an acoustic scene classification model training method aiming at solving the above-mentioned defects in the prior art, and aims to solve the problems of low accuracy, large model size and high classification delay of the acoustic scene classification method in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

in a first aspect, an embodiment of the present invention provides an acoustic scene classification model training method, where the method includes:

acquiring training data, wherein the training data comprises a sample audio and a real label corresponding to the sample audio;

extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model; the acoustic scene classification model is obtained by improving a residual error neural network;

and training the acoustic scene classification model according to the predicted acoustic scene category and the real label to obtain a trained acoustic scene classification model.

In one implementation, the obtaining training data previously comprises:

acquiring acoustic scene categories; wherein the acoustic scene categories are divided based on an audio frequency domain energy distribution state.

In one implementation, the extracting the first feature signal of the sample audio includes:

resampling the sample audio to obtain a first audio;

converting the first audio into a first digital signal, and carrying out normalization processing on the first digital signal to obtain a second digital signal;

acquiring the frequency domain characteristics of the second digital signal based on a logarithmic Mel frequency spectrum mode to obtain a first characteristic signal of the sample audio; wherein the logarithmic Mel frequency spectrum employs a plurality of Mel filter banks.

In one implementation, the obtaining the second characteristic signal after slicing and expanding the first characteristic signal includes:

slicing the first characteristic signal to obtain a plurality of sub-characteristic signals;

dividing a plurality of sub-feature signals into a plurality of groups to obtain a plurality of first signal groups;

performing secondary grouping on the plurality of first signal groups to obtain a plurality of second signal groups; each second signal group has overlapped signals with preset length;

marking a number of the second signal groups based on the authentic tag;

randomly disordering the sequence of the marked second signal groups to obtain a plurality of third signal groups;

and carrying out batch production on the plurality of third signal groups to obtain second characteristic signals.

In one implementation, the acoustic scene classification model includes: a first convolution layer, a maximum pooling layer, a plurality of improved residual modules, a second convolution layer, a third convolution layer and a fourth convolution layer;

the inputting the second feature signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model includes:

inputting the second characteristic signal into the first convolution layer to obtain a first convolution map;

inputting the first convolution map into the maximum pooling layer to obtain a second convolution map;

sequentially inputting the second convolution map into each improved residual error module to obtain a third convolution map;

and sequentially inputting the third convolution map into a second convolution layer, a third convolution layer and a fourth convolution layer to obtain the predicted acoustic scene category.

In one implementation, the modified residual module includes a modified residual layer and a first residual layer; wherein the improved residual layer comprises a second residual layer and a linear processing module; the linear processing module is composed of a convolution layer and a normalization layer and is used for carrying out linear transformation on the input of the improved residual error layer.

In a second aspect, an embodiment of the present invention further provides an acoustic scene classification model training apparatus, where the apparatus includes:

the training data acquisition module is used for acquiring training data, wherein the training data comprises a sample audio and a real label corresponding to the sample audio;

the predicted acoustic scene category acquisition module is used for extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model;

and the acoustic scene classification model acquisition module is used for adjusting model parameters of the acoustic scene classification model according to the predicted acoustic scene category and the real label, and continuously executing the step of inputting the second characteristic signal into the acoustic scene classification model until a preset training condition is met so as to obtain the trained acoustic scene classification model.

In a third aspect, an embodiment of the present invention further provides an acoustic scene classification method, where the method includes:

acquiring audio to be classified, and extracting first characteristics of the audio;

inputting the first characteristic into a trained acoustic scene classification model to obtain an acoustic scene category corresponding to the audio to be classified; wherein the trained acoustic scene classification model is the acoustic scene classification model of any one of claims 1-6.

In a fourth aspect, an embodiment of the present invention further provides an intelligent terminal, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include instructions for executing the method for training an acoustic scene classification model according to any one of the above items.

In a fifth aspect, embodiments of the present invention also provide a non-transitory computer-readable storage medium, where instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the acoustic scene classification model training method according to any one of the above.

The invention has the beneficial effects that: firstly, acquiring training data, wherein the training data comprises a sample audio and a real label corresponding to the sample audio; then extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model; the acoustic scene classification model is obtained by improving a residual error neural network; finally, training the acoustic scene classification model according to the predicted acoustic scene category and the real label to obtain a trained acoustic scene classification model; therefore, in the embodiment of the invention, the characteristic signal of the sample audio is subjected to capacity expansion after being sliced, so that the size of the input sample signal is reduced, and the response speed is improved; and the sample signals are input into the improved residual error neural network for training, so that the accuracy of the classification result of the model is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a training method of an acoustic scene classification model according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a model structure of Simplify-ResNet according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an improved residual error unit according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a network error rate after training and checking by using a residual error network according to an embodiment of the present invention.

Fig. 5 is a diagram illustrating a network training effect according to an embodiment of the present invention.

Fig. 6 is a schematic block diagram of an acoustic scene classification model apparatus according to an embodiment of the present invention.

Fig. 7 is a schematic block diagram of an internal structure of an intelligent terminal according to an embodiment of the present invention.

Detailed Description

The invention discloses an acoustic scene classification model training method, an acoustic scene classification model training device, an intelligent terminal and a storage medium, and in order to make the purpose, the technical scheme and the effect of the invention clearer and clearer, the invention is further described in detail by referring to the attached drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the prior art, currently, the classification of acoustic scenes is mainly realized by the following two ways: the first type is an acoustic scene classification manner based on conventional machine learning, and specifically, a conventional machine learning model, such as a Support Vector Machines (SVMs) model, Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), or the like, is used to fit acoustic features in an audio to obtain an acoustic scene corresponding to the audio. The second type is an acoustic scene classification manner based on deep learning, and specifically, a deep Neural network model, such as a Recurrent Neural Network (RNNs) model, a Convolutional Neural Network (CNNs) model, or the like, is used to classify the acoustic scene of the audio. Practice shows that the existing acoustic scene classification mode depends on the existing model, the existing acoustic scene classification method is easy to have the over-fitting problem in the acoustic scene classification process (the over-fitting means that the difference between a training error and a test error is too large, namely, the complexity of the model is higher than that of the actual problem, the model is well represented on a training set, but is poor represented on the test set.

In order to solve the problems in the prior art, the embodiment provides an acoustic scene classification model training method, an acoustic scene classification model training device, an intelligent terminal and a storage medium, wherein the size of an input sample signal is reduced and the response speed is increased by slicing and then expanding the capacity of a characteristic signal of a sample audio; and the sample signals are input into the improved residual error neural network for training, so that the accuracy of the classification result of the model is higher. In specific implementation, training data is obtained firstly, wherein the training data comprises a sample audio and a real label corresponding to the sample audio; then extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model; the acoustic scene classification model is obtained by improving a residual error neural network; and finally, training the acoustic scene classification model according to the predicted acoustic scene category and the real label to obtain a trained acoustic scene classification model.

Exemplary method

The embodiment provides an acoustic scene classification model training method which can be applied to an acoustic intelligent terminal. As shown in fig. 1 in detail, the method includes:

step S100, acquiring training data, wherein the training data comprises a sample audio and a real label corresponding to the sample audio;

specifically, the training data is from an open source data set, or a batch of data may be collected according to actual needs, and then the data is further screened and optimized, for example, data that does not conform to the characteristics of the scene is removed, data that has particularly obvious human voice characteristics is also removed, and then the data is categorized again according to a defined classification scene. The real label is a real acoustic scene category corresponding to the sample audio.

In one implementation, the acoustic scenes are classified according to preparation work before training data is acquired, and the existing classification methods are classified into airports, shopping centers, subway stations, sidewalks of streets, squares, streets with a small amount of traffic, electric cars, buses, subways, parks and the like; the further classification methods are divided into: "quiet environment", "noisy environment", "music environment", "speech environment"; the classification part of the invention carries out refining extraction and reclassification according to specific working and living scenes, and divides the scene categories into 6 categories: the environment of the rail vehicle, the environment of the engine vehicle, the environment with dense outdoor personnel, the environment with dense vehicle, the environment with dense indoor personnel and the relatively quiet environment can be expanded according to actual requirements. The division basis of the 6 types of scenes can be an audio frequency domain energy distribution state or an audio frequency domain energy centralized distribution range, the division method divides acoustic expressions of the same type together, the acoustic scenes mainly focus on acoustic features, and scene classification is distinguished under the environment with prominent acoustic features to have practical significance, so that the method can be applied to real life scenes and the classification accuracy is improved. The classification is more consistent with the daily environment than event detection. The event detection is only that a single event (such as glass fragmentation sound and dog call) happens at the moment, and is just a transient event, and cannot represent the acoustic characteristics of the current large environment; the division of the 6 types of basic environments of the invention can basically cover the working life and learning environment which most people can contact, thereby having universality and being implemented in the actual engineering.

Thus, each sample audio corresponds to one of six types of acoustic scenes.

After the training data is obtained, the following steps can be performed as shown in fig. 1: s200, extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model; the acoustic scene classification model is obtained by improving a residual error neural network;

specifically, the sample audio is a video signal, a model cannot be directly input, a first characteristic signal in the sample audio needs to be extracted, the size of the first characteristic signal extracted once is too large in practice, and the response speed of the model is low. In this embodiment, the acoustic scene classification model is obtained by improving a residual error neural network, and may be a model obtained by deleting a conventional residual error neural network or a model obtained by adjusting a conventional residual error neural network, which is not particularly limited.

In one implementation, the extracting the first feature signal of the sample audio includes the following steps: resampling the sample audio to obtain a first audio; converting the first audio into a first digital signal, and carrying out normalization processing on the first digital signal to obtain a second digital signal; acquiring the frequency domain characteristics of the second digital signal based on a logarithmic Mel frequency spectrum acquisition mode to obtain a first characteristic signal of the sample audio; wherein the logarithmic Mel frequency spectrum employs a plurality of Mel filter banks.

Specifically, the sampling rate of resampling is 16K, the sample audio is sampled at the sampling rate of 16K to obtain a first audio, then data of a WAV file of the first audio is converted into a first digital signal which can be used as a network model input, and then normalization processing is performed on the first digital signal to obtain a second digital signal; wherein the normalization process is to change a number to a decimal between (0, 1). After the second digital signal is obtained, the frequency domain characteristic of the audio signal, that is, the first characteristic signal of the sample audio, is obtained by using a Log Mel Filter Bank (LMFB) method in the prior art, and the log Mel frequency spectrum is the prior art and is not described herein again. The logarithmic Mel frequency spectrum of the sample audio can represent the acoustic signal characteristics of the target audio; the log mel-frequency spectrum consists of a mel-filter bank. The Mel filter bank is a group of filter banks in nonlinear distribution, the Mel filter bank is densely distributed at a low-frequency part, the Mel filter bank is sparsely distributed at a high-frequency part, and the distribution difference of the Mel filter bank at high and low frequencies can better meet the auditory characteristics of human ears. The method for obtaining the frequency domain characteristics of the voice signal by the logarithmic Mel frequency spectrum comprises the following specific parameters: the length of a short-time Fourier window is 1024 sampling points, the frame shift is 500 sampling points, the length of the window is consistent with that of the FFT window, the number of filter banks is 40, the highest frequency is 8k, the minimum frequency is 50, the index of an amplitude spectrum is 2 (power spectrum), the method of the window is Hamming window (hann) from the center. In this embodiment, the 40 filter banks are used for the logarithmic mel-frequency spectrum, and the filter bank is a mel filter bank, so that the size of the extracted first characteristic signal is reduced, the parameters of the model can be further reduced, the size of the device is reduced, and the time delay is reduced.

In an implementation manner, the slicing and expanding the first characteristic signal to obtain the second characteristic signal includes the following steps: slicing the first characteristic signal to obtain a plurality of sub-characteristic signals; dividing a plurality of sub-feature signals into a plurality of groups to obtain a plurality of first signal groups; performing secondary grouping on the plurality of first signal groups to obtain a plurality of second signal groups; each second signal group has overlapped signals with preset length; marking a number of the second signal groups based on the authentic tag; randomly disordering the sequence of the marked second signal groups to obtain a plurality of third signal groups; and carrying out batch production on the plurality of third signal groups to obtain second characteristic signals.

Specifically, the first feature signal is sliced, for example, 10 seconds of audio frequency in each segment is processed into 320 groups of digital signals, so that the size of a plurality of sliced sub-feature signals is smaller, and then the plurality of sub-feature signals are divided into a plurality of groups to obtain a plurality of first signal groups; if a plurality of sub-feature signals are grouped into a group according to every 20 arrays, the sub-feature signals are divided into 16 first signal groups; secondly, grouping the first signal groups for the second time to obtain second signal groups; each second signal group has overlapped signals with preset length; if a plurality of first signal groups are numbered as a 1 st group, a 2 nd group and a 3 rd group according to the chronological order, the 1 st group and the 2 nd group are sequentially spliced into a second signal group with the sequence number of 1, then the 2 nd group and the 3 rd group are sequentially spliced into a second signal group with the sequence number of 2, and so on, so as to obtain 15 independent second signal groups. The second signal group with the sequence number 1 and the second signal group with the sequence number 2 have an overlapped signal group 2, and so on, the second signal group with the sequence number 2 and the second signal group with the sequence number 3 have an overlapped signal group 3.. since the real tag is the real acoustic scene category corresponding to the sample audio, and six real acoustic scene categories, which can be labeled, such as marking "inside rail vehicle environment" as 0, marking "inside engine vehicle environment" as 1, marking "outdoor person-dense environment" as 2, marking "vehicle-dense environment" as 3, marking "indoor person-dense environment" as 4, marking "relatively quiet environment" as 5, several of the second signal groups are processed from the sample audio so that the markings of the several of the second signal groups are consistent with the markings of the sample audio. Marking a plurality of second signal groups, and randomly disordering the sequence of the marked second signal groups to obtain a plurality of third signal groups; and finally, carrying out batch production on the plurality of third signal groups to obtain second characteristic signals, storing the second characteristic signals, and loading the stored data by using a data loader for later use. The random scrambling and data enhancement operations on the data set can further improve the accuracy and generalization capability of the classification method.

In one implementation, the acoustic scene classification model includes: a first convolution layer, a maximum pooling layer, a plurality of improved residual modules, a second convolution layer, a third convolution layer and a fourth convolution layer; the step of inputting the second feature signal into an acoustic scene classification model and outputting a predicted acoustic scene category through the acoustic scene classification model includes the following steps: inputting the second characteristic signal into the first convolution layer to obtain a first convolution map; inputting the first convolution map into the maximum pooling layer to obtain a second convolution map; sequentially inputting the second convolution diagram into each improved residual error module to obtain a third convolution diagram; and sequentially inputting the third convolution map into a second convolution layer, a third convolution layer and a fourth convolution layer to obtain the predicted acoustic scene category.

In this embodiment, the acoustic scene classification model is improved by using a classical residual Neural network (Resnet) in Convolutional Neural Networks (CNNs) to obtain a simplified Resnet Neural network model, which is called as Simplify-Resnet-CNN in the present invention. As shown in fig. 2, the input dimension of the acoustic scene classification model is 1 channel, W may be determined according to its own requirements, and W used in one example of the present invention is 40. The acoustic scene classification model comprises: a first convolution layer, a maximum pooling layer, a plurality of improved residual modules, a second convolution layer, a third convolution layer and a fourth convolution layer; the concrete structure of the model can be adjusted and is also within the protection range. The first, second, third and fourth convolutional layers are two-dimensional convolutional layers. The convolution feature extraction layer in the first convolution layer is a two-dimensional convolution, the convolution kernel (kernel) is 1 × 1, the number of 0 padding (pad) is 2, that is, two circles of 0 need to be padded around the input feature, the sliding step length (stride) is 1, the number of input channels is 1, and the number of output channels is 4. The first convolutional layer has a normalization layer and an activation layer, and then performs a maximum two-dimensional pooling process, with 3 input channels, 1 output, and 1 number of 0 padding (pad). Then, the first convolution layer and the first pooling layer are subjected to a normalization process. The second convolutional kernel (kernel) is 3 × 3, the number of padding 0 (pad) is (0,3), the number of input channels is 32, and the number of output channels is 64. The third convolutional kernel (kernel) is 1 × 1, the number of input channels is 64, and the number of output channels is 128. The fourth convolutional kernel (kernel) is 1 × 1, the number of input channels is 128, and the number of output channels is 6. In addition, the acoustic scene classification model also comprises a Softmax processing, which maps the output of a plurality of neurons into a (0,1) interval and is a probability output, so as to perform a plurality of classifications. In practice, there are four modified residual modules (resblockackplus modules), and the input channel of the 1 st resblockackplus in each modified residual module is 4 and the output channel is 8; the input channel of the 2 nd resblockackplus is 8 and the output channel is 16; the 3 rd respockplus input channel is 16 and the output channel is 32; the 4 th resblockackplus input channel is 32 and the output channel is 32. When the second characteristic signal is input into the acoustic scene classification model, the second characteristic signal is input into the first convolution layer in sequence, the maximum pooling layer is input into each improved residual error module, the second convolution layer, the third convolution layer, the fourth convolution layer and the Softmax layer in sequence, six acoustic scene category probability values corresponding to each second characteristic signal are output, and the acoustic scene category corresponding to the maximum probability value is the predicted acoustic scene category of the second characteristic signal. The size, the calculation performance and the accuracy of the acoustic scene classification model are improved.

Specifically, the classical residual block has two layers, as expressed by the following expression:

where σ represents the nonlinear function ReLU, then through a shortcut, and the 2 nd ReLU, the output y is obtained:

the residual block often needs more than two layers, and a single layer of the residual block (y ═ W1x + x) cannot play a role in lifting. In consideration of the need to change the input and output dimensions (e.g., change the number of channels), the present invention can perform a linear transformation Ws on x during shortcut, as follows:

based on the above theory, as shown in fig. 3, the embodiment of the present invention improves the residual error module, where the improved residual error module includes an improved residual error layer and a first residual error layer, and the first residual error layer includes a first convolutional layer and a second convolutional layer; the convolution feature extraction layer in the first convolution layer is a two-dimensional convolution, the convolution kernel (kernel) is 3x3, the number (pad) of 0 padding is 1, that is, a circle of 0 needs to be padded around the input feature, the sliding step length (stride) is 1, and the number of channels is an input parameter. The first convolution layer has a normalization processing layer and an activation processing layer; the convolution feature extraction layer in the second convolution layer is two-dimensional convolution, the convolution kernel (kernel) is 3x3, the number of 0 padding (pad) is 1, that is, two circles of 0 need to be padded around the input feature, the sliding step length (stride) is 1, and the number of channels is the input parameter. One normalization layer is in the second convolutional layer. Then, the first convolutional layer and the second convolutional layer are serialized, and the result obtained by serialization is added with the input to obtain the final input. Compared with the structure of the first residual layer, the improved residual layer has one more linear processing module, the linear processing module performs linear transformation processing on the input and then adds the input, and the processing flow includes and is not limited to the following operations: the input is subjected to two-dimensional convolution, the convolution kernel (kernel) is 1x1, the sliding step length (stride) is 1, the convolution behaviors are set according to different settings for other parameters, and then one-time normalization processing is carried out. The signal processing of the improved residual module is that the input is processed by the improved residual layer, and the processing result is processed by the first residual layer. In practice, the improved residual error module can flexibly increase or decrease the input/output parameters and the number of the residual error blocks, which is not limited herein. The network error rate after training and checking the residual network is shown in fig. 4.

Having obtained the predicted acoustic scene category, the following steps can be performed as shown in fig. 1: s300, training the acoustic scene classification model according to the predicted acoustic scene category and the real label to obtain a trained acoustic scene classification model.

Specifically, for each sample audio, the objective is to approach the predicted acoustic scene category to the real label through an acoustic scene classification model, and in this process, the parameters of the acoustic scene classification model are continuously adjusted and the step of inputting the second characteristic signal into the acoustic scene classification model is continuously performed until a preset training condition is met, such as a loss function meeting a preset requirement or the training times reaching a preset number, so as to obtain a trained acoustic scene classification model. In this example, the size of the data batch input (batch size) is 64, the optimizer uses a random gradient (SGD), the learning rate is 0.01, the loss function is cross entropy (Cross EntropyLoss), and 30 epochs are iterated. The training effect is shown in fig. 5.

The accuracy of the training set of the Simplify-ResNet inference model is 96.28%, and the accuracy of the test set is 84.35%. The calculation formula of the accuracy rate is as follows:

the accuracy rate is X100 percent of the number with correct inference/total number

(g) How the Simplify-ResNet model is used, on which devices or apparatuses:

the trained model is smaller, the computational reasoning speed is higher, and model compression quantization, model pruning and knowledge distillation can be performed.

On devices or equipment with relatively large memory and computing power, including but not limited to cell phones, personal PCs, servers, etc., may also be used. The current mainstream pytorreh Lightning and tensoflow Lite reasoning frameworks can be directly adopted for deployment. After the digital signal of the target audio is acquired, resampling and normalization processing are carried out, then logarithmic Mel frequency spectrum features are extracted, the dimensionality of 1x 40 is used as input (the processing steps and parameters are consistent with the training data processing steps and parameters), if the model is subjected to quantization processing, the input data also needs to be subjected to quantization processing, the model is operated on the reasoning frame, and finally one classification serial number in 6 classifications is obtained.

The invention has the characteristics of smaller size and low computational requirement, is more suitable for being deployed on a microprocessor, and has the following specific flow:

firstly, constructing a reasoning engine by using C language or assembly code according to a specific model structure, wherein the reasoning engine needs to be optimized through a platform chip arithmetic instruction, and a model parameter file is loaded into the reasoning engine in a binary mode when the reasoning engine runs; the processing method of the target audio is consistent with the deployment, data needing to be input is obtained after the processing, the data is used as parameters to be input into the reasoning engine, then the executable file is compiled to generate an executable file, and the executable file is burned into a corresponding microcontroller platform to run.

The acoustic scene classification model training method has the advantages of low time delay, small size, capability of distinguishing acoustic scenes in a complex environment, and capability of expanding classification categories.

The comparison result between the acoustic scene classification model and other acoustic scene classification models is shown in the following table,

the invention has the advantages of very small requirement on internal memory computing power, improvement on time delay reduction, higher accuracy and more suitability for being used on hearing aids and wearable equipment.

Exemplary device

As shown in fig. 6, an acoustic scene classification model training apparatus provided in an embodiment of the present invention includes a training data obtaining module 401, a predicted acoustic scene category obtaining module 402, and an acoustic scene classification model obtaining module 403: a training data obtaining module 401, configured to obtain training data, where the training data includes a sample audio and a real label corresponding to the sample audio;

a predicted acoustic scene category obtaining module 402, configured to extract a first feature signal of the sample audio, and slice and expand the first feature signal to obtain a second feature signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category through the acoustic scene classification model;

an acoustic scene classification model obtaining module 403, configured to adjust a model parameter of the acoustic scene classification model according to the predicted acoustic scene category and the real label, and continue to perform the step of inputting the second feature signal into the acoustic scene classification model until a preset training condition is met, so as to obtain a trained acoustic scene classification model.

The embodiment of the invention also provides an acoustic scene classification method, which is characterized by comprising the following steps:

h100, obtaining audio to be classified, and extracting first features of the audio;

h200, inputting the first characteristics into a trained acoustic scene classification model to obtain an acoustic scene category corresponding to the audio to be classified;

specifically, the audio to be classified may be acquired first, then the first feature of the audio is obtained in a logarithmic mel-frequency spectrum manner, and then the first feature is input into a trained acoustic scene classification model.

The existing acoustic scene classification technology is applied to a plurality of fields such as ecological environment monitoring, public safety intelligent monitoring, voice recognition technology, automatic auxiliary driving, voice communication, hearing aids and the like.

Based on the above embodiment, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 7. The intelligent terminal comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. The computer program is executed by a processor to implement a method of training an acoustic scene classification model. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the intelligent terminal is arranged inside the intelligent terminal in advance and used for detecting the operating temperature of internal equipment.

It will be understood by those skilled in the art that the schematic diagram of fig. 7 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have different arrangements of components.

In one embodiment, an intelligent terminal is provided that includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:

acquiring training data, wherein the training data comprises sample audio and real labels corresponding to the sample audio;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

In summary, the present invention discloses a method, an apparatus, an intelligent terminal and a storage medium for training an acoustic scene classification model, wherein the method comprises: acquiring training data; extracting a first characteristic signal of the sample audio, and slicing and expanding the first characteristic signal to obtain a second characteristic signal; inputting the second characteristic signal into an acoustic scene classification model, and outputting a predicted acoustic scene category; the acoustic scene classification model is obtained by improving a residual error neural network; and training the acoustic scene classification model according to the predicted acoustic scene category and the real label to obtain a trained acoustic scene classification model. According to the embodiment of the invention, the characteristic signal of the sample audio is subjected to capacity expansion after being sliced, so that the size of the input sample signal is reduced, and the response speed is increased; and the sample signals are input into the improved residual error neural network for training, so that the accuracy of the classification result of the model is higher.

Based on the above embodiments, the present invention discloses a method for training an acoustic scene classification model, and it should be understood that the application of the present invention is not limited to the above examples, and it will be obvious to those skilled in the art that modifications and variations can be made in the light of the above description, and all such modifications and variations should fall within the scope of the appended claims.

Claims

1. A method for training an acoustic scene classification model, the method comprising:

2. The method of claim 1, wherein the obtaining training data comprises:

3. The method of claim 1, wherein the extracting the first feature signal of the sample audio comprises:

resampling the sample audio to obtain a first audio;

4. The method for training an acoustic scene classification model according to claim 1, wherein the step of slicing and expanding the first feature signal to obtain a second feature signal comprises:

performing secondary grouping on the plurality of first signal groups to obtain a plurality of second signal groups; overlapping signals with preset length are arranged among the second signal groups;

marking a number of the second signal groups based on the authentic tag;

5. The method of claim 1, wherein the acoustic scene classification model comprises: a first convolution layer, a maximum pooling layer, a plurality of improved residual modules, a second convolution layer, a third convolution layer and a fourth convolution layer;

the inputting the second feature signal into an acoustic scene classification model, and the outputting a predicted acoustic scene category through the acoustic scene classification model includes:

sequentially inputting the second convolution diagram into each improved residual error module to obtain a third convolution diagram;

6. The acoustic scene classification model training method according to claim 5, wherein the modified residual module comprises a modified residual layer and a first residual layer; wherein the improved residual layer comprises a second residual layer and a linear processing module; the linear processing module is composed of a convolution layer and a normalization layer and is used for carrying out linear transformation on the input of the improved residual error layer.

7. An acoustic scene classification model training apparatus, the apparatus comprising:

8. A method of acoustic scene classification, the method comprising:

inputting the first characteristic into a trained acoustic scene classification model to obtain an acoustic scene category corresponding to the audio to be classified; wherein the trained acoustic scene classification model is the acoustic scene classification model of any of claims 1-6.

9. An intelligent terminal comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs being configured to be executed by the one or more processors comprises instructions for performing the method of any of claims 1-6.

10. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-6.