CN115713945A

CN115713945A - Audio data processing method and prediction method

Info

Publication number: CN115713945A
Application number: CN202211406445.0A
Authority: CN
Inventors: 张凯帆; 张静; 毛志德; 郑红; 王双杰
Original assignee: Hangzhou Aihua Instruments Co ltd
Current assignee: Hangzhou Aihua Instruments Co ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-02-24

Abstract

The invention relates to an audio data processing method and a prediction method in the technical field of audio processing, which comprises the following steps: acquiring an audio data set, and preprocessing the audio data set to obtain a preprocessed audio set; extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of audio signal, wherein the feature language spectrogram set comprises more than two feature language spectrograms; normalizing the feature speech spectrogram set and generating multi-channel features; and a neural network model is generated, and the multichannel characteristics are used as input to carry out neural network training, so that the problem of deep learning aiming at the multi-characteristics of the audio is solved.

Description

Audio data processing method and prediction method

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio data processing method and a prediction method.

Background

Currently, technicians of audio classification algorithms train audio classification models by extracting Mel-Frequency Cepstral Coefficients (MFCC) spectrogram of audio and then using a recurrent neural network or a convolutional neural network.

The single spectrum of the MFCC voice extracted by the method is too simple, the amount of contained audio information is small, and deep features of the audio are difficult to learn during neural network training.

The recurrent neural network has a memory function and tends to process tasks related to time sequence, such as prediction of text context, but tends to consider targets as a whole in a noise classification model. The convolutional neural network has three characteristics of local receptive field, weight sharing and down-sampling, and parameters and complexity of the model can be reduced, but the model is difficult to train along with the increase of the number of network layers, and a deep network can not learn deeper contents.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an audio data processing method and a prediction method, and solves the problem of deep learning aiming at multiple characteristics of audio.

In order to solve the technical problems, the invention is solved by the following technical scheme:

an audio data processing method, comprising the steps of:

acquiring an audio data set, and preprocessing the audio data set to obtain a preprocessed audio set;

extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of the audio signal, wherein the feature language spectrogram set comprises more than two feature language spectrograms;

normalizing the feature language spectrogram set and generating multi-channel features;

and generating a neural network model, and performing neural network training by taking the multichannel characteristics as input.

Optionally, the audio data set is preprocessed, including the following steps:

filtering useless audios in the audio data set, and unifying the audio duration of each section of audio in the audio data set;

and performing framing and windowing processing on each section of audio data of the filtered audio data set to obtain a preprocessed audio set.

Optionally, the filtering the unwanted audio in the audio data set includes the following steps:

deleting the audio data which cannot be judged in the audio data set;

setting a first audio length threshold value and a frequency threshold value, and deleting the audio data of which the audio length is shorter than the first audio length threshold value or the frequency is lower than the frequency threshold value in the audio data set.

Optionally, unifying the audio duration of each segment of audio in the audio data set, includes the following steps:

setting a second audio length threshold value, and judging the audio length of the audio data in the audio data set and the second audio length threshold value;

if the audio length of the audio data is greater than or equal to a second audio length threshold value, continuously intercepting the audio data with standard duration;

and if the audio length of the audio data is smaller than a second audio length threshold, obtaining the audio data with standard duration by adopting an intercepting or filling method.

Optionally, the extracting a feature speech spectrogram set of the audio signal of each frame includes the following steps:

and sequentially acquiring the power normalization chromatogram, the Mel cepstrum coefficient, the Mel frequency spectrum and the constant Q chromatogram of the audio signal.

Optionally, generating the multi-channel feature includes the following steps:

setting the audio frame length, frame shift and maximum audio duration of each section of audio data in the preprocessed audio set, and calculating the audio frame number;

and generating multi-channel characteristics based on the audio frame length, the audio frame number and the characteristic speech spectrogram set, wherein the channels of the multi-channel characteristics correspond to the characteristic speech spectrograms in the characteristic speech spectrogram set one by one.

Optionally, the neural network training is performed by using the multichannel features as inputs, and includes the following steps:

inputting the multi-channel features into a multi-channel input layer;

and the depth residual convolution layer of the neural network model trains input multi-channel characteristics according to a residual method to obtain neural network models of different audio classifications and generate a model library.

Optionally, the method further comprises the following steps:

and acquiring an audio verification set, and optimizing the learning rate of the neural network model based on the audio verification set.

A method for predicting audio data, comprising obtaining a trained neural network model using the audio data processing method as described in any one of the above, further comprising the steps of:

acquiring an audio test set, and preprocessing the audio test set to obtain a preprocessed audio test set;

extracting each frame of test audio signal of each section of audio in the preprocessed audio test set, and extracting a test feature language spectrogram set of each frame of the test audio signal, wherein the test feature language spectrogram set comprises more than two feature language spectrograms;

normalizing the test feature spectrogram set and generating multi-channel test features;

calling the trained neural network model, and taking the multi-channel test characteristics as input to obtain an audio classification result;

and modifying the format of the audio classification result into a format for displaying.

A computer-readable storage medium storing a computer program which, when executed by a processor, performs the audio data processing method of any one of the above.

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

the method has the advantages that the multi-channel characteristics are obtained by constructing the spectrogram of different audio characteristics of the audio and performing characteristic fusion, so that the network can learn more audio characteristics, and the accuracy of the model is improved; the problem that the convolution network is degraded and difficult to train is solved by modifying the input of the convolution layer by using a residual error method; the method has the advantages of flexible model application capability, model sharing and rapid deployment realized by a model library mode, and the interference among models is reduced by selecting corresponding models according to different actual scenes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of an audio data processing method and a prediction method according to the first and second embodiments;

FIG. 2 is a diagram illustrating an example of framing an audio signal according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of overlapping between two consecutive frames after windowing an audio signal according to an embodiment;

fig. 4 is a multi-channel feature map after audio feature fusion according to the embodiment;

fig. 5 is a diagram of a convolution pooling layer structure according to the embodiment;

fig. 6 is a structural diagram of a multi-channel residual convolution network according to the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Example one

As shown in fig. 1, an audio data processing method includes the steps of: the method comprises the following steps of acquiring an audio data set, namely a training set, preprocessing the audio data set to obtain a preprocessed audio set, firstly, acquiring audio data on site by using a practical sound level meter to obtain the audio data set, and preprocessing the audio data set, wherein the method comprises the following steps: filtering useless audios in the audio data set, and unifying the audio duration of each section of audio in the audio data set; and performing framing and windowing processing on each section of audio data of the filtered audio data set to obtain a preprocessed audio set.

In particular, the filtering of unwanted audio in an audio data set comprises the steps of: deleting the audio data which cannot be judged in the audio data set, where the audio data which cannot be judged refers to what sound is more noisy or cannot be judged specifically, the audio data needs to be deleted, and then performing standardization, that is, setting a first audio length threshold and a frequency threshold, deleting the audio data in the audio data set, where the audio length is shorter than the first audio length threshold or the frequency is lower than the frequency threshold, in order to ensure that each audio data has the same characteristic dimension when generating a spectrogram, according to a theorem, it is required to ensure that the frequency is 2.56 to 4 times of the highest frequency of a signal, and since the sampling frequency affects the quality of the audio, the sampling frequency is too low, the audio is distorted, the sampling rate of 48K has the sound quality of DVD, the highest audio which can be collected by 48KHz is 48000/2.56= 50hz, and the audible range of 20Hz to 20KHz, therefore, in this embodiment, the first audio length threshold may be set to 3 seconds, and the frequency threshold may be set to 48KHz, at this time, the audio data whose duration is less than 3 seconds or whose sampling frequency is lower than 48KHz needs to be deleted.

Further, unifying the audio duration of each segment of audio in the audio data set comprises the following steps: setting a second audio length threshold value, and judging the audio length of the audio data in the audio data set and the second audio length threshold value; if the audio length of the audio data is greater than or equal to the second audio length threshold, continuously intercepting the audio data with standard duration; and if the audio length of the audio data is smaller than the second audio length threshold, obtaining the audio data with standard time length by adopting an intercepting or filling method.

Specifically, for audio data with different durations, the signal length is unified by uniformly intercepting or filling at the beginning and end of the audio, in this embodiment, the second audio length threshold may be set to 15 seconds, when the audio length of the audio data is greater than or equal to 15 seconds, audio with a duration of 10 seconds is continuously intercepted, otherwise, audio data with a standard duration is obtained by an intercepting or filling method, specifically, the maximum audio duration T may be set to 10 seconds, the sampling frequency RF is 48KHz, and at this time, the maximum audio timing length and the actual audio timing length L may be calculated _wav The calculation formula of the difference value padding is as follows: padding = RF T-RF T _wav 。

Wherein, T _wav For the actual audio duration, if padding is greater than 0, signals with padding/2 length are respectively intercepted at the head and the tail of the audio, and because the interception may cause the loss of effective audio at the edge, the boundary delay can be increased by adopting boundary filling, and the influence of the boundary delay on the audio result is low; if padding < 0, the boundary signal of the length of padding | 2 is filled in at the head and at the tail of the signal, respectively. If padding is odd, the audio starts rounding up after padding is divided by 2.

Since the audio signal has short-time stationarity, it is necessary to frame and window the audio, and specifically, in this embodiment, the frame length L may be set _frame 25ms, frame shift S _frame 10ms, the frame number of the audio N can be calculated by the following formula _frame ：

Wherein L is _frame -S _frame For the overlap duration T of two frames _overlap ，[]Represents rounding, for example: [4.1]＝4，[4.8]=4, as shown in fig. 2, in the present embodiment, the frame length L is set to have an audio duration T =10 seconds _frame =3 seconds, S _frame For example, =2, the time length of the overlapping part of two frames is 1s, and the number of frames is N _frame =10- (3-2)/2 +1=5, i.e. 10 seconds of audio may be divided into 5 frames. As shown in fig. 2, if the last frame is not enough, it can be ignored, or zero padding can be used,the concrete can be adjusted according to the actual situation.

As shown in fig. 3, in order to prevent the occurrence of spectrum leakage at the boundary of two consecutive frames, a minghan window is used to perform windowing on each frame, and after windowing, signals at two ends of a frame are weakened, so that an overlapping portion is disposed between two consecutive frames, and the overlapping portion can be adjusted according to actual conditions, which is not specifically limited herein.

Further, extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of audio signal to realize audio feature fusion, wherein the feature language spectrogram set comprises more than two feature language spectrograms; extracting a feature speech spectrogram set of each frame of audio signal, comprising the following steps: and sequentially acquiring a power normalization chromatogram, a Mel cepstrum coefficient, a Mel spectrum and a constant Q chromatogram of the audio signal.

Wherein power normalized chromagrams are typically used to identify similarities between different interpretations of a given music for audio matching and similarity tasks; the constant Q-diagram indicates that transforming the time series to the frequency domain is related to the fourier transform, since its output amplitude is relative to logarithmic frequency. Where the entire spectrum is projected into 12 bins, representing 12 different semitones or chromaticities of a musical octave; the audio frequency characteristic embodied by the Mel frequency spectrum is that the resolution of the filter group with Mel scales at the low frequency part is high, and the filter group is consistent with the auditory characteristic of human ears, so that the function of simulating human ears is achieved; since the perception of human ears to sound is not linear and is better described by the nonlinear relation log, the Mel cepstrum coefficient analyzes the audio features by the Mel value log, so that the diversity of the audio features is analyzed by the different spectrogram.

After generating a plurality of spectrogram, all spectrogram, i.e. a feature spectrogram set, needs to be normalized, so as to generate a multi-channel feature, wherein the generating of the multi-channel feature includes the following steps: setting the audio frame length, frame shift and maximum audio duration of each section of audio data in the preprocessed audio set, and calculating the number of audio frames; and generating multi-channel characteristics based on the audio frame length, the audio frame number and the characteristic language spectrogram set, wherein the channels of the multi-channel characteristics correspond to the characteristic language spectrogram in the characteristic language spectrogram set one by one.

Specifically, the calculation formula of the normalization processing is as follows:

wherein x represents the value corresponding to the point on each spectrogram as input value, y represents the output value after normalization processing of each spectrogram, and the value range of the output value is [0,1 ]]For example, X = [0,4,7]Wherein 0,4,7 are three values of x respectively, and become [1/2, 1/(1 + e-4), 1/(1 + e-7) after normalization processing]Therefore, all spectrogram is measured by adopting a uniform standard, and the igmoid function is an S-shaped squeezing function and has unchanged amplitude after compression.

Furthermore, in order to mine the signal features in the deep layer, the embodiment does not simply perform transverse splicing on all the features, but uses the idea of channels in the image for reference, and combines the above features longitudinally to form a multi-channel N _chanle Similarly, the frame length L of each frame may be set in the present embodiment _frame Is 30 milliseconds, wherein the frame length is set to be in the range of 20-40 milliseconds, and can be adjusted according to the actual data situation, the two-dimensional matrix of each channel is expressed as (N) _frame ，L _frame ) And the resulting multi-channel feature is represented as (N) _frame ，L _frame ，N _chanle ) In this embodiment, as shown in fig. 4, the multiple channels are represented by taking the power normalized chromatogram obtained as described above as one channel, the mel cepstral coefficient as one channel, the mel frequency spectrum as one channel, and the constant Q chromatogram as one channel, so that the multi-channel feature of four channels can be obtained.

Furthermore, a neural network model is generated, and the multichannel characteristics are used as input to carry out neural network training, and the method specifically comprises the following steps: inputting the multi-channel features into a multi-channel input layer; and the depth residual convolution layer of the neural network model trains the input multichannel characteristics according to a residual method.

As shown in fig. 5, in this experiment, edge stuffing and multiple sets of 3 × 3 convolution kernels were used for each convolution layer, respectively, with the activation function being ReLU,2 × 2 max pooling layers, where the input layers and each convolution kernel have the same number of channels, and the multiple convolution kernels output multiple channels.

Compared with the traditional convolutional network, by means of local connection and weight sharing, not only can local optimal characteristics be learned, but also the parameter quantity is greatly reduced, training is more difficult along with the increase of the network depth, and the problems of gradient disappearance and gradient explosion are more obvious, so that as shown in fig. 6, a residual error method is introduced in the embodiment to correct and reduce strong correlation between adjacent layers, and the output of the (n-2) layer and the output of the (n) layer are spliced and then used as the input of the lower layer (n + 1).

Specifically, in the present embodiment, taking the network structure including 12 layers, for example, 8 convolutional layers, 2 max pooling layers, and 2 full link layers, the network structure formed by combining the multi-channel features and the residual convolution is shown in fig. 6.

Further, in order to optimize the learning rate of the neural network model, an audio verification set needs to be obtained, the learning rate of the neural network model is optimized based on the audio verification set, specifically, in the training process of the neural network model, the learning rate is optimized, a dropout function is set, the early-stop mechanism is set to improve the effect of the model and prevent overfitting, the loss value of the verification set is continuously reduced in the training process, when the value is increased for the first time, the learning rate is reduced by 0.1 time on the original basis, and when 5 continuous loss values are larger than the value before increasing, the training is stopped, the stored model is the optimal model at the moment, and the loss function used in calculating the loss value is categorical _ cross.

For example, the initial learning rate is 0.001, when the loss value of the verification set is increased for the first time, the learning rate is adjusted to be 0.0001, that is, the learning rate is multiplied by 0.1 on the basis of the original learning rate, the learning rate is too small, the model learning is slow, the convergence is slow, the learning rate is too large, the loss value is vibrated and even increased, the learning rate can be compared to the step size, the step size is too small, the step size is too large and unstable, and the change is fast.

Specifically, in the neural network model training process, the following may be set: training in a loop of 500 times, namely epochs =500, and training in a group of every 200 pieces of data, namely batch _ size =200; assuming that the total amount of samples is 1000 data, 1000/200=5, namely 1 epochs is obtained after 5 times of training, exactly 1000 data are trained after 200,5 times of training each time, a verification set participates in prediction after each epochs, a loss value and an accuracy rate are respectively calculated, the loss value is continuously reduced and the accuracy rate is continuously improved in the training process, and when the loss value is larger than the value of (i-1) times after the i-th epoch, the learning rate is adjusted to continue training. When the loss values of 5 consecutive times are all smaller than before, the training is stopped, and an early stop mechanism is triggered, in which case the epochs may be equal to 34, and the stored optimal model is the 29 th training one.

In the training process of each batch of batch _ size in the training data set, the optimal solution is solved through a back propagation method, namely a gradient descent method according to the loss value, and the network parameters are updated sequentially from output to input.

The verification set does not participate in the process of back propagation, namely, the model parameters are not trained; only the accuracy and the loss rate of the current model are verified, and the current model participates in the manual parameter adjusting process, such as the change of the network layer number, the setting of batch _ size, the setting of initial learning rate and the like.

The neural network models trained by the method are different in audio classification, so that a plurality of classified neural network models can be obtained, and at the moment, all the different types of neural networks can be collected into a model library for the next audio prediction to use.

Example two

As shown in fig. 1, an audio data prediction method includes obtaining a trained neural network model by using the audio data processing method according to the first embodiment, and further includes the following steps: acquiring an audio test set, and preprocessing the audio test set to obtain a preprocessed audio test set; extracting each frame of test audio signal of each section of audio in the preprocessed audio test set, and extracting a test feature language spectrogram set of each frame of test audio signal, wherein the test feature language spectrogram set comprises more than two feature language spectrograms; normalizing the test feature language spectrogram set and generating multi-channel test features; when the audio data is predicted, the method for preprocessing the audio test set, framing, extracting the frame features and fusing the audio features into the test feature speech spectrogram set is the same as that of the first embodiment, but the difference is that the trained neural network model can be directly called, and the multi-channel test features are used as input to obtain the audio classification result.

Specifically, different neural network classification models are selected from a model library, corresponding API interfaces are respectively developed for each neural network model to call the designated models to provide real-time online service, then audio data in a test set are input into the neural network models to obtain audio classification results, and then the formats of the audio classification results are modified into formats for display.

A computer-readable storage medium storing a computer program which, when executed by a processor, performs the audio data processing method of any one of the embodiments.

More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wire segments, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless segments, wire segments, fiber optic cables, RF, etc., or any suitable combination of the foregoing.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules, modules or units is only one type of logical function division, and other division manners may be available in actual implementation, for example, multiple units, modules or components may be combined or integrated into another device, or some features may be omitted, or not executed.

The units may or may not be physically separate, and components displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of audio data processing, comprising the steps of:

and generating a neural network model, and performing neural network training by taking the multi-channel characteristics as input.

2. The audio data processing method of claim 1, wherein the audio data set is preprocessed, comprising the steps of:

3. The audio data processing method of claim 2, wherein filtering the unwanted audio in the audio data set comprises:

deleting the audio data which cannot be judged in the audio data set;

4. The audio data processing method of claim 2, wherein unifying the audio duration of each piece of audio in the audio data set comprises:

5. The audio data processing method of claim 1, wherein extracting the feature spectrogram set of the audio signal of each frame comprises:

6. The audio data processing method of claim 1, wherein generating the multi-channel feature comprises:

and generating multi-channel features based on the audio frame length, the audio frame number and the feature language spectrogram set, wherein channels of the multi-channel features correspond to feature language spectrograms in the feature language spectrogram set one by one.

7. The audio data processing method of claim 1, wherein performing neural network training using the multichannel features as inputs comprises:

inputting the multi-channel features into a multi-channel input layer;

8. The audio data processing method according to claim 1, further comprising the steps of:

9. A method for predicting audio data, comprising obtaining a trained neural network model using the audio data processing method according to any one of claims 1 to 8, and further comprising the steps of:

and modifying the format of the audio classification result into a format for display.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the audio data processing method of any one of claims 1 to 8.