CN115713945A - Audio data processing method and prediction method - Google Patents

Audio data processing method and prediction method Download PDF

Info

Publication number
CN115713945A
CN115713945A CN202211406445.0A CN202211406445A CN115713945A CN 115713945 A CN115713945 A CN 115713945A CN 202211406445 A CN202211406445 A CN 202211406445A CN 115713945 A CN115713945 A CN 115713945A
Authority
CN
China
Prior art keywords
audio
audio data
frame
neural network
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211406445.0A
Other languages
Chinese (zh)
Inventor
张凯帆
张静
毛志德
郑红
王双杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Aihua Instruments Co ltd
Original Assignee
Hangzhou Aihua Instruments Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Aihua Instruments Co ltd filed Critical Hangzhou Aihua Instruments Co ltd
Priority to CN202211406445.0A priority Critical patent/CN115713945A/en
Publication of CN115713945A publication Critical patent/CN115713945A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to an audio data processing method and a prediction method in the technical field of audio processing, which comprises the following steps: acquiring an audio data set, and preprocessing the audio data set to obtain a preprocessed audio set; extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of audio signal, wherein the feature language spectrogram set comprises more than two feature language spectrograms; normalizing the feature speech spectrogram set and generating multi-channel features; and a neural network model is generated, and the multichannel characteristics are used as input to carry out neural network training, so that the problem of deep learning aiming at the multi-characteristics of the audio is solved.

Description

Audio data processing method and prediction method
Technical Field
The invention relates to the technical field of audio processing, in particular to an audio data processing method and a prediction method.
Background
Currently, technicians of audio classification algorithms train audio classification models by extracting Mel-Frequency Cepstral Coefficients (MFCC) spectrogram of audio and then using a recurrent neural network or a convolutional neural network.
The single spectrum of the MFCC voice extracted by the method is too simple, the amount of contained audio information is small, and deep features of the audio are difficult to learn during neural network training.
The recurrent neural network has a memory function and tends to process tasks related to time sequence, such as prediction of text context, but tends to consider targets as a whole in a noise classification model. The convolutional neural network has three characteristics of local receptive field, weight sharing and down-sampling, and parameters and complexity of the model can be reduced, but the model is difficult to train along with the increase of the number of network layers, and a deep network can not learn deeper contents.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an audio data processing method and a prediction method, and solves the problem of deep learning aiming at multiple characteristics of audio.
In order to solve the technical problems, the invention is solved by the following technical scheme:
an audio data processing method, comprising the steps of:
acquiring an audio data set, and preprocessing the audio data set to obtain a preprocessed audio set;
extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of the audio signal, wherein the feature language spectrogram set comprises more than two feature language spectrograms;
normalizing the feature language spectrogram set and generating multi-channel features;
and generating a neural network model, and performing neural network training by taking the multichannel characteristics as input.
Optionally, the audio data set is preprocessed, including the following steps:
filtering useless audios in the audio data set, and unifying the audio duration of each section of audio in the audio data set;
and performing framing and windowing processing on each section of audio data of the filtered audio data set to obtain a preprocessed audio set.
Optionally, the filtering the unwanted audio in the audio data set includes the following steps:
deleting the audio data which cannot be judged in the audio data set;
setting a first audio length threshold value and a frequency threshold value, and deleting the audio data of which the audio length is shorter than the first audio length threshold value or the frequency is lower than the frequency threshold value in the audio data set.
Optionally, unifying the audio duration of each segment of audio in the audio data set, includes the following steps:
setting a second audio length threshold value, and judging the audio length of the audio data in the audio data set and the second audio length threshold value;
if the audio length of the audio data is greater than or equal to a second audio length threshold value, continuously intercepting the audio data with standard duration;
and if the audio length of the audio data is smaller than a second audio length threshold, obtaining the audio data with standard duration by adopting an intercepting or filling method.
Optionally, the extracting a feature speech spectrogram set of the audio signal of each frame includes the following steps:
and sequentially acquiring the power normalization chromatogram, the Mel cepstrum coefficient, the Mel frequency spectrum and the constant Q chromatogram of the audio signal.
Optionally, generating the multi-channel feature includes the following steps:
setting the audio frame length, frame shift and maximum audio duration of each section of audio data in the preprocessed audio set, and calculating the audio frame number;
and generating multi-channel characteristics based on the audio frame length, the audio frame number and the characteristic speech spectrogram set, wherein the channels of the multi-channel characteristics correspond to the characteristic speech spectrograms in the characteristic speech spectrogram set one by one.
Optionally, the neural network training is performed by using the multichannel features as inputs, and includes the following steps:
inputting the multi-channel features into a multi-channel input layer;
and the depth residual convolution layer of the neural network model trains input multi-channel characteristics according to a residual method to obtain neural network models of different audio classifications and generate a model library.
Optionally, the method further comprises the following steps:
and acquiring an audio verification set, and optimizing the learning rate of the neural network model based on the audio verification set.
A method for predicting audio data, comprising obtaining a trained neural network model using the audio data processing method as described in any one of the above, further comprising the steps of:
acquiring an audio test set, and preprocessing the audio test set to obtain a preprocessed audio test set;
extracting each frame of test audio signal of each section of audio in the preprocessed audio test set, and extracting a test feature language spectrogram set of each frame of the test audio signal, wherein the test feature language spectrogram set comprises more than two feature language spectrograms;
normalizing the test feature spectrogram set and generating multi-channel test features;
calling the trained neural network model, and taking the multi-channel test characteristics as input to obtain an audio classification result;
and modifying the format of the audio classification result into a format for displaying.
A computer-readable storage medium storing a computer program which, when executed by a processor, performs the audio data processing method of any one of the above.
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
the method has the advantages that the multi-channel characteristics are obtained by constructing the spectrogram of different audio characteristics of the audio and performing characteristic fusion, so that the network can learn more audio characteristics, and the accuracy of the model is improved; the problem that the convolution network is degraded and difficult to train is solved by modifying the input of the convolution layer by using a residual error method; the method has the advantages of flexible model application capability, model sharing and rapid deployment realized by a model library mode, and the interference among models is reduced by selecting corresponding models according to different actual scenes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an audio data processing method and a prediction method according to the first and second embodiments;
FIG. 2 is a diagram illustrating an example of framing an audio signal according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example of overlapping between two consecutive frames after windowing an audio signal according to an embodiment;
fig. 4 is a multi-channel feature map after audio feature fusion according to the embodiment;
fig. 5 is a diagram of a convolution pooling layer structure according to the embodiment;
fig. 6 is a structural diagram of a multi-channel residual convolution network according to the present embodiment.
Detailed Description
The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.
Example one
As shown in fig. 1, an audio data processing method includes the steps of: the method comprises the following steps of acquiring an audio data set, namely a training set, preprocessing the audio data set to obtain a preprocessed audio set, firstly, acquiring audio data on site by using a practical sound level meter to obtain the audio data set, and preprocessing the audio data set, wherein the method comprises the following steps: filtering useless audios in the audio data set, and unifying the audio duration of each section of audio in the audio data set; and performing framing and windowing processing on each section of audio data of the filtered audio data set to obtain a preprocessed audio set.
In particular, the filtering of unwanted audio in an audio data set comprises the steps of: deleting the audio data which cannot be judged in the audio data set, where the audio data which cannot be judged refers to what sound is more noisy or cannot be judged specifically, the audio data needs to be deleted, and then performing standardization, that is, setting a first audio length threshold and a frequency threshold, deleting the audio data in the audio data set, where the audio length is shorter than the first audio length threshold or the frequency is lower than the frequency threshold, in order to ensure that each audio data has the same characteristic dimension when generating a spectrogram, according to a theorem, it is required to ensure that the frequency is 2.56 to 4 times of the highest frequency of a signal, and since the sampling frequency affects the quality of the audio, the sampling frequency is too low, the audio is distorted, the sampling rate of 48K has the sound quality of DVD, the highest audio which can be collected by 48KHz is 48000/2.56= 50hz, and the audible range of 20Hz to 20KHz, therefore, in this embodiment, the first audio length threshold may be set to 3 seconds, and the frequency threshold may be set to 48KHz, at this time, the audio data whose duration is less than 3 seconds or whose sampling frequency is lower than 48KHz needs to be deleted.
Further, unifying the audio duration of each segment of audio in the audio data set comprises the following steps: setting a second audio length threshold value, and judging the audio length of the audio data in the audio data set and the second audio length threshold value; if the audio length of the audio data is greater than or equal to the second audio length threshold, continuously intercepting the audio data with standard duration; and if the audio length of the audio data is smaller than the second audio length threshold, obtaining the audio data with standard time length by adopting an intercepting or filling method.
Specifically, for audio data with different durations, the signal length is unified by uniformly intercepting or filling at the beginning and end of the audio, in this embodiment, the second audio length threshold may be set to 15 seconds, when the audio length of the audio data is greater than or equal to 15 seconds, audio with a duration of 10 seconds is continuously intercepted, otherwise, audio data with a standard duration is obtained by an intercepting or filling method, specifically, the maximum audio duration T may be set to 10 seconds, the sampling frequency RF is 48KHz, and at this time, the maximum audio timing length and the actual audio timing length L may be calculated wav The calculation formula of the difference value padding is as follows: padding = RF T-RF T wav
Wherein, T wav For the actual audio duration, if padding is greater than 0, signals with padding/2 length are respectively intercepted at the head and the tail of the audio, and because the interception may cause the loss of effective audio at the edge, the boundary delay can be increased by adopting boundary filling, and the influence of the boundary delay on the audio result is low; if padding < 0, the boundary signal of the length of padding | 2 is filled in at the head and at the tail of the signal, respectively. If padding is odd, the audio starts rounding up after padding is divided by 2.
Since the audio signal has short-time stationarity, it is necessary to frame and window the audio, and specifically, in this embodiment, the frame length L may be set frame 25ms, frame shift S frame 10ms, the frame number of the audio N can be calculated by the following formula frame
Figure BDA0003936787940000041
Wherein L is frame -S frame For the overlap duration T of two frames overlap ,[]Represents rounding, for example: [4.1]=4,[4.8]=4, as shown in fig. 2, in the present embodiment, the frame length L is set to have an audio duration T =10 seconds frame =3 seconds, S frame For example, =2, the time length of the overlapping part of two frames is 1s, and the number of frames is N frame =10- (3-2)/2 +1=5, i.e. 10 seconds of audio may be divided into 5 frames. As shown in fig. 2, if the last frame is not enough, it can be ignored, or zero padding can be used,the concrete can be adjusted according to the actual situation.
As shown in fig. 3, in order to prevent the occurrence of spectrum leakage at the boundary of two consecutive frames, a minghan window is used to perform windowing on each frame, and after windowing, signals at two ends of a frame are weakened, so that an overlapping portion is disposed between two consecutive frames, and the overlapping portion can be adjusted according to actual conditions, which is not specifically limited herein.
Further, extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of audio signal to realize audio feature fusion, wherein the feature language spectrogram set comprises more than two feature language spectrograms; extracting a feature speech spectrogram set of each frame of audio signal, comprising the following steps: and sequentially acquiring a power normalization chromatogram, a Mel cepstrum coefficient, a Mel spectrum and a constant Q chromatogram of the audio signal.
Wherein power normalized chromagrams are typically used to identify similarities between different interpretations of a given music for audio matching and similarity tasks; the constant Q-diagram indicates that transforming the time series to the frequency domain is related to the fourier transform, since its output amplitude is relative to logarithmic frequency. Where the entire spectrum is projected into 12 bins, representing 12 different semitones or chromaticities of a musical octave; the audio frequency characteristic embodied by the Mel frequency spectrum is that the resolution of the filter group with Mel scales at the low frequency part is high, and the filter group is consistent with the auditory characteristic of human ears, so that the function of simulating human ears is achieved; since the perception of human ears to sound is not linear and is better described by the nonlinear relation log, the Mel cepstrum coefficient analyzes the audio features by the Mel value log, so that the diversity of the audio features is analyzed by the different spectrogram.
After generating a plurality of spectrogram, all spectrogram, i.e. a feature spectrogram set, needs to be normalized, so as to generate a multi-channel feature, wherein the generating of the multi-channel feature includes the following steps: setting the audio frame length, frame shift and maximum audio duration of each section of audio data in the preprocessed audio set, and calculating the number of audio frames; and generating multi-channel characteristics based on the audio frame length, the audio frame number and the characteristic language spectrogram set, wherein the channels of the multi-channel characteristics correspond to the characteristic language spectrogram in the characteristic language spectrogram set one by one.
Specifically, the calculation formula of the normalization processing is as follows:
Figure BDA0003936787940000051
wherein x represents the value corresponding to the point on each spectrogram as input value, y represents the output value after normalization processing of each spectrogram, and the value range of the output value is [0,1 ]]For example, X = [0,4,7]Wherein 0,4,7 are three values of x respectively, and become [1/2, 1/(1 + e-4), 1/(1 + e-7) after normalization processing]Therefore, all spectrogram is measured by adopting a uniform standard, and the igmoid function is an S-shaped squeezing function and has unchanged amplitude after compression.
Furthermore, in order to mine the signal features in the deep layer, the embodiment does not simply perform transverse splicing on all the features, but uses the idea of channels in the image for reference, and combines the above features longitudinally to form a multi-channel N chanle Similarly, the frame length L of each frame may be set in the present embodiment frame Is 30 milliseconds, wherein the frame length is set to be in the range of 20-40 milliseconds, and can be adjusted according to the actual data situation, the two-dimensional matrix of each channel is expressed as (N) frame ,L frame ) And the resulting multi-channel feature is represented as (N) frame ,L frame ,N chanle ) In this embodiment, as shown in fig. 4, the multiple channels are represented by taking the power normalized chromatogram obtained as described above as one channel, the mel cepstral coefficient as one channel, the mel frequency spectrum as one channel, and the constant Q chromatogram as one channel, so that the multi-channel feature of four channels can be obtained.
Furthermore, a neural network model is generated, and the multichannel characteristics are used as input to carry out neural network training, and the method specifically comprises the following steps: inputting the multi-channel features into a multi-channel input layer; and the depth residual convolution layer of the neural network model trains the input multichannel characteristics according to a residual method.
As shown in fig. 5, in this experiment, edge stuffing and multiple sets of 3 × 3 convolution kernels were used for each convolution layer, respectively, with the activation function being ReLU,2 × 2 max pooling layers, where the input layers and each convolution kernel have the same number of channels, and the multiple convolution kernels output multiple channels.
Compared with the traditional convolutional network, by means of local connection and weight sharing, not only can local optimal characteristics be learned, but also the parameter quantity is greatly reduced, training is more difficult along with the increase of the network depth, and the problems of gradient disappearance and gradient explosion are more obvious, so that as shown in fig. 6, a residual error method is introduced in the embodiment to correct and reduce strong correlation between adjacent layers, and the output of the (n-2) layer and the output of the (n) layer are spliced and then used as the input of the lower layer (n + 1).
Specifically, in the present embodiment, taking the network structure including 12 layers, for example, 8 convolutional layers, 2 max pooling layers, and 2 full link layers, the network structure formed by combining the multi-channel features and the residual convolution is shown in fig. 6.
Further, in order to optimize the learning rate of the neural network model, an audio verification set needs to be obtained, the learning rate of the neural network model is optimized based on the audio verification set, specifically, in the training process of the neural network model, the learning rate is optimized, a dropout function is set, the early-stop mechanism is set to improve the effect of the model and prevent overfitting, the loss value of the verification set is continuously reduced in the training process, when the value is increased for the first time, the learning rate is reduced by 0.1 time on the original basis, and when 5 continuous loss values are larger than the value before increasing, the training is stopped, the stored model is the optimal model at the moment, and the loss function used in calculating the loss value is categorical _ cross.
For example, the initial learning rate is 0.001, when the loss value of the verification set is increased for the first time, the learning rate is adjusted to be 0.0001, that is, the learning rate is multiplied by 0.1 on the basis of the original learning rate, the learning rate is too small, the model learning is slow, the convergence is slow, the learning rate is too large, the loss value is vibrated and even increased, the learning rate can be compared to the step size, the step size is too small, the step size is too large and unstable, and the change is fast.
Specifically, in the neural network model training process, the following may be set: training in a loop of 500 times, namely epochs =500, and training in a group of every 200 pieces of data, namely batch _ size =200; assuming that the total amount of samples is 1000 data, 1000/200=5, namely 1 epochs is obtained after 5 times of training, exactly 1000 data are trained after 200,5 times of training each time, a verification set participates in prediction after each epochs, a loss value and an accuracy rate are respectively calculated, the loss value is continuously reduced and the accuracy rate is continuously improved in the training process, and when the loss value is larger than the value of (i-1) times after the i-th epoch, the learning rate is adjusted to continue training. When the loss values of 5 consecutive times are all smaller than before, the training is stopped, and an early stop mechanism is triggered, in which case the epochs may be equal to 34, and the stored optimal model is the 29 th training one.
In the training process of each batch of batch _ size in the training data set, the optimal solution is solved through a back propagation method, namely a gradient descent method according to the loss value, and the network parameters are updated sequentially from output to input.
The verification set does not participate in the process of back propagation, namely, the model parameters are not trained; only the accuracy and the loss rate of the current model are verified, and the current model participates in the manual parameter adjusting process, such as the change of the network layer number, the setting of batch _ size, the setting of initial learning rate and the like.
The neural network models trained by the method are different in audio classification, so that a plurality of classified neural network models can be obtained, and at the moment, all the different types of neural networks can be collected into a model library for the next audio prediction to use.
Example two
As shown in fig. 1, an audio data prediction method includes obtaining a trained neural network model by using the audio data processing method according to the first embodiment, and further includes the following steps: acquiring an audio test set, and preprocessing the audio test set to obtain a preprocessed audio test set; extracting each frame of test audio signal of each section of audio in the preprocessed audio test set, and extracting a test feature language spectrogram set of each frame of test audio signal, wherein the test feature language spectrogram set comprises more than two feature language spectrograms; normalizing the test feature language spectrogram set and generating multi-channel test features; when the audio data is predicted, the method for preprocessing the audio test set, framing, extracting the frame features and fusing the audio features into the test feature speech spectrogram set is the same as that of the first embodiment, but the difference is that the trained neural network model can be directly called, and the multi-channel test features are used as input to obtain the audio classification result.
Specifically, different neural network classification models are selected from a model library, corresponding API interfaces are respectively developed for each neural network model to call the designated models to provide real-time online service, then audio data in a test set are input into the neural network models to obtain audio classification results, and then the formats of the audio classification results are modified into formats for display.
A computer-readable storage medium storing a computer program which, when executed by a processor, performs the audio data processing method of any one of the embodiments.
More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wire segments, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless segments, wire segments, fiber optic cables, RF, etc., or any suitable combination of the foregoing.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules, modules or units is only one type of logical function division, and other division manners may be available in actual implementation, for example, multiple units, modules or components may be combined or integrated into another device, or some features may be omitted, or not executed.
The units may or may not be physically separate, and components displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method of audio data processing, comprising the steps of:
acquiring an audio data set, and preprocessing the audio data set to obtain a preprocessed audio set;
extracting each frame of audio signal of each section of audio in the preprocessed audio set, and extracting a feature language spectrogram set of each frame of the audio signal, wherein the feature language spectrogram set comprises more than two feature language spectrograms;
normalizing the feature language spectrogram set and generating multi-channel features;
and generating a neural network model, and performing neural network training by taking the multi-channel characteristics as input.
2. The audio data processing method of claim 1, wherein the audio data set is preprocessed, comprising the steps of:
filtering useless audios in the audio data set, and unifying the audio duration of each section of audio in the audio data set;
and performing framing and windowing processing on each section of audio data of the filtered audio data set to obtain a preprocessed audio set.
3. The audio data processing method of claim 2, wherein filtering the unwanted audio in the audio data set comprises:
deleting the audio data which cannot be judged in the audio data set;
setting a first audio length threshold value and a frequency threshold value, and deleting the audio data of which the audio length is shorter than the first audio length threshold value or the frequency is lower than the frequency threshold value in the audio data set.
4. The audio data processing method of claim 2, wherein unifying the audio duration of each piece of audio in the audio data set comprises:
setting a second audio length threshold value, and judging the audio length of the audio data in the audio data set and the second audio length threshold value;
if the audio length of the audio data is greater than or equal to a second audio length threshold value, continuously intercepting the audio data with standard duration;
and if the audio length of the audio data is smaller than a second audio length threshold, obtaining the audio data with standard duration by adopting an intercepting or filling method.
5. The audio data processing method of claim 1, wherein extracting the feature spectrogram set of the audio signal of each frame comprises:
and sequentially acquiring the power normalization chromatogram, the Mel cepstrum coefficient, the Mel frequency spectrum and the constant Q chromatogram of the audio signal.
6. The audio data processing method of claim 1, wherein generating the multi-channel feature comprises:
setting the audio frame length, frame shift and maximum audio duration of each section of audio data in the preprocessed audio set, and calculating the audio frame number;
and generating multi-channel features based on the audio frame length, the audio frame number and the feature language spectrogram set, wherein channels of the multi-channel features correspond to feature language spectrograms in the feature language spectrogram set one by one.
7. The audio data processing method of claim 1, wherein performing neural network training using the multichannel features as inputs comprises:
inputting the multi-channel features into a multi-channel input layer;
and the depth residual convolution layer of the neural network model trains input multi-channel characteristics according to a residual method to obtain neural network models of different audio classifications and generate a model library.
8. The audio data processing method according to claim 1, further comprising the steps of:
and acquiring an audio verification set, and optimizing the learning rate of the neural network model based on the audio verification set.
9. A method for predicting audio data, comprising obtaining a trained neural network model using the audio data processing method according to any one of claims 1 to 8, and further comprising the steps of:
acquiring an audio test set, and preprocessing the audio test set to obtain a preprocessed audio test set;
extracting each frame of test audio signal of each section of audio in the preprocessed audio test set, and extracting a test feature language spectrogram set of each frame of the test audio signal, wherein the test feature language spectrogram set comprises more than two feature language spectrograms;
normalizing the test feature spectrogram set and generating multi-channel test features;
calling the trained neural network model, and taking the multi-channel test characteristics as input to obtain an audio classification result;
and modifying the format of the audio classification result into a format for display.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the audio data processing method of any one of claims 1 to 8.
CN202211406445.0A 2022-11-10 2022-11-10 Audio data processing method and prediction method Pending CN115713945A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211406445.0A CN115713945A (en) 2022-11-10 2022-11-10 Audio data processing method and prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211406445.0A CN115713945A (en) 2022-11-10 2022-11-10 Audio data processing method and prediction method

Publications (1)

Publication Number Publication Date
CN115713945A true CN115713945A (en) 2023-02-24

Family

ID=85232744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211406445.0A Pending CN115713945A (en) 2022-11-10 2022-11-10 Audio data processing method and prediction method

Country Status (1)

Country Link
CN (1) CN115713945A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863957A (en) * 2023-09-05 2023-10-10 硕橙(厦门)科技有限公司 Method, device, equipment and storage medium for identifying operation state of industrial equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109166593A (en) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 audio data processing method, device and storage medium
CN110111773A (en) * 2019-04-01 2019-08-09 华南理工大学 The more New Method for Instrument Recognition of music signal based on convolutional neural networks
CN110600054A (en) * 2019-09-06 2019-12-20 南京工程学院 Sound scene classification method based on network model fusion
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
US20210319321A1 (en) * 2020-04-14 2021-10-14 Sony Interactive Entertainment Inc. Self-supervised ai-assisted sound effect recommendation for silent video
KR20210125366A (en) * 2020-04-08 2021-10-18 주식회사 케이티 Method for detecting recording device failure using neural network classifier, server and smart device implementing the same
CN113539283A (en) * 2020-12-03 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method and device based on artificial intelligence, electronic equipment and storage medium
CN114352486A (en) * 2021-12-31 2022-04-15 西安翔迅科技有限责任公司 Wind turbine generator blade audio fault detection method based on classification
CN114627895A (en) * 2022-03-29 2022-06-14 大象声科(深圳)科技有限公司 Acoustic scene classification model training method and device, intelligent terminal and storage medium
CN114974302A (en) * 2022-05-06 2022-08-30 珠海高凌信息科技股份有限公司 Ambient sound event detection method, apparatus and medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109166593A (en) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 audio data processing method, device and storage medium
CN110111773A (en) * 2019-04-01 2019-08-09 华南理工大学 The more New Method for Instrument Recognition of music signal based on convolutional neural networks
CN110600054A (en) * 2019-09-06 2019-12-20 南京工程学院 Sound scene classification method based on network model fusion
KR20210125366A (en) * 2020-04-08 2021-10-18 주식회사 케이티 Method for detecting recording device failure using neural network classifier, server and smart device implementing the same
US20210319321A1 (en) * 2020-04-14 2021-10-14 Sony Interactive Entertainment Inc. Self-supervised ai-assisted sound effect recommendation for silent video
CN113539283A (en) * 2020-12-03 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method and device based on artificial intelligence, electronic equipment and storage medium
CN112802484A (en) * 2021-04-12 2021-05-14 四川大学 Panda sound event detection method and system under mixed audio frequency
CN114352486A (en) * 2021-12-31 2022-04-15 西安翔迅科技有限责任公司 Wind turbine generator blade audio fault detection method based on classification
CN114627895A (en) * 2022-03-29 2022-06-14 大象声科(深圳)科技有限公司 Acoustic scene classification model training method and device, intelligent terminal and storage medium
CN114974302A (en) * 2022-05-06 2022-08-30 珠海高凌信息科技股份有限公司 Ambient sound event detection method, apparatus and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863957A (en) * 2023-09-05 2023-10-10 硕橙(厦门)科技有限公司 Method, device, equipment and storage medium for identifying operation state of industrial equipment
CN116863957B (en) * 2023-09-05 2023-12-12 硕橙(厦门)科技有限公司 Method, device, equipment and storage medium for identifying operation state of industrial equipment

Similar Documents

Publication Publication Date Title
US11004461B2 (en) Real-time vocal features extraction for automated emotional or mental state assessment
US11062725B2 (en) Multichannel speech recognition using neural networks
US20220093111A1 (en) Analysing speech signals
US10878823B2 (en) Voiceprint recognition method, device, terminal apparatus and storage medium
CN107680597B (en) Audio recognition method, device, equipment and computer readable storage medium
US20220230651A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
CN108564963B (en) Method and apparatus for enhancing voice
CN110459241B (en) Method and system for extracting voice features
CN108922513B (en) Voice distinguishing method and device, computer equipment and storage medium
US20130024191A1 (en) Audio communication device, method for outputting an audio signal, and communication system
CN110379412A (en) Method, apparatus, electronic equipment and the computer readable storage medium of speech processes
US7809560B2 (en) Method and system for identifying speech sound and non-speech sound in an environment
EP1995723A1 (en) Neuroevolution training system
CN108962231B (en) Voice classification method, device, server and storage medium
US20210142815A1 (en) Generating synthetic acoustic impulse responses from an acoustic impulse response
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
US11961504B2 (en) System and method for data augmentation of feature-based voice data
CN102214464A (en) Transient state detecting method of audio signals and duration adjusting method based on same
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN114203163A (en) Audio signal processing method and device
CN110047478A (en) Multicenter voice based on space characteristics compensation identifies Acoustic Modeling method and device
US20170365256A1 (en) Speech processing system and speech processing method
CN115713945A (en) Audio data processing method and prediction method
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination