CN114067788A

CN114067788A - Guangdong opera vocal cavity classification method based on combination of CNN and LSTM

Info

Publication number: CN114067788A
Application number: CN202111313774.6A
Authority: CN
Inventors: 赵文锋; 陈侨; 王琴; 赵娅雯; 刘韵锋; 林暖晨; 刘小玲; 刘易迪; 梁升濠
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-18

Abstract

The invention provides a classification method of a Guangdong opera vocal cavity based on combination of CNN and LSTM, which comprises the following steps: constructing a classification network model: setting a cascade convolution neural network increment-CNN as a primary network, setting a CNN-double-layer LSTM neural network as a secondary network, and connecting the output of the primary network with the input of the secondary network; classifying the Guangdong opera cavities by utilizing a classification network model: extracting Mel frequency spectrum MFCC characteristics from the Mel frequency spectrum of each audio signal of the audio episode Guangdong opera through characteristic engineering; inputting the Mel frequency spectrum MFCC characteristics into a first-level network to fuse the shallow and deep characteristics of each audio signal of the segmentum Guangdong opera; the secondary network learns the internal characteristics among the audio signals of the audio segments of the Guangdong opera, extracts the context associated semantics and predicts the types of the cavities of the Guangdong opera. The Guangdong drama audio cavity classification method based on the combination of the multilayer characteristic cascade convolution neural network and the long and short time memory unit is more suitable for the characteristics of the Guangdong drama audio cavities, fully extracts the characteristics of the Guangdong drama, achieves the effect of accurate classification, is comprehensive in expression and more meets the requirements of practical application.

Description

Guangdong opera vocal cavity classification method based on combination of CNN and LSTM

Technical Field

The invention relates to the technical field of classification of cavities of Guangdong opera phonographs, in particular to a method for evaluating the conservation quantity of a regional water source by adopting a water reserve balance method based on GRACE gravity satellites and remote sensing image data.

Background

Guangdong opera as world-level non-material cultural heritage has profound traditional culture background, makes full use of the advantages of artificial intelligence technology to rescue, excavate, arrange, protect and transmit traditional culture, and has become the direction of current non-material cultural heritage protection. Therefore, the scientific continuation significance of the Chinese traditional culture is realized by combining the modern scientific technology and the traditional culture industry.

How to process, organize, analyze and utilize massive audio information is an important part of the information processing field, and audio classification is one of the key technologies. In recent years, there has been a breakthrough progress in the research of audio/music style classification based on deep learning, and considering that the yue-opera vocal cavity classification belongs to a branch of the music style classification field, the research on the feature analysis of the yue-opera vocal cavity can reference the existing music style classification technology, and further consider the characteristics of the yue-opera vocal cavity different from music on the basis, and then put into practice.

In recent years, music style classification has continued to grow into a booming research area. At the same time, more and more activities in this field of research are driven by scientific challenges and industrial needs. The current opera cavity classification still has the following technical problems:

because rhythm and accompaniment of different cavities of the Guangdong opera and the musical rhythm characteristics of using musical instruments and the like have similarity, the existing music classification method is difficult to accurately classify the music.

The existing music style/audio classification technology can only extract shallow layer characteristics in the Guangdong drama audio segments, the related information between different audio segments of the same song is not analyzed and researched, and meanwhile, the deep layer characteristics and the shallow layer characteristics of multilayer cascade are difficult to extract by combining the characteristics of the Guangdong drama audio cavities, so that certain possibility of information loss exists.

Therefore, how to provide a CNN and LSTM combined yue-opera cavity classification method which is comprehensive in expression and accurate in classification is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a yue-opera vocal cavity classification method based on a combination of a multilayer feature cascade Convolution Neural Network (CNN) and a long-and-short-term memory unit (LSTM), which aims at the problem that the musical rhythm features of different vocal cavities of yue-opera have similarity and the traditional music classification method is difficult to accurately classify.

In order to achieve the purpose, the invention adopts the following technical scheme:

a classification method of Yue-Opera singal cavity based on combination of CNN and LSTM comprises the following steps:

constructing a classification network model:

setting a cascade convolution neural network increment-CNN as a primary network, setting a CNN-double-layer LSTM neural network as a secondary network, and connecting the output of the primary network with the input of the secondary network;

the classification of the cavities of the Guangdong opera by utilizing a classification network model comprises the following steps:

extracting Mel frequency spectrum MFCC characteristics from the Mel frequency spectrum of each audio signal of the audio episode Guangdong opera through characteristic engineering;

inputting the Mel frequency spectrum MFCC characteristics into a primary network to fuse the shallow layer and deep layer characteristics of each audio signal of the segmentum Guangdong opera, and inputting the output signals into a secondary network;

the secondary network learns the internal characteristics among the audio signals of the audio segments of the Guangdong opera, extracts the context associated semantics and predicts the types of the cavities of the Guangdong opera.

Preferably, the primary network comprises a two-dimensional convolution CRMD block and a multi-scale feature extractor;

the two-dimensional convolution CRMD block converts the audio signal from a time domain to a frequency domain, and extracts multi-level features from two directions of time and frequency;

the multi-scale feature extractor is used for splitting multi-level features by using a lightweight inclusion structure through a symmetrical double-layer stacked convolution kernel structure, and extracting features of different levels;

the hierarchical features are connected by a Concat layer and input to the secondary network.

Preferably, the secondary network comprises a feature cascade stacking network formed by stacking double-layer LSTM and multi-convolution layer CNN, the deep separability features are extracted layer by adopting the stacking network of double-layer LSTM, and each layer outputs a feature vector as the input of a subsequent layer.

Preferably, the step of classifying the yue-opera vocal cavity by using the classification network model further includes a step of preprocessing the yue-opera audio data sample set to obtain a mel spectrum, including:

slicing the audio data sample set of the Guangdong drama into a plurality of audio files with equal duration;

performing framing operation on the audio file, and performing windowing operation on every two adjacent frame signals by adopting a preset overlapping rate;

performing fast Fourier transform on each frame of signal by adopting fast Fourier transform to obtain a spectrogram;

and processing the spectrogram by adopting a Mel filter bank to obtain a Mel frequency spectrum.

Preferably, the confusion matrix is used for visually judging the prediction result output by the classification network model, and the categories of the cavities of the Guangdong opera are predicted by a majority voting algorithm.

Through the technical scheme, compared with the prior art, the invention has the beneficial effects that:

the method takes the tone cavity musical notation corresponding to the Guangdong-play segments as model input, and obtains the fusion characteristic expression of the shallow layer and the deep layer of each Guangdong-play segment through a primary network (inclusion-CNN) consisting of a 2d-CRMD block and a multi-scale characteristic extractor. And then through a secondary network (CNN-2LSTM) consisting of CNN and stacked LSTM, learning the context association semantics among the record segments and the importance among the record segments. Compared with the existing music style and audio frequency classification algorithm, the Guangdong drama vocal cavity classification method based on the combination of the multilayer characteristic cascade Convolution Neural Network (CNN) and the long-time and short-time memory unit (LSTM) disclosed by the invention is more suitable for the characteristics of the Guangdong drama vocal cavity, fully extracts the Guangdong drama characteristics, achieves the effect of accurate classification, is comprehensively expressed and more meets the requirements of practical application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts;

FIG. 1 is a flow chart of a sorting method for Yue-Opera cavities based on the combination of CNN and LSTM according to the present invention;

FIG. 2 is a schematic diagram of the framing of an audio signal according to the present invention;

FIG. 3 is a mapping of an audio signal from a frequency scale to a Mel scale provided by the present invention;

FIG. 4 is a spectrum diagram of an audio signal provided by the present invention;

FIG. 5 is a diagram illustrating a process for obtaining a Mel frequency spectrum diagram according to the present invention;

FIG. 6 is a Mel frequency spectrum diagram of an audio signal provided by the present invention;

FIG. 7 is a network architecture diagram of a classification network model provided by the present invention;

FIG. 8 is a graph showing the variation of the training set accuracy and the test set accuracy of the classification network model according to the present invention;

FIG. 9 is a graph showing the variation of the training set error and the test set error of the classification network model provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Currently, the mainstream music classification technology mainly aims at the style characteristics of music. The yue-opera can be used as a reference method for music, but the characteristics of the audio cavity of the yue-opera are combined to extract multi-dimensional high-level features. Referring to fig. 1, the flow chart of the classification method for yue-opera vocal cavity based on CNN and LSTM in this embodiment includes the following steps:

s1, extracting Mel frequency spectrum MFCC characteristics from the Mel frequency spectrum of each audio signal of the audio episode Guangdong opera through characteristic engineering.

And S2, inputting the Mel frequency spectrum MFCC characteristics into a primary network to fuse the shallow layer characteristics and the deep layer characteristics of the audio signals of the audio of the singe segmentations Guangdong opera, and inputting output signals of the characteristics into a secondary network.

S3, learning the internal characteristics among audio signals of the Guangdong opera by the secondary network, extracting context associated semantics and predicting the types of the Guangdong opera cavities.

In this embodiment, a Convolutional Neural Network (CNN) is a kind of feed forward Neural network (fed forward Neural network) including convolution calculation and having a deep structure, and is one of typical algorithms of deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure of the convolutional neural network. Supervised learning and unsupervised learning can be performed, and convolution kernel parameter sharing in an implicit layer and sparsity of connection among layers enable a convolutional neural network to perform grid-like topology (grid-like topology) feature with small calculation amount.

The specific implementation process is as follows:

step one, acquiring data of audio segments of Guangdong opera, and splitting a training set and a test set of the acquired audio data of the audio segments of the Guangdong opera;

step two, audio data preprocessing, including screening, tuning, slicing, framing and windowing;

step three, acquiring a Mel frequency spectrogram of the audio data;

establishing a classification network model of the Guangdong opera gramophone cavity combining a multilayer characteristic cascade Convolution Neural Network (CNN) and a long-time and short-time memory unit (LSTM);

and step five, predicting the categories of the Guangdong opera cavities by utilizing the trained optimal classification network model, and predicting the result by using a majority voting algorithm.

In one embodiment, the step of preprocessing the set of yue-opera audio data samples to obtain a mel-frequency spectrum comprises:

In this embodiment, the data is from internet platforms such as chinese drama, the cavities of guangdong opera are used as labels of the data sets, the data sets use five types of cavities as labels of the data sets, and include 165 guangdong opera, and 500 cavity segments in the WAV format in total, where the length of each cavity segment is 30s, and the sampling rate is 44100 Hz.

Because different cavities of different genres of Guangdong opera have a lot of commonalities, such as musical instruments, rhythms, temperaments and the like for accompaniment, audio files with identification degrees need to be screened out for accurate classification. In consideration of the fact that the sources of data sets of various genres are different, the relative change of original files is slightly carried out, the sensitivity of a network to sound intensity is properly reduced under the condition that a segment singing cavity is not changed, and the adaptability of the model to audios with different loudness is improved. Thus, it is very necessary to preprocess the original signal of the audio piece in the yue-rui. The analysis of the audio signal includes screening, tuning, slicing, framing, and so forth.

Slicing: the python program was used to slice audio files of equal duration (30 s).

Framing: the speech signal is a non-stationary signal (the speech signal is generated by the vibration of the sound-producing organ, but the motion state of the sound-producing organ is unpredictable and is affected by the environment). Research shows that within 10-30 ms, the spectral characteristics of the speech signal can be regarded as substantially unchanged (because the vibration audio of the vocal tract muscles is limited). In view of the continuity inside the signal of the audio segments of the yue-opera, overlapping framing is generally used to avoid the loss of certain information due to physical separation. As shown in fig. 2, the audio signal framing diagram shows a 30-second-duration framing process of audio samples of a yue-opera-segment, with a duration of 1.5 seconds and an overlap rate of 50% (there is 50% overlap before two adjacent frames), and the last portion of the samples that is less than the duration is discarded.

And sampling FFT calculation to obtain the spectrum information of the whole file. FFT calculations with a sampling window length of 1024 obtain spectral information for the entire file, sliding 128 samples forward for each calculation, so that the sampling windows overlap each other.

Audio mel-frequency spectrogram acquisition: the sound signal is originally a one-dimensional time domain signal, and the frequency change rule is difficult to be seen intuitively. The characteristics of the relevant audio can be more intuitively and specifically seen by using the Mel frequency spectrogram. The Mel frequency spectrum is a spectrum under the Mel scale, and is obtained by multiplying the spectrum by a plurality of Mel filters. The mapping relationship between the common frequency scale and the mel frequency scale is shown in formula (1), taking a red line female yue-ou-chong sing passage as an example, and the mapping relationship from Hz to mel of the yue-ou-chong passage is shown in fig. 3;

the audio file is framed (long-term signals are framed as short-term signals), windowed, subjected to fourier transform (FFT) on each frame to map the audio signal from the time domain to the frequency domain (the overlapping window portion of the audio signal is performed), and finally the results of each frame are stacked along another dimension to obtain a spectrogram. As shown in fig. 4.

Obtaining a mel spectrum from an audio signal: the time domain signal is subjected to Fourier transform and converted into a frequency domain, and then the frequency domain signal corresponding to the filter bank with the Mel frequency scale is processed to obtain the Mel frequency spectrum. The process is shown in figure 5. In the meier spectrum diagram of the yue-opera song, the abscissa represents time, the ordinate represents frequency, the color depth represents the energy, and the frequency scale of the meier spectrum is in exponential distribution, as shown in fig. 6.

In one embodiment, the classification network model is constructed as follows:

implemented in Tensorflow2.2 using Keras API, the model network structure includes convolutional layers, ReLU layers, Max-Pooling layers, Concat layers, LSTM layers, Flatten layers, Dense layers, Dropout layers, Batch Normalization layers, etc. The model consists of a two-level neural network, as shown in FIG. 7. The method is characterized in that a tone cavity musical tone spectrum corresponding to the Guangdong play segments is used as model input, and the fusion feature expression of the shallow layer and the deep layer of each Guangdong play segment can be obtained through a primary network (inclusion-CNN) and composed of a 2d-CRMD block and a multi-scale feature extractor. And then through a secondary network (CNN-2LSTM), the content of the CNN and the stacked LSTM is used for learning the context association semantics among the record segments and the importance among the record segments. Meanwhile, the effective data set of the Guangdong drama resources is considered to be small, and overfitting is possibly caused compared with the expression capability and the fitting capability of the network. The neural network model used by the invention is connected with the dropout layer behind the convolution layer, so that the CNN overfitting can be well lightened.

Primary network: the two-dimensional convolution CRMD block converts the audio signal from a time domain to a frequency domain, and then the two-dimensional convolution is utilized to respectively extract the characteristics of higher layers from two directions of time and frequency. The multiscale feature extractor uses a lightweight inclusion structure to split a larger convolution through a symmetrical double-layer stacked convolution kernel structure, extracts features of different levels to be fused, realizes multi-dimensional feature expression, and finally connects all dimensional features through a Concat layer, so that the generalization capability of a model to a feature range is enhanced while certain feature space information is maintained.

Secondary network: a feature cascade stack network is composed of a stack of two layers of LSTM and convolutional layers. And extracting deep separability features layer by adopting a stacked network of double-layer LSTM, from basic conceptual features to more abstract deep features. The stacked convolution kernels expand the learning range of the frequency spectrum characteristics of the vocal segments, so that the network has stronger identification capability on potential patterns in the sound spectrum. And outputting a feature vector sequence from each layer as the input of a subsequent layer, observing in blocks along with the passage of time, and expressing features on different time scales, thereby enhancing the characterization capability of the model. Meanwhile, the problem of gradient disappearance caused by the fact that the shallow layer weight cannot be updated iteratively due to multi-layer LSTM stacking is avoided.

In one embodiment, Mel frequency spectrum characteristics are input into a classification network model for classification prediction, the obtained prediction result is visually judged by using a confusion matrix, and the categories of the cavities of the Guangdong opera are predicted by a majority voting algorithm.

The MFCC of the original signal audio is used as the characteristic vector input to establish a classification model of the Guangdong opera cavity, which combines a multilayer characteristic cascade Convolution Neural Network (CNN) and a long-time and short-time memory unit (LSTM). The training set errors, the accuracy rate change curves and the error change curves of the test set of the model are shown in figures 8-9, the accuracy rates of the training set and the test set of the optimal model obtained after training are both more than 90%, and the requirements of accurate classification are met while the characteristics of the Guangdong opera gramophone cavity are effectively kept.

The invention plays an important role in the field of identification and classification of the playing cavities:

the audio is learned by a deep learning method, so that the problem of manual feature selection is avoided, and unnecessary resource waste is avoided for workers engaged in Yue-Opera drama to a certain extent. The method can provide a more scientific, reasonable and high-efficiency data management and classification method for the Guangdong drama database, and provide data support for research on the performance form of the Guangdong drama, the playback genre of the Guangdong drama and the music of the Guangdong drama. The method can also be applied to music software, scientifically and conveniently brings better and more convenient music listening experience to yuejie enthusiasts and audiences, and further promotes the inheritance of yuejie culture.

The classification method of the yue-opera vocal cavity based on the combination of CNN and LSTM provided by the present invention is introduced in detail above, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A classification method of Yue-Opera singal cavity based on the combination of CNN and LSTM is characterized by comprising the following steps:

constructing a classification network model:

2. The CNN and LSTM combination-based yue-opera cavity classification method as claimed in claim 1, wherein the primary network comprises a two-dimensional convolved CRMD block and a multi-scale feature extractor;

3. The CNN and LSTM-based yue-opera vocal cavity classification method according to claim 1, wherein the secondary network comprises a feature cascade stacking network formed by stacking double-layer LSTM and multi-convolution layer CNN, the stacking network of double-layer LSTM is adopted to extract deep separability features layer by layer, and each layer outputs a feature vector as an input of a subsequent layer.

4. The CNN and LSTM based yue-rue-cavity classification method as claimed in claim 1, wherein the step of using the classification network model to classify yue-rue-cavity further includes the step of preprocessing the yue-rue audio data sample set to obtain mel spectrum, including:

5. The CNN and LSTM based classification method for cavities in yue-opera record as claimed in claim 1, wherein the confusion matrix is used to visually judge the prediction result outputted from the classification network model, and the classification of cavities in yue-opera record is predicted by majority voting.