CN107068167A

CN107068167A - Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures

Info

Publication number: CN107068167A
Application number: CN201710146957.0A
Authority: CN
Inventors: 李明; 倪志东
Original assignee: SYSU CMU Shunde International Joint Research Institute; National Sun Yat Sen University
Current assignee: SYSU CMU Shunde International Joint Research Institute; National Sun Yat Sen University
Priority date: 2017-03-13
Filing date: 2017-03-13
Publication date: 2017-08-18
Also published as: WO2018166316A1

Abstract

The present invention relates to a kind of speaker's cold symptoms recognition methods for merging a variety of end-to-end neural network structures, comprise the following steps：S1. it is voice to build and train input, and identification network is the end-to-end neutral net A of convolutional neural networks and shot and long term memory network；S2. it is voice spectrum to build and train input, and identification network is the end-to-end neutral net B of convolutional neural networks and shot and long term memory network；S3. it is voice spectrum to build and train input, and identification network is convolutional neural networks and the end-to-end neutral net C of fully-connected network；S4. it is voice MFCC features/CQCC features to build and train input, and identification network is the end-to-end neutral net D of shot and long term memory network；S5. the end-to-end neutral net that more than fusion four kinds train carries out speaker's cold symptoms identification.

Description

Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures

Technical field

The present invention relates to Application on Voiceprint Recognition field, more particularly, to a kind of a variety of end-to-end neural network structures of fusion Speaker's cold symptoms recognition methods.

Background technology

Speaker Identification is also known as Application on Voiceprint Recognition, is the technology of Land use models identification technology automatic identification speaker.Current Speaker Recognition Technology obtains good performance in experiment condition, but in practice, the voice recognized can be by environment The influence of noise and speaker's healthiness condition so that the robustness reduction of existing speaker Recognition Technology.Existing speaker knows In terms of other method is mainly used in speaker's identity determination, there is presently no the related identification side applied to speaker's cold symptoms Method.

In voice technology research, researcher always wants to that the feature for representing target type can be found, from identification target language The characteristic that significant difference normal voice is found in sound is described, and speech feature extraction is the phonetic feature harmony for extracting speaker Road feature, at present, the characteristic parameter of main flow are based on single feature, to characterize speaker including MFCC, LPCC, CQCC etc., all The information of cold symptoms is not enough, influences accuracy of identification.A large amount of knowledge for distinguishing class object voice are needed simultaneously, and are calculated in identification In method, starting is the method based on channel model and speech model earlier, but because the complexity of model, is not obtained very Good practical function.And model matching method such as technology such as dynamic time warping, hidden Markov model, vector quantization etc. starts Play good recognition effect.Feature extraction and pattern classification are separately studied be Study of recognition common method, but exist The problem of feature and unmatched models, training difficulty, feature are difficult to find, the problem of classical identification framework has above-mentioned.

Recently as the development of deep learning, had shown that based on deep-neural-network in the identification of image and voice huge Big energy, a series of neural network structure is also suggested, such as autocoding network, convolutional neural networks and circulation nerve Network etc..There are many scholars to find, voice is learnt by neutral net, can obtain being better described the hiding knot of voice Structure feature, end to end recognition methods is exactly by few priori of trying one's best, while being carried out to feature learning and feature recognition Processing, with good recognition effect.

The content of the invention

Feature extraction and pattern classification are separated caused feature by the present invention to solve the identification technology that prior art is provided Difficult with unmatched models, training, there is provided a variety of end-to-end neural network structures of one kind fusion for the problems such as feature is difficult to find Speaker's cold symptoms recognition methods, this method is by being unified feature learning and pattern classification so that entirely say Talk about people's cold symptoms identification process simpler quick, be with a wide range of applications.

To realize above goal of the invention, the technical scheme of use is：

Speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures is merged, is comprised the following steps：

S1. it is voice to build and train input, and identification network is the end-to-end god of convolutional neural networks and shot and long term memory network Through network A；

S2. it is voice spectrum that structure and training, which are inputted, and identification network arrives for the end of convolutional neural networks and shot and long term memory network Terminal nerve network B；

S3. it is voice spectrum to build and train input, and identification network is the end-to-end god of convolutional neural networks and fully-connected network Through network C；

S4. it is voice MFCC features/CQCC features to build and train input, and identification network is end-to-end for shot and long term memory network Neutral net D；

S5. the end-to-end neutral net that more than fusion four kinds train carries out speaker's cold symptoms identification.

Preferably, the convolutional neural networks of the end-to-end neutral net A include 8 modules, and each module includes one Convolutional layer, ReLU active coatings and one-dimensional maximum pond layer are tieed up, wherein the size of the convolution kernel of one-dimensional convolutional layer is 32, one-dimensional maximum The Chi Huahe of pond layer size is 2, and pond step-length is 2.

Preferably, the convolutional neural networks of the end-to-end neutral net B include 6 modules, and each module includes two dimension Convolutional layer, ReLU active coatings and Two-dimensional Maximum pond layer；Wherein first convolutional layer uses 7*7 convolution kernel, and the second layer is used 5*5 convolution kernel, is left 4 layers of convolution kernel using 3*3；All maximum pond layers use 3*3 Chi Huahe, pond step-length For 2.

Preferably, the convolutional neural networks of the end-to-end neutral net C include 6 modules, and each module includes two dimension Convolutional layer, ReLU active coatings and Two-dimensional Maximum pond layer；Wherein first convolutional layer uses 7*7 convolution kernel, and the second layer is used 5*5 convolution kernel, is left 4 layers of convolution kernel using 3*3；All maximum pond layers use 3*3 Chi Huahe, pond step-length For 2.

Compared with prior art, the beneficial effects of the invention are as follows：

Existing identification technology is all that feature and pattern classification are separately studied, existing characteristics and unmatched models, training difficulty, special Levy the problems such as being difficult to find.And the method that the present invention is provided is by merging four kinds of different end-to-end neutral nets feature learning It is unified with pattern classification so that whole speaker's cold symptoms identification process is simpler quick, is answered with extensive Use prospect.

Brief description of the drawings

Fig. 1 is the specific implementation schematic diagram of method.

Fig. 2 is the flow chart that voice extracts mel cepstrum coefficients (MFCC).

Fig. 3 is the flow chart of voice extraction constant Q cepstrum coefficients (CQCC).

Fig. 4 is end-to-end neutral net A schematic diagram.

Fig. 5 is end-to-end neutral net B schematic diagram.

Fig. 6 is end-to-end neutral net C schematic diagram.

Fig. 7 is end-to-end neutral net D schematic diagram.

Embodiment

Accompanying drawing being given for example only property explanation, it is impossible to be interpreted as the limitation to this patent；

Below in conjunction with drawings and examples, the present invention is further elaborated.

Embodiment 1

The specific implementation process figure for the method that Fig. 1 provides for the present invention, as shown in figure 1, a variety of ends of fusion that the present invention is provided are arrived Speaker's cold symptoms recognition methods of terminal nerve network structure, comprises the following steps：

S4. it is voice MFCC features/CQCC features to build and train input, and identification network is end-to-end for shot and long term memory network Neutral net D, it is specific as shown in Figure 7；

Wherein, as shown in Figure 2,3, the MFCC features in step S4 are by carrying out preemphasis to voice, adding window framing, quick After Fourier transformation, calculating energy spectral density, the filtering of melscale triangular filter group, computing of taking the logarithm, discrete cosine transform Finally give, and CQCC features be by voice carry out constant Q transform, ask energy spectral density, operation of taking the logarithm, it is discrete more than String conversion is obtained.

In specific implementation process, as shown in figure 4, the convolutional neural networks of the end-to-end neutral net A include 8 Module, each module includes one-dimensional convolutional layer, ReLU active coatings and one-dimensional maximum pond layer, wherein the convolution of one-dimensional convolutional layer The size of core is 32, and the Chi Huahe of one-dimensional maximum pond layer size is 2, and pond step-length is 2.

In specific implementation process, as shown in figure 5, the convolutional neural networks of the end-to-end neutral net B include 6 Module, each module includes two-dimensional convolution layer, ReLU active coatings and Two-dimensional Maximum pond layer；Wherein first convolutional layer uses 7* 7 convolution kernel, the second layer uses 5*5 convolution kernel, is left 4 layers of convolution kernel using 3*3；All maximum pond layers are used 3*3 Chi Huahe, pond step-length is 2.

In specific implementation process, as shown in fig. 6, the convolutional neural networks of the end-to-end neutral net C include 6 Module, each module includes two-dimensional convolution layer, ReLU active coatings and Two-dimensional Maximum pond layer；Wherein first convolutional layer uses 7* 7 convolution kernel, the second layer uses 5*5 convolution kernel, is left 4 layers of convolution kernel using 3*3；All maximum pond layers are used 3*3 Chi Huahe, pond step-length is 2.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not pair The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no necessity and possibility to exhaust all the enbodiments.It is all this Any modifications, equivalent substitutions and improvements made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures of fusion, it is characterised in that：Including following Step：

2. speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures of fusion according to claim 1, It is characterized in that：The convolutional neural networks of the end-to-end neutral net A include 8 modules, and each module includes one-dimensional volume Lamination, ReLU active coatings and one-dimensional maximum pond layer, wherein the size of the convolution kernel of one-dimensional convolutional layer is 32, one-dimensional maximum pond The Chi Huahe of layer size is 2, and pond step-length is 2.

3. speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures of fusion according to claim 1, It is characterized in that：The convolutional neural networks of the end-to-end neutral net B include 6 modules, and each module includes two-dimensional convolution Layer, ReLU active coatings and Two-dimensional Maximum pond layer；Wherein first convolutional layer uses 7*7 convolution kernel, and the second layer uses 5*5's Convolution kernel, is left 4 layers of convolution kernel using 3*3；All maximum pond layers use 3*3 Chi Huahe, and pond step-length is 2.

4. speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures of fusion according to claim 1, It is characterized in that：The convolutional neural networks of the end-to-end neutral net C include 6 modules, and each module includes two-dimensional convolution Layer, ReLU active coatings and Two-dimensional Maximum pond layer；Wherein first convolutional layer uses 7*7 convolution kernel, and the second layer uses 5*5's Convolution kernel, is left 4 layers of convolution kernel using 3*3；All maximum pond layers use 3*3 Chi Huahe, and pond step-length is 2.