CN111091809B

CN111091809B - Regional accent recognition method and device based on depth feature fusion

Info

Publication number: CN111091809B
Application number: CN201911051663.5A
Authority: CN
Inventors: 计哲; 黄远; 高圣翔; 孙晓晨; 戚梦苑; 宁珊; 徐艳云
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2023-05-23
Anticipated expiration: 2039-10-31
Also published as: CN111091809A

Abstract

The invention provides a regional accent recognition method and a regional accent recognition device for depth feature fusion, wherein the method comprises the following steps: extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized; and inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a Support Vector Machine (SVM) classifier trained in advance to obtain the voice category of the output voice to be recognized. The invention adopts the multi-feature fusion language identification system to extract the depth features of the voice, fuses the traditional SDC features, inputs the SDC features into the SVM classifier, realizes a more robust language identification function, and obtains a better classification effect on regional dialect Mandarin.

Description

Regional accent recognition method and device based on depth feature fusion

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a regional accent recognition method and device for depth feature fusion.

Background

At present, the speech recognition engines such as Chinese continuous speech recognition, speech keyword retrieval, speech-to-text and the like are trained for a plurality of years, and can achieve good recognition effect for standard Mandarin in telephone channels.

However, in actual work, a large number of telephone voices have obvious regional characteristics, such as guangdong and Fujian, and when the voice is processed by the existing voice recognition engine based on standard mandarin training, the recognition effect is relatively poor, the recognition accuracy is low, the recognition effect and the intention discrimination of the transfer contents are seriously affected, so that a language recognition technology aiming at regional accent classification is needed to pre-classify and screen the voices so as to improve the efficiency and accuracy of tasks such as follow-up voice recognition.

Disclosure of Invention

In order to overcome the existing problems or at least partially solve the problems, an embodiment of the present invention provides a regional accent recognition method and apparatus for depth feature fusion.

According to a first aspect of an embodiment of the present invention, there is provided a regional accent recognition method for depth feature fusion, including:

extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;

inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the output voice to be recognized;

the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV.

On the basis of the technical scheme, the invention can also make the following improvements.

Further, the extracting the bottleneck BN feature of the voice to be recognized includes:

inputting the voice to be recognized into a preset deep belief network DBN to obtain bottleneck BN characteristics of the outputted voice to be recognized;

the preset deep belief network DBN is obtained by training a training sample containing regional accent Mandarin data of each category and extracted bottleneck BN characteristics.

Further, training the deep belief network DBN to obtain the preset deep training network DBN by the following method:

learning and training the deep belief network DBN based on a RBM stacking method of a Boltzmann machine by utilizing a voice training set, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;

and after the deep belief network DBN is trained based on the RBM stacking method of the limiting Boltzmann machine, removing network parameters after a bottleneck layer with the node number smaller than a threshold value in the deep belief network DBN, and obtaining the preset deep belief network DBN.

Further, the extracting the sliding differential cepstrum SDC feature of the voice to be recognized includes:

extracting a mel cepstrum coefficient (MFCC) feature vector of the voice to be recognized;

and according to the MFCC feature vector of the voice to be recognized, obtaining the sliding differential cepstrum SDC feature of the voice to be recognized.

Further, the obtaining the sliding differential cepstrum SDC feature of the voice to be recognized according to the MFCC feature of the voice to be recognized includes:

splicing the MFCC feature vectors of the voice to be recognized and corresponding differential vectors to form each feature vector of the SDC feature, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;

each differential vector is obtained by subtracting a first vector from a second vector, wherein the first vector is obtained by sliding the MFCC feature vector forward for a first set number of frames and then shifting forward for a second set number of frames, and the second vector is obtained by sliding the MFCC feature vector forward for the first set number of frames and then shifting backward for the second set number of frames.

Further, the gaussian super vector GSV of regional accent mandarin chinese data of each category is labeled by:

inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category by a maximum posterior probability (MAP) self-adaptive method;

labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;

the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.

According to a second aspect of the embodiment of the present invention, there is provided a regional accent recognition apparatus for depth feature fusion, including:

the extraction module is used for extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;

the output module is used for inputting the bottleneck BN characteristics and the sliding differential cepstrum SDC characteristics into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the outputted voice to be recognized;

the preset support vector machine SVM classifier is obtained by training a training sample of regional accent mandarin data of each category marked with Gaussian super vector GSV.

According to a third aspect of embodiments of the present invention, there is also provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor invoking a regional accent recognition method capable of performing depth feature fusion provided by any of the various possible implementations of the first aspect.

According to a fourth aspect of embodiments of the present invention, there is also provided a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the method for regional accent recognition of depth feature fusion provided by any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a regional accent recognition method and a regional accent recognition device with depth feature fusion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a regional accent recognition method with depth feature fusion provided by an embodiment of the present invention;

FIG. 2 is a flow chart of MFCC feature extraction in an embodiment of the present invention;

FIG. 3 is a flow chart of SDC feature extraction in an embodiment of the invention;

fig. 4 is a flowchart of a training method of a GMM-UBM model according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for extracting GSVs for regional accent mandarin chinese data of each category in accordance with an embodiment of the present invention;

fig. 6 is an overall flow chart of a regional accent recognition method with depth feature fusion according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a regional accent recognition device with depth feature fusion according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an overall structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In one embodiment of the present invention, a regional accent recognition method with depth feature fusion is provided, and fig. 1 is a schematic overall flow diagram of the regional accent recognition method with depth feature fusion provided in the embodiment of the present invention, where the method includes:

It can be understood that the voice to be recognized in the embodiment of the invention is a regional accent mandarin, in order to more accurately recognize the category of the regional accent mandarin, the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic of the voice to be recognized are extracted, and the two depth characteristics are fused and input into a Support Vector Machine (SVM) classifier after training, and the voice category of the voice to be recognized is recognized through the SVM classifier.

According to the embodiment of the invention, a language recognition system with multiple characteristics (BN characteristics and SDC characteristics) fused is adopted, the depth characteristics of voice are extracted, the traditional SDC characteristics are fused, and the voice is input into an SVM classifier, so that a more robust language recognition function is realized, and a better classification effect on regional dialect Mandarin is obtained.

On the basis of the above embodiment, in the embodiment of the present invention, extracting bottleneck BN features of a voice to be recognized includes:

Based on the above embodiments, in the embodiments of the present invention, the preset deep training network DBN is obtained by training the deep belief network DBN in the following manner:

It can be understood that, in the embodiment of the present invention, bottleneck BN features of the speech to be recognized are extracted based on the trained deep belief network DBN. In the process of training the deep belief network DBN, firstly, a voice training set is constructed, namely, various regional accent mandarin data required by the user are collected, and the training set of each language model is constructed. Because the data source comes from the international telecommunication network, the proportion of regional accent mandarin data meeting the requirements is very small, the redundant workload of manual selection is too large, and the feasibility is not high, therefore, a plurality of computer intelligent auxiliary measures are adopted to match with manual labeling (labeling the category of regional accent mandarin data), firstly, a mature language identification system is used for screening and filtering, and the model is repeatedly updated after a certain amount of data is accumulated until the scale requirement of a data set is met.

Before training the deep belief network DBN by using the voice training set, voice activity detection is carried out on each piece of voice data in the training set, invalid parts of vibration (DTMF signal tones), color ring, music and other various types of noise mixed in conversation voice are recognized and filtered, effective voice is obtained, and BN characteristics of each piece of voice are extracted.

After effective voice is obtained, training the deep belief network DBN according to regional accent mandarin data of each category in a voice training set, wherein in the embodiment of the invention, the training mode of the deep belief network DBN is based on a method of limiting Boltzmann machine (RBM) stacking, unsupervised learning is performed layer by layer from bottom to top, and after all RBM training is finished, top-to-bottom supervised fine adjustment is performed. After the DBN training is completed, network parameters behind the bottleneck layer with fewer nodes are removed, so that the language information is compressed to a low dimension, and the BN characteristics suitable for language identification are obtained. The deep belief network DBN includes a plurality of bottleneck layers, and each bottleneck layer includes a plurality of nodes.

Based on the foregoing embodiments, in the embodiment of the present invention, the extracting a sliding differential cepstrum SDC feature of a voice to be recognized includes:

Based on the above embodiments, in the embodiment of the present invention, according to MFCC characteristics of a voice to be recognized, obtaining sliding differential cepstrum SDC characteristics of the voice to be recognized includes:

It can be understood that, for the voice to be recognized, the BN feature is extracted and the sliding differential cepstrum SDC feature is extracted, wherein the sliding differential cepstrum SDC feature is calculated according to the mel-frequency cepstrum coefficient MFCC feature vector.

Referring to fig. 2, in order to extract MFCC features of voice data, language recognition is a typical classification problem, and the purpose of distinguishing language types is achieved by extracting features of different levels of voice to be recognized. The most widely used features are mainly based on the acoustic level, usually obtained by a series of mathematical transformations from frame-split speech, reflecting different time-frequency information of the speech signal, such as Mel-cepstral coefficients (Mel-frequency cepstral coefficient, MFCC), sliding differential cepstral (shifted delta cepstrum, SDC).

Cepstrum analysis refers to inverse fourier transformation of the natural logarithm of the signal spectrum, while mel cepstrum coefficients are different, and is more focused on representing the auditory characteristics of the human ear. The relationship between the mel-frequency cepstrum of the signal and the actual frequency spectrum can be represented by the following formula:

Mel(f)＝2595lg(1+f/700)；

where Mel (f) is the MFCC characteristic of the speech data and f is the actual spectrum of the speech data.

Unlike speech recognition and voiceprint recognition, due to the specificity of language recognition, sliding differential cepstral SDC features derived from mel-frequency cepstral coefficient shift differentiation are often used. The extraction of the SDC features can be determined by a set of parameters, namely the MFCC dimension N of each frame of voice, the number of frames d which are shifted back and forth and the number of frames P which are slid forward during the differential operation, and the number of differential vectors k, on the basis of the MFCC features.

In the embodiment of the present invention, { N, d, P, k } is generally {7,1,3,7}, and the process of extracting the SDC feature can be seen in fig. 3, where each feature vector of the SDC feature is a basic 7-dimensional MFCC feature vector, and a 56-dimensional MFCC feature vector of 7+7×7 is obtained by stitching 7 differential vectors, where each differential vector is obtained by subtracting two vectors with a front-back offset of 1 from the MFCC vector sliding forward for 3 frames. To reduce the effects of channel and noise, and to facilitate gaussian modeling, mean variance normalization is also performed on the SDC features for each speech data extracted.

Based on the above embodiments, in the embodiment of the present invention, the gaussian supervectors GSV of regional accent mandarin chinese data of each category are labeled by:

It can be understood that, by extracting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic of the voice to be recognized respectively in the above embodiments, the BN characteristic and the SDC characteristic of the voice to be recognized are input into the trained support vector machine SVM classifier, and the SVM classifier outputs the type of the voice to be recognized.

A gaussian supervector-support vector machine (GSV-SVM) system is another common method of modeling acoustic features. The support vector machine SVM is a classifier for completing a sample classification task through the maximum interval hyperplane, and has the advantages that samples which cannot be linearly distinguished in a low-dimensional space can be mapped to a high-dimensional space, and the task of classification is completed through the optimal hyperplane, so that the support vector machine SVM has better robustness. The mapping function for mapping the sample from the low-dimensional space to the high-dimensional space is a large theoretical basis, a kernel function, of the support vector machine. The GSV-SVM system adopted by the embodiment of the invention is a kernel function based on Gaussian supervectors.

Another problem with using SVM as a language classification is that the SVM function is to find the optimal interface between two sample spaces, and the language classification is a multi-classification task, so we define one language model as positive and the other language as negative.

The modeling technology is a technology combining GMM modeling and SVM modeling, a support vector obtained by SVM training is reversely pushed back to the GMM Model, and finally the obtained language Model utilizes discrimination information of SVM classification to obtain good language identification performance.

Deep learning is through the construction and learning of deep nonlinear network structures, and abstract high-level feature representations are formed by the combination of low-level features to find distributed features of data. The essence is to build a machine learning model with a plurality of hidden layers through massive training data so as to learn more useful characteristics and realize classification accuracy which cannot be achieved by the traditional linear method.

The support vector machine SVM classifier is trained based on the Gaussian super vector GSV of regional accent Mandarin data of each class in the voice training set.

Referring to fig. 4, when the support vector machine SVM classifier is trained, for each type of regional accent mandarin data, the gaussian supervector GSV of the training data of the type of regional accent mandarin is set as a positive sample during training, the gaussian supervector GSV of the training data of other types of regional accent mandarin is set as a negative sample, the language models of the regional accent mandarin data of various types are obtained through the standard support vector machine SVM classification algorithm, and the type of speech to be recognized can be recognized through the trained support vector machine SVM.

When the voice to be recognized is specifically recognized, the BN characteristic and the SDC characteristic of the voice to be recognized are extracted, the BN characteristic and the SDC characteristic of the voice to be recognized are input into a trained support vector machine SVM classifier, and the class of the voice to be recognized is output.

Among them, the gaussian mixture model-universal background model (GMM-UBM) system has made a significant breakthrough in the field of speech recognition and has been introduced into language recognition systems. The Gaussian mixture model is a statistical model for constructing probability density distribution and has wide application in the fields of image recognition, natural language understanding and the like. Complex data that does not fit a single distribution can often be well described by a series of weighted combinations of gaussian distributions. Gaussian mixture models typically estimate the parameters of the model by a Expectation Maximization (EM) algorithm. After modeling each language, a decision is made as to the category of the language by calculating the likelihood of the speech feature to be identified. The general background model is a mixed Gaussian model which is trained by selecting part of voices from each language and is irrelevant to specific languages, and the significance of the model is mainly two points. Firstly, in the modeling process of an actual Gaussian mixture model, a plurality of gaussians (such as 256 and 1024) are needed to better describe complex voice characteristics, and the data volume of a single class is not enough to train the high-order Gaussian mixture; in addition, training the mixed Gaussian model of each language independently easily causes the non-correspondence of Gaussian components, thereby affecting the subsequent decision effect. After the general background model is adopted, each language uses the self-limited training data, and the respective Gaussian mixture model can be obtained by a maximum posterior probability (Maximum A Posterior, MAP) self-adaption method, so that the components of the Gaussian mixture model of each language can be aligned, and meanwhile, the training time is greatly saved.

And training the support vector machine SVM classifier based on the Gaussian super vector GSV of regional accent Mandarin data of each class for the constructed SVM classifier. Referring to fig. 5, the method for extracting the gaussian super vector GSV of the regional accent mandarin chinese data of each category is as follows: training a Gaussian mixture model-a universal background model (GMM-UBM) by an Expectation Maximization (EM) algorithm based on regional accent mandarin data of different categories;

and inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into the GMM-UBM, and obtaining Gaussian super vector GSV of the regional accent mandarin data of each category through a maximum posterior probability MAP self-adaption method.

Referring to fig. 6, an overall flow chart of a regional accent recognition method based on depth feature fusion is divided into a training phase and a recognition phase, wherein the training phase mainly collects regional accent data of various different categories to form voice training data. And extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of each piece of voice training data and Gaussian super-vector GSV of each piece of voice training data, and training a support vector machine SVM classifier by adopting the characteristics.

In the recognition stage, extracting BN characteristics and SDC characteristics of the voice to be recognized, inputting the BN characteristics and the SDC characteristics of the voice to be recognized into a trained SVM classifier, and recognizing the category of the voice to be recognized.

In another embodiment of the present invention, a depth feature fusion regional accent recognition apparatus is provided, which is configured to implement the method in the foregoing embodiments. Therefore, the description and definition in the foregoing embodiments of the regional accent recognition method of depth feature fusion may be used for understanding each execution module in the embodiments of the present invention. Fig. 7 is a schematic overall structure diagram of a regional accent recognition device with depth feature fusion according to an embodiment of the present invention, where the device includes an extraction module 71 and a recognition module 72.

The extraction module 71 is configured to extract bottleneck BN features and sliding differential cepstrum SDC features of the voice to be recognized;

the output module 72 is configured to input the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a support vector machine SVM classifier trained in advance, to obtain a voice class for outputting the voice to be recognized;

the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with the Gaussian super vector GSV.

The apparatus further comprises:

the labeling module 73 is configured to label regional accent mandarin data of each corresponding category based on the obtained gaussian supervector GSV;

the method comprises the steps of inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category through a maximum posterior probability (MAP) self-adaptive method;

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor (processor) 01, communication interface (Communications Interface) 02, memory (memory) 03 and communication bus 04, wherein processor 01, communication interface 02, memory 03 accomplish the communication between each other through communication bus 04. The processor 01 may call logic instructions in the memory 03 to perform the following method:

Further, the logic instructions in the memory 03 may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a separate product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present embodiment provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;

According to the regional accent recognition method and device with depth feature fusion, the multi-feature fusion language recognition system is adopted, the depth features of voice are extracted, the traditional SDC features are fused, the SVM classifier is input, a more robust language recognition function is achieved, and a better classification effect on regional dialect Mandarin is achieved.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The regional accent recognition method based on depth feature fusion is characterized by comprising the following steps of:

the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV,

the Gaussian super vector GSV of regional accent mandarin chinese data of each category is labeled by:

2. The method for regional accent recognition according to claim 1, wherein the extracting bottleneck BN features of the speech to be recognized comprises:

3. The regional accent recognition method of claim 2, wherein the preset deep training network DBN is obtained by training the deep belief network DBN by:

4. The method for regional accent recognition according to claim 1, wherein the extracting sliding differential cepstrum SDC features of the speech to be recognized comprises:

5. The method for regional accent recognition according to claim 4, wherein obtaining the sliding differential cepstrum SDC feature of the speech to be recognized according to the MFCC feature of the speech to be recognized comprises:

6. A regional accent recognition device of depth feature fusion, characterized by comprising:

the preset Support Vector Machine (SVM) classifier is obtained by training a training sample of regional accent mandarin data of each category marked with Gaussian Super Vector (GSV);

7. The regional accent recognition apparatus of claim 6, wherein the apparatus further comprises:

the labeling module is used for labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method for regional accent recognition of depth feature fusion according to any one of claims 1 to 5 when the program is executed.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the regional accent recognition method of depth feature fusion according to any one of claims 1 to 5.