CN111091809A

CN111091809A - Regional accent recognition method and device based on depth feature fusion

Info

Publication number: CN111091809A
Application number: CN201911051663.5A
Authority: CN
Inventors: 计哲; 黄远; 高圣翔; 孙晓晨; 戚梦苑; 宁珊; 徐艳云
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-05-01
Anticipated expiration: 2039-10-31
Also published as: CN111091809B

Abstract

The invention provides a regional accent recognition method and device with depth feature fusion, wherein the method comprises the following steps: extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized; and inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a pre-trained SVM classifier to obtain the voice category of the output voice to be recognized. The method adopts a multi-feature fusion language identification system, extracts the depth feature of the voice, fuses the traditional SDC feature, inputs the SDC feature into an SVM classifier, realizes a more robust language identification function, and obtains a better classification effect on regional dialect Putonghua.

Description

Regional accent recognition method and device based on depth feature fusion

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a regional accent recognition method and device with depth feature fusion.

Background

At present, the speech recognition engines of Chinese continuous speech recognition, speech keyword retrieval, speech to text and the like can achieve good recognition effect aiming at standard mandarin in a telephone channel after years of training.

However, in actual work, a large amount of telephone speech has obvious regional characteristics, such as guangdong and fujianha, and when the existing speech recognition engine based on standard mandarin training processes speech, the recognition effect is relatively poor, the recognition accuracy is low, the recognition effect and the intention judgment of the transcribed content are seriously influenced, so that a language recognition technology aiming at regional spoken language classification is needed to pre-classify and screen the speech so as to improve the efficiency and accuracy of tasks such as subsequent speech recognition.

Disclosure of Invention

To overcome the existing problems or at least partially solve the problems, embodiments of the present invention provide a regional accent recognition method and apparatus with depth feature fusion.

According to a first aspect of the embodiments of the present invention, there is provided a regional accent recognition method with depth feature fusion, including:

extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;

inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the voice category of the output voice to be recognized;

the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.

On the basis of the technical scheme, the invention can be improved as follows.

Further, the extracting the bottleneck BN characteristics of the speech to be recognized includes:

inputting the voice to be recognized into a preset Deep Belief Network (DBN) to obtain the bottleneck BN characteristic of the output voice to be recognized;

the preset deep belief network DBN is obtained by training a training sample containing regional accent mandarin data of each category and the extracted bottleneck BN characteristics.

Further, the deep belief network DBN is trained to obtain the preset deep training network DBN as follows:

learning and training the deep belief network DBN by utilizing a voice training set based on a restricted Boltzmann machine RBM stacking method, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;

and after the deep belief network DBN is trained based on a method for limiting the Boltzmann machine RBM stacking, removing network parameters behind a bottleneck layer with the number of nodes smaller than a threshold value in the deep belief network DBN to obtain the preset deep belief network DBN.

Further, the extracting the sliding difference cepstrum SDC features of the speech to be recognized includes:

extracting a Mel cepstrum coefficient MFCC feature vector of the voice to be recognized;

and obtaining sliding difference cepstrum SDC characteristics of the voice to be recognized according to the MFCC characteristic vectors of the voice to be recognized.

Further, the obtaining, according to the MFCC feature of the speech to be recognized, a sliding differential cepstrum SDC feature of the speech to be recognized includes:

splicing MFCC feature vectors of the speech to be recognized and corresponding differential vectors to form each feature vector of the SDC features, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;

each differential vector is obtained by subtracting a first vector and a second vector, wherein the first vector is obtained by shifting the MFCC feature vector forward by a first set number of frames and then shifting the MFCC feature vector forward by a second set number of frames, and the second vector is obtained by shifting the MFCC feature vector forward by the first set number of frames and then shifting the MFCC feature vector backward by the second set number of frames.

Further, the gaussian supervectors GSV of the regional accent mandarin data of each category are labeled as follows:

inputting BN (boron nitride) characteristics and SDC (stand alone data) characteristics of regional accent Mandarin data of each category into a preset Gaussian mixture model-general background model GMM-UBM, and obtaining a Gaussian supervectors GSV (Gaussian super vector) of the regional accent Mandarin data of each category by a maximum posterior probability MAP (maximum posterior probability) self-adaptive method;

labeling corresponding regional accent Mandarin data of each category based on the obtained Gaussian supervectors GSV;

the preset Gaussian mixture model-general background model GMM-UBM is obtained by training through an expectation maximization EM algorithm based on different types of regional accent Mandarin data.

According to a second aspect of the embodiments of the present invention, there is provided a regional accent recognition apparatus with depth feature fusion, including:

the extraction module is used for extracting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic of the voice to be recognized;

the output module is used for inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the output voice category of the voice to be recognized;

According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor calls the program instruction to perform the regional accent recognition method with depth feature fusion provided in any one of the various possible implementations of the first aspect.

According to a fourth aspect of the embodiments of the present invention, there is further provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method for depth feature-fused regional accent recognition provided in any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a regional accent recognition method and device with depth feature fusion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a regional accent recognition method with depth feature fusion according to an embodiment of the present invention;

FIG. 2 is a flowchart of MFCC feature extraction in an embodiment of the present invention;

fig. 3 is a flow chart of SDC feature extraction according to an embodiment of the present invention;

FIG. 4 is a flow chart of a method for training a GMM-UBM model according to an embodiment of the present invention;

FIG. 5 is a flow chart of a GSV extraction method for each category of regional accent Mandarin data according to an embodiment of the present invention;

fig. 6 is a schematic overall flow chart of a regional accent recognition method with depth feature fusion according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a regional accent recognition apparatus with depth feature fusion according to an embodiment of the present invention;

fig. 8 is a schematic view of an overall structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In an embodiment of the present invention, a depth-feature-fused regional accent recognition method is provided, and fig. 1 is a schematic overall flow chart of the depth-feature-fused regional accent recognition method provided in the embodiment of the present invention, where the method includes:

It can be understood that the speech to be recognized in the embodiment of the present invention is a regional accent mandarin, and in order to more accurately recognize the class of the speech, the embodiment of the present invention extracts the bottleneck BN feature and the sliding difference cepstrum SDC feature of the speech to be recognized, and inputs the two depth features into the trained SVM classifier, and recognizes the class of the speech to be recognized through the SVM classifier.

The embodiment of the invention adopts a language identification system with multi-feature (BN feature and SDC feature) fusion, extracts the depth feature of the voice, fuses the traditional SDC feature, inputs the feature into an SVM classifier, realizes a more robust language identification function and obtains a better classification effect on regional dialect Putonghua.

On the basis of the above embodiment, in the embodiment of the present invention, extracting the bottleneck BN feature of the speech to be recognized includes:

On the basis of the above embodiments, in the embodiments of the present invention, the preset deep training network DBN is obtained by training the deep belief network DBN in the following manner:

It can be understood that, in the embodiment of the present invention, the bottleneck BN feature of the speech to be recognized is extracted based on the trained deep belief network DBN. In the process of training the deep belief network DBN, firstly, a speech training set is constructed, namely, required data of various regional accent Mandarin Chinese are collected, and a training set of each language model is constructed. Because the data source comes from the international telecommunication network, the proportion of regional accent mandarin data which meets the requirements is very small, the redundancy workload of manual selection is too large, and the feasibility is not high, various computer intelligent auxiliary measures are adopted to be matched with manual marking (marking the category of the regional accent mandarin data), a mature language identification system is firstly used for screening and filtering, and after a certain amount of data is accumulated, the model is repeatedly updated until the scale requirement of the data set is met.

Before a Deep Belief Network (DBN) is trained by using a voice training set, voice activity detection is performed on each piece of voice data in the training set, invalid parts of Dual Tone Multifrequency (DTMF) signal tones, polyphonic ringtone, music and other various types of noises mixed in conversation voice are identified and filtered, effective voice is obtained, and BN (boron nitride) features of each piece of voice are extracted.

After the effective voice is obtained, the deep belief network DBN is trained according to regional accent Mandarin data of each category in the voice training set. After the DBN training is completed, removing the network parameters behind the bottleneck layer with less nodes to obtain BN characteristics which compress the language information to low dimension and are suitable for language identification. The deep belief network DBN comprises a plurality of bottleneck layers, and each bottleneck layer comprises a plurality of nodes.

On the basis of the foregoing embodiments, in the embodiments of the present invention, the extracting a sliding differential cepstrum SDC feature of a speech to be recognized includes:

On the basis of the foregoing embodiments, in an embodiment of the present invention, obtaining a sliding differential cepstrum SDC feature of a speech to be recognized according to an MFCC feature of the speech to be recognized includes:

It will be appreciated that, for speech to be recognized, the BN features are extracted simultaneously with the sliding differential cepstral SDC features, which are computed from mel cepstral coefficient MFCC feature vectors.

Referring to fig. 2, for a flow chart of extracting MFCC features of voice data, language identification is a typical classification problem, and the purpose of distinguishing language types is achieved by extracting features of different levels of the voice to be identified. The most widely used features are mainly based on the acoustic level, and are usually obtained by a series of mathematical transformations from frame-segmented speech, reflecting different time-frequency information of the speech signal, such as Mel-frequency cepstral coefficient (MFCC), Sliding Difference Cepstrum (SDC).

Cepstrum analysis refers to the inverse fourier transform of the natural logarithm of the signal spectrum, and the mel-frequency cepstrum coefficients are different and focus more on the auditory properties of human ears. The relationship between the mel-frequency cepstrum of the signal and the actual frequency spectrum can be represented by the following formula:

Mel(f)＝2595lg(1+f/700)；

where mel (f) is the MFCC characteristic of the speech data, and f is the actual spectrum of the speech data.

Unlike speech recognition and voiceprint recognition, due to the particularities of language recognition, one often uses a sliding differential cepstrum SDC feature derived from the difference of shifts of mel cepstrum coefficients. The extraction of the SDC feature can be determined by a group of parameters on the basis of the MFCC feature, namely the MFCC dimension N of each frame of voice, the number of frames d shifted forwards and backwards during the difference operation, the number of frames P sliding forwards and the number of difference vectors k.

In the embodiment of the present invention, { N, d, P, k } is generally {7,1,3,7}, and the process of extracting the SDC feature can be seen in fig. 3, where each feature vector of the SDC feature is a 56-dimensional 7+7 × 7 obtained by concatenating a basic 7-dimensional MFCC feature vector with 7 differential vectors, where each differential vector is obtained by subtracting two vectors offset by 1 from the MFCC vector of 3 frames sliding forward and backward. To reduce the influence of channel and noise and facilitate gaussian modeling, mean variance normalization is also performed on the SDC features for each extracted speech data.

On the basis of the above embodiments, in the embodiments of the present invention, the gaussian supervectors GSV of the regional accent mandarin data of each category are labeled as follows:

It can be understood that, through the above embodiments, the bottleneck BN feature and the sliding difference cepstrum SDC feature of the speech to be recognized are respectively extracted, the BN feature and the SDC feature of the speech to be recognized are input into the trained support vector machine SVM classifier, and the type of the speech to be recognized is output by the SVM classifier.

A gaussian super-vector-support vector machine (GMM super-vector-support-vector machines, GSV-SVM) system is another common method for modeling acoustic features. The support vector machine SVM is a classifier which completes a sample classification task through a maximum interval hyperplane, and has the advantages that samples which cannot be linearly distinguished in a low-dimensional space can be mapped to a high-dimensional space, the classification task is completed through an optimal hyperplane, and the robustness is better. The mapping function for mapping the samples from the low-dimensional space to the high-dimensional space is a large theoretical basis, namely a kernel function, of the support vector machine. The GSV-SVM system adopted by the embodiment of the invention utilizes a kernel function based on a Gaussian supervector.

Another problem with using SVM for language classification is that the SVM function is to find the optimal interface between two sample spaces, and language classification is a multi-classification task, so when training a language model, we will define the language speech as a positive sample, and the speech of other languages will be set as a negative sample.

The pushning Model technology is a technology combining GMM modeling and SVM modeling, a support vector obtained by SVM training is reversely pushed back to a GMM Model, and finally the obtained language Model utilizes discriminative information of SVM classification to obtain good language identification performance.

Deep learning forms abstract high-level feature representation by combination of low-level features through building and learning of deep nonlinear network structures so as to discover distributed features of data. The essence is that a machine learning model with a plurality of hidden layers is constructed through mass training data to learn more useful characteristics, and the classification accuracy which cannot be achieved by the traditional linear method is realized.

The SVM classifier is trained on the basis of the Gaussian super vector GSV of the regional accent Mandarin data of each category in the voice training set.

Referring to fig. 4, when the SVM classifier is trained, for each category of regional accent mandarin data, the gaussian supervectors GSV of the training data of the category of regional accent mandarin are set as positive samples during training, the gaussian supervectors GSV of the training data of other categories of regional accent mandarin are set as negative samples, different categories of regional accent mandarin data language models are obtained through a standard SVM classification algorithm, and the category of the speech to be recognized can be recognized through the trained SVM.

And when the speech to be recognized is specifically recognized, extracting BN (boron nitride) features and SDC (stand alone description) features of the speech to be recognized, inputting the BN features and the SDC features of the speech to be recognized into a trained SVM (support vector machine) classifier, and outputting the category of the speech to be recognized.

Among them, the gaussian mixture model-general background model (GMM-UBM) system has made a major breakthrough in the field of speech recognition, and was immediately introduced into a language recognition system. The Gaussian mixture model is a statistical model for constructing probability density distribution, and is widely applied to the fields of image recognition, natural language understanding and the like. Complex data that do not fit into a single distribution can often be well described by a weighted combination of a series of gaussian distributions. Gaussian mixture models typically estimate the parameters of the model by an expectation-maximization (EM) algorithm. After the model is established for each language, the judgment is made for the language category by calculating the likelihood of the speech feature to be recognized. The general background model is a mixed Gaussian model which is trained from partial speech selected from each language and is irrelevant to the specific language, and has two main meanings. Firstly, in the actual process of modeling a mixed Gaussian model, a plurality of gaussians (such as 256, 1024) are needed to better describe complex speech features, and the data volume of a single class is not enough to train the mixed Gaussian of such high order; in addition, the gaussian mixture model of each language is trained independently, which easily causes the non-correspondence of gaussian components, thereby affecting the subsequent decision effect. After the universal background model exists, each language can obtain respective mixed Gaussian model by using self-limited training data and a Maximum A Posteriori (MAP) self-adaptive method, so that not only can the components of each language Gaussian model be aligned, but also the training time is greatly saved.

For the constructed SVM classifier, the SVM classifier is trained based on the Gaussian supervectors GSV of the regional accent Mandarin data of each category. Referring to fig. 5, the method of extracting the gaussian supervectors GSV of the regional accent mandarin data of each category is: training a Gaussian mixture model-general background model GMM-UBM through an expectation maximization EM algorithm based on different types of regional accent Mandarin data;

and inputting the BN characteristic and the SDC characteristic of the regional accent Mandarin data of each category into the GMM-UBM, and obtaining the Gaussian supervectors GSV of the regional accent Mandarin data of each category by a maximum posterior probability MAP self-adaptive method.

Referring to fig. 6, the overall flowchart of the regional accent recognition method based on depth feature fusion is divided into a training stage and a recognition stage, where the training stage mainly collects various different types of regional accent data to form speech training data. And extracting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic of each piece of voice training data and the Gaussian super vector GSV of each piece of voice training data, and training the SVM classifier by adopting the characteristics.

And in the recognition stage, extracting the BN characteristic and the SDC characteristic of the voice to be recognized, inputting the BN characteristic and the SDC characteristic of the voice to be recognized into the trained SVM classifier, and recognizing the category of the voice to be recognized.

In another embodiment of the present invention, a depth feature fused regional accent recognition apparatus is provided, which is used to implement the methods in the foregoing embodiments. Therefore, the description and definition in the embodiments of the regional accent recognition method with depth feature fusion may be used for understanding the execution modules in the embodiments of the present invention. Fig. 7 is a schematic view of an overall structure of a depth feature fused regional accent recognition apparatus according to an embodiment of the present invention, where the apparatus includes an extraction module 71 and a recognition module 72.

The extraction module 71 is configured to extract a bottleneck BN feature and a sliding difference cepstrum SDC feature of the speech to be recognized;

the output module 72 is configured to input the bottleneck BN feature and the sliding difference cepstrum SDC feature into a pre-trained support vector machine SVM classifier, so as to obtain a speech class outputting the speech to be recognized;

The device further comprises:

a labeling module 73, configured to label, based on the obtained gaussian supervectors GSV, the corresponding regional accent mandarin data of each category;

the BN characteristic and the SDC characteristic of the regional accent Mandarin data of each category are input into a preset Gaussian mixture model-general background model GMM-UBM, and a Gaussian supervectors GSV of the regional accent Mandarin data of each category is obtained through a maximum posterior probability MAP self-adaptive method;

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: the system comprises a processor (processor)01, a communication Interface (Communications Interface)02, a memory (memory)03 and a communication bus 04, wherein the processor 01, the communication Interface 02 and the memory 03 complete mutual communication through the communication bus 04. Processor 01 may call logic instructions in memory 03 to perform the following method:

In addition, the logic instructions in the memory 03 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above method embodiments, for example, including: extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;

The regional accent recognition method and device based on depth feature fusion provided by the embodiment of the invention have the advantages that a multi-feature fusion language recognition system is adopted, the depth features of voice are extracted, the traditional SDC features are fused, and an SVM classifier is input, so that a more robust language recognition function is realized, and a better classification effect on regional dialect mandarin is achieved.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A regional accent recognition method with depth feature fusion is characterized by comprising the following steps:

2. The regional accent recognition method of claim 1, wherein the extracting bottleneck BN features of the speech to be recognized comprises:

3. The regional accent recognition method of claim 2, wherein the preset deep training network DBN is obtained by training the deep belief network DBN as follows:

4. The regional accent recognition method of claim 1, wherein the extracting sliding differential cepstrum SDC features of the speech to be recognized comprises:

5. The regional accent recognition method of claim 4, wherein the obtaining of the sliding differential cepstrum SDC features of the speech to be recognized according to the MFCC features of the speech to be recognized comprises:

6. The regional accent recognition method of claim 1, wherein the gaussian supervectors GSV of each category of regional accent mandarin data are labeled as follows:

7. A regional accent recognition device of degree of depth characteristic fusion, its characterized in that includes:

8. The regional accent recognition device of claim 7, further comprising:

the marking module is used for marking the corresponding regional accent Mandarin data of each category based on the obtained Gaussian supervectors GSV;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method for regional accent recognition with depth feature fusion according to any of claims 1 to 6.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the depth feature fused regional accent recognition method of any one of claims 1 to 6.