CN111091809B - Regional accent recognition method and device based on depth feature fusion - Google Patents

Regional accent recognition method and device based on depth feature fusion Download PDF

Info

Publication number
CN111091809B
CN111091809B CN201911051663.5A CN201911051663A CN111091809B CN 111091809 B CN111091809 B CN 111091809B CN 201911051663 A CN201911051663 A CN 201911051663A CN 111091809 B CN111091809 B CN 111091809B
Authority
CN
China
Prior art keywords
voice
regional accent
recognized
category
sdc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911051663.5A
Other languages
Chinese (zh)
Other versions
CN111091809A (en
Inventor
计哲
黄远
高圣翔
孙晓晨
戚梦苑
宁珊
徐艳云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Information Engineering of CAS
Priority to CN201911051663.5A priority Critical patent/CN111091809B/en
Publication of CN111091809A publication Critical patent/CN111091809A/en
Application granted granted Critical
Publication of CN111091809B publication Critical patent/CN111091809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a regional accent recognition method and a regional accent recognition device for depth feature fusion, wherein the method comprises the following steps: extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized; and inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a Support Vector Machine (SVM) classifier trained in advance to obtain the voice category of the output voice to be recognized. The invention adopts the multi-feature fusion language identification system to extract the depth features of the voice, fuses the traditional SDC features, inputs the SDC features into the SVM classifier, realizes a more robust language identification function, and obtains a better classification effect on regional dialect Mandarin.

Description

Regional accent recognition method and device based on depth feature fusion
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a regional accent recognition method and device for depth feature fusion.
Background
At present, the speech recognition engines such as Chinese continuous speech recognition, speech keyword retrieval, speech-to-text and the like are trained for a plurality of years, and can achieve good recognition effect for standard Mandarin in telephone channels.
However, in actual work, a large number of telephone voices have obvious regional characteristics, such as guangdong and Fujian, and when the voice is processed by the existing voice recognition engine based on standard mandarin training, the recognition effect is relatively poor, the recognition accuracy is low, the recognition effect and the intention discrimination of the transfer contents are seriously affected, so that a language recognition technology aiming at regional accent classification is needed to pre-classify and screen the voices so as to improve the efficiency and accuracy of tasks such as follow-up voice recognition.
Disclosure of Invention
In order to overcome the existing problems or at least partially solve the problems, an embodiment of the present invention provides a regional accent recognition method and apparatus for depth feature fusion.
According to a first aspect of an embodiment of the present invention, there is provided a regional accent recognition method for depth feature fusion, including:
extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV.
On the basis of the technical scheme, the invention can also make the following improvements.
Further, the extracting the bottleneck BN feature of the voice to be recognized includes:
inputting the voice to be recognized into a preset deep belief network DBN to obtain bottleneck BN characteristics of the outputted voice to be recognized;
the preset deep belief network DBN is obtained by training a training sample containing regional accent Mandarin data of each category and extracted bottleneck BN characteristics.
Further, training the deep belief network DBN to obtain the preset deep training network DBN by the following method:
learning and training the deep belief network DBN based on a RBM stacking method of a Boltzmann machine by utilizing a voice training set, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;
and after the deep belief network DBN is trained based on the RBM stacking method of the limiting Boltzmann machine, removing network parameters after a bottleneck layer with the node number smaller than a threshold value in the deep belief network DBN, and obtaining the preset deep belief network DBN.
Further, the extracting the sliding differential cepstrum SDC feature of the voice to be recognized includes:
extracting a mel cepstrum coefficient (MFCC) feature vector of the voice to be recognized;
and according to the MFCC feature vector of the voice to be recognized, obtaining the sliding differential cepstrum SDC feature of the voice to be recognized.
Further, the obtaining the sliding differential cepstrum SDC feature of the voice to be recognized according to the MFCC feature of the voice to be recognized includes:
splicing the MFCC feature vectors of the voice to be recognized and corresponding differential vectors to form each feature vector of the SDC feature, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;
each differential vector is obtained by subtracting a first vector from a second vector, wherein the first vector is obtained by sliding the MFCC feature vector forward for a first set number of frames and then shifting forward for a second set number of frames, and the second vector is obtained by sliding the MFCC feature vector forward for the first set number of frames and then shifting backward for the second set number of frames.
Further, the gaussian super vector GSV of regional accent mandarin chinese data of each category is labeled by:
inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category by a maximum posterior probability (MAP) self-adaptive method;
labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;
the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.
According to a second aspect of the embodiment of the present invention, there is provided a regional accent recognition apparatus for depth feature fusion, including:
the extraction module is used for extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
the output module is used for inputting the bottleneck BN characteristics and the sliding differential cepstrum SDC characteristics into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the outputted voice to be recognized;
the preset support vector machine SVM classifier is obtained by training a training sample of regional accent mandarin data of each category marked with Gaussian super vector GSV.
According to a third aspect of embodiments of the present invention, there is also provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor invoking a regional accent recognition method capable of performing depth feature fusion provided by any of the various possible implementations of the first aspect.
According to a fourth aspect of embodiments of the present invention, there is also provided a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the method for regional accent recognition of depth feature fusion provided by any one of the various possible implementations of the first aspect.
The embodiment of the invention provides a regional accent recognition method and a regional accent recognition device with depth feature fusion.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a regional accent recognition method with depth feature fusion provided by an embodiment of the present invention;
FIG. 2 is a flow chart of MFCC feature extraction in an embodiment of the present invention;
FIG. 3 is a flow chart of SDC feature extraction in an embodiment of the invention;
fig. 4 is a flowchart of a training method of a GMM-UBM model according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method for extracting GSVs for regional accent mandarin chinese data of each category in accordance with an embodiment of the present invention;
fig. 6 is an overall flow chart of a regional accent recognition method with depth feature fusion according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a regional accent recognition device with depth feature fusion according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an overall structure of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
In one embodiment of the present invention, a regional accent recognition method with depth feature fusion is provided, and fig. 1 is a schematic overall flow diagram of the regional accent recognition method with depth feature fusion provided in the embodiment of the present invention, where the method includes:
extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV.
It can be understood that the voice to be recognized in the embodiment of the invention is a regional accent mandarin, in order to more accurately recognize the category of the regional accent mandarin, the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic of the voice to be recognized are extracted, and the two depth characteristics are fused and input into a Support Vector Machine (SVM) classifier after training, and the voice category of the voice to be recognized is recognized through the SVM classifier.
According to the embodiment of the invention, a language recognition system with multiple characteristics (BN characteristics and SDC characteristics) fused is adopted, the depth characteristics of voice are extracted, the traditional SDC characteristics are fused, and the voice is input into an SVM classifier, so that a more robust language recognition function is realized, and a better classification effect on regional dialect Mandarin is obtained.
On the basis of the above embodiment, in the embodiment of the present invention, extracting bottleneck BN features of a voice to be recognized includes:
inputting the voice to be recognized into a preset deep belief network DBN to obtain bottleneck BN characteristics of the outputted voice to be recognized;
the preset deep belief network DBN is obtained by training a training sample containing regional accent Mandarin data of each category and extracted bottleneck BN characteristics.
Based on the above embodiments, in the embodiments of the present invention, the preset deep training network DBN is obtained by training the deep belief network DBN in the following manner:
learning and training the deep belief network DBN based on a RBM stacking method of a Boltzmann machine by utilizing a voice training set, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;
and after the deep belief network DBN is trained based on the RBM stacking method of the limiting Boltzmann machine, removing network parameters after a bottleneck layer with the node number smaller than a threshold value in the deep belief network DBN, and obtaining the preset deep belief network DBN.
It can be understood that, in the embodiment of the present invention, bottleneck BN features of the speech to be recognized are extracted based on the trained deep belief network DBN. In the process of training the deep belief network DBN, firstly, a voice training set is constructed, namely, various regional accent mandarin data required by the user are collected, and the training set of each language model is constructed. Because the data source comes from the international telecommunication network, the proportion of regional accent mandarin data meeting the requirements is very small, the redundant workload of manual selection is too large, and the feasibility is not high, therefore, a plurality of computer intelligent auxiliary measures are adopted to match with manual labeling (labeling the category of regional accent mandarin data), firstly, a mature language identification system is used for screening and filtering, and the model is repeatedly updated after a certain amount of data is accumulated until the scale requirement of a data set is met.
Before training the deep belief network DBN by using the voice training set, voice activity detection is carried out on each piece of voice data in the training set, invalid parts of vibration (DTMF signal tones), color ring, music and other various types of noise mixed in conversation voice are recognized and filtered, effective voice is obtained, and BN characteristics of each piece of voice are extracted.
After effective voice is obtained, training the deep belief network DBN according to regional accent mandarin data of each category in a voice training set, wherein in the embodiment of the invention, the training mode of the deep belief network DBN is based on a method of limiting Boltzmann machine (RBM) stacking, unsupervised learning is performed layer by layer from bottom to top, and after all RBM training is finished, top-to-bottom supervised fine adjustment is performed. After the DBN training is completed, network parameters behind the bottleneck layer with fewer nodes are removed, so that the language information is compressed to a low dimension, and the BN characteristics suitable for language identification are obtained. The deep belief network DBN includes a plurality of bottleneck layers, and each bottleneck layer includes a plurality of nodes.
Based on the foregoing embodiments, in the embodiment of the present invention, the extracting a sliding differential cepstrum SDC feature of a voice to be recognized includes:
extracting a mel cepstrum coefficient (MFCC) feature vector of the voice to be recognized;
and according to the MFCC feature vector of the voice to be recognized, obtaining the sliding differential cepstrum SDC feature of the voice to be recognized.
Based on the above embodiments, in the embodiment of the present invention, according to MFCC characteristics of a voice to be recognized, obtaining sliding differential cepstrum SDC characteristics of the voice to be recognized includes:
splicing the MFCC feature vectors of the voice to be recognized and corresponding differential vectors to form each feature vector of the SDC feature, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;
each differential vector is obtained by subtracting a first vector from a second vector, wherein the first vector is obtained by sliding the MFCC feature vector forward for a first set number of frames and then shifting forward for a second set number of frames, and the second vector is obtained by sliding the MFCC feature vector forward for the first set number of frames and then shifting backward for the second set number of frames.
It can be understood that, for the voice to be recognized, the BN feature is extracted and the sliding differential cepstrum SDC feature is extracted, wherein the sliding differential cepstrum SDC feature is calculated according to the mel-frequency cepstrum coefficient MFCC feature vector.
Referring to fig. 2, in order to extract MFCC features of voice data, language recognition is a typical classification problem, and the purpose of distinguishing language types is achieved by extracting features of different levels of voice to be recognized. The most widely used features are mainly based on the acoustic level, usually obtained by a series of mathematical transformations from frame-split speech, reflecting different time-frequency information of the speech signal, such as Mel-cepstral coefficients (Mel-frequency cepstral coefficient, MFCC), sliding differential cepstral (shifted delta cepstrum, SDC).
Cepstrum analysis refers to inverse fourier transformation of the natural logarithm of the signal spectrum, while mel cepstrum coefficients are different, and is more focused on representing the auditory characteristics of the human ear. The relationship between the mel-frequency cepstrum of the signal and the actual frequency spectrum can be represented by the following formula:
Mel(f)=2595lg(1+f/700);
where Mel (f) is the MFCC characteristic of the speech data and f is the actual spectrum of the speech data.
Unlike speech recognition and voiceprint recognition, due to the specificity of language recognition, sliding differential cepstral SDC features derived from mel-frequency cepstral coefficient shift differentiation are often used. The extraction of the SDC features can be determined by a set of parameters, namely the MFCC dimension N of each frame of voice, the number of frames d which are shifted back and forth and the number of frames P which are slid forward during the differential operation, and the number of differential vectors k, on the basis of the MFCC features.
In the embodiment of the present invention, { N, d, P, k } is generally {7,1,3,7}, and the process of extracting the SDC feature can be seen in fig. 3, where each feature vector of the SDC feature is a basic 7-dimensional MFCC feature vector, and a 56-dimensional MFCC feature vector of 7+7×7 is obtained by stitching 7 differential vectors, where each differential vector is obtained by subtracting two vectors with a front-back offset of 1 from the MFCC vector sliding forward for 3 frames. To reduce the effects of channel and noise, and to facilitate gaussian modeling, mean variance normalization is also performed on the SDC features for each speech data extracted.
Based on the above embodiments, in the embodiment of the present invention, the gaussian supervectors GSV of regional accent mandarin chinese data of each category are labeled by:
inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category by a maximum posterior probability (MAP) self-adaptive method;
labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;
the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.
It can be understood that, by extracting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic of the voice to be recognized respectively in the above embodiments, the BN characteristic and the SDC characteristic of the voice to be recognized are input into the trained support vector machine SVM classifier, and the SVM classifier outputs the type of the voice to be recognized.
A gaussian supervector-support vector machine (GSV-SVM) system is another common method of modeling acoustic features. The support vector machine SVM is a classifier for completing a sample classification task through the maximum interval hyperplane, and has the advantages that samples which cannot be linearly distinguished in a low-dimensional space can be mapped to a high-dimensional space, and the task of classification is completed through the optimal hyperplane, so that the support vector machine SVM has better robustness. The mapping function for mapping the sample from the low-dimensional space to the high-dimensional space is a large theoretical basis, a kernel function, of the support vector machine. The GSV-SVM system adopted by the embodiment of the invention is a kernel function based on Gaussian supervectors.
Another problem with using SVM as a language classification is that the SVM function is to find the optimal interface between two sample spaces, and the language classification is a multi-classification task, so we define one language model as positive and the other language as negative.
The modeling technology is a technology combining GMM modeling and SVM modeling, a support vector obtained by SVM training is reversely pushed back to the GMM Model, and finally the obtained language Model utilizes discrimination information of SVM classification to obtain good language identification performance.
Deep learning is through the construction and learning of deep nonlinear network structures, and abstract high-level feature representations are formed by the combination of low-level features to find distributed features of data. The essence is to build a machine learning model with a plurality of hidden layers through massive training data so as to learn more useful characteristics and realize classification accuracy which cannot be achieved by the traditional linear method.
The support vector machine SVM classifier is trained based on the Gaussian super vector GSV of regional accent Mandarin data of each class in the voice training set.
Referring to fig. 4, when the support vector machine SVM classifier is trained, for each type of regional accent mandarin data, the gaussian supervector GSV of the training data of the type of regional accent mandarin is set as a positive sample during training, the gaussian supervector GSV of the training data of other types of regional accent mandarin is set as a negative sample, the language models of the regional accent mandarin data of various types are obtained through the standard support vector machine SVM classification algorithm, and the type of speech to be recognized can be recognized through the trained support vector machine SVM.
When the voice to be recognized is specifically recognized, the BN characteristic and the SDC characteristic of the voice to be recognized are extracted, the BN characteristic and the SDC characteristic of the voice to be recognized are input into a trained support vector machine SVM classifier, and the class of the voice to be recognized is output.
Among them, the gaussian mixture model-universal background model (GMM-UBM) system has made a significant breakthrough in the field of speech recognition and has been introduced into language recognition systems. The Gaussian mixture model is a statistical model for constructing probability density distribution and has wide application in the fields of image recognition, natural language understanding and the like. Complex data that does not fit a single distribution can often be well described by a series of weighted combinations of gaussian distributions. Gaussian mixture models typically estimate the parameters of the model by a Expectation Maximization (EM) algorithm. After modeling each language, a decision is made as to the category of the language by calculating the likelihood of the speech feature to be identified. The general background model is a mixed Gaussian model which is trained by selecting part of voices from each language and is irrelevant to specific languages, and the significance of the model is mainly two points. Firstly, in the modeling process of an actual Gaussian mixture model, a plurality of gaussians (such as 256 and 1024) are needed to better describe complex voice characteristics, and the data volume of a single class is not enough to train the high-order Gaussian mixture; in addition, training the mixed Gaussian model of each language independently easily causes the non-correspondence of Gaussian components, thereby affecting the subsequent decision effect. After the general background model is adopted, each language uses the self-limited training data, and the respective Gaussian mixture model can be obtained by a maximum posterior probability (Maximum A Posterior, MAP) self-adaption method, so that the components of the Gaussian mixture model of each language can be aligned, and meanwhile, the training time is greatly saved.
And training the support vector machine SVM classifier based on the Gaussian super vector GSV of regional accent Mandarin data of each class for the constructed SVM classifier. Referring to fig. 5, the method for extracting the gaussian super vector GSV of the regional accent mandarin chinese data of each category is as follows: training a Gaussian mixture model-a universal background model (GMM-UBM) by an Expectation Maximization (EM) algorithm based on regional accent mandarin data of different categories;
and inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into the GMM-UBM, and obtaining Gaussian super vector GSV of the regional accent mandarin data of each category through a maximum posterior probability MAP self-adaption method.
Referring to fig. 6, an overall flow chart of a regional accent recognition method based on depth feature fusion is divided into a training phase and a recognition phase, wherein the training phase mainly collects regional accent data of various different categories to form voice training data. And extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of each piece of voice training data and Gaussian super-vector GSV of each piece of voice training data, and training a support vector machine SVM classifier by adopting the characteristics.
In the recognition stage, extracting BN characteristics and SDC characteristics of the voice to be recognized, inputting the BN characteristics and the SDC characteristics of the voice to be recognized into a trained SVM classifier, and recognizing the category of the voice to be recognized.
In another embodiment of the present invention, a depth feature fusion regional accent recognition apparatus is provided, which is configured to implement the method in the foregoing embodiments. Therefore, the description and definition in the foregoing embodiments of the regional accent recognition method of depth feature fusion may be used for understanding each execution module in the embodiments of the present invention. Fig. 7 is a schematic overall structure diagram of a regional accent recognition device with depth feature fusion according to an embodiment of the present invention, where the device includes an extraction module 71 and a recognition module 72.
The extraction module 71 is configured to extract bottleneck BN features and sliding differential cepstrum SDC features of the voice to be recognized;
the output module 72 is configured to input the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a support vector machine SVM classifier trained in advance, to obtain a voice class for outputting the voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with the Gaussian super vector GSV.
The apparatus further comprises:
the labeling module 73 is configured to label regional accent mandarin data of each corresponding category based on the obtained gaussian supervector GSV;
the method comprises the steps of inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category through a maximum posterior probability (MAP) self-adaptive method;
the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.
Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor (processor) 01, communication interface (Communications Interface) 02, memory (memory) 03 and communication bus 04, wherein processor 01, communication interface 02, memory 03 accomplish the communication between each other through communication bus 04. The processor 01 may call logic instructions in the memory 03 to perform the following method:
extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV.
Further, the logic instructions in the memory 03 may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a separate product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The present embodiment provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV.
According to the regional accent recognition method and device with depth feature fusion, the multi-feature fusion language recognition system is adopted, the depth features of voice are extracted, the traditional SDC features are fused, the SVM classifier is input, a more robust language recognition function is achieved, and a better classification effect on regional dialect Mandarin is achieved.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. The regional accent recognition method based on depth feature fusion is characterized by comprising the following steps of:
extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding differential cepstrum SDC characteristic into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian super vector GSV,
the Gaussian super vector GSV of regional accent mandarin chinese data of each category is labeled by:
inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category by a maximum posterior probability (MAP) self-adaptive method;
labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;
the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.
2. The method for regional accent recognition according to claim 1, wherein the extracting bottleneck BN features of the speech to be recognized comprises:
inputting the voice to be recognized into a preset deep belief network DBN to obtain bottleneck BN characteristics of the outputted voice to be recognized;
the preset deep belief network DBN is obtained by training a training sample containing regional accent Mandarin data of each category and extracted bottleneck BN characteristics.
3. The regional accent recognition method of claim 2, wherein the preset deep training network DBN is obtained by training the deep belief network DBN by:
learning and training the deep belief network DBN based on a RBM stacking method of a Boltzmann machine by utilizing a voice training set, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;
and after the deep belief network DBN is trained based on the RBM stacking method of the limiting Boltzmann machine, removing network parameters after a bottleneck layer with the node number smaller than a threshold value in the deep belief network DBN, and obtaining the preset deep belief network DBN.
4. The method for regional accent recognition according to claim 1, wherein the extracting sliding differential cepstrum SDC features of the speech to be recognized comprises:
extracting a mel cepstrum coefficient (MFCC) feature vector of the voice to be recognized;
and according to the MFCC feature vector of the voice to be recognized, obtaining the sliding differential cepstrum SDC feature of the voice to be recognized.
5. The method for regional accent recognition according to claim 4, wherein obtaining the sliding differential cepstrum SDC feature of the speech to be recognized according to the MFCC feature of the speech to be recognized comprises:
splicing the MFCC feature vectors of the voice to be recognized and corresponding differential vectors to form each feature vector of the SDC feature, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;
each differential vector is obtained by subtracting a first vector from a second vector, wherein the first vector is obtained by sliding the MFCC feature vector forward for a first set number of frames and then shifting forward for a second set number of frames, and the second vector is obtained by sliding the MFCC feature vector forward for the first set number of frames and then shifting backward for the second set number of frames.
6. A regional accent recognition device of depth feature fusion, characterized by comprising:
the extraction module is used for extracting bottleneck BN characteristics and sliding differential cepstrum SDC characteristics of the voice to be recognized;
the output module is used for inputting the bottleneck BN characteristics and the sliding differential cepstrum SDC characteristics into a preset Support Vector Machine (SVM) classifier to obtain the voice category of the outputted voice to be recognized;
the preset Support Vector Machine (SVM) classifier is obtained by training a training sample of regional accent mandarin data of each category marked with Gaussian Super Vector (GSV);
the Gaussian super vector GSV of regional accent mandarin chinese data of each category is labeled by:
inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category by a maximum posterior probability (MAP) self-adaptive method;
labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;
the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.
7. The regional accent recognition apparatus of claim 6, wherein the apparatus further comprises:
the labeling module is used for labeling the regional accent mandarin data of each corresponding category based on the obtained Gaussian super vector GSV;
the method comprises the steps of inputting BN characteristics and SDC characteristics of regional accent mandarin data of each category into a preset Gaussian mixture model-universal background model (GMM-UBM), and obtaining Gaussian Super Vector (GSV) of the regional accent mandarin data of each category through a maximum posterior probability (MAP) self-adaptive method;
the preset Gaussian mixture model-universal background model GMM-UBM is obtained through training through an expectation maximization EM algorithm based on regional accent mandarin data of different categories.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method for regional accent recognition of depth feature fusion according to any one of claims 1 to 5 when the program is executed.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the regional accent recognition method of depth feature fusion according to any one of claims 1 to 5.
CN201911051663.5A 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion Active CN111091809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911051663.5A CN111091809B (en) 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911051663.5A CN111091809B (en) 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion

Publications (2)

Publication Number Publication Date
CN111091809A CN111091809A (en) 2020-05-01
CN111091809B true CN111091809B (en) 2023-05-23

Family

ID=70393476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911051663.5A Active CN111091809B (en) 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion

Country Status (1)

Country Link
CN (1) CN111091809B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640419B (en) * 2020-05-26 2023-04-07 合肥讯飞数码科技有限公司 Language identification method, system, electronic equipment and storage medium
CN112233651B (en) * 2020-10-10 2024-06-04 深圳前海微众银行股份有限公司 Dialect type determining method, device, equipment and storage medium
CN112908295B (en) * 2021-02-02 2023-05-16 睿云联(厦门)网络通讯技术有限公司 Generation method and device of regional offline accent voice recognition system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012036934A1 (en) * 2010-09-15 2012-03-22 Microsoft Corporation Deep belief network for large vocabulary continuous speech recognition
CN103065622A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Language model practicing method and system thereof for language recognition
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110164417A (en) * 2019-05-31 2019-08-23 科大讯飞股份有限公司 A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9966064B2 (en) * 2012-07-18 2018-05-08 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN103077709B (en) * 2012-12-28 2015-09-09 中国科学院声学研究所 A kind of Language Identification based on total distinctive subspace mapping and device
CN105895087B (en) * 2016-03-24 2020-02-07 海信集团有限公司 Voice recognition method and device
KR102545764B1 (en) * 2016-04-01 2023-06-20 삼성전자주식회사 Device and method for voice translation
CN107633842B (en) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
CN109979432B (en) * 2019-04-02 2021-10-08 科大讯飞股份有限公司 Dialect translation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012036934A1 (en) * 2010-09-15 2012-03-22 Microsoft Corporation Deep belief network for large vocabulary continuous speech recognition
CN103065622A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Language model practicing method and system thereof for language recognition
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110164417A (en) * 2019-05-31 2019-08-23 科大讯飞股份有限公司 A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Discriminatively Trained GMMs for Language Classification Using Boosting Methods;Man-Hung Siu,et al.;《IEEE Transactions on Audio, Speech, and Language Processing 》;IEEE;20090106;第17卷(第1期);全文 *
基于深度神经网络的语种识别;崔瑞莲等;《基于深度神经网络的语种识别》;中国知网;20151215;第28卷(第12期);全文 *

Also Published As

Publication number Publication date
CN111091809A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN107578775B (en) Multi-classification voice method based on deep neural network
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
Ghahabi et al. Deep learning backend for single and multisession i-vector speaker recognition
CN108172218B (en) Voice modeling method and device
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN107680582A (en) Acoustic training model method, audio recognition method, device, equipment and medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
CN108694949B (en) Speaker identification method and device based on reordering supervectors and residual error network
CN103117060A (en) Modeling approach and modeling system of acoustic model used in speech recognition
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
WO2014029099A1 (en) I-vector based clustering training data in speech recognition
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
CN109637526A (en) The adaptive approach of DNN acoustic model based on personal identification feature
CN111144097B (en) Modeling method and device for emotion tendency classification model of dialogue text
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
CN113744727A (en) Model training method, system, terminal device and storage medium
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Shivakumar et al. Simplified and supervised i-vector modeling for speaker age regression
CN111241820A (en) Bad phrase recognition method, device, electronic device, and storage medium
Ling An acoustic model for English speech recognition based on deep learning
Ozerov et al. GMM-based classification from noisy features
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant