CN111091809A - Regional accent recognition method and device based on depth feature fusion - Google Patents

Regional accent recognition method and device based on depth feature fusion Download PDF

Info

Publication number
CN111091809A
CN111091809A CN201911051663.5A CN201911051663A CN111091809A CN 111091809 A CN111091809 A CN 111091809A CN 201911051663 A CN201911051663 A CN 201911051663A CN 111091809 A CN111091809 A CN 111091809A
Authority
CN
China
Prior art keywords
recognized
voice
regional accent
sdc
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911051663.5A
Other languages
Chinese (zh)
Other versions
CN111091809B (en
Inventor
计哲
黄远
高圣翔
孙晓晨
戚梦苑
宁珊
徐艳云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Information Engineering of CAS
Priority to CN201911051663.5A priority Critical patent/CN111091809B/en
Publication of CN111091809A publication Critical patent/CN111091809A/en
Application granted granted Critical
Publication of CN111091809B publication Critical patent/CN111091809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a regional accent recognition method and device with depth feature fusion, wherein the method comprises the following steps: extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized; and inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a pre-trained SVM classifier to obtain the voice category of the output voice to be recognized. The method adopts a multi-feature fusion language identification system, extracts the depth feature of the voice, fuses the traditional SDC feature, inputs the SDC feature into an SVM classifier, realizes a more robust language identification function, and obtains a better classification effect on regional dialect Putonghua.

Description

Regional accent recognition method and device based on depth feature fusion
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a regional accent recognition method and device with depth feature fusion.
Background
At present, the speech recognition engines of Chinese continuous speech recognition, speech keyword retrieval, speech to text and the like can achieve good recognition effect aiming at standard mandarin in a telephone channel after years of training.
However, in actual work, a large amount of telephone speech has obvious regional characteristics, such as guangdong and fujianha, and when the existing speech recognition engine based on standard mandarin training processes speech, the recognition effect is relatively poor, the recognition accuracy is low, the recognition effect and the intention judgment of the transcribed content are seriously influenced, so that a language recognition technology aiming at regional spoken language classification is needed to pre-classify and screen the speech so as to improve the efficiency and accuracy of tasks such as subsequent speech recognition.
Disclosure of Invention
To overcome the existing problems or at least partially solve the problems, embodiments of the present invention provide a regional accent recognition method and apparatus with depth feature fusion.
According to a first aspect of the embodiments of the present invention, there is provided a regional accent recognition method with depth feature fusion, including:
extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
On the basis of the technical scheme, the invention can be improved as follows.
Further, the extracting the bottleneck BN characteristics of the speech to be recognized includes:
inputting the voice to be recognized into a preset Deep Belief Network (DBN) to obtain the bottleneck BN characteristic of the output voice to be recognized;
the preset deep belief network DBN is obtained by training a training sample containing regional accent mandarin data of each category and the extracted bottleneck BN characteristics.
Further, the deep belief network DBN is trained to obtain the preset deep training network DBN as follows:
learning and training the deep belief network DBN by utilizing a voice training set based on a restricted Boltzmann machine RBM stacking method, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;
and after the deep belief network DBN is trained based on a method for limiting the Boltzmann machine RBM stacking, removing network parameters behind a bottleneck layer with the number of nodes smaller than a threshold value in the deep belief network DBN to obtain the preset deep belief network DBN.
Further, the extracting the sliding difference cepstrum SDC features of the speech to be recognized includes:
extracting a Mel cepstrum coefficient MFCC feature vector of the voice to be recognized;
and obtaining sliding difference cepstrum SDC characteristics of the voice to be recognized according to the MFCC characteristic vectors of the voice to be recognized.
Further, the obtaining, according to the MFCC feature of the speech to be recognized, a sliding differential cepstrum SDC feature of the speech to be recognized includes:
splicing MFCC feature vectors of the speech to be recognized and corresponding differential vectors to form each feature vector of the SDC features, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;
each differential vector is obtained by subtracting a first vector and a second vector, wherein the first vector is obtained by shifting the MFCC feature vector forward by a first set number of frames and then shifting the MFCC feature vector forward by a second set number of frames, and the second vector is obtained by shifting the MFCC feature vector forward by the first set number of frames and then shifting the MFCC feature vector backward by the second set number of frames.
Further, the gaussian supervectors GSV of the regional accent mandarin data of each category are labeled as follows:
inputting BN (boron nitride) characteristics and SDC (stand alone data) characteristics of regional accent Mandarin data of each category into a preset Gaussian mixture model-general background model GMM-UBM, and obtaining a Gaussian supervectors GSV (Gaussian super vector) of the regional accent Mandarin data of each category by a maximum posterior probability MAP (maximum posterior probability) self-adaptive method;
labeling corresponding regional accent Mandarin data of each category based on the obtained Gaussian supervectors GSV;
the preset Gaussian mixture model-general background model GMM-UBM is obtained by training through an expectation maximization EM algorithm based on different types of regional accent Mandarin data.
According to a second aspect of the embodiments of the present invention, there is provided a regional accent recognition apparatus with depth feature fusion, including:
the extraction module is used for extracting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic of the voice to be recognized;
the output module is used for inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the output voice category of the voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor calls the program instruction to perform the regional accent recognition method with depth feature fusion provided in any one of the various possible implementations of the first aspect.
According to a fourth aspect of the embodiments of the present invention, there is further provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method for depth feature-fused regional accent recognition provided in any one of the various possible implementations of the first aspect.
The embodiment of the invention provides a regional accent recognition method and device with depth feature fusion.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a regional accent recognition method with depth feature fusion according to an embodiment of the present invention;
FIG. 2 is a flowchart of MFCC feature extraction in an embodiment of the present invention;
fig. 3 is a flow chart of SDC feature extraction according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method for training a GMM-UBM model according to an embodiment of the present invention;
FIG. 5 is a flow chart of a GSV extraction method for each category of regional accent Mandarin data according to an embodiment of the present invention;
fig. 6 is a schematic overall flow chart of a regional accent recognition method with depth feature fusion according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a regional accent recognition apparatus with depth feature fusion according to an embodiment of the present invention;
fig. 8 is a schematic view of an overall structure of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
In an embodiment of the present invention, a depth-feature-fused regional accent recognition method is provided, and fig. 1 is a schematic overall flow chart of the depth-feature-fused regional accent recognition method provided in the embodiment of the present invention, where the method includes:
extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
It can be understood that the speech to be recognized in the embodiment of the present invention is a regional accent mandarin, and in order to more accurately recognize the class of the speech, the embodiment of the present invention extracts the bottleneck BN feature and the sliding difference cepstrum SDC feature of the speech to be recognized, and inputs the two depth features into the trained SVM classifier, and recognizes the class of the speech to be recognized through the SVM classifier.
The embodiment of the invention adopts a language identification system with multi-feature (BN feature and SDC feature) fusion, extracts the depth feature of the voice, fuses the traditional SDC feature, inputs the feature into an SVM classifier, realizes a more robust language identification function and obtains a better classification effect on regional dialect Putonghua.
On the basis of the above embodiment, in the embodiment of the present invention, extracting the bottleneck BN feature of the speech to be recognized includes:
inputting the voice to be recognized into a preset Deep Belief Network (DBN) to obtain the bottleneck BN characteristic of the output voice to be recognized;
the preset deep belief network DBN is obtained by training a training sample containing regional accent mandarin data of each category and the extracted bottleneck BN characteristics.
On the basis of the above embodiments, in the embodiments of the present invention, the preset deep training network DBN is obtained by training the deep belief network DBN in the following manner:
learning and training the deep belief network DBN by utilizing a voice training set based on a restricted Boltzmann machine RBM stacking method, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;
and after the deep belief network DBN is trained based on a method for limiting the Boltzmann machine RBM stacking, removing network parameters behind a bottleneck layer with the number of nodes smaller than a threshold value in the deep belief network DBN to obtain the preset deep belief network DBN.
It can be understood that, in the embodiment of the present invention, the bottleneck BN feature of the speech to be recognized is extracted based on the trained deep belief network DBN. In the process of training the deep belief network DBN, firstly, a speech training set is constructed, namely, required data of various regional accent Mandarin Chinese are collected, and a training set of each language model is constructed. Because the data source comes from the international telecommunication network, the proportion of regional accent mandarin data which meets the requirements is very small, the redundancy workload of manual selection is too large, and the feasibility is not high, various computer intelligent auxiliary measures are adopted to be matched with manual marking (marking the category of the regional accent mandarin data), a mature language identification system is firstly used for screening and filtering, and after a certain amount of data is accumulated, the model is repeatedly updated until the scale requirement of the data set is met.
Before a Deep Belief Network (DBN) is trained by using a voice training set, voice activity detection is performed on each piece of voice data in the training set, invalid parts of Dual Tone Multifrequency (DTMF) signal tones, polyphonic ringtone, music and other various types of noises mixed in conversation voice are identified and filtered, effective voice is obtained, and BN (boron nitride) features of each piece of voice are extracted.
After the effective voice is obtained, the deep belief network DBN is trained according to regional accent Mandarin data of each category in the voice training set. After the DBN training is completed, removing the network parameters behind the bottleneck layer with less nodes to obtain BN characteristics which compress the language information to low dimension and are suitable for language identification. The deep belief network DBN comprises a plurality of bottleneck layers, and each bottleneck layer comprises a plurality of nodes.
On the basis of the foregoing embodiments, in the embodiments of the present invention, the extracting a sliding differential cepstrum SDC feature of a speech to be recognized includes:
extracting a Mel cepstrum coefficient MFCC feature vector of the voice to be recognized;
and obtaining sliding difference cepstrum SDC characteristics of the voice to be recognized according to the MFCC characteristic vectors of the voice to be recognized.
On the basis of the foregoing embodiments, in an embodiment of the present invention, obtaining a sliding differential cepstrum SDC feature of a speech to be recognized according to an MFCC feature of the speech to be recognized includes:
splicing MFCC feature vectors of the speech to be recognized and corresponding differential vectors to form each feature vector of the SDC features, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;
each differential vector is obtained by subtracting a first vector and a second vector, wherein the first vector is obtained by shifting the MFCC feature vector forward by a first set number of frames and then shifting the MFCC feature vector forward by a second set number of frames, and the second vector is obtained by shifting the MFCC feature vector forward by the first set number of frames and then shifting the MFCC feature vector backward by the second set number of frames.
It will be appreciated that, for speech to be recognized, the BN features are extracted simultaneously with the sliding differential cepstral SDC features, which are computed from mel cepstral coefficient MFCC feature vectors.
Referring to fig. 2, for a flow chart of extracting MFCC features of voice data, language identification is a typical classification problem, and the purpose of distinguishing language types is achieved by extracting features of different levels of the voice to be identified. The most widely used features are mainly based on the acoustic level, and are usually obtained by a series of mathematical transformations from frame-segmented speech, reflecting different time-frequency information of the speech signal, such as Mel-frequency cepstral coefficient (MFCC), Sliding Difference Cepstrum (SDC).
Cepstrum analysis refers to the inverse fourier transform of the natural logarithm of the signal spectrum, and the mel-frequency cepstrum coefficients are different and focus more on the auditory properties of human ears. The relationship between the mel-frequency cepstrum of the signal and the actual frequency spectrum can be represented by the following formula:
Mel(f)=2595lg(1+f/700);
where mel (f) is the MFCC characteristic of the speech data, and f is the actual spectrum of the speech data.
Unlike speech recognition and voiceprint recognition, due to the particularities of language recognition, one often uses a sliding differential cepstrum SDC feature derived from the difference of shifts of mel cepstrum coefficients. The extraction of the SDC feature can be determined by a group of parameters on the basis of the MFCC feature, namely the MFCC dimension N of each frame of voice, the number of frames d shifted forwards and backwards during the difference operation, the number of frames P sliding forwards and the number of difference vectors k.
In the embodiment of the present invention, { N, d, P, k } is generally {7,1,3,7}, and the process of extracting the SDC feature can be seen in fig. 3, where each feature vector of the SDC feature is a 56-dimensional 7+7 × 7 obtained by concatenating a basic 7-dimensional MFCC feature vector with 7 differential vectors, where each differential vector is obtained by subtracting two vectors offset by 1 from the MFCC vector of 3 frames sliding forward and backward. To reduce the influence of channel and noise and facilitate gaussian modeling, mean variance normalization is also performed on the SDC features for each extracted speech data.
On the basis of the above embodiments, in the embodiments of the present invention, the gaussian supervectors GSV of the regional accent mandarin data of each category are labeled as follows:
inputting BN (boron nitride) characteristics and SDC (stand alone data) characteristics of regional accent Mandarin data of each category into a preset Gaussian mixture model-general background model GMM-UBM, and obtaining a Gaussian supervectors GSV (Gaussian super vector) of the regional accent Mandarin data of each category by a maximum posterior probability MAP (maximum posterior probability) self-adaptive method;
labeling corresponding regional accent Mandarin data of each category based on the obtained Gaussian supervectors GSV;
the preset Gaussian mixture model-general background model GMM-UBM is obtained by training through an expectation maximization EM algorithm based on different types of regional accent Mandarin data.
It can be understood that, through the above embodiments, the bottleneck BN feature and the sliding difference cepstrum SDC feature of the speech to be recognized are respectively extracted, the BN feature and the SDC feature of the speech to be recognized are input into the trained support vector machine SVM classifier, and the type of the speech to be recognized is output by the SVM classifier.
A gaussian super-vector-support vector machine (GMM super-vector-support-vector machines, GSV-SVM) system is another common method for modeling acoustic features. The support vector machine SVM is a classifier which completes a sample classification task through a maximum interval hyperplane, and has the advantages that samples which cannot be linearly distinguished in a low-dimensional space can be mapped to a high-dimensional space, the classification task is completed through an optimal hyperplane, and the robustness is better. The mapping function for mapping the samples from the low-dimensional space to the high-dimensional space is a large theoretical basis, namely a kernel function, of the support vector machine. The GSV-SVM system adopted by the embodiment of the invention utilizes a kernel function based on a Gaussian supervector.
Another problem with using SVM for language classification is that the SVM function is to find the optimal interface between two sample spaces, and language classification is a multi-classification task, so when training a language model, we will define the language speech as a positive sample, and the speech of other languages will be set as a negative sample.
The pushning Model technology is a technology combining GMM modeling and SVM modeling, a support vector obtained by SVM training is reversely pushed back to a GMM Model, and finally the obtained language Model utilizes discriminative information of SVM classification to obtain good language identification performance.
Deep learning forms abstract high-level feature representation by combination of low-level features through building and learning of deep nonlinear network structures so as to discover distributed features of data. The essence is that a machine learning model with a plurality of hidden layers is constructed through mass training data to learn more useful characteristics, and the classification accuracy which cannot be achieved by the traditional linear method is realized.
The SVM classifier is trained on the basis of the Gaussian super vector GSV of the regional accent Mandarin data of each category in the voice training set.
Referring to fig. 4, when the SVM classifier is trained, for each category of regional accent mandarin data, the gaussian supervectors GSV of the training data of the category of regional accent mandarin are set as positive samples during training, the gaussian supervectors GSV of the training data of other categories of regional accent mandarin are set as negative samples, different categories of regional accent mandarin data language models are obtained through a standard SVM classification algorithm, and the category of the speech to be recognized can be recognized through the trained SVM.
And when the speech to be recognized is specifically recognized, extracting BN (boron nitride) features and SDC (stand alone description) features of the speech to be recognized, inputting the BN features and the SDC features of the speech to be recognized into a trained SVM (support vector machine) classifier, and outputting the category of the speech to be recognized.
Among them, the gaussian mixture model-general background model (GMM-UBM) system has made a major breakthrough in the field of speech recognition, and was immediately introduced into a language recognition system. The Gaussian mixture model is a statistical model for constructing probability density distribution, and is widely applied to the fields of image recognition, natural language understanding and the like. Complex data that do not fit into a single distribution can often be well described by a weighted combination of a series of gaussian distributions. Gaussian mixture models typically estimate the parameters of the model by an expectation-maximization (EM) algorithm. After the model is established for each language, the judgment is made for the language category by calculating the likelihood of the speech feature to be recognized. The general background model is a mixed Gaussian model which is trained from partial speech selected from each language and is irrelevant to the specific language, and has two main meanings. Firstly, in the actual process of modeling a mixed Gaussian model, a plurality of gaussians (such as 256, 1024) are needed to better describe complex speech features, and the data volume of a single class is not enough to train the mixed Gaussian of such high order; in addition, the gaussian mixture model of each language is trained independently, which easily causes the non-correspondence of gaussian components, thereby affecting the subsequent decision effect. After the universal background model exists, each language can obtain respective mixed Gaussian model by using self-limited training data and a Maximum A Posteriori (MAP) self-adaptive method, so that not only can the components of each language Gaussian model be aligned, but also the training time is greatly saved.
For the constructed SVM classifier, the SVM classifier is trained based on the Gaussian supervectors GSV of the regional accent Mandarin data of each category. Referring to fig. 5, the method of extracting the gaussian supervectors GSV of the regional accent mandarin data of each category is: training a Gaussian mixture model-general background model GMM-UBM through an expectation maximization EM algorithm based on different types of regional accent Mandarin data;
and inputting the BN characteristic and the SDC characteristic of the regional accent Mandarin data of each category into the GMM-UBM, and obtaining the Gaussian supervectors GSV of the regional accent Mandarin data of each category by a maximum posterior probability MAP self-adaptive method.
Referring to fig. 6, the overall flowchart of the regional accent recognition method based on depth feature fusion is divided into a training stage and a recognition stage, where the training stage mainly collects various different types of regional accent data to form speech training data. And extracting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic of each piece of voice training data and the Gaussian super vector GSV of each piece of voice training data, and training the SVM classifier by adopting the characteristics.
And in the recognition stage, extracting the BN characteristic and the SDC characteristic of the voice to be recognized, inputting the BN characteristic and the SDC characteristic of the voice to be recognized into the trained SVM classifier, and recognizing the category of the voice to be recognized.
In another embodiment of the present invention, a depth feature fused regional accent recognition apparatus is provided, which is used to implement the methods in the foregoing embodiments. Therefore, the description and definition in the embodiments of the regional accent recognition method with depth feature fusion may be used for understanding the execution modules in the embodiments of the present invention. Fig. 7 is a schematic view of an overall structure of a depth feature fused regional accent recognition apparatus according to an embodiment of the present invention, where the apparatus includes an extraction module 71 and a recognition module 72.
The extraction module 71 is configured to extract a bottleneck BN feature and a sliding difference cepstrum SDC feature of the speech to be recognized;
the output module 72 is configured to input the bottleneck BN feature and the sliding difference cepstrum SDC feature into a pre-trained support vector machine SVM classifier, so as to obtain a speech class outputting the speech to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
The device further comprises:
a labeling module 73, configured to label, based on the obtained gaussian supervectors GSV, the corresponding regional accent mandarin data of each category;
the BN characteristic and the SDC characteristic of the regional accent Mandarin data of each category are input into a preset Gaussian mixture model-general background model GMM-UBM, and a Gaussian supervectors GSV of the regional accent Mandarin data of each category is obtained through a maximum posterior probability MAP self-adaptive method;
the preset Gaussian mixture model-general background model GMM-UBM is obtained by training through an expectation maximization EM algorithm based on different types of regional accent Mandarin data.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: the system comprises a processor (processor)01, a communication Interface (Communications Interface)02, a memory (memory)03 and a communication bus 04, wherein the processor 01, the communication Interface 02 and the memory 03 complete mutual communication through the communication bus 04. Processor 01 may call logic instructions in memory 03 to perform the following method:
extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
In addition, the logic instructions in the memory 03 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above method embodiments, for example, including: extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
The regional accent recognition method and device based on depth feature fusion provided by the embodiment of the invention have the advantages that a multi-feature fusion language recognition system is adopted, the depth features of voice are extracted, the traditional SDC features are fused, and an SVM classifier is input, so that a more robust language recognition function is realized, and a better classification effect on regional dialect mandarin is achieved.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A regional accent recognition method with depth feature fusion is characterized by comprising the following steps:
extracting bottleneck BN characteristics and sliding difference cepstrum SDC characteristics of the voice to be recognized;
inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the voice category of the output voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
2. The regional accent recognition method of claim 1, wherein the extracting bottleneck BN features of the speech to be recognized comprises:
inputting the voice to be recognized into a preset Deep Belief Network (DBN) to obtain the bottleneck BN characteristic of the output voice to be recognized;
the preset deep belief network DBN is obtained by training a training sample containing regional accent mandarin data of each category and the extracted bottleneck BN characteristics.
3. The regional accent recognition method of claim 2, wherein the preset deep training network DBN is obtained by training the deep belief network DBN as follows:
learning and training the deep belief network DBN by utilizing a voice training set based on a restricted Boltzmann machine RBM stacking method, wherein the voice training set comprises regional accent Mandarin data of each category and extracted bottleneck BN characteristics;
and after the deep belief network DBN is trained based on a method for limiting the Boltzmann machine RBM stacking, removing network parameters behind a bottleneck layer with the number of nodes smaller than a threshold value in the deep belief network DBN to obtain the preset deep belief network DBN.
4. The regional accent recognition method of claim 1, wherein the extracting sliding differential cepstrum SDC features of the speech to be recognized comprises:
extracting a Mel cepstrum coefficient MFCC feature vector of the voice to be recognized;
and obtaining sliding difference cepstrum SDC characteristics of the voice to be recognized according to the MFCC characteristic vectors of the voice to be recognized.
5. The regional accent recognition method of claim 4, wherein the obtaining of the sliding differential cepstrum SDC features of the speech to be recognized according to the MFCC features of the speech to be recognized comprises:
splicing MFCC feature vectors of the speech to be recognized and corresponding differential vectors to form each feature vector of the SDC features, wherein the number of the differential vectors is the same as the dimension of the MFCC feature vectors;
each differential vector is obtained by subtracting a first vector and a second vector, wherein the first vector is obtained by shifting the MFCC feature vector forward by a first set number of frames and then shifting the MFCC feature vector forward by a second set number of frames, and the second vector is obtained by shifting the MFCC feature vector forward by the first set number of frames and then shifting the MFCC feature vector backward by the second set number of frames.
6. The regional accent recognition method of claim 1, wherein the gaussian supervectors GSV of each category of regional accent mandarin data are labeled as follows:
inputting BN (boron nitride) characteristics and SDC (stand alone data) characteristics of regional accent Mandarin data of each category into a preset Gaussian mixture model-general background model GMM-UBM, and obtaining a Gaussian supervectors GSV (Gaussian super vector) of the regional accent Mandarin data of each category by a maximum posterior probability MAP (maximum posterior probability) self-adaptive method;
labeling corresponding regional accent Mandarin data of each category based on the obtained Gaussian supervectors GSV;
the preset Gaussian mixture model-general background model GMM-UBM is obtained by training through an expectation maximization EM algorithm based on different types of regional accent Mandarin data.
7. A regional accent recognition device of degree of depth characteristic fusion, its characterized in that includes:
the extraction module is used for extracting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic of the voice to be recognized;
the output module is used for inputting the bottleneck BN characteristic and the sliding difference cepstrum SDC characteristic into a preset SVM classifier to obtain the output voice category of the voice to be recognized;
the preset SVM classifier is obtained by training a training sample of regional accent Mandarin data of each category marked with Gaussian supervectors GSV.
8. The regional accent recognition device of claim 7, further comprising:
the marking module is used for marking the corresponding regional accent Mandarin data of each category based on the obtained Gaussian supervectors GSV;
the BN characteristic and the SDC characteristic of the regional accent Mandarin data of each category are input into a preset Gaussian mixture model-general background model GMM-UBM, and a Gaussian supervectors GSV of the regional accent Mandarin data of each category is obtained through a maximum posterior probability MAP self-adaptive method;
the preset Gaussian mixture model-general background model GMM-UBM is obtained by training through an expectation maximization EM algorithm based on different types of regional accent Mandarin data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method for regional accent recognition with depth feature fusion according to any of claims 1 to 6.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the depth feature fused regional accent recognition method of any one of claims 1 to 6.
CN201911051663.5A 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion Active CN111091809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911051663.5A CN111091809B (en) 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911051663.5A CN111091809B (en) 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion

Publications (2)

Publication Number Publication Date
CN111091809A true CN111091809A (en) 2020-05-01
CN111091809B CN111091809B (en) 2023-05-23

Family

ID=70393476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911051663.5A Active CN111091809B (en) 2019-10-31 2019-10-31 Regional accent recognition method and device based on depth feature fusion

Country Status (1)

Country Link
CN (1) CN111091809B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640419A (en) * 2020-05-26 2020-09-08 合肥讯飞数码科技有限公司 Language identification method, system, electronic equipment and storage medium
CN112233651A (en) * 2020-10-10 2021-01-15 深圳前海微众银行股份有限公司 Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112908295A (en) * 2021-02-02 2021-06-04 睿云联(厦门)网络通讯技术有限公司 Method and device for generating regional offline accent voice recognition system
CN112233651B (en) * 2020-10-10 2024-06-04 深圳前海微众银行股份有限公司 Dialect type determining method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012036934A1 (en) * 2010-09-15 2012-03-22 Microsoft Corporation Deep belief network for large vocabulary continuous speech recognition
CN103065622A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Language model practicing method and system thereof for language recognition
CN103077709A (en) * 2012-12-28 2013-05-01 中国科学院声学研究所 Method and device for identifying languages based on common identification subspace mapping
US8583432B1 (en) * 2012-07-18 2013-11-12 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN105895087A (en) * 2016-03-24 2016-08-24 海信集团有限公司 Voice recognition method and apparatus
US20170286407A1 (en) * 2016-04-01 2017-10-05 Samsung Electronics Co., Ltd. Device and method for voice translation
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
CN109979432A (en) * 2019-04-02 2019-07-05 科大讯飞股份有限公司 A kind of dialect translation method and device
CN110164417A (en) * 2019-05-31 2019-08-23 科大讯飞股份有限公司 A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus
US20190266998A1 (en) * 2017-06-12 2019-08-29 Ping An Technology(Shenzhen) Co., Ltd. Speech recognition method and device, computer device and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012036934A1 (en) * 2010-09-15 2012-03-22 Microsoft Corporation Deep belief network for large vocabulary continuous speech recognition
US8583432B1 (en) * 2012-07-18 2013-11-12 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN103065622A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Language model practicing method and system thereof for language recognition
CN103077709A (en) * 2012-12-28 2013-05-01 中国科学院声学研究所 Method and device for identifying languages based on common identification subspace mapping
CN105895087A (en) * 2016-03-24 2016-08-24 海信集团有限公司 Voice recognition method and apparatus
US20170286407A1 (en) * 2016-04-01 2017-10-05 Samsung Electronics Co., Ltd. Device and method for voice translation
US20190266998A1 (en) * 2017-06-12 2019-08-29 Ping An Technology(Shenzhen) Co., Ltd. Speech recognition method and device, computer device and storage medium
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN109979432A (en) * 2019-04-02 2019-07-05 科大讯飞股份有限公司 A kind of dialect translation method and device
CN110164417A (en) * 2019-05-31 2019-08-23 科大讯飞股份有限公司 A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MAN-HUNG SIU,ET AL.: "Discriminatively Trained GMMs for Language Classification Using Boosting Methods", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 》 *
崔瑞莲等: "基于深度神经网络的语种识别", 《基于深度神经网络的语种识别 *
李晋徽等: "一种新的基于瓶颈深度信念网络的特征提取方法及其在语种识别中的应用", 《计算机科学》 *
王烨等: "基于子空间映射和得分规整的GSV-SVM方言识别", 《计算机工程与设计》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640419A (en) * 2020-05-26 2020-09-08 合肥讯飞数码科技有限公司 Language identification method, system, electronic equipment and storage medium
CN111640419B (en) * 2020-05-26 2023-04-07 合肥讯飞数码科技有限公司 Language identification method, system, electronic equipment and storage medium
CN112233651A (en) * 2020-10-10 2021-01-15 深圳前海微众银行股份有限公司 Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112233651B (en) * 2020-10-10 2024-06-04 深圳前海微众银行股份有限公司 Dialect type determining method, device, equipment and storage medium
CN112908295A (en) * 2021-02-02 2021-06-04 睿云联(厦门)网络通讯技术有限公司 Method and device for generating regional offline accent voice recognition system
CN112908295B (en) * 2021-02-02 2023-05-16 睿云联(厦门)网络通讯技术有限公司 Generation method and device of regional offline accent voice recognition system

Also Published As

Publication number Publication date
CN111091809B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
CN109256150B (en) Speech emotion recognition system and method based on machine learning
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN106875936B (en) Voice recognition method and device
CN111243602A (en) Voiceprint recognition method based on gender, nationality and emotional information
WO2014029099A1 (en) I-vector based clustering training data in speech recognition
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN107093422B (en) Voice recognition method and voice recognition system
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN108735200A (en) A kind of speaker's automatic marking method
CN111583906A (en) Role recognition method, device and terminal for voice conversation
CN116665676B (en) Semantic recognition method for intelligent voice outbound system
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
CN110706710A (en) Voice recognition method and device, electronic equipment and storage medium
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN106710588B (en) Speech data sentence recognition method, device and system
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
Ling An acoustic model for English speech recognition based on deep learning
CN112614510B (en) Audio quality assessment method and device
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN113744727A (en) Model training method, system, terminal device and storage medium
CN111326161B (en) Voiceprint determining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant