WO2020034628A1

WO2020034628A1 - Accent identification method and device, computer device, and storage medium

Info

Publication number: WO2020034628A1
Application number: PCT/CN2019/077512
Authority: WO
Inventors: 张丝潆; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-08-14
Filing date: 2019-03-08
Publication date: 2020-02-20
Also published as: CN109036437A

Abstract

An accent identification method, comprising: pre-processing a voice signal to be identified (101); detecting an effective voice in said voice signal having been pre-processed (102); extracting a mel frequency cepstrum coefficient (MFCC) feature parameter with respect to the effective voice (103); according to the MFCC feature parameter, using a pre-trained Gaussian mixture model (GMM)-universal background model (UBM) to extract an identity vector (iVector) of the effective voice (104); and calculating, according to the iVector, a determination score of said voice signal with respect to a given accent, and obtaining, according to the determination score, an accent identification result of said voice signal (105).

Description

Accent recognition method, device, computer device and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 14, 2018, with an application number of 201810922056.0, and the invention name is "Accent Recognition Method, Device, Computer Device, and Computer-readable Storage Medium". Citations are incorporated in this application.

Technical field

The present invention relates to the field of computer hearing technology, and in particular, to a method and device for accent recognition, a computer device, and a storage medium.

Background technique

With the continuous emergence and application of various types of intelligent identity authentication, such as face recognition and voiceprint recognition have achieved relatively mature development, but the accuracy of recognition still has room for improvement, such as in the direction of voiceprint recognition. The accent factor is one of the breakthrough points to get more accurate recognition results. Because the speakers live in different regions, even if they all speak Mandarin, there will still be more or less accent differences. If accent recognition can be added to the existing voiceprint recognition as a supplement, the application scenario will be further The most direct application of the extension is to identify the area where the speaker is located before voiceprint recognition, and then reduce the scope of subsequent recognition objects. However, the existing accent recognition effect is not ideal, the recognition speed is slow and the accuracy is not high.

Summary of the Invention

In view of the above, it is necessary to propose an accent recognition method and device, a computer device, and a storage medium, which can realize fast and accurate accent recognition.

A first aspect of the present application provides an accent recognition method, which includes:

Pre-process the recognition speech signal;

Detecting valid speech in the speech signal to be identified after preprocessing;

Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;

According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;

A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.

A second aspect of the present application provides an accent recognition device, where the device includes:

A pre-processing unit for pre-processing the speech signal to be recognized;

A detection unit, configured to detect valid speech in the speech signal to be identified after preprocessing;

A first extraction unit, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;

A second extraction unit, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;

The recognition unit is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the decision score.

A third aspect of the present application provides a computer device including a processor, where the processor is configured to implement the accent recognition method when executing computer-readable instructions stored in a memory.

A fourth aspect of the present application provides a non-volatile readable storage medium on which computer-readable instructions are stored, and the computer-readable instructions implement the accent recognition method when executed by a processor.

This application preprocesses the speech signal to be identified; detects the effective speech in the speech signal to be identified after preprocessing; extracts the Melf frequency cepstrum coefficient MFCC characteristic parameter from the effective speech; according to the MFCC characteristic parameter, using The pre-trained Gaussian mixture model-general background model GMM-UBM extracts the identity vector iVector of the effective speech; calculates the judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtains The accent recognition result of the speech signal to be recognized is described. This application can find problems at the database level, without the need for testers to find problems through complex and extensive functional tests. This application can realize fast and accurate accent recognition.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of an accent recognition method according to an embodiment of the present application.

FIG. 2 is a structural diagram of an accent recognition device according to an embodiment of the present application.

FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present application.

detailed description

In order to more clearly understand the foregoing objectives, features, and advantages of the present application, the present application is described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.

In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the specification of the present application is only for the purpose of describing specific embodiments, and is not intended to limit the present application.

Preferably, the accent recognition method of the present application is applied in one or more computer devices. The computer device is a device capable of automatically performing numerical calculations and / or information processing in accordance with instructions set or stored in advance. The hardware includes, but is not limited to, a microprocessor and an Application Specific Integrated Circuit (ASIC). , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), Embedded Equipment, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can perform human-computer interaction with a user through a keyboard, a mouse, a remote control, a touch pad, or a voice control device.

Example one

FIG. 1 is a flowchart of an accent recognition method provided in Embodiment 1 of the present application. The accent recognition method is applied to a computer device. The accent recognition method detects a failed object in the test library so that developers can fix the code and complete the rectification of the failed object.

As shown in FIG. 1, the accent recognition method specifically includes the following steps:

Step 101: Pre-process a speech signal to be recognized.

The voice signal to be identified may be an analog voice signal or a digital voice signal. If the speech signal to be identified is an analog speech signal, the analog speech signal is subjected to analog-to-digital conversion to be converted into a digital speech signal.

The voice signal to be recognized may be a voice signal collected through a voice input device (for example, a microphone, a mobile phone microphone, etc.).

Preprocessing the speech signal to be identified may include pre-emphasizing the speech signal to be identified.

The purpose of pre-emphasis is to boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis of the speech signal to be identified, it is necessary to perform frequency boosting on the high frequency portion of the speech signal to be identified, that is, pre-emphasis of the speech signal to be identified. Pre-emphasis is generally implemented using a high-pass filter. The transfer function of the high-pass filter can be:

H (z) = 1-κz ^-1 , 0.9≤κ≤1.0,

Among them, κ is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.

Preprocessing the speech signal to be identified may further include windowing and framing the speech signal to be identified.

Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced. The pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms. In order to obtain a short-term stationary signal, the voice signal can be divided into short segments for processing in the voice signal processing. This process is called frame framing, and the resulting short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal. In order to avoid that the amplitude of the change between two adjacent frames is too large, there needs to be some overlap between frames. In an embodiment of the present application, each voice frame is 20 milliseconds, and there is a 10 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.

The commonly used window functions are rectangular window, Hamming window and Hanning window. The rectangular window function is:

The Hamming window function is:

The Hanning window function is:

Among them, N is the number of sampling points included in a speech frame.

Step 102: Detect a valid voice in the pre-recognized voice signal.

Endpoint detection may be performed according to the pre-processed short-term energy and short-time zero-crossing rate of the speech signal to be identified to determine valid speech in the speech signal to be identified.

In this embodiment, the valid voice in the pre-recognized voice signal can be detected by the following methods:

(1) Windowing and framing the pre-processed speech signal to be identified to obtain a speech frame x (n) of the speech signal to be identified. In a specific embodiment, a Hamming window may be added to the pre-processed speech signal to be identified, each frame being 20 ms, and the frame shifting being 10 ms. If window frames have been framed in the pre-processed speech signal, this step is omitted.

(2) Discrete Fourier Transform (DFT) is performed on the speech frame x (n) to obtain the frequency spectrum of the speech frame x (n):

(3) Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame x (n):

Where E (m) represents the cumulative energy of the m-th frequency band, and (m ₁ , m ₂ ) represents the starting frequency band point of the m-th frequency band.

(4) Perform a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithm value of the accumulated energy of each frequency band.

(5) Compare the cumulative energy logarithm of each frequency band with a preset threshold to obtain the effective speech. If the cumulative energy log value of a frequency band is higher than a preset threshold, the speech corresponding to the frequency band is a valid speech.

In step 103, a Mel Frequency Frequency Cepstrum Coefficient (MFCC) feature parameter is extracted for the effective speech.

The process of extracting MFCC characteristic parameters is as follows:

(1) Perform a discrete Fourier transform (which can be a fast Fourier transform) on each speech frame to obtain the frequency spectrum of the speech frame.

(2) The square of the spectral amplitude of the speech frame is obtained to obtain the discrete energy spectrum of the speech frame.

(3) The discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter. The center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters. The center frequency of the triangular filter is:

The frequency response of the triangular filter is:

Among them, f _h and f ₁ are the high and low frequencies of the triangular filter; N is the number of points of the Fourier transform; F _s is the sampling frequency; M is the number of triangular filters; B ^-1 = 700 (e ^{b / 1125} -1) is the inverse function of f _mel .

(4) Logarithmic operations are performed on the outputs of all triangular filters to obtain the logarithmic power spectrum S (m) of the speech frame.

(5) Discrete Cosine Transform (DCT) is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame. The discrete cosine transform is:

(6) Extract the dynamic differential MFCC feature parameters of the speech frame. The initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters. The dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics are used. parameter.

In a specific embodiment, the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.

The triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has good recognition performance in a noisy environment.

In one implementation of the present application, after extracting MFCC feature parameters from the pre-processed speech signal to be identified, the extracted MFCC feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced MFCC feature parameters. For example, the MFCC feature parameters of the piecewise mean data dimensionality reduction algorithm are used for dimensionality reduction processing to obtain the MFCC feature parameters after dimensionality reduction. The reduced MFCC feature parameters will be used in subsequent steps.

Step 104: Use a pre-trained Gaussian Mix Model (GMM) -Universal Background Model (UBM) to extract the identity-vector (iVector) of the effective voice according to the MFCC feature parameters. ).

Before extracting iVector, we must first train a universal background model with a large amount of training data belonging to different accents. The general background model is actually a Gaussian mixture model (GMM), which aims to solve the problem of scarce data volume in actual scenes. GMM is a parametric generative model, which has a strong representation of actual data (based on Gaussian components). The more Gaussian components, the stronger the GMM characterization, and the larger the scale. At this time, the negative effects are gradually prominent. If you want to obtain a GMM model with strong generalization ability, you need sufficient data to drive GMM parameter training. However, The voice data obtained in the actual scene is difficult to reach even the minute level. UBM solves the problem of insufficient training data. UBM uses a large amount of training data belonging to different accents (irrespective of speakers and regions) to be fully trained to obtain a global GMM that can characterize common characteristics of speech, which can greatly reduce the resources consumed to calculate GMM parameters from scratch. After the training of the universal background model is completed, only the training data belonging to each accent needs to be used to fine-tune the parameters of the UBM (for example, through UBM adaptation) to obtain the GMM belonging to each accent.

In one embodiment, different accents may be accents belonging to different regions. The regions may be divided according to administrative regions, such as Liaoning, Beijing, Tianjin, Shanghai, Henan, Guangdong, and so on. The region may also be divided into regions based on accent according to common experience, such as southern Fujian and Hakka.

The extraction of iVector is based on the full difference space modeling (TV) method, which maps the high-dimensional GMM trained by UBM to the low-dimensional full-variable subspace, which can break through the inconvenient calculation of the extracted vector dimension as the length of the speech signal is too long. Limits, and can increase the speed of calculation and express more comprehensive characteristics. The GMM supervector in GMM-UBM may include a linear superposition of vector features related to the speaker itself and vector features related to channels and other changes.

The subspace modeling form of the TV model is:

M = m + Tw

Among them, M represents the GMM supervector of the speech, which is the MFCC characteristic parameter, m represents the accent-independent GMM supervector, T represents the load matrix of the space describing the difference, and w represents the corresponding low of the GMM supervector M in the load matrix space Dimension factor representation, iVector.

In this embodiment, the extracted iVector line noise can be compensated. In one embodiment, linear discriminant analysis (LDA) and intra-class covariance normalization (WCCN) can be used to perform noise compensation on the extracted iVector.

Step 105: Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.

A given accent can be one or more. For example, if the given accent is one, the judgment score of the to-be-recognized voice signal for the given accent may be calculated according to the iVector, and the to-be-recognized may be judged according to the judgment score of the to-be-recognized voice signal for the given accent. Whether the speech signal is the given accent. It can be judged whether the judgment score is greater than a preset score (for example, 9 points), and if the judgment score is greater than a preset score, it is judged that the speech signal to be recognized is the given accent.

If there are multiple given accents, the judgment score of the to-be-recognized voice signal for each given accent may be calculated according to the iVector, and the judgment score of each given accent for the given-voice signal may be judged and determined. Which of the given accents the speech is. The highest score among the decision scores for a plurality of given accents may be determined, and the given accent corresponding to the highest score is used as the accent to which the speech signal to be identified belongs.

In this embodiment, a Logistic Regression model can be used to calculate the decision score of the speech signal to be recognized for a given accent. As a classifier, the logistic regression model can score the speech signal to be identified according to the iVector of the speech signal to be identified. Specifically, in a specific embodiment, a multi-class logistic regression model may be used to calculate the decision score of the speech signal to be recognized for a given accent.

Assuming that a given accent includes N kinds of accents including accent 1, accent 2, ... accent N, then a N-type logistic regression model is used to calculate the decision score of the speech signal to be recognized for the given accent. The iVector (denoted as x _t ) of the speech signal to be recognized is input into the N-type logistic regression model to obtain N decision scores s _it (that is, the determination score of the speech signal to be recognized for N given accents), _it = W _i * x _t + k _i , i = 1, ..., N. Obtain the highest score s _{jt of} N judgment scores s _it , i = 1, ..., N, and the accent j corresponding to the highest score s _{jt is} the accent to which the speech signal to be identified belongs. Wherein, w _{_i,} k _i is a parameter logistic regression model category N, regression coefficients w _i, k _i is a constant for each accent will have to be given a corresponding w _i and k _{_{_i,}} w _i, k _i composition Parameter vector of N-type logistic regression model, M = {(w ₁ , k ₁ ), (w ₂ , k ₂ ), ..., (w _N , k _N )}.

The accent recognition method of the first embodiment preprocesses the speech signal to be recognized; detects valid speech in the speech signal to be recognized after preprocessing; extracts MFCC characteristic parameters from the valid speech; and uses the MFCC characteristic parameters in advance to use The trained GMM-UBG extracts the iVector of the effective voice; calculates the decision score of the voice signal to be recognized for a given accent according to the iVector, and obtains the accent recognition result of the voice signal to be recognized according to the judgment score. The first embodiment can realize fast and accurate accent recognition.

In other embodiments, when extracting MFCC characteristic parameters, channel length normalization (Vocal, Tract, Length, Normalization, VTLN) may be performed to obtain MFCC characteristic parameters with normalized channel length.

The channels can be represented as a cascaded sound tube model. Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.

VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates. In this embodiment, a VTLN method based on bilinear transformation may be used. The bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the recognized speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the average third formant frequency aligned with different speakers A bending factor; according to the frequency bending factor, the position (for example, the starting point, the middle point, and the ending point of the triangular filter) and the width of the triangular filter bank are adjusted by using a bilinear transformation; according to the adjusted triangular filter The group calculates the channel normalized MFCC characteristic parameters. For example, if spectrum compression is to be performed on the speech signal to be recognized, the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time. To spectrally stretch the recognition speech signal, the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right. When the VTLN method based on the bilinear transformation is used to normalize the channel of a specific group of people or a specific person, only the triangular filter bank coefficients need to be transformed once, and the signal spectrum does not need to be extracted each time the feature parameters are extracted. Folding, which greatly reduces the amount of calculation. In addition, the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity. At the same time, the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.

In another embodiment, the accent recognition method may further include: performing voiceprint recognition according to the accent recognition result. Because the speakers live in different regions, even if they all speak Mandarin, there will be more or less accent differences. Applying accent recognition to voiceprint recognition can reduce the scope of subsequent voiceprint recognition objects and get more For accurate recognition results.

Example two

FIG. 2 is a structural diagram of an accent recognition device provided in Embodiment 2 of the present application. As shown in FIG. 2, the accent recognition device 10 may include a preprocessing unit 201, a detection unit 202, a first extraction unit 203, a second extraction unit 204, and a recognition unit 205.

The pre-processing unit 201 is configured to pre-process a speech signal to be recognized.

H (z) = 1-κz ^-1 , 0.9≤κ≤1.0,

The Hamming window function is:

The Hanning window function is:

Among them, N is the number of sampling points included in a speech frame.

The detecting unit 202 is configured to detect a valid voice in the pre-recognized voice signal.

Where E (m) represents the cumulative energy of the m-th frequency band, and (m ₁ , m ₂ ) represents the starting point of the m-th frequency band.

The first extraction unit 203 is configured to extract a Melf Frequency Cepstrum Coefficient (MFCC) feature parameter from the effective speech.

The process of extracting MFCC characteristic parameters is as follows:

The frequency response of the triangular filter is:

(6) Extract the dynamic differential MFCC feature parameters of the speech frame. The initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters. The dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics parameter.

A second extraction unit 204 is configured to extract an identity vector of the effective voice using a pre-trained Gaussian Mixture Model (GMM) -Universal Background Model (UBM) according to the MFCC feature parameters ( identity-vector, iVector).

Before extracting iVector, we must first train a universal background model with a large amount of training data belonging to different accents. The general background model is actually a Gaussian mixture model (GMM), which aims to solve the problem of scarce data volume in actual scenes. GMM is a parametric generative model, which has a strong representation of actual data (based on Gaussian components). The more Gaussian components, the stronger the GMM characterization, and the larger the scale. At this time, the negative effects are gradually prominent. If you want to obtain a GMM model with strong generalization ability, you need sufficient data to drive GMM parameter training. The voice data obtained in the actual scene is difficult to reach even the minute level. UBM solves the problem of insufficient training data. UBM uses a large amount of training data belonging to different accents (irrespective of speakers and regions) to be fully trained to obtain a global GMM that can characterize common characteristics of speech, which can greatly reduce the resources consumed to calculate GMM parameters from scratch. After the training of the universal background model is completed, only the training data belonging to each accent needs to be used to fine-tune the parameters of the UBM (for example, through UBM adaptation) to obtain the GMM belonging to each accent.

The extraction of iVector is based on the full difference space modeling (TV) method, which maps the high-dimensional GMM trained by UBM to the low-dimensional full-variable subspace, which can break through the inconvenient calculation of the extracted vector dimension as the length of the speech signal is too long Limits, and can increase the speed of calculation and express more comprehensive characteristics. The GMM supervector in GMM-UBM may include a linear superposition of vector features related to the speaker itself and vector features related to channels and other changes.

The subspace modeling form of the TV model is:

M = m + Tw

Among them, M represents the GMM supervector of the speech, which is the MFCC characteristic parameter, m represents the accent-independent GMM supervector, T represents the load matrix of the space describing the difference, and w represents the corresponding low of the GMM supervector M in the load matrix space. Dimension factor representation, iVector.

The recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.

Assuming that a given accent includes N kinds of accents including accent 1, accent 2, ... accent N, then a N-type logistic regression model is used to calculate the decision score of the speech signal to be recognized for the given accent. The iVector (denoted as x _t ) of the speech signal to be identified is input into the N-type logistic regression model to obtain N decision scores s _it (that is, the determination score of the speech signal to be identified for N given accents), s _it = W _i * x _t + k _i , i = 1, ..., N. The highest score s _jt among N judgment scores s _it , i = 1, ..., N is obtained, and the accent j corresponding to the highest score s _{jt is} the accent to which the speech signal to be identified belongs. Wherein, w _{_i,} k _i is a parameter logistic regression model category N, regression coefficients w _i, k _i is a constant for each accent will have to be given a corresponding w _i and k _{_{_i,}} w _i, k _i composition Parameter vector of N-type logistic regression model, M = {(w ₁ , k ₁ ), (w ₂ , k ₂ ), ..., (w _N , k _N )}.

The accent recognition device 10 of Embodiment 2 preprocesses the speech signal to be recognized; detects valid speech in the speech signal to be recognized after preprocessing; extracts MFCC characteristic parameters from the valid speech; and uses the MFCC characteristic parameters to use The pre-trained GMM-UBG extracts the iVector of the effective voice; calculates the judgment score of the voice signal to be recognized for a given accent according to the iVector, and obtains the accent recognition result of the voice signal to be recognized according to the judgment score . The second embodiment can realize fast and accurate accent recognition.

In other embodiments, when extracting the MFCC characteristic parameters, the first extraction unit 203 may perform channel length normalization (Vocal Tract Length Normalization, VTLN) to obtain the MFCC characteristic parameters with normalized channel length.

In another embodiment, the recognition unit 205 may be further configured to perform voiceprint recognition according to the accent recognition result. Because the speakers live in different regions, even if they all speak Mandarin, there will be more or less accent differences. Applying accent recognition to voiceprint recognition can reduce the scope of subsequent voiceprint recognition objects and get more For accurate recognition results.

Example three

This embodiment provides a non-volatile readable storage medium. Computer-readable instructions are stored on the non-volatile readable storage medium. When the computer-readable instructions are executed by a processor, the accent recognition method embodiment is implemented. Steps, such as steps 101-105 shown in Figure 1:

Step 101: Pre-process a speech signal to be recognized;

Step 102: Detect a valid voice in the pre-recognized voice signal;

Step 103: extracting Melc frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;

Step 104: Use a pre-trained Gaussian mixture model-general background model GMM-UBM to extract the identity vector iVector of the effective voice according to the MFCC feature parameters;

The detecting valid speech in the to-be-recognized voice signal after preprocessing may include:

Windowing and framing the speech signal to be identified to obtain a speech frame of the speech signal to be identified;

Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;

Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;

Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;

The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.

The feature parameters for extracting Mel frequency cepstrum coefficient MFCC for the effective speech may include:

Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;

Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;

The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.

Alternatively, when the computer-readable instructions are executed by a processor, the functions of the modules / units in the foregoing device embodiments are implemented, for example, units 201-205 in FIG. 2:

A pre-processing unit 201, configured to pre-process a speech signal to be recognized;

A detection unit 202, configured to detect valid speech in the pre-recognized speech signal;

A first extraction unit 203, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;

A second extraction unit 204, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;

The detection unit 202 may be specifically configured to:

The first extraction unit 203 may be specifically configured to:

Embodiment 4

FIG. 3 is a schematic diagram of a computer device according to a fourth embodiment of the present application. The computer device 1 includes a memory 20, a processor 30, and computer-readable instructions 40 stored in the memory 20 and executable on the processor 30, such as an accent recognition program. When the processor 30 executes the computer-readable instructions 40, the steps in the embodiment of the accent recognition method described above are implemented, for example, steps 101-105 shown in FIG. 1:

Step 101: Pre-process a speech signal to be recognized;

Step 102: Detect a valid voice in the pre-recognized voice signal;

Alternatively, when the processor 30 executes the computer-readable instructions 40, the functions of the modules / units in the foregoing device embodiments are implemented, for example, units 201-205 in FIG. 2:

The detection unit 202 may be specifically configured to:

The first extraction unit 203 may be specifically configured to:

Exemplarily, the computer-readable instructions 40 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 20 and executed by the processor 30, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 40 in the computer device 1. For example, the computer-readable instructions 40 may be divided into a pre-processing unit 201, a detection unit 202, a first extraction unit 203, a second extraction unit 204, and a recognition unit 205 in FIG. .

The computer device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 1, and does not constitute a limitation on the computer device 1. It may include more or fewer components than shown in the figure, or some components may be combined or different. The components, for example, the computer apparatus 1 may further include an input-output device, a network access device, a bus, and the like.

The so-called processor 30 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor, or the processor 30 may be any conventional processor, etc. The processor 30 is a control center of the computer device 1 and uses various interfaces and lines to connect the entire computer device 1 Various parts.

The memory 20 may be configured to store the computer-readable instructions 40 and / or modules / units, and the processor 30 may execute or execute the computer-readable instructions and / or modules / units stored in the memory 20, and The data stored in the memory 20 is called to implement various functions of the computer device 1. The memory 20 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, application programs required for at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may Data (such as audio data, phone book, etc.) created according to the use of the computer device 1 are stored. In addition, the memory 20 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD). Card, flash memory card (Flash card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

When the modules / units integrated in the computer device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the methods of the above embodiments, and can also be completed by computer-readable instructions instructing related hardware. The computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by a processor, the steps of the foregoing method embodiments can be implemented. The computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in a source code form, an object code form, an executable file, or some intermediate form. The non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media. It should be noted that the content contained in the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdictions. Volatile readable media does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided in this application, it should be understood that the disclosed computer device and method may be implemented in other ways. For example, the embodiments of the computer device described above are merely schematic. For example, the division of the units is only a logical function division, and there may be another division manner in actual implementation.

In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist separately physically, or two or more units may be integrated in the same unit. The integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

It is obvious to a person skilled in the art that the present application is not limited to the details of the above exemplary embodiments, and that the present application can be implemented in other specific forms without departing from the spirit or basic features of the application. Therefore, the embodiments should be regarded as exemplary and non-limiting in every respect. The scope of the present application is defined by the appended claims rather than the above description, and therefore is intended to fall within the claims. All changes that are within the meaning and scope of equivalent requirements are included in this application. Any reference signs in the claims should not be construed as limiting the claims involved. In addition, it is clear that the word "comprising" does not exclude other units or steps, and that the singular does not exclude the plural. A plurality of units or computer devices stated in a computer device claim may also be implemented by the same unit or computer device through software or hardware. Words such as first and second are used to indicate names, but not in any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not limited, although the present application has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solution of the present application Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solution of the present application.

Claims

An accent recognition method, characterized in that the method includes:

Pre-process the recognition speech signal;

Detecting valid speech in the speech signal to be identified after preprocessing;

Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;

According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;

A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
The method according to claim 1, wherein the detecting valid speech in the speech signal to be identified after preprocessing comprises:

Performing windowing and framing on the pre-processed speech signal to obtain the speech frame of the speech signal to be identified;

Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;

Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;

Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;

The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
The method according to claim 1, wherein the MFCC characteristic parameters include an initial MFCC characteristic parameter, a first-order differential MFCC characteristic parameter, and a second-order differential MFCC characteristic parameter.
The method of claim 1, further comprising:

Perform noise compensation on the iVector.
The method according to claim 1, wherein calculating the decision score of the speech signal to be recognized for a given accent according to the iVector comprises:

The iVector is input to a logistic regression model to obtain a decision score of the speech signal to be recognized for a given accent.
The method according to claim 1, wherein said extracting a Mel frequency cepstrum coefficient MFCC characteristic parameter for said effective speech comprises:

Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;

Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;

The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
The method according to claim 1, wherein the preprocessing of the speech signal to be identified comprises:

Pre-emphasis the speech signal to be identified; and

Windowing and framing the speech signal to be identified.
An accent recognition device, characterized in that the device includes:

A pre-processing unit for pre-processing the speech signal to be recognized;

A detection unit, configured to detect valid speech in the speech signal to be identified after preprocessing;

A first extraction unit, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;

A second extraction unit, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;

The recognition unit is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the decision score.
A computer device, wherein the computer device includes a processor, and the processor is configured to execute computer-readable instructions stored in a memory to implement the following steps:

Pre-process the recognition speech signal;

Detecting valid speech in the speech signal to be identified after preprocessing;

Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;

According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;

A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
The computer device according to claim 9, wherein the detecting valid speech in the speech signal to be identified after preprocessing comprises:

Performing windowing and framing on the pre-processed speech signal to obtain the speech frame of the speech signal to be identified;

Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;

Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;

Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;

The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
The computer device according to claim 9, wherein the processor is further configured to execute the computer-readable instructions to implement the following steps:

Perform noise compensation on the iVector.
The computer device according to claim 9, wherein calculating the decision score of the speech signal to be recognized for a given accent according to the iVector comprises:

The iVector is input to a logistic regression model to obtain a decision score of the speech signal to be recognized for a given accent.
The computer device according to claim 9, wherein the MFCC characteristic parameters for extracting Mel frequency cepstrum coefficients for the effective speech include:

Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;

Adjusting the position and width of the triangular filter bank used for the MFCC feature parameter extraction according to the frequency bending factor;

The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
The computer device according to claim 9, wherein the preprocessing of the speech signal to be recognized comprises:

Pre-emphasis the speech signal to be identified; and

Windowing and framing the speech signal to be identified.
A non-volatile readable storage medium on which computer-readable instructions are stored, characterized in that, when the computer-readable instructions are executed by a processor, the following steps are implemented:

Pre-process the recognition speech signal;

Detecting valid speech in the speech signal to be identified after preprocessing;

Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;

According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;

A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
The storage medium according to claim 15, wherein the detecting valid speech in the speech signal to be identified after preprocessing comprises:

Performing windowing and framing on the pre-processed speech signal to obtain the speech frame of the speech signal to be identified;

Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;

Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;

Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;

The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
The storage medium of claim 15, wherein the computer-readable instructions are further used to implement the following steps when executed by a processor:

Perform noise compensation on the iVector.
The storage medium according to claim 15, wherein the calculating a decision score of the to-be-recognized voice signal for a given accent according to the iVector comprises:

The iVector is input to a logistic regression model to obtain a decision score of the speech signal to be recognized for a given accent.
The storage medium according to claim 15, wherein the MFCC characteristic parameters for extracting Mel frequency cepstrum coefficients for the effective speech include:

Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;

Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;

The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
The storage medium according to claim 15, wherein the preprocessing of the speech signal to be identified comprises:

Pre-emphasis the speech signal to be identified; and

Windowing and framing the speech signal to be identified.