WO2020034628A1 - Accent identification method and device, computer device, and storage medium - Google Patents

Accent identification method and device, computer device, and storage medium Download PDF

Info

Publication number
WO2020034628A1
WO2020034628A1 PCT/CN2019/077512 CN2019077512W WO2020034628A1 WO 2020034628 A1 WO2020034628 A1 WO 2020034628A1 CN 2019077512 W CN2019077512 W CN 2019077512W WO 2020034628 A1 WO2020034628 A1 WO 2020034628A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
speech signal
accent
frequency
identified
Prior art date
Application number
PCT/CN2019/077512
Other languages
French (fr)
Chinese (zh)
Inventor
张丝潆
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020034628A1 publication Critical patent/WO2020034628A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present invention relates to the field of computer hearing technology, and in particular, to a method and device for accent recognition, a computer device, and a storage medium.
  • the accent factor is one of the breakthrough points to get more accurate recognition results. Because the speakers live in different regions, even if they all speak Mandarin, there will still be more or less accent differences. If accent recognition can be added to the existing voiceprint recognition as a supplement, the application scenario will be further The most direct application of the extension is to identify the area where the speaker is located before voiceprint recognition, and then reduce the scope of subsequent recognition objects. However, the existing accent recognition effect is not ideal, the recognition speed is slow and the accuracy is not high.
  • a first aspect of the present application provides an accent recognition method, which includes:
  • a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech
  • a decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
  • a second aspect of the present application provides an accent recognition device, where the device includes:
  • a detection unit configured to detect valid speech in the speech signal to be identified after preprocessing
  • a first extraction unit configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech
  • a second extraction unit configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
  • the recognition unit is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the decision score.
  • a third aspect of the present application provides a computer device including a processor, where the processor is configured to implement the accent recognition method when executing computer-readable instructions stored in a memory.
  • a fourth aspect of the present application provides a non-volatile readable storage medium on which computer-readable instructions are stored, and the computer-readable instructions implement the accent recognition method when executed by a processor.
  • This application preprocesses the speech signal to be identified; detects the effective speech in the speech signal to be identified after preprocessing; extracts the Melf frequency cepstrum coefficient MFCC characteristic parameter from the effective speech; according to the MFCC characteristic parameter, using The pre-trained Gaussian mixture model-general background model GMM-UBM extracts the identity vector iVector of the effective speech; calculates the judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtains The accent recognition result of the speech signal to be recognized is described.
  • This application can find problems at the database level, without the need for testers to find problems through complex and extensive functional tests. This application can realize fast and accurate accent recognition.
  • FIG. 1 is a flowchart of an accent recognition method according to an embodiment of the present application.
  • FIG. 2 is a structural diagram of an accent recognition device according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the accent recognition method of the present application is applied in one or more computer devices.
  • the computer device is a device capable of automatically performing numerical calculations and / or information processing in accordance with instructions set or stored in advance.
  • the hardware includes, but is not limited to, a microprocessor and an Application Specific Integrated Circuit (ASIC). , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), Embedded Equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can perform human-computer interaction with a user through a keyboard, a mouse, a remote control, a touch pad, or a voice control device.
  • FIG. 1 is a flowchart of an accent recognition method provided in Embodiment 1 of the present application.
  • the accent recognition method is applied to a computer device.
  • the accent recognition method detects a failed object in the test library so that developers can fix the code and complete the rectification of the failed object.
  • the accent recognition method specifically includes the following steps:
  • Step 101 Pre-process a speech signal to be recognized.
  • the voice signal to be identified may be an analog voice signal or a digital voice signal. If the speech signal to be identified is an analog speech signal, the analog speech signal is subjected to analog-to-digital conversion to be converted into a digital speech signal.
  • the voice signal to be recognized may be a voice signal collected through a voice input device (for example, a microphone, a mobile phone microphone, etc.).
  • a voice input device for example, a microphone, a mobile phone microphone, etc.
  • Preprocessing the speech signal to be identified may include pre-emphasizing the speech signal to be identified.
  • pre-emphasis boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis of the speech signal to be identified, it is necessary to perform frequency boosting on the high frequency portion of the speech signal to be identified, that is, pre-emphasis of the speech signal to be identified. Pre-emphasis is generally implemented using a high-pass filter.
  • the transfer function of the high-pass filter can be:
  • is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.
  • Preprocessing the speech signal to be identified may further include windowing and framing the speech signal to be identified.
  • Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced.
  • the pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms.
  • the voice signal can be divided into short segments for processing in the voice signal processing. This process is called frame framing, and the resulting short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal.
  • each voice frame is 20 milliseconds, and there is a 10 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.
  • the commonly used window functions are rectangular window, Hamming window and Hanning window.
  • the rectangular window function is:
  • the Hamming window function is:
  • the Hanning window function is:
  • N is the number of sampling points included in a speech frame.
  • Step 102 Detect a valid voice in the pre-recognized voice signal.
  • Endpoint detection may be performed according to the pre-processed short-term energy and short-time zero-crossing rate of the speech signal to be identified to determine valid speech in the speech signal to be identified.
  • the valid voice in the pre-recognized voice signal can be detected by the following methods:
  • a Hamming window may be added to the pre-processed speech signal to be identified, each frame being 20 ms, and the frame shifting being 10 ms. If window frames have been framed in the pre-processed speech signal, this step is omitted.
  • DFT Discrete Fourier Transform
  • E (m) represents the cumulative energy of the m-th frequency band
  • (m 1 , m 2 ) represents the starting frequency band point of the m-th frequency band.
  • MFCC Mel Frequency Frequency Cepstrum Coefficient
  • the discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter.
  • the center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters.
  • the center frequency of the triangular filter is:
  • the frequency response of the triangular filter is:
  • Discrete Cosine Transform is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame.
  • the discrete cosine transform is:
  • the initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters.
  • the dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics are used. parameter.
  • the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.
  • the triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has good recognition performance in a noisy environment.
  • the extracted MFCC feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced MFCC feature parameters.
  • the MFCC feature parameters of the piecewise mean data dimensionality reduction algorithm are used for dimensionality reduction processing to obtain the MFCC feature parameters after dimensionality reduction.
  • the reduced MFCC feature parameters will be used in subsequent steps.
  • Step 104 Use a pre-trained Gaussian Mix Model (GMM) -Universal Background Model (UBM) to extract the identity-vector (iVector) of the effective voice according to the MFCC feature parameters. ).
  • GMM Gaussian Mix Model
  • UBM Universal Background Model
  • GMM Gaussian mixture model
  • the general background model is actually a Gaussian mixture model (GMM), which aims to solve the problem of scarce data volume in actual scenes.
  • GMM is a parametric generative model, which has a strong representation of actual data (based on Gaussian components). The more Gaussian components, the stronger the GMM characterization, and the larger the scale. At this time, the negative effects are gradually prominent. If you want to obtain a GMM model with strong generalization ability, you need sufficient data to drive GMM parameter training. However, The voice data obtained in the actual scene is difficult to reach even the minute level. UBM solves the problem of insufficient training data.
  • UBM uses a large amount of training data belonging to different accents (irrespective of speakers and regions) to be fully trained to obtain a global GMM that can characterize common characteristics of speech, which can greatly reduce the resources consumed to calculate GMM parameters from scratch.
  • the training of the universal background model After the training of the universal background model is completed, only the training data belonging to each accent needs to be used to fine-tune the parameters of the UBM (for example, through UBM adaptation) to obtain the GMM belonging to each accent.
  • different accents may be accents belonging to different regions.
  • the regions may be divided according to administrative regions, such as Liaoning, Beijing, Tianjin, Shanghai, Henan, Guangdong, and so on.
  • the region may also be divided into regions based on accent according to common experience, such as southern Fujian and Hakka.
  • the extraction of iVector is based on the full difference space modeling (TV) method, which maps the high-dimensional GMM trained by UBM to the low-dimensional full-variable subspace, which can break through the inconvenient calculation of the extracted vector dimension as the length of the speech signal is too long. Limits, and can increase the speed of calculation and express more comprehensive characteristics.
  • the GMM supervector in GMM-UBM may include a linear superposition of vector features related to the speaker itself and vector features related to channels and other changes.
  • the subspace modeling form of the TV model is:
  • M represents the GMM supervector of the speech, which is the MFCC characteristic parameter
  • m represents the accent-independent GMM supervector
  • T represents the load matrix of the space describing the difference
  • w represents the corresponding low of the GMM supervector M in the load matrix space Dimension factor representation, iVector.
  • the extracted iVector line noise can be compensated.
  • linear discriminant analysis (LDA) and intra-class covariance normalization (WCCN) can be used to perform noise compensation on the extracted iVector.
  • Step 105 Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.
  • a given accent can be one or more.
  • the judgment score of the to-be-recognized voice signal for the given accent may be calculated according to the iVector, and the to-be-recognized may be judged according to the judgment score of the to-be-recognized voice signal for the given accent.
  • Whether the speech signal is the given accent It can be judged whether the judgment score is greater than a preset score (for example, 9 points), and if the judgment score is greater than a preset score, it is judged that the speech signal to be recognized is the given accent.
  • a preset score for example, 9 points
  • the judgment score of the to-be-recognized voice signal for each given accent may be calculated according to the iVector, and the judgment score of each given accent for the given-voice signal may be judged and determined. Which of the given accents the speech is. The highest score among the decision scores for a plurality of given accents may be determined, and the given accent corresponding to the highest score is used as the accent to which the speech signal to be identified belongs.
  • a Logistic Regression model can be used to calculate the decision score of the speech signal to be recognized for a given accent.
  • the logistic regression model can score the speech signal to be identified according to the iVector of the speech signal to be identified.
  • a multi-class logistic regression model may be used to calculate the decision score of the speech signal to be recognized for a given accent.
  • a N-type logistic regression model is used to calculate the decision score of the speech signal to be recognized for the given accent.
  • the accent recognition method of the first embodiment preprocesses the speech signal to be recognized; detects valid speech in the speech signal to be recognized after preprocessing; extracts MFCC characteristic parameters from the valid speech; and uses the MFCC characteristic parameters in advance to use
  • the trained GMM-UBG extracts the iVector of the effective voice; calculates the decision score of the voice signal to be recognized for a given accent according to the iVector, and obtains the accent recognition result of the voice signal to be recognized according to the judgment score.
  • the first embodiment can realize fast and accurate accent recognition.
  • channel length normalization (Vocal, Tract, Length, Normalization, VTLN) may be performed to obtain MFCC characteristic parameters with normalized channel length.
  • the channels can be represented as a cascaded sound tube model.
  • Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.
  • VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates.
  • a VTLN method based on bilinear transformation may be used.
  • the bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the recognized speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the average third formant frequency aligned with different speakers A bending factor; according to the frequency bending factor, the position (for example, the starting point, the middle point, and the ending point of the triangular filter) and the width of the triangular filter bank are adjusted by using a bilinear transformation; according to the adjusted triangular filter The group calculates the channel normalized MFCC characteristic parameters.
  • the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time.
  • the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right.
  • the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity.
  • the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.
  • the accent recognition method may further include: performing voiceprint recognition according to the accent recognition result. Because the speakers live in different regions, even if they all speak Mandarin, there will be more or less accent differences. Applying accent recognition to voiceprint recognition can reduce the scope of subsequent voiceprint recognition objects and get more For accurate recognition results.
  • FIG. 2 is a structural diagram of an accent recognition device provided in Embodiment 2 of the present application.
  • the accent recognition device 10 may include a preprocessing unit 201, a detection unit 202, a first extraction unit 203, a second extraction unit 204, and a recognition unit 205.
  • the pre-processing unit 201 is configured to pre-process a speech signal to be recognized.
  • the voice signal to be identified may be an analog voice signal or a digital voice signal. If the speech signal to be identified is an analog speech signal, the analog speech signal is subjected to analog-to-digital conversion to be converted into a digital speech signal.
  • the voice signal to be recognized may be a voice signal collected through a voice input device (for example, a microphone, a mobile phone microphone, etc.).
  • a voice input device for example, a microphone, a mobile phone microphone, etc.
  • Preprocessing the speech signal to be identified may include pre-emphasizing the speech signal to be identified.
  • pre-emphasis boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis of the speech signal to be identified, it is necessary to perform frequency boosting on the high frequency portion of the speech signal to be identified, that is, pre-emphasis of the speech signal to be identified. Pre-emphasis is generally implemented using a high-pass filter.
  • the transfer function of the high-pass filter can be:
  • is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.
  • Preprocessing the speech signal to be identified may further include windowing and framing the speech signal to be identified.
  • Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced.
  • the pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms.
  • the voice signal can be divided into short segments for processing in the voice signal processing. This process is called frame framing, and the resulting short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal.
  • each voice frame is 20 milliseconds, and there is a 10 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.
  • the commonly used window functions are rectangular window, Hamming window and Hanning window.
  • the rectangular window function is:
  • the Hamming window function is:
  • the Hanning window function is:
  • N is the number of sampling points included in a speech frame.
  • the detecting unit 202 is configured to detect a valid voice in the pre-recognized voice signal.
  • Endpoint detection may be performed according to the pre-processed short-term energy and short-time zero-crossing rate of the speech signal to be identified to determine valid speech in the speech signal to be identified.
  • the valid voice in the pre-recognized voice signal can be detected by the following methods:
  • a Hamming window may be added to the pre-processed speech signal to be identified, each frame being 20 ms, and the frame shifting being 10 ms. If window frames have been framed in the pre-processed speech signal, this step is omitted.
  • DFT Discrete Fourier Transform
  • E (m) represents the cumulative energy of the m-th frequency band
  • (m 1 , m 2 ) represents the starting point of the m-th frequency band.
  • the first extraction unit 203 is configured to extract a Melf Frequency Cepstrum Coefficient (MFCC) feature parameter from the effective speech.
  • MFCC Melf Frequency Cepstrum Coefficient
  • the discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter.
  • the center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters.
  • the center frequency of the triangular filter is:
  • the frequency response of the triangular filter is:
  • Discrete Cosine Transform is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame.
  • the discrete cosine transform is:
  • the initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters.
  • the dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics parameter.
  • the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.
  • the triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has good recognition performance in a noisy environment.
  • the extracted MFCC feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced MFCC feature parameters.
  • the MFCC feature parameters of the piecewise mean data dimensionality reduction algorithm are used for dimensionality reduction processing to obtain the MFCC feature parameters after dimensionality reduction.
  • the reduced MFCC feature parameters will be used in subsequent steps.
  • a second extraction unit 204 is configured to extract an identity vector of the effective voice using a pre-trained Gaussian Mixture Model (GMM) -Universal Background Model (UBM) according to the MFCC feature parameters ( identity-vector, iVector).
  • GMM Gaussian Mixture Model
  • UBM Universal Background Model
  • GMM Gaussian mixture model
  • the general background model is actually a Gaussian mixture model (GMM), which aims to solve the problem of scarce data volume in actual scenes.
  • GMM is a parametric generative model, which has a strong representation of actual data (based on Gaussian components). The more Gaussian components, the stronger the GMM characterization, and the larger the scale. At this time, the negative effects are gradually prominent. If you want to obtain a GMM model with strong generalization ability, you need sufficient data to drive GMM parameter training. The voice data obtained in the actual scene is difficult to reach even the minute level. UBM solves the problem of insufficient training data.
  • UBM uses a large amount of training data belonging to different accents (irrespective of speakers and regions) to be fully trained to obtain a global GMM that can characterize common characteristics of speech, which can greatly reduce the resources consumed to calculate GMM parameters from scratch.
  • the training of the universal background model After the training of the universal background model is completed, only the training data belonging to each accent needs to be used to fine-tune the parameters of the UBM (for example, through UBM adaptation) to obtain the GMM belonging to each accent.
  • different accents may be accents belonging to different regions.
  • the regions may be divided according to administrative regions, such as Liaoning, Beijing, Tianjin, Shanghai, Henan, Guangdong, and so on.
  • the region may also be divided into regions based on accent according to common experience, such as southern Fujian and Hakka.
  • the extraction of iVector is based on the full difference space modeling (TV) method, which maps the high-dimensional GMM trained by UBM to the low-dimensional full-variable subspace, which can break through the inconvenient calculation of the extracted vector dimension as the length of the speech signal is too long Limits, and can increase the speed of calculation and express more comprehensive characteristics.
  • the GMM supervector in GMM-UBM may include a linear superposition of vector features related to the speaker itself and vector features related to channels and other changes.
  • the subspace modeling form of the TV model is:
  • M represents the GMM supervector of the speech, which is the MFCC characteristic parameter
  • m represents the accent-independent GMM supervector
  • T represents the load matrix of the space describing the difference
  • w represents the corresponding low of the GMM supervector M in the load matrix space.
  • the extracted iVector line noise can be compensated.
  • linear discriminant analysis (LDA) and intra-class covariance normalization (WCCN) can be used to perform noise compensation on the extracted iVector.
  • the recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.
  • a given accent can be one or more.
  • the judgment score of the to-be-recognized voice signal for the given accent may be calculated according to the iVector, and the to-be-recognized may be judged according to the judgment score of the to-be-recognized voice signal for the given accent.
  • Whether the speech signal is the given accent It can be judged whether the judgment score is greater than a preset score (for example, 9 points), and if the judgment score is greater than a preset score, it is judged that the speech signal to be recognized is the given accent.
  • a preset score for example, 9 points
  • the judgment score of the to-be-recognized voice signal for each given accent may be calculated according to the iVector, and the judgment score of each given accent for the given-voice signal may be judged and determined. Which of the given accents the speech is. The highest score among the decision scores for a plurality of given accents may be determined, and the given accent corresponding to the highest score is used as the accent to which the speech signal to be identified belongs.
  • a Logistic Regression model can be used to calculate the decision score of the speech signal to be recognized for a given accent.
  • the logistic regression model can score the speech signal to be identified according to the iVector of the speech signal to be identified.
  • a multi-class logistic regression model may be used to calculate the decision score of the speech signal to be recognized for a given accent.
  • a N-type logistic regression model is used to calculate the decision score of the speech signal to be recognized for the given accent.
  • w i, k i is a parameter logistic regression model category N
  • regression coefficients w i, k i is a constant for each accent will have to be given a corresponding w i and k i, w i, k i composition
  • M ⁇ (w 1 , k 1 ), (w 2 , k 2 ), ..., (w N , k N ) ⁇ .
  • the accent recognition device 10 of Embodiment 2 preprocesses the speech signal to be recognized; detects valid speech in the speech signal to be recognized after preprocessing; extracts MFCC characteristic parameters from the valid speech; and uses the MFCC characteristic parameters to use
  • the pre-trained GMM-UBG extracts the iVector of the effective voice; calculates the judgment score of the voice signal to be recognized for a given accent according to the iVector, and obtains the accent recognition result of the voice signal to be recognized according to the judgment score .
  • the second embodiment can realize fast and accurate accent recognition.
  • the first extraction unit 203 may perform channel length normalization (Vocal Tract Length Normalization, VTLN) to obtain the MFCC characteristic parameters with normalized channel length.
  • VTLN Channel Length Normalization
  • the channels can be represented as a cascaded sound tube model.
  • Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.
  • VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates.
  • a VTLN method based on bilinear transformation may be used.
  • the bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the recognized speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the average third formant frequency aligned with different speakers A bending factor; according to the frequency bending factor, the position (for example, the starting point, the middle point, and the ending point of the triangular filter) and the width of the triangular filter bank are adjusted by using a bilinear transformation; according to the adjusted triangular filter The group calculates the channel normalized MFCC characteristic parameters.
  • the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time.
  • the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right.
  • the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity.
  • the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.
  • the recognition unit 205 may be further configured to perform voiceprint recognition according to the accent recognition result. Because the speakers live in different regions, even if they all speak Mandarin, there will be more or less accent differences. Applying accent recognition to voiceprint recognition can reduce the scope of subsequent voiceprint recognition objects and get more For accurate recognition results.
  • This embodiment provides a non-volatile readable storage medium.
  • Computer-readable instructions are stored on the non-volatile readable storage medium.
  • the accent recognition method embodiment is implemented. Steps, such as steps 101-105 shown in Figure 1:
  • Step 101 Pre-process a speech signal to be recognized
  • Step 102 Detect a valid voice in the pre-recognized voice signal
  • Step 103 extracting Melc frequency cepstrum coefficient MFCC characteristic parameters for the effective speech
  • Step 104 Use a pre-trained Gaussian mixture model-general background model GMM-UBM to extract the identity vector iVector of the effective voice according to the MFCC feature parameters;
  • Step 105 Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.
  • the detecting valid speech in the to-be-recognized voice signal after preprocessing may include:
  • the cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  • the channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  • a pre-processing unit 201 configured to pre-process a speech signal to be recognized
  • a detection unit 202 configured to detect valid speech in the pre-recognized speech signal
  • a first extraction unit 203 configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech
  • a second extraction unit 204 configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
  • the recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.
  • the detection unit 202 may be specifically configured to:
  • the cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  • the first extraction unit 203 may be specifically configured to:
  • the channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  • FIG. 3 is a schematic diagram of a computer device according to a fourth embodiment of the present application.
  • the computer device 1 includes a memory 20, a processor 30, and computer-readable instructions 40 stored in the memory 20 and executable on the processor 30, such as an accent recognition program.
  • the processor 30 executes the computer-readable instructions 40, the steps in the embodiment of the accent recognition method described above are implemented, for example, steps 101-105 shown in FIG. 1:
  • Step 101 Pre-process a speech signal to be recognized
  • Step 102 Detect a valid voice in the pre-recognized voice signal
  • Step 103 extracting Melc frequency cepstrum coefficient MFCC characteristic parameters for the effective speech
  • Step 104 Use a pre-trained Gaussian mixture model-general background model GMM-UBM to extract the identity vector iVector of the effective voice according to the MFCC feature parameters;
  • Step 105 Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.
  • the detecting valid speech in the to-be-recognized voice signal after preprocessing may include:
  • the cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  • the channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  • a pre-processing unit 201 configured to pre-process a speech signal to be recognized
  • a detection unit 202 configured to detect valid speech in the pre-recognized speech signal
  • a first extraction unit 203 configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech
  • a second extraction unit 204 configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
  • the recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.
  • the detection unit 202 may be specifically configured to:
  • the cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  • the first extraction unit 203 may be specifically configured to:
  • the channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  • the computer-readable instructions 40 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 20 and executed by the processor 30, To complete this application.
  • the one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 40 in the computer device 1.
  • the computer-readable instructions 40 may be divided into a pre-processing unit 201, a detection unit 202, a first extraction unit 203, a second extraction unit 204, and a recognition unit 205 in FIG. .
  • the computer device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the schematic diagram 3 is only an example of the computer device 1, and does not constitute a limitation on the computer device 1. It may include more or fewer components than shown in the figure, or some components may be combined or different.
  • the components, for example, the computer apparatus 1 may further include an input-output device, a network access device, a bus, and the like.
  • the so-called processor 30 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor, or the processor 30 may be any conventional processor, etc.
  • the processor 30 is a control center of the computer device 1 and uses various interfaces and lines to connect the entire computer device 1 Various parts.
  • the memory 20 may be configured to store the computer-readable instructions 40 and / or modules / units, and the processor 30 may execute or execute the computer-readable instructions and / or modules / units stored in the memory 20, and
  • the data stored in the memory 20 is called to implement various functions of the computer device 1.
  • the memory 20 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, application programs required for at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may Data (such as audio data, phone book, etc.) created according to the use of the computer device 1 are stored.
  • the memory 20 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD).
  • a non-volatile memory such as a hard disk, an internal memory, a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD).
  • SSD Secure Digital
  • flash memory card Flash card
  • flash memory device at least one disk storage device, flash memory device, or other volatile solid-state storage device.
  • the modules / units integrated in the computer device 1 When the modules / units integrated in the computer device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the methods of the above embodiments, and can also be completed by computer-readable instructions instructing related hardware.
  • the computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by a processor, the steps of the foregoing method embodiments can be implemented.
  • the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in a source code form, an object code form, an executable file, or some intermediate form.
  • the non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media. It should be noted that the content contained in the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdictions. Volatile readable media does not include electrical carrier signals and telecommunication signals.
  • each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist separately physically, or two or more units may be integrated in the same unit.
  • the integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

Abstract

An accent identification method, comprising: pre-processing a voice signal to be identified (101); detecting an effective voice in said voice signal having been pre-processed (102); extracting a mel frequency cepstrum coefficient (MFCC) feature parameter with respect to the effective voice (103); according to the MFCC feature parameter, using a pre-trained Gaussian mixture model (GMM)-universal background model (UBM) to extract an identity vector (iVector) of the effective voice (104); and calculating, according to the iVector, a determination score of said voice signal with respect to a given accent, and obtaining, according to the determination score, an accent identification result of said voice signal (105).

Description

口音识别方法、装置、计算机装置及存储介质Accent recognition method, device, computer device and storage medium
本申请要求于2018年08月14日提交中国专利局,申请号为201810922056.0发明名称为“口音识别方法、装置、计算机装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 14, 2018, with an application number of 201810922056.0, and the invention name is "Accent Recognition Method, Device, Computer Device, and Computer-readable Storage Medium". Citations are incorporated in this application.
技术领域Technical field
本申请计算机听觉技术领域,具体涉及一种口音识别方法及装置、计算机装置和存储介质。The present invention relates to the field of computer hearing technology, and in particular, to a method and device for accent recognition, a computer device, and a storage medium.
背景技术Background technique
随着各类智能身份认证的不断出现和落地应用,诸如人脸识别、声纹识别已经获得了较为成熟的发展,但识别的准确性依然有提升的空间,诸如在声纹识别方向仍然可找到突破点以得到更为准确的识别结果,口音因素就是一个。由于说话人所生活的地域不同,即使在都讲普通话的情况下或多或少依然会有口音的差别,若能在现有的声纹识别中加入口音识别作为补充,应用场景将会有进一步的扩展,最为直接的应用为在声纹识别前识别出该说话人所处地域的范围,进而缩小后续识别的对象范围。然而,现有的口音识别效果并不理想,识别速度较慢且准确度不高。With the continuous emergence and application of various types of intelligent identity authentication, such as face recognition and voiceprint recognition have achieved relatively mature development, but the accuracy of recognition still has room for improvement, such as in the direction of voiceprint recognition. The accent factor is one of the breakthrough points to get more accurate recognition results. Because the speakers live in different regions, even if they all speak Mandarin, there will still be more or less accent differences. If accent recognition can be added to the existing voiceprint recognition as a supplement, the application scenario will be further The most direct application of the extension is to identify the area where the speaker is located before voiceprint recognition, and then reduce the scope of subsequent recognition objects. However, the existing accent recognition effect is not ideal, the recognition speed is slow and the accuracy is not high.
发明内容Summary of the Invention
鉴于以上内容,有必要提出一种口音识别方法及装置、计算机装置和存储介质,其可以实现快速准确的口音识别。In view of the above, it is necessary to propose an accent recognition method and device, a computer device, and a storage medium, which can realize fast and accurate accent recognition.
本申请的第一方面提供一种口音识别方法,所述方法包括:A first aspect of the present application provides an accent recognition method, which includes:
对待识别语音信号进行预处理;Pre-process the recognition speech signal;
检测预处理后的所述待识别语音信号中的有效语音;Detecting valid speech in the speech signal to be identified after preprocessing;
对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;
根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;
根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
本申请的第二方面提供一种口音识别装置,所述装置包括:A second aspect of the present application provides an accent recognition device, where the device includes:
预处理单元,用于对待识别语音信号进行预处理;A pre-processing unit for pre-processing the speech signal to be recognized;
检测单元,用于检测预处理后的所述待识别语音信号中的有效语音;A detection unit, configured to detect valid speech in the speech signal to be identified after preprocessing;
第一提取单元,用于对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;A first extraction unit, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;
第二提取单元,用于根据所述MFCC特征参数,利用预先训练好的高斯 混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;A second extraction unit, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
识别单元,用于根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。The recognition unit is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the decision score.
本申请的第三方面提供一种计算机装置,所述计算机装置包括处理器,所述处理器用于执行存储器中存储的计算机可读指令时实现所述口音识别方法。A third aspect of the present application provides a computer device including a processor, where the processor is configured to implement the accent recognition method when executing computer-readable instructions stored in a memory.
本申请的第四方面提供一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述口音识别方法。A fourth aspect of the present application provides a non-volatile readable storage medium on which computer-readable instructions are stored, and the computer-readable instructions implement the accent recognition method when executed by a processor.
本申请对待识别语音信号进行预处理;检测预处理后的所述待识别语音信号中的有效语音;对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。本申请可以通过数据库层面发现问题,而无需测试人员通过复杂、大量的功能测试去发现问题。本申请可以实现快速准确的口音识别。This application preprocesses the speech signal to be identified; detects the effective speech in the speech signal to be identified after preprocessing; extracts the Melf frequency cepstrum coefficient MFCC characteristic parameter from the effective speech; according to the MFCC characteristic parameter, using The pre-trained Gaussian mixture model-general background model GMM-UBM extracts the identity vector iVector of the effective speech; calculates the judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtains The accent recognition result of the speech signal to be recognized is described. This application can find problems at the database level, without the need for testers to find problems through complex and extensive functional tests. This application can realize fast and accurate accent recognition.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的口音识别方法的流程图。FIG. 1 is a flowchart of an accent recognition method according to an embodiment of the present application.
图2是本申请实施例提供的口音识别装置的结构图。FIG. 2 is a structural diagram of an accent recognition device according to an embodiment of the present application.
图3是本申请实施例提供的计算机装置的示意图。FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present application.
具体实施方式detailed description
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to more clearly understand the foregoing objectives, features, and advantages of the present application, the present application is described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the specification of the present application is only for the purpose of describing specific embodiments, and is not intended to limit the present application.
优选地,本申请的口音识别方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。Preferably, the accent recognition method of the present application is applied in one or more computer devices. The computer device is a device capable of automatically performing numerical calculations and / or information processing in accordance with instructions set or stored in advance. The hardware includes, but is not limited to, a microprocessor and an Application Specific Integrated Circuit (ASIC). , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), Embedded Equipment, etc.
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器 等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can perform human-computer interaction with a user through a keyboard, a mouse, a remote control, a touch pad, or a voice control device.
实施例一Example one
图1是本申请实施例一提供的口音识别方法的流程图。所述口音识别方法应用于计算机装置。所述口音识别方法检测出测试库的失效对象,以便开发人员做代码修复,完成失效对象整改。FIG. 1 is a flowchart of an accent recognition method provided in Embodiment 1 of the present application. The accent recognition method is applied to a computer device. The accent recognition method detects a failed object in the test library so that developers can fix the code and complete the rectification of the failed object.
如图1所示,所述口音识别方法具体包括以下步骤:As shown in FIG. 1, the accent recognition method specifically includes the following steps:
步骤101,对待识别语音信号进行预处理。Step 101: Pre-process a speech signal to be recognized.
所述待识别语音信号可以是模拟语音信号,也可以是数字语音信号。若所述待识别语音信号是模拟语音信号,则将所述模拟语音信号进行模数变换,转换为数字语音信号。The voice signal to be identified may be an analog voice signal or a digital voice signal. If the speech signal to be identified is an analog speech signal, the analog speech signal is subjected to analog-to-digital conversion to be converted into a digital speech signal.
所述待识别语音信号可以是通过语音输入设备(例如麦克风、手机话筒等)采集到的语音信号。The voice signal to be recognized may be a voice signal collected through a voice input device (for example, a microphone, a mobile phone microphone, etc.).
对所述待识别语音信号进行预处理可以包括对所述待识别语音信号进行预加重。Preprocessing the speech signal to be identified may include pre-emphasizing the speech signal to be identified.
预加重的目的是提升语音的高频分量,使信号的频谱变得平坦。语音信号由于受声门激励和口鼻辐射的影响,能量在高频端明显减小,通常是频率越高幅值越小。当频率提升两倍时,功率谱幅度按6dB/oct跌落。因此,在对待识别语音信号进行频谱分析或声道参数分析前,需要对待识别语音信号的高频部分进行频率提升,即对待识别语音信号进行预加重。预加重一般利用高通滤波器实现,高通滤波器的传递函数可以为:The purpose of pre-emphasis is to boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis of the speech signal to be identified, it is necessary to perform frequency boosting on the high frequency portion of the speech signal to be identified, that is, pre-emphasis of the speech signal to be identified. Pre-emphasis is generally implemented using a high-pass filter. The transfer function of the high-pass filter can be:
H(z)=1-κz -1,0.9≤κ≤1.0, H (z) = 1-κz -1 , 0.9≤κ≤1.0,
其中,κ为预加重系数,优选取值在0.94-0.97之间。Among them, κ is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.
对所述待识别语音信号进行预处理还可以包括对所述待识别语音信号进行加窗分帧。Preprocessing the speech signal to be identified may further include windowing and framing the speech signal to be identified.
语音信号是一种非平稳的时变信号,主要分为浊音和清音两大类。浊音的基音周期、请浊音信号幅度和声道参数等都随时间而缓慢变化,但通常在10ms-30ms的时间内可以认为具有短时平稳性。为了获得短时平稳信号,语音信号处理中可以把语音信号分成一些短段来进行处理,这个过程称为分帧,得到的短段的语音信号称为语音帧。分帧是通过对语音信号进行加窗处理来实现的。为了避免相邻两帧的变化幅度过大,帧与帧之间需要重叠一部分。在本申请的一个实施例中,每个语音帧为20毫秒,相邻两个语音帧之间存在10毫秒重叠,也就是每隔10毫秒取一个语音帧。Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced. The pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms. In order to obtain a short-term stationary signal, the voice signal can be divided into short segments for processing in the voice signal processing. This process is called frame framing, and the resulting short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal. In order to avoid that the amplitude of the change between two adjacent frames is too large, there needs to be some overlap between frames. In an embodiment of the present application, each voice frame is 20 milliseconds, and there is a 10 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.
常用的窗函数有矩形窗、汉明窗和汉宁窗,矩形窗函数为:The commonly used window functions are rectangular window, Hamming window and Hanning window. The rectangular window function is:
Figure PCTCN2019077512-appb-000001
Figure PCTCN2019077512-appb-000001
汉明窗函数为:The Hamming window function is:
Figure PCTCN2019077512-appb-000002
Figure PCTCN2019077512-appb-000002
汉宁窗函数为:The Hanning window function is:
Figure PCTCN2019077512-appb-000003
Figure PCTCN2019077512-appb-000003
其中,N为一个语音帧所包含的采样点的个数。Among them, N is the number of sampling points included in a speech frame.
步骤102,检测预处理后的所述待识别语音信号中的有效语音。Step 102: Detect a valid voice in the pre-recognized voice signal.
可以根据预处理后的所述待识别语音信号的短时能量和短时过零率等进行端点检测,以确定所述待识别语音信号中的有效语音。Endpoint detection may be performed according to the pre-processed short-term energy and short-time zero-crossing rate of the speech signal to be identified to determine valid speech in the speech signal to be identified.
在本实施例中,可以通过下述方法检测预处理后的所述待识别语音信号中的有效语音:In this embodiment, the valid voice in the pre-recognized voice signal can be detected by the following methods:
(1)对预处理后的所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧x(n)。在一个具体实施例中,可以对预处理后的所述待识别语音信号加汉明窗,每帧20ms,帧移10ms。若预处理过程中已对待识别语音信号加窗分帧,则该步骤省略。(1) Windowing and framing the pre-processed speech signal to be identified to obtain a speech frame x (n) of the speech signal to be identified. In a specific embodiment, a Hamming window may be added to the pre-processed speech signal to be identified, each frame being 20 ms, and the frame shifting being 10 ms. If window frames have been framed in the pre-processed speech signal, this step is omitted.
(2)对所述语音帧x(n)进行离散傅里叶变换(Discrete Fourier Transform,DFT),得到所述语音帧x(n)的频谱:(2) Discrete Fourier Transform (DFT) is performed on the speech frame x (n) to obtain the frequency spectrum of the speech frame x (n):
Figure PCTCN2019077512-appb-000004
Figure PCTCN2019077512-appb-000004
(3)根据所述语音帧x(n)的频谱计算各个频带的累计能量:(3) Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame x (n):
Figure PCTCN2019077512-appb-000005
Figure PCTCN2019077512-appb-000005
其中E(m)表示第m个频带的累计能量,(m 1,m 2)表示第m个频带的起始频带点。 Where E (m) represents the cumulative energy of the m-th frequency band, and (m 1 , m 2 ) represents the starting frequency band point of the m-th frequency band.
(4)对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值。(4) Perform a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithm value of the accumulated energy of each frequency band.
(5)将所述各个频带的累计能量对数值与预设阈值比较,得到所述有效语音。若一个频带的累计能量对数值高于预设阈值,则所述频带对应的语音为有效语音。(5) Compare the cumulative energy logarithm of each frequency band with a preset threshold to obtain the effective speech. If the cumulative energy log value of a frequency band is higher than a preset threshold, the speech corresponding to the frequency band is a valid speech.
步骤103,对所述有效语音提取梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征参数。In step 103, a Mel Frequency Frequency Cepstrum Coefficient (MFCC) feature parameter is extracted for the effective speech.
提取MFCC特征参数的流程如下:The process of extracting MFCC characteristic parameters is as follows:
(1)对每一个语音帧进行离散傅里叶变换(可以是快速傅里叶变换),得到该语音帧的频谱。(1) Perform a discrete Fourier transform (which can be a fast Fourier transform) on each speech frame to obtain the frequency spectrum of the speech frame.
(2)求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱。(2) The square of the spectral amplitude of the speech frame is obtained to obtain the discrete energy spectrum of the speech frame.
(3)将该语音帧的离散能量谱通过一组Mel频率上均匀分布的三角滤波器(即三角滤波器组),得到各个三角滤波器的输出。该组三角滤波器的中心频率在Mel频率刻度上均匀排列,且每个三角滤波器的三角形两个底点的频率分别等于相邻的两个三角滤波器的中心频率。三角滤波器的中心频率为:(3) The discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter. The center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters. The center frequency of the triangular filter is:
Figure PCTCN2019077512-appb-000006
Figure PCTCN2019077512-appb-000006
三角滤波器的频率响应为:The frequency response of the triangular filter is:
Figure PCTCN2019077512-appb-000007
Figure PCTCN2019077512-appb-000007
其中,f h、f 1为三角滤波器的高频和低频;N为傅里叶变换的点数;F s为采样频率;M为三角滤波器的个数;B -1=700(e b/1125-1)是f mel的逆函数。 Among them, f h and f 1 are the high and low frequencies of the triangular filter; N is the number of points of the Fourier transform; F s is the sampling frequency; M is the number of triangular filters; B -1 = 700 (e b / 1125 -1) is the inverse function of f mel .
(4)对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱S(m)。(4) Logarithmic operations are performed on the outputs of all triangular filters to obtain the logarithmic power spectrum S (m) of the speech frame.
(5)对S(m)做离散余弦变换(Discrete Cosine Transform,DCT),得到该语音帧的初始MFCC特征参数。离散余弦变换为:(5) Discrete Cosine Transform (DCT) is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame. The discrete cosine transform is:
Figure PCTCN2019077512-appb-000008
Figure PCTCN2019077512-appb-000008
(6)提取语音帧的动态差分MFCC特征参数。初始MFCC特征参数只反映了语音参数的静态特性,语音的动态特性可通过静态特征的差分谱来描述,动静态结合可以有效提升系统的识别性能,通常使用一阶和/或者二阶差分MFCC特征参数。(6) Extract the dynamic differential MFCC feature parameters of the speech frame. The initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters. The dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics are used. parameter.
在一具体实施例中,提取的MFCC特征参数为39维的特征矢量,包括13维初始MFCC特征参数、13维一阶差分MFCC特征参数和13维二阶差分MFCC特征参数。In a specific embodiment, the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.
MFCC中引入了三角滤波器组,且三角滤波器在低频段分布较密,在高频段分布较疏,符合人耳听觉特性,在噪声环境下仍具有较好的识别性能。The triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has good recognition performance in a noisy environment.
在本申请的一个实施中,在对预处理后的待识别语音信号提取MFCC特征参数之后,还可以对提取的MFCC特征参数进行降维处理,得到降维后的MFCC特征参数。例如,采用分段均值数据降维算法MFCC特征参数进行降维处理,得到降维后的MFCC特征参数。降维后的MFCC特征参数将用于后续的步骤。In one implementation of the present application, after extracting MFCC feature parameters from the pre-processed speech signal to be identified, the extracted MFCC feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced MFCC feature parameters. For example, the MFCC feature parameters of the piecewise mean data dimensionality reduction algorithm are used for dimensionality reduction processing to obtain the MFCC feature parameters after dimensionality reduction. The reduced MFCC feature parameters will be used in subsequent steps.
步骤104,根据所述MFCC特征参数,利用预先训练好的高斯混合模型(Gaussian Mixture Model,GMM)-通用背景模型(Universal Background Model,UBM)提取所述有效语音的身份矢量(identity-vector,iVector)。Step 104: Use a pre-trained Gaussian Mix Model (GMM) -Universal Background Model (UBM) to extract the identity-vector (iVector) of the effective voice according to the MFCC feature parameters. ).
提取iVector之前,首先要用大量属于不同口音的训练数据训练出通用背景模型。通用背景模型实际上是一种高斯混合模型(GMM),旨在解决实际场景数据量稀缺的问题。GMM是一种参数化的生成性模型,具备对实际数据极强的表征力(基于高斯分量实现)。高斯分量越多,GMM表征力越强,规模也越庞大,此时负面效应逐步凸显——若想获得一个泛化能力较强的GMM模型,则需要足够的数据来驱动GMM的参数训练,然而实际场景中获取的语音数据甚至连分钟级都很难企及。UBM正是解决了训练数据不足的问题。UBM是利用大量属于不同口音的训练数据(无关乎说话人、地域)混合起来充分训练,得到一个可以对语音共通特性进行表征的全局GMM,可大大缩减从头计算GMM参数所消耗的资源。通用背景模型训练完成后, 只需利用单独属于每个口音的训练数据,分别对UBM的参数进行微调(例如通过UBM自适应),得到属于各个口音的GMM。Before extracting iVector, we must first train a universal background model with a large amount of training data belonging to different accents. The general background model is actually a Gaussian mixture model (GMM), which aims to solve the problem of scarce data volume in actual scenes. GMM is a parametric generative model, which has a strong representation of actual data (based on Gaussian components). The more Gaussian components, the stronger the GMM characterization, and the larger the scale. At this time, the negative effects are gradually prominent. If you want to obtain a GMM model with strong generalization ability, you need sufficient data to drive GMM parameter training. However, The voice data obtained in the actual scene is difficult to reach even the minute level. UBM solves the problem of insufficient training data. UBM uses a large amount of training data belonging to different accents (irrespective of speakers and regions) to be fully trained to obtain a global GMM that can characterize common characteristics of speech, which can greatly reduce the resources consumed to calculate GMM parameters from scratch. After the training of the universal background model is completed, only the training data belonging to each accent needs to be used to fine-tune the parameters of the UBM (for example, through UBM adaptation) to obtain the GMM belonging to each accent.
在一个实施例中,不同口音可以是属于不同地域的口音。所述地域可以是按照行政区域来划分,例如辽宁、北京、天津、上海、河南、广东等。所述地域也可以是按照普遍的经验以口音对地区来划分,例如闽南、客家等。In one embodiment, different accents may be accents belonging to different regions. The regions may be divided according to administrative regions, such as Liaoning, Beijing, Tianjin, Shanghai, Henan, Guangdong, and so on. The region may also be divided into regions based on accent according to common experience, such as southern Fujian and Hakka.
提取iVector是基于全差异空间建模(TV)方法将UBM训练得出的高维GMM映射至低维度的全变量子空间,可突破随语音信号时长过长而提取的向量维度过大不便计算的限制,并能提升计算速度,表达出更全面的特征。GMM-UBM中的GMM超矢量可以包括跟说话人本身有关的矢量特征和跟信道以及其他变化有关的矢量特征的线性叠加。The extraction of iVector is based on the full difference space modeling (TV) method, which maps the high-dimensional GMM trained by UBM to the low-dimensional full-variable subspace, which can break through the inconvenient calculation of the extracted vector dimension as the length of the speech signal is too long. Limits, and can increase the speed of calculation and express more comprehensive characteristics. The GMM supervector in GMM-UBM may include a linear superposition of vector features related to the speaker itself and vector features related to channels and other changes.
TV模型的子空间建模形式为:The subspace modeling form of the TV model is:
M=m+TwM = m + Tw
其中,M表示语音的GMM超矢量,即所述MFCC特征参数,m表示口音无关的GMM超矢量,T表示描述差异的空间的载荷矩阵,w表示GMM超矢量M在载荷矩阵空间下对应的低维因子表示,即iVector。Among them, M represents the GMM supervector of the speech, which is the MFCC characteristic parameter, m represents the accent-independent GMM supervector, T represents the load matrix of the space describing the difference, and w represents the corresponding low of the GMM supervector M in the load matrix space Dimension factor representation, iVector.
在本实施例中,可以对提取的iVector行噪声补偿。在一实施例中,可以采用线性判别分析(Linear Discriminative Analysis,LDA)和类内协方差规整(Within Class Covariance Normalization,WCCN)对提取的iVector进行噪声补偿。In this embodiment, the extracted iVector line noise can be compensated. In one embodiment, linear discriminant analysis (LDA) and intra-class covariance normalization (WCCN) can be used to perform noise compensation on the extracted iVector.
步骤105,根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。Step 105: Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.
给定口音可以是一个也可以是多个。例如,若给定口音为一个,可以根据所述iVector计算所述待识别语音信号对该给定口音的判决得分,根据所述待识别语音信号对该给定口音的判决得分判断所述待识别语音信号是否为该给定口音。可以判断所述判决得分是否大于预设得分(例如9分),若所述判决得分大于预设得分,则判断所述待识别语音信号为该给定口音。A given accent can be one or more. For example, if the given accent is one, the judgment score of the to-be-recognized voice signal for the given accent may be calculated according to the iVector, and the to-be-recognized may be judged according to the judgment score of the to-be-recognized voice signal for the given accent. Whether the speech signal is the given accent. It can be judged whether the judgment score is greater than a preset score (for example, 9 points), and if the judgment score is greater than a preset score, it is judged that the speech signal to be recognized is the given accent.
若给定口音为多个,可以根据所述iVector计算所述待识别语音信号对每个给定口音的判决得分,根据所述待识别语音信号对每个给定口音的判决得分判断判断所述语音为多个给定口音中的哪一个。可以确定对多个给定口音的判决得分中的最高得分,将所述最高得分对应的给定口音作为所述待识别语音信号所属的口音。If there are multiple given accents, the judgment score of the to-be-recognized voice signal for each given accent may be calculated according to the iVector, and the judgment score of each given accent for the given-voice signal may be judged and determined. Which of the given accents the speech is. The highest score among the decision scores for a plurality of given accents may be determined, and the given accent corresponding to the highest score is used as the accent to which the speech signal to be identified belongs.
在本实施例中,可以利用逻辑回归(Logistic Regression)模型计算所述待识别语音信号对给定口音的判决得分。逻辑回归模型作为一个分类器,可根据待识别语音信号的iVector对待识别语音信号进行打分。特别地,在一具体实施例中,可以使用多类逻辑回归模型计算所述待识别语音信号对给定口音的判决得分。In this embodiment, a Logistic Regression model can be used to calculate the decision score of the speech signal to be recognized for a given accent. As a classifier, the logistic regression model can score the speech signal to be identified according to the iVector of the speech signal to be identified. Specifically, in a specific embodiment, a multi-class logistic regression model may be used to calculate the decision score of the speech signal to be recognized for a given accent.
假设给定口音包括口音1、口音2、…口音N共N种口音,则利用N类逻辑回归模型计算所述待识别语音信号对给定口音的判决得分。将待识别语音信号的iVector(记为x t)输入所述N类逻辑回归模型,得到N个判决得分s it(即所述待识别语音信号对N种给定口音的判决得分),s it=w i*x t+k i, i=1,…,N。求取N个判决得分s it,i=1,…,N中的最高得分s jt,最高得分s jt对应的口音j为所述待识别语音信号所属的口音。其中,w i、k i是N类逻辑回归模型的参数,w i为回归系数,k i为常数,针对每个给定口音均会有对应的w i和k i,w i、k i组成N类逻辑回归模型的参数向量,M={(w 1,k 1),(w 2,k 2),…,(w N,k N)}。 Assuming that a given accent includes N kinds of accents including accent 1, accent 2, ... accent N, then a N-type logistic regression model is used to calculate the decision score of the speech signal to be recognized for the given accent. The iVector (denoted as x t ) of the speech signal to be recognized is input into the N-type logistic regression model to obtain N decision scores s it (that is, the determination score of the speech signal to be recognized for N given accents), it = W i * x t + k i , i = 1, ..., N. Obtain the highest score s jt of N judgment scores s it , i = 1, ..., N, and the accent j corresponding to the highest score s jt is the accent to which the speech signal to be identified belongs. Wherein, w i, k i is a parameter logistic regression model category N, regression coefficients w i, k i is a constant for each accent will have to be given a corresponding w i and k i, w i, k i composition Parameter vector of N-type logistic regression model, M = {(w 1 , k 1 ), (w 2 , k 2 ), ..., (w N , k N )}.
实施例一的口音识别方法对待识别语音信号进行预处理;检测预处理后的所述待识别语音信号中的有效语音;对所述有效语音提取MFCC特征参数;根据所述MFCC特征参数,利用预先训练好的GMM-UBG提取所述有效语音的iVector;根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。实施例一可以实现快速准确的口音识别。The accent recognition method of the first embodiment preprocesses the speech signal to be recognized; detects valid speech in the speech signal to be recognized after preprocessing; extracts MFCC characteristic parameters from the valid speech; and uses the MFCC characteristic parameters in advance to use The trained GMM-UBG extracts the iVector of the effective voice; calculates the decision score of the voice signal to be recognized for a given accent according to the iVector, and obtains the accent recognition result of the voice signal to be recognized according to the judgment score. The first embodiment can realize fast and accurate accent recognition.
在其他的实施例中,在提取MFCC特征参数时,可以进行声道长度归一化(Vocal Tract Length Normalization,VTLN),得到声道长度归一化的MFCC特征参数。In other embodiments, when extracting MFCC characteristic parameters, channel length normalization (Vocal, Tract, Length, Normalization, VTLN) may be performed to obtain MFCC characteristic parameters with normalized channel length.
声道可以表示为级联声管模型,每个声管都可以看成是一个谐振腔,它们的共振频率取决于声管的长度和形状。因此,说话人之间的部分声学差异是由于说话人的声道长度不同。例如,声道长度的变化范围一般从13cm(成年女性)变化到18cm(成年男性),因此,不同性别的人说同一个元音的共振峰频率相差很大。VTLN就是为了消除男、女声道长度的差异,使口音识别的结果不受性别的干扰。The channels can be represented as a cascaded sound tube model. Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.
VTLN可以通过弯折和平移频率坐标来使各说话人的共振峰频率相匹配。在本实施例中,可以采用基于双线性变换的VTLN方法。该基于双线性变换的VTLN方法并不直接对待识别语音信号的频谱进行折叠,而是采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;根据所述频率弯折因子,采用双线性变换对三角滤波器组的位置(例如三角滤波器的起点、中间点和结束点)和宽度进行调整;根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。例如,若要对待识别语音信号进行频谱压缩,则对三角滤波器的刻度进行拉伸,此时三角滤波器组向左扩展和移动。若要对待识别语音信号进行频谱拉伸,则对三角滤波器的刻度进行压缩,此时三角滤波器组向右压缩和移动。采用该基于双线性变换的VTLN方法对特定人群或特定人进行声道归一化时,仅需要对三角滤波器组系数进行一次变换即可,无需每次在提取特征参数时都对信号频谱折叠,从而大大减小了计算量。并且,该基于双线性变换的VTLN方法避免了对频率因子线性搜索,减小了运算复杂度。同时,该基于双线性变换的VTLN方法利用双线性变换,使弯折的频率连续且无带宽改变。VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates. In this embodiment, a VTLN method based on bilinear transformation may be used. The bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the recognized speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the average third formant frequency aligned with different speakers A bending factor; according to the frequency bending factor, the position (for example, the starting point, the middle point, and the ending point of the triangular filter) and the width of the triangular filter bank are adjusted by using a bilinear transformation; according to the adjusted triangular filter The group calculates the channel normalized MFCC characteristic parameters. For example, if spectrum compression is to be performed on the speech signal to be recognized, the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time. To spectrally stretch the recognition speech signal, the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right. When the VTLN method based on the bilinear transformation is used to normalize the channel of a specific group of people or a specific person, only the triangular filter bank coefficients need to be transformed once, and the signal spectrum does not need to be extracted each time the feature parameters are extracted. Folding, which greatly reduces the amount of calculation. In addition, the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity. At the same time, the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.
在另一实施例中,所述口音识别方法还可以包括:根据所述口音识别结果进行声纹识别。由于说话人所生活的地域不同,即使在都讲普通话的情况下或多或少依然会有口音的差别,将口音识别应用到声纹识别中,可以缩小后续声纹识别的对象范围,得到更为准确的识别结果。In another embodiment, the accent recognition method may further include: performing voiceprint recognition according to the accent recognition result. Because the speakers live in different regions, even if they all speak Mandarin, there will be more or less accent differences. Applying accent recognition to voiceprint recognition can reduce the scope of subsequent voiceprint recognition objects and get more For accurate recognition results.
实施例二Example two
图2为本申请实施例二提供的口音识别装置的结构图。如图2所示,所述口音识别装置10可以包括:预处理单元201、检测单元202、第一提取单元203、第二提取单元204、识别单元205。FIG. 2 is a structural diagram of an accent recognition device provided in Embodiment 2 of the present application. As shown in FIG. 2, the accent recognition device 10 may include a preprocessing unit 201, a detection unit 202, a first extraction unit 203, a second extraction unit 204, and a recognition unit 205.
预处理单元201,用于对待识别语音信号进行预处理。The pre-processing unit 201 is configured to pre-process a speech signal to be recognized.
所述待识别语音信号可以是模拟语音信号,也可以是数字语音信号。若所述待识别语音信号是模拟语音信号,则将所述模拟语音信号进行模数变换,转换为数字语音信号。The voice signal to be identified may be an analog voice signal or a digital voice signal. If the speech signal to be identified is an analog speech signal, the analog speech signal is subjected to analog-to-digital conversion to be converted into a digital speech signal.
所述待识别语音信号可以是通过语音输入设备(例如麦克风、手机话筒等)采集到的语音信号。The voice signal to be recognized may be a voice signal collected through a voice input device (for example, a microphone, a mobile phone microphone, etc.).
对所述待识别语音信号进行预处理可以包括对所述待识别语音信号进行预加重。Preprocessing the speech signal to be identified may include pre-emphasizing the speech signal to be identified.
预加重的目的是提升语音的高频分量,使信号的频谱变得平坦。语音信号由于受声门激励和口鼻辐射的影响,能量在高频端明显减小,通常是频率越高幅值越小。当频率提升两倍时,功率谱幅度按6dB/oct跌落。因此,在对待识别语音信号进行频谱分析或声道参数分析前,需要对待识别语音信号的高频部分进行频率提升,即对待识别语音信号进行预加重。预加重一般利用高通滤波器实现,高通滤波器的传递函数可以为:The purpose of pre-emphasis is to boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis of the speech signal to be identified, it is necessary to perform frequency boosting on the high frequency portion of the speech signal to be identified, that is, pre-emphasis of the speech signal to be identified. Pre-emphasis is generally implemented using a high-pass filter. The transfer function of the high-pass filter can be:
H(z)=1-κz -1,0.9≤κ≤1.0, H (z) = 1-κz -1 , 0.9≤κ≤1.0,
其中,κ为预加重系数,优选取值在0.94-0.97之间。Among them, κ is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.
对所述待识别语音信号进行预处理还可以包括对所述待识别语音信号进行加窗分帧。Preprocessing the speech signal to be identified may further include windowing and framing the speech signal to be identified.
语音信号是一种非平稳的时变信号,主要分为浊音和清音两大类。浊音的基音周期、请浊音信号幅度和声道参数等都随时间而缓慢变化,但通常在10ms-30ms的时间内可以认为具有短时平稳性。为了获得短时平稳信号,语音信号处理中可以把语音信号分成一些短段来进行处理,这个过程称为分帧,得到的短段的语音信号称为语音帧。分帧是通过对语音信号进行加窗处理来实现的。为了避免相邻两帧的变化幅度过大,帧与帧之间需要重叠一部分。在本申请的一个实施例中,每个语音帧为20毫秒,相邻两个语音帧之间存在10毫秒重叠,也就是每隔10毫秒取一个语音帧。Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced. The pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms. In order to obtain a short-term stationary signal, the voice signal can be divided into short segments for processing in the voice signal processing. This process is called frame framing, and the resulting short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal. In order to avoid that the amplitude of the change between two adjacent frames is too large, there needs to be some overlap between frames. In an embodiment of the present application, each voice frame is 20 milliseconds, and there is a 10 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.
常用的窗函数有矩形窗、汉明窗和汉宁窗,矩形窗函数为:The commonly used window functions are rectangular window, Hamming window and Hanning window. The rectangular window function is:
Figure PCTCN2019077512-appb-000009
Figure PCTCN2019077512-appb-000009
汉明窗函数为:The Hamming window function is:
Figure PCTCN2019077512-appb-000010
Figure PCTCN2019077512-appb-000010
汉宁窗函数为:The Hanning window function is:
Figure PCTCN2019077512-appb-000011
Figure PCTCN2019077512-appb-000011
其中,N为一个语音帧所包含的采样点的个数。Among them, N is the number of sampling points included in a speech frame.
检测单元202,用于检测预处理后的所述待识别语音信号中的有效语音。The detecting unit 202 is configured to detect a valid voice in the pre-recognized voice signal.
可以根据预处理后的所述待识别语音信号的短时能量和短时过零率等进行端点检测,以确定所述待识别语音信号中的有效语音。Endpoint detection may be performed according to the pre-processed short-term energy and short-time zero-crossing rate of the speech signal to be identified to determine valid speech in the speech signal to be identified.
在本实施例中,可以通过下述方法检测预处理后的所述待识别语音信号中的有效语音:In this embodiment, the valid voice in the pre-recognized voice signal can be detected by the following methods:
(1)对预处理后的所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧x(n)。在一个具体实施例中,可以对预处理后的所述待识别语音信号加汉明窗,每帧20ms,帧移10ms。若预处理过程中已对待识别语音信号加窗分帧,则该步骤省略。(1) Windowing and framing the pre-processed speech signal to be identified to obtain a speech frame x (n) of the speech signal to be identified. In a specific embodiment, a Hamming window may be added to the pre-processed speech signal to be identified, each frame being 20 ms, and the frame shifting being 10 ms. If window frames have been framed in the pre-processed speech signal, this step is omitted.
(2)对所述语音帧x(n)进行离散傅里叶变换(Discrete Fourier Transform,DFT),得到所述语音帧x(n)的频谱:(2) Discrete Fourier Transform (DFT) is performed on the speech frame x (n) to obtain the frequency spectrum of the speech frame x (n):
Figure PCTCN2019077512-appb-000012
Figure PCTCN2019077512-appb-000012
(3)根据所述语音帧x(n)的频谱计算各个频带的累计能量:(3) Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame x (n):
Figure PCTCN2019077512-appb-000013
Figure PCTCN2019077512-appb-000013
其中E(m)表示第m个频带的累计能量,(m 1,m 2)表示第m个频带的起始频带点。 Where E (m) represents the cumulative energy of the m-th frequency band, and (m 1 , m 2 ) represents the starting point of the m-th frequency band.
(4)对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值。(4) Perform a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithm value of the accumulated energy of each frequency band.
(5)将所述各个频带的累计能量对数值与预设阈值比较,得到所述有效语音。若一个频带的累计能量对数值高于预设阈值,则所述频带对应的语音为有效语音。(5) Compare the cumulative energy logarithm of each frequency band with a preset threshold to obtain the effective speech. If the cumulative energy log value of a frequency band is higher than a preset threshold, the speech corresponding to the frequency band is a valid speech.
第一提取单元203,用于对所述有效语音提取梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征参数。The first extraction unit 203 is configured to extract a Melf Frequency Cepstrum Coefficient (MFCC) feature parameter from the effective speech.
提取MFCC特征参数的流程如下:The process of extracting MFCC characteristic parameters is as follows:
(1)对每一个语音帧进行离散傅里叶变换(可以是快速傅里叶变换),得到该语音帧的频谱。(1) Perform a discrete Fourier transform (which can be a fast Fourier transform) on each speech frame to obtain the frequency spectrum of the speech frame.
(2)求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱。(2) The square of the spectral amplitude of the speech frame is obtained to obtain the discrete energy spectrum of the speech frame.
(3)将该语音帧的离散能量谱通过一组Mel频率上均匀分布的三角滤波器(即三角滤波器组),得到各个三角滤波器的输出。该组三角滤波器的中心频率在Mel频率刻度上均匀排列,且每个三角滤波器的三角形两个底点的频率分别等于相邻的两个三角滤波器的中心频率。三角滤波器的中心频率为:(3) The discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter. The center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters. The center frequency of the triangular filter is:
Figure PCTCN2019077512-appb-000014
Figure PCTCN2019077512-appb-000014
三角滤波器的频率响应为:The frequency response of the triangular filter is:
Figure PCTCN2019077512-appb-000015
Figure PCTCN2019077512-appb-000015
其中,f h、f 1为三角滤波器的高频和低频;N为傅里叶变换的点数;F s为采样频率;M为三角滤波器的个数;B -1=700(e b/1125-1)是f mel的逆函数。 Among them, f h and f 1 are the high and low frequencies of the triangular filter; N is the number of points of the Fourier transform; F s is the sampling frequency; M is the number of triangular filters; B -1 = 700 (e b / 1125 -1) is the inverse function of f mel .
(4)对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱S(m)。(4) Logarithmic operations are performed on the outputs of all triangular filters to obtain the logarithmic power spectrum S (m) of the speech frame.
(5)对S(m)做离散余弦变换(Discrete Cosine Transform,DCT),得到该语音帧的初始MFCC特征参数。离散余弦变换为:(5) Discrete Cosine Transform (DCT) is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame. The discrete cosine transform is:
Figure PCTCN2019077512-appb-000016
Figure PCTCN2019077512-appb-000016
(6)提取语音帧的动态差分MFCC特征参数。初始MFCC特征参数只反映了语音参数的静态特性,语音的动态特性可通过静态特征的差分谱来描述,动静态结合可以有效提升系统的识别性能,通常使用一阶和/或者二阶差分MFCC特征参数。(6) Extract the dynamic differential MFCC feature parameters of the speech frame. The initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters. The dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics parameter.
在一具体实施例中,提取的MFCC特征参数为39维的特征矢量,包括13维初始MFCC特征参数、13维一阶差分MFCC特征参数和13维二阶差分MFCC特征参数。In a specific embodiment, the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.
MFCC中引入了三角滤波器组,且三角滤波器在低频段分布较密,在高频段分布较疏,符合人耳听觉特性,在噪声环境下仍具有较好的识别性能。The triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has good recognition performance in a noisy environment.
在本申请的一个实施中,在对预处理后的待识别语音信号提取MFCC特征参数之后,还可以对提取的MFCC特征参数进行降维处理,得到降维后的MFCC特征参数。例如,采用分段均值数据降维算法MFCC特征参数进行降维处理,得到降维后的MFCC特征参数。降维后的MFCC特征参数将用于后续的步骤。In one implementation of the present application, after extracting MFCC feature parameters from the pre-processed speech signal to be identified, the extracted MFCC feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced MFCC feature parameters. For example, the MFCC feature parameters of the piecewise mean data dimensionality reduction algorithm are used for dimensionality reduction processing to obtain the MFCC feature parameters after dimensionality reduction. The reduced MFCC feature parameters will be used in subsequent steps.
第二提取单元204,用于根据所述MFCC特征参数,利用预先训练好的高斯混合模型(Gaussian Mixture Model,GMM)-通用背景模型(Universal Background Model,UBM)提取所述有效语音的身份矢量(identity-vector,iVector)。A second extraction unit 204 is configured to extract an identity vector of the effective voice using a pre-trained Gaussian Mixture Model (GMM) -Universal Background Model (UBM) according to the MFCC feature parameters ( identity-vector, iVector).
提取iVector之前,首先要用大量属于不同口音的训练数据训练出通用背景模型。通用背景模型实际上是一种高斯混合模型(GMM),旨在解决实际场景数据量稀缺的问题。GMM是一种参数化的生成性模型,具备对实际数据极强的表征力(基于高斯分量实现)。高斯分量越多,GMM表征力越强,规模也越庞大,此时负面效应逐步凸显——若想获得一个泛化能力较强的GMM模型,则需要足够的数据来驱动GMM的参数训练,然而实际场景中获取的语音数据甚至连分钟级都很难企及。UBM正是解决了训练数据不足的问题。UBM是利用大量属于不同口音的训练数据(无关乎说话人、地域)混合起来充分训练,得到一个可以对语音共通特性进行表征的全局GMM, 可大大缩减从头计算GMM参数所消耗的资源。通用背景模型训练完成后,只需利用单独属于每个口音的训练数据,分别对UBM的参数进行微调(例如通过UBM自适应),得到属于各个口音的GMM。Before extracting iVector, we must first train a universal background model with a large amount of training data belonging to different accents. The general background model is actually a Gaussian mixture model (GMM), which aims to solve the problem of scarce data volume in actual scenes. GMM is a parametric generative model, which has a strong representation of actual data (based on Gaussian components). The more Gaussian components, the stronger the GMM characterization, and the larger the scale. At this time, the negative effects are gradually prominent. If you want to obtain a GMM model with strong generalization ability, you need sufficient data to drive GMM parameter training. The voice data obtained in the actual scene is difficult to reach even the minute level. UBM solves the problem of insufficient training data. UBM uses a large amount of training data belonging to different accents (irrespective of speakers and regions) to be fully trained to obtain a global GMM that can characterize common characteristics of speech, which can greatly reduce the resources consumed to calculate GMM parameters from scratch. After the training of the universal background model is completed, only the training data belonging to each accent needs to be used to fine-tune the parameters of the UBM (for example, through UBM adaptation) to obtain the GMM belonging to each accent.
在一个实施例中,不同口音可以是属于不同地域的口音。所述地域可以是按照行政区域来划分,例如辽宁、北京、天津、上海、河南、广东等。所述地域也可以是按照普遍的经验以口音对地区来划分,例如闽南、客家等。In one embodiment, different accents may be accents belonging to different regions. The regions may be divided according to administrative regions, such as Liaoning, Beijing, Tianjin, Shanghai, Henan, Guangdong, and so on. The region may also be divided into regions based on accent according to common experience, such as southern Fujian and Hakka.
提取iVector是基于全差异空间建模(TV)方法将UBM训练得出的高维GMM映射至低维度的全变量子空间,可突破随语音信号时长过长而提取的向量维度过大不便计算的限制,并能提升计算速度,表达出更全面的特征。GMM-UBM中的GMM超矢量可以包括跟说话人本身有关的矢量特征和跟信道以及其他变化有关的矢量特征的线性叠加。The extraction of iVector is based on the full difference space modeling (TV) method, which maps the high-dimensional GMM trained by UBM to the low-dimensional full-variable subspace, which can break through the inconvenient calculation of the extracted vector dimension as the length of the speech signal is too long Limits, and can increase the speed of calculation and express more comprehensive characteristics. The GMM supervector in GMM-UBM may include a linear superposition of vector features related to the speaker itself and vector features related to channels and other changes.
TV模型的子空间建模形式为:The subspace modeling form of the TV model is:
M=m+TwM = m + Tw
其中,M表示语音的GMM超矢量,即所述MFCC特征参数,m表示口音无关的GMM超矢量,T表示描述差异的空间的载荷矩阵,w表示GMM超矢量M在载荷矩阵空间下对应的低维因子表示,即iVector。Among them, M represents the GMM supervector of the speech, which is the MFCC characteristic parameter, m represents the accent-independent GMM supervector, T represents the load matrix of the space describing the difference, and w represents the corresponding low of the GMM supervector M in the load matrix space. Dimension factor representation, iVector.
在本实施例中,可以对提取的iVector行噪声补偿。在一实施例中,可以采用线性判别分析(Linear Discriminative Analysis,LDA)和类内协方差规整(Within Class Covariance Normalization,WCCN)对提取的iVector进行噪声补偿。In this embodiment, the extracted iVector line noise can be compensated. In one embodiment, linear discriminant analysis (LDA) and intra-class covariance normalization (WCCN) can be used to perform noise compensation on the extracted iVector.
识别单元205,用于根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。The recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.
给定口音可以是一个也可以是多个。例如,若给定口音为一个,可以根据所述iVector计算所述待识别语音信号对该给定口音的判决得分,根据所述待识别语音信号对该给定口音的判决得分判断所述待识别语音信号是否为该给定口音。可以判断所述判决得分是否大于预设得分(例如9分),若所述判决得分大于预设得分,则判断所述待识别语音信号为该给定口音。A given accent can be one or more. For example, if the given accent is one, the judgment score of the to-be-recognized voice signal for the given accent may be calculated according to the iVector, and the to-be-recognized may be judged according to the judgment score of the to-be-recognized voice signal for the given accent. Whether the speech signal is the given accent. It can be judged whether the judgment score is greater than a preset score (for example, 9 points), and if the judgment score is greater than a preset score, it is judged that the speech signal to be recognized is the given accent.
若给定口音为多个,可以根据所述iVector计算所述待识别语音信号对每个给定口音的判决得分,根据所述待识别语音信号对每个给定口音的判决得分判断判断所述语音为多个给定口音中的哪一个。可以确定对多个给定口音的判决得分中的最高得分,将所述最高得分对应的给定口音作为所述待识别语音信号所属的口音。If there are multiple given accents, the judgment score of the to-be-recognized voice signal for each given accent may be calculated according to the iVector, and the judgment score of each given accent for the given-voice signal may be judged and determined. Which of the given accents the speech is. The highest score among the decision scores for a plurality of given accents may be determined, and the given accent corresponding to the highest score is used as the accent to which the speech signal to be identified belongs.
在本实施例中,可以利用逻辑回归(Logistic Regression)模型计算所述待识别语音信号对给定口音的判决得分。逻辑回归模型作为一个分类器,可根据待识别语音信号的iVector对待识别语音信号进行打分。特别地,在一具体实施例中,可以使用多类逻辑回归模型计算所述待识别语音信号对给定口音的判决得分。In this embodiment, a Logistic Regression model can be used to calculate the decision score of the speech signal to be recognized for a given accent. As a classifier, the logistic regression model can score the speech signal to be identified according to the iVector of the speech signal to be identified. Specifically, in a specific embodiment, a multi-class logistic regression model may be used to calculate the decision score of the speech signal to be recognized for a given accent.
假设给定口音包括口音1、口音2、…口音N共N种口音,则利用N类逻辑回归模型计算所述待识别语音信号对给定口音的判决得分。将待识别语音信号的iVector(记为x t)输入所述N类逻辑回归模型,得到N个判决得分 s it(即所述待识别语音信号对N种给定口音的判决得分),s it=w i*x t+k i,i=1,…,N。求取N个判决得分s it,i=1,…,N中的最高得分s jt,最高得分s jt对应的口音j为所述待识别语音信号所属的口音。其中,w i、k i是N类逻辑回归模型的参数,w i为回归系数,k i为常数,针对每个给定口音均会有对应的w i和k i,w i、k i组成N类逻辑回归模型的参数向量,M={(w 1,k 1),(w 2,k 2),…,(w N,k N)}。 Assuming that a given accent includes N kinds of accents including accent 1, accent 2, ... accent N, then a N-type logistic regression model is used to calculate the decision score of the speech signal to be recognized for the given accent. The iVector (denoted as x t ) of the speech signal to be identified is input into the N-type logistic regression model to obtain N decision scores s it (that is, the determination score of the speech signal to be identified for N given accents), s it = W i * x t + k i , i = 1, ..., N. The highest score s jt among N judgment scores s it , i = 1, ..., N is obtained, and the accent j corresponding to the highest score s jt is the accent to which the speech signal to be identified belongs. Wherein, w i, k i is a parameter logistic regression model category N, regression coefficients w i, k i is a constant for each accent will have to be given a corresponding w i and k i, w i, k i composition Parameter vector of N-type logistic regression model, M = {(w 1 , k 1 ), (w 2 , k 2 ), ..., (w N , k N )}.
实施例二的口音识别装置10对待识别语音信号进行预处理;检测预处理后的所述待识别语音信号中的有效语音;对所述有效语音提取MFCC特征参数;根据所述MFCC特征参数,利用预先训练好的GMM-UBG提取所述有效语音的iVector;根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。实施例二可以实现快速准确的口音识别。The accent recognition device 10 of Embodiment 2 preprocesses the speech signal to be recognized; detects valid speech in the speech signal to be recognized after preprocessing; extracts MFCC characteristic parameters from the valid speech; and uses the MFCC characteristic parameters to use The pre-trained GMM-UBG extracts the iVector of the effective voice; calculates the judgment score of the voice signal to be recognized for a given accent according to the iVector, and obtains the accent recognition result of the voice signal to be recognized according to the judgment score . The second embodiment can realize fast and accurate accent recognition.
在其他的实施例中,第一提取单元203在提取MFCC特征参数时,可以进行声道长度归一化(Vocal Tract Length Normalization,VTLN),得到声道长度归一化的MFCC特征参数。In other embodiments, when extracting the MFCC characteristic parameters, the first extraction unit 203 may perform channel length normalization (Vocal Tract Length Normalization, VTLN) to obtain the MFCC characteristic parameters with normalized channel length.
声道可以表示为级联声管模型,每个声管都可以看成是一个谐振腔,它们的共振频率取决于声管的长度和形状。因此,说话人之间的部分声学差异是由于说话人的声道长度不同。例如,声道长度的变化范围一般从13cm(成年女性)变化到18cm(成年男性),因此,不同性别的人说同一个元音的共振峰频率相差很大。VTLN就是为了消除男、女声道长度的差异,使口音识别的结果不受性别的干扰。The channels can be represented as a cascaded sound tube model. Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.
VTLN可以通过弯折和平移频率坐标来使各说话人的共振峰频率相匹配。在本实施例中,可以采用基于双线性变换的VTLN方法。该基于双线性变换的VTLN方法并不直接对待识别语音信号的频谱进行折叠,而是采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;根据所述频率弯折因子,采用双线性变换对三角滤波器组的位置(例如三角滤波器的起点、中间点和结束点)和宽度进行调整;根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。例如,若要对待识别语音信号进行频谱压缩,则对三角滤波器的刻度进行拉伸,此时三角滤波器组向左扩展和移动。若要对待识别语音信号进行频谱拉伸,则对三角滤波器的刻度进行压缩,此时三角滤波器组向右压缩和移动。采用该基于双线性变换的VTLN方法对特定人群或特定人进行声道归一化时,仅需要对三角滤波器组系数进行一次变换即可,无需每次在提取特征参数时都对信号频谱折叠,从而大大减小了计算量。并且,该基于双线性变换的VTLN方法避免了对频率因子线性搜索,减小了运算复杂度。同时,该基于双线性变换的VTLN方法利用双线性变换,使弯折的频率连续且无带宽改变。VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates. In this embodiment, a VTLN method based on bilinear transformation may be used. The bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the recognized speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the average third formant frequency aligned with different speakers A bending factor; according to the frequency bending factor, the position (for example, the starting point, the middle point, and the ending point of the triangular filter) and the width of the triangular filter bank are adjusted by using a bilinear transformation; according to the adjusted triangular filter The group calculates the channel normalized MFCC characteristic parameters. For example, if spectrum compression is to be performed on the speech signal to be recognized, the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time. To spectrally stretch the recognition speech signal, the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right. When the VTLN method based on the bilinear transformation is used to normalize the channel of a specific group of people or a specific person, only the triangular filter bank coefficients need to be transformed once, and the signal spectrum does not need to be extracted each time the feature parameters are extracted. Folding, which greatly reduces the amount of calculation. In addition, the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity. At the same time, the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.
在另一实施例中,所述识别单元205还可以用于根据所述口音识别结果进行声纹识别。由于说话人所生活的地域不同,即使在都讲普通话的情况下或多或少依然会有口音的差别,将口音识别应用到声纹识别中,可以缩小后 续声纹识别的对象范围,得到更为准确的识别结果。In another embodiment, the recognition unit 205 may be further configured to perform voiceprint recognition according to the accent recognition result. Because the speakers live in different regions, even if they all speak Mandarin, there will be more or less accent differences. Applying accent recognition to voiceprint recognition can reduce the scope of subsequent voiceprint recognition objects and get more For accurate recognition results.
实施例三Example three
本实施例提供一种非易失性可读存储介质,该非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述口音识别方法实施例中的步骤,例如图1所示的步骤101-105:This embodiment provides a non-volatile readable storage medium. Computer-readable instructions are stored on the non-volatile readable storage medium. When the computer-readable instructions are executed by a processor, the accent recognition method embodiment is implemented. Steps, such as steps 101-105 shown in Figure 1:
步骤101,对待识别语音信号进行预处理;Step 101: Pre-process a speech signal to be recognized;
步骤102,检测预处理后的所述待识别语音信号中的有效语音;Step 102: Detect a valid voice in the pre-recognized voice signal;
步骤103,对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;Step 103: extracting Melc frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;
步骤104,根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;Step 104: Use a pre-trained Gaussian mixture model-general background model GMM-UBM to extract the identity vector iVector of the effective voice according to the MFCC feature parameters;
步骤105,根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。Step 105: Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.
所述检测预处理后的所述待识别语音信号中的有效语音可以包括:The detecting valid speech in the to-be-recognized voice signal after preprocessing may include:
对所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Windowing and framing the speech signal to be identified to obtain a speech frame of the speech signal to be identified;
对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
所述对所述有效语音提取梅尔频率倒谱系数MFCC特征参数可以包括:The feature parameters for extracting Mel frequency cepstrum coefficient MFCC for the effective speech may include:
采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;
根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块/单元的功能,例如图2中的单元201-205:Alternatively, when the computer-readable instructions are executed by a processor, the functions of the modules / units in the foregoing device embodiments are implemented, for example, units 201-205 in FIG. 2:
预处理单元201,用于对待识别语音信号进行预处理;A pre-processing unit 201, configured to pre-process a speech signal to be recognized;
检测单元202,用于检测预处理后的所述待识别语音信号中的有效语音;A detection unit 202, configured to detect valid speech in the pre-recognized speech signal;
第一提取单元203,用于对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;A first extraction unit 203, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;
第二提取单元204,用于根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;A second extraction unit 204, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
识别单元205,用于根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。The recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.
所述检测单元202具体可以用于:The detection unit 202 may be specifically configured to:
对所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Windowing and framing the speech signal to be identified to obtain a speech frame of the speech signal to be identified;
对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
所述第一提取单元203具体可以用于:The first extraction unit 203 may be specifically configured to:
采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;
根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
实施例四Embodiment 4
图3为本申请实施例四提供的计算机装置的示意图。所述计算机装置1包括存储器20、处理器30以及存储在所述存储器20中并可在所述处理器30上运行的计算机可读指令40,例如口音识别程序。所述处理器30执行所述计算机可读指令40时实现上述口音识别方法实施例中的步骤,例如图1所示的步骤101-105:FIG. 3 is a schematic diagram of a computer device according to a fourth embodiment of the present application. The computer device 1 includes a memory 20, a processor 30, and computer-readable instructions 40 stored in the memory 20 and executable on the processor 30, such as an accent recognition program. When the processor 30 executes the computer-readable instructions 40, the steps in the embodiment of the accent recognition method described above are implemented, for example, steps 101-105 shown in FIG. 1:
步骤101,对待识别语音信号进行预处理;Step 101: Pre-process a speech signal to be recognized;
步骤102,检测预处理后的所述待识别语音信号中的有效语音;Step 102: Detect a valid voice in the pre-recognized voice signal;
步骤103,对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;Step 103: extracting Melc frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;
步骤104,根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;Step 104: Use a pre-trained Gaussian mixture model-general background model GMM-UBM to extract the identity vector iVector of the effective voice according to the MFCC feature parameters;
步骤105,根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。Step 105: Calculate a judgment score of the speech signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the speech signal to be identified according to the judgment score.
所述检测预处理后的所述待识别语音信号中的有效语音可以包括:The detecting valid speech in the to-be-recognized voice signal after preprocessing may include:
对所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Windowing and framing the speech signal to be identified to obtain a speech frame of the speech signal to be identified;
对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
所述对所述有效语音提取梅尔频率倒谱系数MFCC特征参数可以包括:The feature parameters for extracting Mel frequency cepstrum coefficient MFCC for the effective speech may include:
采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;
根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
或者,所述处理器30执行所述计算机可读指令40时实现上述装置实施例中各模块/单元的功能,例如图2中的单元201-205:Alternatively, when the processor 30 executes the computer-readable instructions 40, the functions of the modules / units in the foregoing device embodiments are implemented, for example, units 201-205 in FIG. 2:
预处理单元201,用于对待识别语音信号进行预处理;A pre-processing unit 201, configured to pre-process a speech signal to be recognized;
检测单元202,用于检测预处理后的所述待识别语音信号中的有效语音;A detection unit 202, configured to detect valid speech in the pre-recognized speech signal;
第一提取单元203,用于对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;A first extraction unit 203, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;
第二提取单元204,用于根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;A second extraction unit 204, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
识别单元205,用于根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。The recognition unit 205 is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the judgment score.
所述检测单元202具体可以用于:The detection unit 202 may be specifically configured to:
对所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Windowing and framing the speech signal to be identified to obtain a speech frame of the speech signal to be identified;
对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
所述第一提取单元203具体可以用于:The first extraction unit 203 may be specifically configured to:
采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;
根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
示例性的,所述计算机可读指令40可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器20中,并由所述处理器30执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令40在所述计算机装置1中的执行过程。例如,所述计算机可读指令40可以被分割成图2中的预处理单元201、检测单元202、第一提取单元203、第二提取单元204、识别单元205,各单元具体功能参见实施例二。Exemplarily, the computer-readable instructions 40 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 20 and executed by the processor 30, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 40 in the computer device 1. For example, the computer-readable instructions 40 may be divided into a pre-processing unit 201, a detection unit 202, a first extraction unit 203, a second extraction unit 204, and a recognition unit 205 in FIG. .
所述计算机装置1可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图3仅仅是计算机装置1的示例,并不构成对计算机装置1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述计算机装置1还可以 包括输入输出设备、网络接入设备、总线等。The computer device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 1, and does not constitute a limitation on the computer device 1. It may include more or fewer components than shown in the figure, or some components may be combined or different. The components, for example, the computer apparatus 1 may further include an input-output device, a network access device, a bus, and the like.
所称处理器30可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器30也可以是任何常规的处理器等,所述处理器30是所述计算机装置1的控制中心,利用各种接口和线路连接整个计算机装置1的各个部分。The so-called processor 30 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor, or the processor 30 may be any conventional processor, etc. The processor 30 is a control center of the computer device 1 and uses various interfaces and lines to connect the entire computer device 1 Various parts.
所述存储器20可用于存储所述计算机可读指令40和/或模块/单元,所述处理器30通过运行或执行存储在所述存储器20内的计算机可读指令和/或模块/单元,以及调用存储在存储器20内的数据,实现所述计算机装置1的各种功能。所述存储器20可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机装置1的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器20可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 20 may be configured to store the computer-readable instructions 40 and / or modules / units, and the processor 30 may execute or execute the computer-readable instructions and / or modules / units stored in the memory 20, and The data stored in the memory 20 is called to implement various functions of the computer device 1. The memory 20 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, application programs required for at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may Data (such as audio data, phone book, etc.) created according to the use of the computer device 1 are stored. In addition, the memory 20 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD). Card, flash memory card (Flash card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
所述计算机装置1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述非易失性可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,非易失性可读介质不包括电载波信号和电信信号。When the modules / units integrated in the computer device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the methods of the above embodiments, and can also be completed by computer-readable instructions instructing related hardware. The computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by a processor, the steps of the foregoing method embodiments can be implemented. The computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in a source code form, an object code form, an executable file, or some intermediate form. The non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media. It should be noted that the content contained in the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdictions. Volatile readable media does not include electrical carrier signals and telecommunication signals.
在本申请所提供的几个实施例中,应该理解到,所揭露的计算机装置和方法,可以通过其它的方式实现。例如,以上所描述的计算机装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed computer device and method may be implemented in other ways. For example, the embodiments of the computer device described above are merely schematic. For example, the division of the units is only a logical function division, and there may be another division manner in actual implementation.
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单 元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist separately physically, or two or more units may be integrated in the same unit. The integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。计算机装置权利要求中陈述的多个单元或计算机装置也可以由同一个单元或计算机装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It is obvious to a person skilled in the art that the present application is not limited to the details of the above exemplary embodiments, and that the present application can be implemented in other specific forms without departing from the spirit or basic features of the application. Therefore, the embodiments should be regarded as exemplary and non-limiting in every respect. The scope of the present application is defined by the appended claims rather than the above description, and therefore is intended to fall within the claims. All changes that are within the meaning and scope of equivalent requirements are included in this application. Any reference signs in the claims should not be construed as limiting the claims involved. In addition, it is clear that the word "comprising" does not exclude other units or steps, and that the singular does not exclude the plural. A plurality of units or computer devices stated in a computer device claim may also be implemented by the same unit or computer device through software or hardware. Words such as first and second are used to indicate names, but not in any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not limited, although the present application has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solution of the present application Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solution of the present application.

Claims (20)

  1. 一种口音识别方法,其特征在于,所述方法包括:An accent recognition method, characterized in that the method includes:
    对待识别语音信号进行预处理;Pre-process the recognition speech signal;
    检测预处理后的所述待识别语音信号中的有效语音;Detecting valid speech in the speech signal to be identified after preprocessing;
    对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;
    根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;
    根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
  2. 如权利要求1所述的方法,其特征在于,所述检测预处理后的所述待识别语音信号中的有效语音包括:The method according to claim 1, wherein the detecting valid speech in the speech signal to be identified after preprocessing comprises:
    对预处理后的所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Performing windowing and framing on the pre-processed speech signal to obtain the speech frame of the speech signal to be identified;
    对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
    根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
    对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
    将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  3. 如权利要求1所述的方法,其特征在于,所述MFCC特征参数包括初始MFCC特征参数、一阶差分MFCC特征参数和二阶差分MFCC特征参数。The method according to claim 1, wherein the MFCC characteristic parameters include an initial MFCC characteristic parameter, a first-order differential MFCC characteristic parameter, and a second-order differential MFCC characteristic parameter.
  4. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, further comprising:
    对所述iVector进行噪声补偿。Perform noise compensation on the iVector.
  5. 如权利要求1所述的方法,其特征在于,所述根据所述iVector计算所述待识别语音信号对给定口音的判决得分包括:The method according to claim 1, wherein calculating the decision score of the speech signal to be recognized for a given accent according to the iVector comprises:
    将所述iVector输入逻辑回归模型,得到所述待识别语音信号对给定口音的判决得分。The iVector is input to a logistic regression model to obtain a decision score of the speech signal to be recognized for a given accent.
  6. 如权利要求1所述的方法,其特征在于,所述对所述有效语音提取梅尔频率倒谱系数MFCC特征参数包括:The method according to claim 1, wherein said extracting a Mel frequency cepstrum coefficient MFCC characteristic parameter for said effective speech comprises:
    采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
    根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;
    根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  7. 如权利要求1所述的方法,其特征在于,所述对待识别语音信号进行预处理包括:The method according to claim 1, wherein the preprocessing of the speech signal to be identified comprises:
    对所述待识别语音信号进行预加重;和Pre-emphasis the speech signal to be identified; and
    对所述待识别语音信号进行加窗分帧。Windowing and framing the speech signal to be identified.
  8. 一种口音识别装置,其特征在于,所述装置包括:An accent recognition device, characterized in that the device includes:
    预处理单元,用于对待识别语音信号进行预处理;A pre-processing unit for pre-processing the speech signal to be recognized;
    检测单元,用于检测预处理后的所述待识别语音信号中的有效语音;A detection unit, configured to detect valid speech in the speech signal to be identified after preprocessing;
    第一提取单元,用于对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;A first extraction unit, configured to extract a Melc frequency cepstrum coefficient MFCC characteristic parameter from the effective speech;
    第二提取单元,用于根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;A second extraction unit, configured to extract an identity vector iVector of the effective voice by using a pre-trained Gaussian mixture model-general background model GMM-UBM according to the MFCC feature parameters;
    识别单元,用于根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。The recognition unit is configured to calculate a decision score of the voice signal to be recognized for a given accent according to the iVector, and obtain an accent recognition result of the voice signal to be recognized according to the decision score.
  9. 一种计算机装置,其特征在于,所述计算机装置包括处理器,所述处理器用于执行存储器中存储的计算机可读指令以实现以下步骤:A computer device, wherein the computer device includes a processor, and the processor is configured to execute computer-readable instructions stored in a memory to implement the following steps:
    对待识别语音信号进行预处理;Pre-process the recognition speech signal;
    检测预处理后的所述待识别语音信号中的有效语音;Detecting valid speech in the speech signal to be identified after preprocessing;
    对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;
    根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;
    根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
  10. 如权利要求9所述的计算机装置,其特征在于,所述检测预处理后的所述待识别语音信号中的有效语音包括:The computer device according to claim 9, wherein the detecting valid speech in the speech signal to be identified after preprocessing comprises:
    对预处理后的所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Performing windowing and framing on the pre-processed speech signal to obtain the speech frame of the speech signal to be identified;
    对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
    根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
    对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
    将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  11. 如权利要求9所述的计算机装置,其特征在于,所述处理器还用于执行所述计算机可读指令以实现以下步骤:The computer device according to claim 9, wherein the processor is further configured to execute the computer-readable instructions to implement the following steps:
    对所述iVector进行噪声补偿。Perform noise compensation on the iVector.
  12. 如权利要求9所述的计算机装置,其特征在于,所述根据所述iVector计算所述待识别语音信号对给定口音的判决得分包括:The computer device according to claim 9, wherein calculating the decision score of the speech signal to be recognized for a given accent according to the iVector comprises:
    将所述iVector输入逻辑回归模型,得到所述待识别语音信号对给定口音的判决得分。The iVector is input to a logistic regression model to obtain a decision score of the speech signal to be recognized for a given accent.
  13. 如权利要求9所述的计算机装置,其特征在于,所述对所述有效语音提取梅尔频率倒谱系数MFCC特征参数包括:The computer device according to claim 9, wherein the MFCC characteristic parameters for extracting Mel frequency cepstrum coefficients for the effective speech include:
    采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
    根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用 的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the MFCC feature parameter extraction according to the frequency bending factor;
    根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  14. 如权利要求9所述的计算机装置,其特征在于,所述对待识别语音信号进行预处理包括:The computer device according to claim 9, wherein the preprocessing of the speech signal to be recognized comprises:
    对所述待识别语音信号进行预加重;和Pre-emphasis the speech signal to be identified; and
    对所述待识别语音信号进行加窗分帧。Windowing and framing the speech signal to be identified.
  15. 一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现以下步骤:A non-volatile readable storage medium on which computer-readable instructions are stored, characterized in that, when the computer-readable instructions are executed by a processor, the following steps are implemented:
    对待识别语音信号进行预处理;Pre-process the recognition speech signal;
    检测预处理后的所述待识别语音信号中的有效语音;Detecting valid speech in the speech signal to be identified after preprocessing;
    对所述有效语音提取梅尔频率倒谱系数MFCC特征参数;Extracting Melf frequency cepstrum coefficient MFCC characteristic parameters for the effective speech;
    根据所述MFCC特征参数,利用预先训练好的高斯混合模型-通用背景模型GMM-UBM提取所述有效语音的身份矢量iVector;According to the MFCC feature parameters, a pre-trained Gaussian mixture model-general background model GMM-UBM is used to extract the identity vector iVector of the effective speech;
    根据所述iVector计算所述待识别语音信号对给定口音的判决得分,根据所述判决得分得到所述待识别语音信号的口音识别结果。A decision score of the speech signal to be recognized for a given accent is calculated according to the iVector, and an accent recognition result of the speech signal to be recognized is obtained according to the decision score.
  16. 如权利要求15所述的存储介质,其特征在于,所述检测预处理后的所述待识别语音信号中的有效语音包括:The storage medium according to claim 15, wherein the detecting valid speech in the speech signal to be identified after preprocessing comprises:
    对预处理后的所述待识别语音信号进行加窗分帧,得到所述待识别语音信号的语音帧;Performing windowing and framing on the pre-processed speech signal to obtain the speech frame of the speech signal to be identified;
    对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;Performing discrete Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame;
    根据所述语音帧的频谱计算各个频带的累计能量;Calculate the cumulative energy of each frequency band according to the frequency spectrum of the speech frame;
    对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;Performing a logarithmic operation on the accumulated energy of each frequency band to obtain a logarithmic value of the accumulated energy of each frequency band;
    将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。The cumulative energy log value of each frequency band is compared with a preset threshold to obtain the effective speech.
  17. 如权利要求15所述的存储介质,其特征在于,所述计算机可读指令被处理器执行时还用以实现以下步骤:The storage medium of claim 15, wherein the computer-readable instructions are further used to implement the following steps when executed by a processor:
    对所述iVector进行噪声补偿。Perform noise compensation on the iVector.
  18. 如权利要求15所述的存储介质,其特征在于,所述根据所述iVector计算所述待识别语音信号对给定口音的判决得分包括:The storage medium according to claim 15, wherein the calculating a decision score of the to-be-recognized voice signal for a given accent according to the iVector comprises:
    将所述iVector输入逻辑回归模型,得到所述待识别语音信号对给定口音的判决得分。The iVector is input to a logistic regression model to obtain a decision score of the speech signal to be recognized for a given accent.
  19. 如权利要求15所述的存储介质,其特征在于,所述对所述有效语音提取梅尔频率倒谱系数MFCC特征参数包括:The storage medium according to claim 15, wherein the MFCC characteristic parameters for extracting Mel frequency cepstrum coefficients for the effective speech include:
    采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;Using the bilinear transformation low-pass filter cutoff frequency mapping formula to calculate the frequency bending factor of the average third formant of different speakers;
    根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;Adjusting the position and width of the triangular filter bank used for the extraction of MFCC feature parameters according to the frequency bending factor;
    根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。The channel normalized MFCC characteristic parameters are calculated according to the adjusted triangular filter bank.
  20. 如权利要求15所述的存储介质,其特征在于,所述对待识别语音信号 进行预处理包括:The storage medium according to claim 15, wherein the preprocessing of the speech signal to be identified comprises:
    对所述待识别语音信号进行预加重;和Pre-emphasis the speech signal to be identified; and
    对所述待识别语音信号进行加窗分帧。Windowing and framing the speech signal to be identified.
PCT/CN2019/077512 2018-08-14 2019-03-08 Accent identification method and device, computer device, and storage medium WO2020034628A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810922056.0 2018-08-14
CN201810922056.0A CN109036437A (en) 2018-08-14 2018-08-14 Accents recognition method, apparatus, computer installation and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2020034628A1 true WO2020034628A1 (en) 2020-02-20

Family

ID=64634084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/077512 WO2020034628A1 (en) 2018-08-14 2019-03-08 Accent identification method and device, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN109036437A (en)
WO (1) WO2020034628A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036437A (en) * 2018-08-14 2018-12-18 平安科技(深圳)有限公司 Accents recognition method, apparatus, computer installation and computer readable storage medium
CN109686362B (en) * 2019-01-02 2021-04-02 百度在线网络技术(北京)有限公司 Voice broadcasting method and device and computer readable storage medium
CN110111769B (en) * 2019-04-28 2021-10-15 深圳信息职业技术学院 Electronic cochlea control method and device, readable storage medium and electronic cochlea
CN112116909A (en) * 2019-06-20 2020-12-22 杭州海康威视数字技术股份有限公司 Voice recognition method, device and system
CN111128229A (en) * 2019-08-05 2020-05-08 上海海事大学 Voice classification method and device and computer storage medium
US11227601B2 (en) * 2019-09-21 2022-01-18 Merry Electronics(Shenzhen) Co., Ltd. Computer-implement voice command authentication method and electronic device
CN112712792A (en) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 Dialect recognition model training method, readable storage medium and terminal device
CN111508498B (en) * 2020-04-09 2024-01-30 携程计算机技术(上海)有限公司 Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
CN113689863B (en) * 2021-09-24 2024-01-16 广东电网有限责任公司 Voiceprint feature extraction method, voiceprint feature extraction device, voiceprint feature extraction equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN105679321A (en) * 2016-01-29 2016-06-15 宇龙计算机通信科技(深圳)有限公司 Speech recognition method and device and terminal
CN105976819A (en) * 2016-03-23 2016-09-28 广州势必可赢网络科技有限公司 Rnorm score normalization based speaker verification method
CN108369813A (en) * 2017-07-31 2018-08-03 深圳和而泰智能家居科技有限公司 Specific sound recognition methods, equipment and storage medium
CN109036437A (en) * 2018-08-14 2018-12-18 平安科技(深圳)有限公司 Accents recognition method, apparatus, computer installation and computer readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2793137B2 (en) * 1994-12-14 1998-09-03 株式会社エイ・ティ・アール音声翻訳通信研究所 Accent phrase boundary detector for continuous speech recognition
CN101894548B (en) * 2010-06-23 2012-07-04 清华大学 Modeling method and modeling device for language identification
US9966064B2 (en) * 2012-07-18 2018-05-08 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN104538035B (en) * 2014-12-19 2018-05-01 深圳先进技术研究院 A kind of method for distinguishing speek person and system based on Fisher super vectors
CN107274905B (en) * 2016-04-08 2019-09-27 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and system
CN107610707B (en) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN108122554B (en) * 2017-12-25 2021-12-21 广东小天才科技有限公司 Control method of microphone device in charging state and microphone device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN105679321A (en) * 2016-01-29 2016-06-15 宇龙计算机通信科技(深圳)有限公司 Speech recognition method and device and terminal
CN105976819A (en) * 2016-03-23 2016-09-28 广州势必可赢网络科技有限公司 Rnorm score normalization based speaker verification method
CN108369813A (en) * 2017-07-31 2018-08-03 深圳和而泰智能家居科技有限公司 Specific sound recognition methods, equipment and storage medium
CN109036437A (en) * 2018-08-14 2018-12-18 平安科技(深圳)有限公司 Accents recognition method, apparatus, computer installation and computer readable storage medium

Also Published As

Publication number Publication date
CN109036437A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US9940935B2 (en) Method and device for voiceprint recognition
CN106486131B (en) A kind of method and device of speech de-noising
WO2020029404A1 (en) Speech processing method and device, computer device and readable storage medium
WO2018107810A1 (en) Voiceprint recognition method and apparatus, and electronic device and medium
Ajmera et al. Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
CN102968990B (en) Speaker identifying method and system
WO2014114116A1 (en) Method and system for voiceprint recognition
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN110931023B (en) Gender identification method, system, mobile terminal and storage medium
Paulose et al. Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition
CN110970036A (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
CN108682432B (en) Speech emotion recognition device
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
Priyadarshani et al. Dynamic time warping based speech recognition for isolated Sinhala words
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
Sinha et al. On the use of pitch normalization for improving children's speech recognition
Nirjon et al. sMFCC: exploiting sparseness in speech for fast acoustic feature extraction on mobile devices--a feasibility study
CN110875037A (en) Voice data processing method and device and electronic equipment
Dai et al. An improved feature fusion for speaker recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19850190

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19850190

Country of ref document: EP

Kind code of ref document: A1