KR20130068196A - Apparatus and method for clustering using confusion matrix of voice recognition error - Google Patents

Apparatus and method for clustering using confusion matrix of voice recognition error Download PDF

Info

Publication number
KR20130068196A
KR20130068196A KR1020110134836A KR20110134836A KR20130068196A KR 20130068196 A KR20130068196 A KR 20130068196A KR 1020110134836 A KR1020110134836 A KR 1020110134836A KR 20110134836 A KR20110134836 A KR 20110134836A KR 20130068196 A KR20130068196 A KR 20130068196A
Authority
KR
South Korea
Prior art keywords
error
acoustic model
clustering
matrix
speech recognition
Prior art date
Application number
KR1020110134836A
Other languages
Korean (ko)
Inventor
강병옥
박기영
이성주
정호영
이윤근
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020110134836A priority Critical patent/KR20130068196A/en
Publication of KR20130068196A publication Critical patent/KR20130068196A/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]

Abstract

The present invention relates to a clustering device using a speech recognition error chaos matrix, comprising: an acoustic model generator for receiving training data and generating an acoustic model; A voice recognition unit for performing voice recognition based on the generated sound model and the received test and user voice data; An error chaos matrix constructing unit configured to form a chaos matrix with error pairs extracted by comparing the speech recognition result with transcription data; A high frequency error pair extractor which extracts an error pair having a high extraction frequency from the chaotic matrix; And a state clustering unit configured to state cluster the acoustic model based on a result of the high frequency error pair extractor.

Figure P1020110134836

Description

Clustering apparatus using speech recognition error chaos matrix and its method {APPARATUS AND METHOD FOR CLUSTERING USING CONFUSION MATRIX OF VOICE RECOGNITION ERROR}

The present invention relates to a clustering apparatus and a method using a speech recognition error chaos matrix, and more particularly, by extracting high frequency error pairs with frequent error recognition results and clustering the acoustic model based on this, thereby improving discrimination of the acoustic model. The present invention relates to a clustering apparatus using the speech recognition error chaos matrix and a method thereof.

Speech recognition systems are based on the correlation between speech and characterization in acoustic space for that speech, which characterization is typically obtained from training data. Training data may be obtained from multiple training speakers to construct a speaker independent system, or may be obtained from a single speaker to construct a speaker dependent system.

The speaker-independent system tries to obtain average statistics among several speakers, so the recognition performance for a particular speaker is poor.

On the other hand, the speaker-dependent system has a better recognition performance for a specific speaker than the speaker-independent system, but has a disadvantage in that a large amount of training data must be obtained from the speaker using the system.

On the other hand, recognizing speech irrespective of the talker can be said to be the ultimate goal of speech recognition. Speaker adaptation is a way to achieve this goal. The speaker adaptation system is an intermediate form between the speaker independent system and the speaker dependent system. Basically, the speaker adaptation system provides the speaker independent function but improves the performance of the specific speaker through the prepared training process. That is, the speaker adaptation system creates a speaker independent system using training data generated by several speakers, and then builds a system adapted to a new speaker using some training data of a newly registered speaker.

In general, the speaker adaptation system uses a class composition method using phonological knowledge base and clustering characteristics in acoustic model space. However, this method assumes that phonemes with similar speech methods are located in similar areas in the acoustic model space, and there is no mathematical and logical basis to support them. In addition, there is a problem in that even if there is a difference in clustering between the models before and after the speaker adaptation.

That is, in the case of clustering using only the distribution of each model in the acoustic model space of the speaker independent model before the speaker adaptation, models belonging to any cluster can move to another cluster after the speaker adaptation. As a result, these shifted models have a problem that speaker adaptation is performed by an incorrect transformation matrix.

The present invention has been invented to solve the above problems, by extracting high frequency error pairs with frequent error recognition results and clustering the acoustic model based on the error recognition, a speech recognition error chaos matrix that can improve the discrimination power of the acoustic model. An object of the present invention is to provide a clustering apparatus and a method using the same.

In addition, an object of the present invention is to provide a clustering apparatus and method using a speech recognition error chaos matrix that can maximize speech recognition performance by improving discrimination against high frequency error pairs.

In order to achieve the above object, according to an embodiment of the present invention, a clustering apparatus using a speech recognition error chaos matrix includes: an acoustic model generator for generating an acoustic model by receiving training voice data; A voice recognition unit for performing voice recognition based on the generated acoustic model, the received test and user voice data, and outputting a result; An error chaos matrix constructing unit configured to form a chaos matrix with error pairs extracted by comparing the speech recognition result with transcription data; A high frequency error pair extractor which extracts an error pair having a high extraction frequency from the chaotic matrix; And a state clustering unit configured to state cluster the acoustic model based on a result of the high frequency error pair extractor.

According to the present invention having the above-described configuration, the clustering apparatus and method using the speech recognition error chaos matrix extract the high frequency error pairs with frequent error in speech recognition, and cluster the acoustic model based on this, thereby distinguishing and reliability of the acoustic model. There is an effect to improve.

Therefore, the present invention has an effect of maximizing speech recognition performance by improving discrimination against high frequency error pairs.

1 is a schematic diagram illustrating a clustering apparatus using a speech recognition error chaos matrix according to an embodiment of the present invention.
2 is a schematic diagram illustrating another configuration of a clustering apparatus using a speech recognition error chaos matrix according to an embodiment of the present invention.
3 is a flowchart illustrating a clustering method using a speech recognition error chaos matrix according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings in order to facilitate a person skilled in the art to easily carry out the technical idea of the present invention. . First, in adding reference numerals to the constituents of the drawings, it is to be noted that the same constituents are denoted by the same reference symbols as possible even if they are displayed on different drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

Hereinafter, a clustering apparatus using a speech recognition error chaos matrix and a method thereof according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a schematic diagram illustrating a clustering apparatus using a speech recognition error chaos matrix, and FIG. 2 is a schematic diagram illustrating another configuration of a clustering apparatus using a speech recognition error chaos matrix.

Referring to FIGS. 1 and 2, the clustering apparatus using the speech recognition error chaos matrix is largely comprised of an acoustic model generator 100, a speech recognizer 110, an error chaos matrix constructer 120, and a high frequency error pair extractor. 130, the state clustering unit 101a, 101b.

The acoustic model generator 100 receives training data and generates an acoustic model. The acoustic model generation unit 100 includes state clustering units 101a and 101b which will be described later. Detailed description of the state clustering units 101a and 101b will be described later.

The speech recognition unit 110 receives the acoustic model generated from the acoustic model generator 100, other pronunciation dictionaries, a language model, and calibration data from the outside, and performs speech recognition. In this case, the calibration data is reference data adjusted to a reference and scale in speech recognition, and includes test voice data and user voice data.

That is, the speech recognition unit 110 performs speech recognition based on the acoustic model, the test speech data, and the user speech data, and outputs the result.

The error chaos matrix configuration unit 120 constructs a chaos matrix using error pairs extracted by comparing the voice recognition result with the transcription data. At this time, the transcription data is pronunciation data for making voice sounds for the tester voice data and the user voice data.

Here, the chaos matrix is a matrix representing the relationship between the real and the prediction. The chaotic matrix is a data that represents the response to the stimulus or the recognition object (data). It will not be described in detail in the present invention as a method commonly used in the analysis method for efficient control, management.

More specifically, the error chaos matrix configuration unit 120 lists the correct answer corresponding words in order as a result of comparing the speech recognition result and the transcription data, wherein the words are based on a triphone unit having three phonemes as one unit. . Triphone units are context-dependent units on the left and right sides of the central unit. For example, the words 'person' are: 'silent- ㅅ + ㅏ', 'ㅅ-ㅏ + ㄹ', 'ㅏ-ㄹ + ㅏ', Each triphone model is used to construct 'ㄹ-ㅏ + ㅁ' and 'ㅏ-ㅁ + silence'.

The error chaos matrix component extracts an error pair from the listed words. The extracted error pairs are constructed in the form of confusion matrix. At this time, the chaos matrix is composed only of voices that are erroneous, that is, errors on the voice result.

The high frequency error pair extractor 130 extracts an error pair having a high extraction frequency from the error chaos matrix configuration unit 120.

The state clustering units 101a and 101b state cluster the acoustic model based on the results of the high frequency error pair extraction unit 130. In this case, clustering is to find a characteristic of data included in each cluster or set by grouping a plurality of given data into several clusters or sets. In other words, the purpose of clustering is to group data into sets with similar properties.

The state clustering units 101a and 101b may be configured to be included in the acoustic model generator as illustrated in FIG. 1 or may be configured to be separately provided as illustrated in FIG. 2. However, in the present invention, FIG.

Hereinafter, a clustering method using a speech recognition error chaos matrix will be described in detail with reference to FIG. 3. 3 is a flowchart illustrating a clustering method using a speech recognition error chaos matrix.

Referring to FIG. 3, the clustering apparatus using the voice recognition error chaos matrix generates a sound model by receiving training voice data.

Next, the generated acoustic model, other pronunciation dictionary, language model, and calibration data are received from the outside, and voice recognition is performed, and the result is output.

Next, a chaotic matrix is formed of error pairs obtained by comparing the speech recognition result with the transcription data (S303). That is, the speech matching result is listed based on the triphone unit. Error pairs are extracted from the words, and the extracted error pairs form a confusion matrix.

Next, an error pair having a high extraction frequency is extracted from the chaotic matrix (S304).

Next, the acoustic model is state clustered based on the high frequency error pair extraction result (S305).

As described above, the clustering apparatus and the method using the speech recognition error chaos matrix of the present invention can improve the discriminating power and reliability of the acoustic model by extracting high frequency error pairs with frequent error recognition results and clustering the acoustic model based on this. . Therefore, the present invention can maximize the speech recognition performance by improving the discrimination ability for high frequency error pairs.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is evident that many alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the scope of the appended claims. As will be understood by those skilled in the art.

100: acoustic model generator 101a, 101b: state clustering unit
110: speech recognition unit 120: error chaos matrix component
130: high frequency error pair extractor

Claims (1)

An acoustic model generator configured to receive training voice data and generate an acoustic model;
A voice recognition unit for performing voice recognition based on the generated sound model and the received test and user voice data;
An error chaos matrix constructing unit configured to form a chaos matrix with error pairs extracted by comparing the speech recognition result with transcription data;
A high frequency error pair extractor which extracts an error pair having a high extraction frequency from the chaotic matrix; And
A state clustering unit configured to state cluster the acoustic model based on a result of the high frequency error pair extractor;
Clustering apparatus using a speech recognition error chaos matrix comprising a.

KR1020110134836A 2011-12-14 2011-12-14 Apparatus and method for clustering using confusion matrix of voice recognition error KR20130068196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020110134836A KR20130068196A (en) 2011-12-14 2011-12-14 Apparatus and method for clustering using confusion matrix of voice recognition error

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020110134836A KR20130068196A (en) 2011-12-14 2011-12-14 Apparatus and method for clustering using confusion matrix of voice recognition error

Publications (1)

Publication Number Publication Date
KR20130068196A true KR20130068196A (en) 2013-06-26

Family

ID=48863870

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020110134836A KR20130068196A (en) 2011-12-14 2011-12-14 Apparatus and method for clustering using confusion matrix of voice recognition error

Country Status (1)

Country Link
KR (1) KR20130068196A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101579544B1 (en) * 2014-09-04 2015-12-23 에스케이 텔레콤주식회사 Apparatus and Method for Calculating Similarity of Natural Language

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101579544B1 (en) * 2014-09-04 2015-12-23 에스케이 텔레콤주식회사 Apparatus and Method for Calculating Similarity of Natural Language

Similar Documents

Publication Publication Date Title
Luo et al. Speaker-independent speech separation with deep attractor network
EP3707716B1 (en) Multi-channel speech separation
Vincent et al. The second ‘CHiME’speech separation and recognition challenge: An overview of challenge systems and outcomes
CN104036774B (en) Tibetan dialect recognition methods and system
CN106297826A (en) Speech emotional identification system and method
CN106057206B (en) Sound-groove model training method, method for recognizing sound-groove and device
McLaren et al. Exploring the role of phonetic bottleneck features for speaker and language recognition
EP2851895A3 (en) Speech recognition using variable-length context
Justin et al. Speaker de-identification using diphone recognition and speech synthesis
CN103065620A (en) Method with which text input by user is received on mobile phone or webpage and synthetized to personalized voice in real time
CN108780645A (en) The speaker verification computer system of text transcription adaptation is carried out to universal background model and registration speaker models
Yin et al. Automatic cognitive load detection from speech features
CN109300339A (en) A kind of exercising method and system of Oral English Practice
Scheffer et al. Content matching for short duration speaker recognition.
KR20160059265A (en) Method And Apparatus for Learning Acoustic Model Considering Reliability Score
US7650281B1 (en) Method of comparing voice signals that reduces false alarms
Meyer et al. Anonymizing speech with generative adversarial networks to preserve speaker privacy
Han et al. Continuous Speech Separation Using Speaker Inventory for Long Recording.
Sinclair et al. A semi-markov model for speech segmentation with an utterance-break prior
Han et al. Continuous speech separation using speaker inventory for long multi-talker recording
Polur et al. Effect of high-frequency spectral components in computer recognition of dysarthric speech based on a Mel-cepstral stochastic model.
KR20160061071A (en) Voice recognition considering utterance variation
Wahidah et al. Makhraj recognition using speech processing
KR20130068196A (en) Apparatus and method for clustering using confusion matrix of voice recognition error
Yanagisawa et al. Noise robustness in HMM-TTS speaker adaptation

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination