KR20130068196A

KR20130068196A - Apparatus and method for clustering using confusion matrix of voice recognition error

Info

Publication number: KR20130068196A
Application number: KR1020110134836A
Authority: KR
Inventors: 강병옥; 박기영; 이성주; 정호영; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2011-12-14
Filing date: 2011-12-14
Publication date: 2013-06-26

Abstract

The present invention relates to a clustering device using a speech recognition error chaos matrix, comprising: an acoustic model generator for receiving training data and generating an acoustic model; A voice recognition unit for performing voice recognition based on the generated sound model and the received test and user voice data; An error chaos matrix constructing unit configured to form a chaos matrix with error pairs extracted by comparing the speech recognition result with transcription data; A high frequency error pair extractor which extracts an error pair having a high extraction frequency from the chaotic matrix; And a state clustering unit configured to state cluster the acoustic model based on a result of the high frequency error pair extractor.

Description

Clustering apparatus using speech recognition error chaos matrix and its method {APPARATUS AND METHOD FOR CLUSTERING USING CONFUSION MATRIX OF VOICE RECOGNITION ERROR}

The present invention relates to a clustering apparatus and a method using a speech recognition error chaos matrix, and more particularly, by extracting high frequency error pairs with frequent error recognition results and clustering the acoustic model based on this, thereby improving discrimination of the acoustic model. The present invention relates to a clustering apparatus using the speech recognition error chaos matrix and a method thereof.

Speech recognition systems are based on the correlation between speech and characterization in acoustic space for that speech, which characterization is typically obtained from training data. Training data may be obtained from multiple training speakers to construct a speaker independent system, or may be obtained from a single speaker to construct a speaker dependent system.

The speaker-independent system tries to obtain average statistics among several speakers, so the recognition performance for a particular speaker is poor.

On the other hand, the speaker-dependent system has a better recognition performance for a specific speaker than the speaker-independent system, but has a disadvantage in that a large amount of training data must be obtained from the speaker using the system.

On the other hand, recognizing speech irrespective of the talker can be said to be the ultimate goal of speech recognition. Speaker adaptation is a way to achieve this goal. The speaker adaptation system is an intermediate form between the speaker independent system and the speaker dependent system. Basically, the speaker adaptation system provides the speaker independent function but improves the performance of the specific speaker through the prepared training process. That is, the speaker adaptation system creates a speaker independent system using training data generated by several speakers, and then builds a system adapted to a new speaker using some training data of a newly registered speaker.

In general, the speaker adaptation system uses a class composition method using phonological knowledge base and clustering characteristics in acoustic model space. However, this method assumes that phonemes with similar speech methods are located in similar areas in the acoustic model space, and there is no mathematical and logical basis to support them. In addition, there is a problem in that even if there is a difference in clustering between the models before and after the speaker adaptation.

That is, in the case of clustering using only the distribution of each model in the acoustic model space of the speaker independent model before the speaker adaptation, models belonging to any cluster can move to another cluster after the speaker adaptation. As a result, these shifted models have a problem that speaker adaptation is performed by an incorrect transformation matrix.

The present invention has been invented to solve the above problems, by extracting high frequency error pairs with frequent error recognition results and clustering the acoustic model based on the error recognition, a speech recognition error chaos matrix that can improve the discrimination power of the acoustic model. An object of the present invention is to provide a clustering apparatus and a method using the same.

In addition, an object of the present invention is to provide a clustering apparatus and method using a speech recognition error chaos matrix that can maximize speech recognition performance by improving discrimination against high frequency error pairs.

In order to achieve the above object, according to an embodiment of the present invention, a clustering apparatus using a speech recognition error chaos matrix includes: an acoustic model generator for generating an acoustic model by receiving training voice data; A voice recognition unit for performing voice recognition based on the generated acoustic model, the received test and user voice data, and outputting a result; An error chaos matrix constructing unit configured to form a chaos matrix with error pairs extracted by comparing the speech recognition result with transcription data; A high frequency error pair extractor which extracts an error pair having a high extraction frequency from the chaotic matrix; And a state clustering unit configured to state cluster the acoustic model based on a result of the high frequency error pair extractor.

According to the present invention having the above-described configuration, the clustering apparatus and method using the speech recognition error chaos matrix extract the high frequency error pairs with frequent error in speech recognition, and cluster the acoustic model based on this, thereby distinguishing and reliability of the acoustic model. There is an effect to improve.

Therefore, the present invention has an effect of maximizing speech recognition performance by improving discrimination against high frequency error pairs.

1 is a schematic diagram illustrating a clustering apparatus using a speech recognition error chaos matrix according to an embodiment of the present invention.
2 is a schematic diagram illustrating another configuration of a clustering apparatus using a speech recognition error chaos matrix according to an embodiment of the present invention.
3 is a flowchart illustrating a clustering method using a speech recognition error chaos matrix according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings in order to facilitate a person skilled in the art to easily carry out the technical idea of the present invention. . First, in adding reference numerals to the constituents of the drawings, it is to be noted that the same constituents are denoted by the same reference symbols as possible even if they are displayed on different drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

Hereinafter, a clustering apparatus using a speech recognition error chaos matrix and a method thereof according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a schematic diagram illustrating a clustering apparatus using a speech recognition error chaos matrix, and FIG. 2 is a schematic diagram illustrating another configuration of a clustering apparatus using a speech recognition error chaos matrix.

Referring to FIGS. 1 and 2, the clustering apparatus using the speech recognition error chaos matrix is largely comprised of an acoustic model generator 100, a speech recognizer 110, an error chaos matrix constructer 120, and a high frequency error pair extractor. 130, the state clustering unit 101a, 101b.

The acoustic model generator 100 receives training data and generates an acoustic model. The acoustic model generation unit 100 includes state clustering units 101a and 101b which will be described later. Detailed description of the state clustering units 101a and 101b will be described later.

The speech recognition unit 110 receives the acoustic model generated from the acoustic model generator 100, other pronunciation dictionaries, a language model, and calibration data from the outside, and performs speech recognition. In this case, the calibration data is reference data adjusted to a reference and scale in speech recognition, and includes test voice data and user voice data.

That is, the speech recognition unit 110 performs speech recognition based on the acoustic model, the test speech data, and the user speech data, and outputs the result.

The error chaos matrix configuration unit 120 constructs a chaos matrix using error pairs extracted by comparing the voice recognition result with the transcription data. At this time, the transcription data is pronunciation data for making voice sounds for the tester voice data and the user voice data.

Here, the chaos matrix is a matrix representing the relationship between the real and the prediction. The chaotic matrix is a data that represents the response to the stimulus or the recognition object (data). It will not be described in detail in the present invention as a method commonly used in the analysis method for efficient control, management.

More specifically, the error chaos matrix configuration unit 120 lists the correct answer corresponding words in order as a result of comparing the speech recognition result and the transcription data, wherein the words are based on a triphone unit having three phonemes as one unit. . Triphone units are context-dependent units on the left and right sides of the central unit. For example, the words 'person' are: 'silent- ㅅ + ㅏ', 'ㅅ-ㅏ + ㄹ', 'ㅏ-ㄹ + ㅏ', Each triphone model is used to construct 'ㄹ-ㅏ + ㅁ' and 'ㅏ-ㅁ + silence'.

The error chaos matrix component extracts an error pair from the listed words. The extracted error pairs are constructed in the form of confusion matrix. At this time, the chaos matrix is composed only of voices that are erroneous, that is, errors on the voice result.

The high frequency error pair extractor 130 extracts an error pair having a high extraction frequency from the error chaos matrix configuration unit 120.

The state clustering units 101a and 101b state cluster the acoustic model based on the results of the high frequency error pair extraction unit 130. In this case, clustering is to find a characteristic of data included in each cluster or set by grouping a plurality of given data into several clusters or sets. In other words, the purpose of clustering is to group data into sets with similar properties.

The state clustering units 101a and 101b may be configured to be included in the acoustic model generator as illustrated in FIG. 1 or may be configured to be separately provided as illustrated in FIG. 2. However, in the present invention, FIG.

Hereinafter, a clustering method using a speech recognition error chaos matrix will be described in detail with reference to FIG. 3. 3 is a flowchart illustrating a clustering method using a speech recognition error chaos matrix.

Referring to FIG. 3, the clustering apparatus using the voice recognition error chaos matrix generates a sound model by receiving training voice data.

Next, the generated acoustic model, other pronunciation dictionary, language model, and calibration data are received from the outside, and voice recognition is performed, and the result is output.

Next, a chaotic matrix is formed of error pairs obtained by comparing the speech recognition result with the transcription data (S303). That is, the speech matching result is listed based on the triphone unit. Error pairs are extracted from the words, and the extracted error pairs form a confusion matrix.

Next, an error pair having a high extraction frequency is extracted from the chaotic matrix (S304).

Next, the acoustic model is state clustered based on the high frequency error pair extraction result (S305).

As described above, the clustering apparatus and the method using the speech recognition error chaos matrix of the present invention can improve the discriminating power and reliability of the acoustic model by extracting high frequency error pairs with frequent error recognition results and clustering the acoustic model based on this. . Therefore, the present invention can maximize the speech recognition performance by improving the discrimination ability for high frequency error pairs.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is evident that many alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the scope of the appended claims. As will be understood by those skilled in the art.

100: acoustic model generator 101a, 101b: state clustering unit
110: speech recognition unit 120: error chaos matrix component
130: high frequency error pair extractor

Claims

An acoustic model generator configured to receive training voice data and generate an acoustic model;
A voice recognition unit for performing voice recognition based on the generated sound model and the received test and user voice data;
An error chaos matrix constructing unit configured to form a chaos matrix with error pairs extracted by comparing the speech recognition result with transcription data;
A high frequency error pair extractor which extracts an error pair having a high extraction frequency from the chaotic matrix; And
A state clustering unit configured to state cluster the acoustic model based on a result of the high frequency error pair extractor;
Clustering apparatus using a speech recognition error chaos matrix comprising a.