KR20000033276A

KR20000033276A - Frame compression method using representative characteristic column and audio recognition method using the same

Info

Publication number: KR20000033276A
Application number: KR1019980050076A
Authority: KR
Inventors: 황규웅; 박준; 권오욱
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1998-11-21
Filing date: 1998-11-21
Publication date: 2000-06-15

Abstract

PURPOSE: A frame compression method using representative characteristic column and an audio recognition method using the same are provided to reduce the number of frame without degradation of performance by obtaining a representative frame of similar frames after extracting a characteristic column from frames of constant time interval. CONSTITUTION: A frame compression method using representative characteristic column comprises steps of: dividing compression signal by prescribed time interval into a frame; extracting a characteristic column about the divided frame; obtaining the similarity between the extracted characteristic columns; obtaining a representative characteristic column of similar characteristic columns using the obtained similarity.

Description

Frame Compression Method Using Representative Feature Strings and Speech Recognition Method Using the Same

본 발명은 대표 특징열을 이용한 프레임 압축 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a frame compression method using a representative feature sequence and a computer readable recording medium having recorded thereon a program for realizing the method.

또한, 본 발명은 입력 음성을 프레임(frame) 단위로 처리하여 이를 기반으로 음성을 인식하는 음성 인식 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것으로, 특히 상기 프레임 압축 방법을 이용하여 프레임의 수를 줄여 인식 처리 속도를 향상시킨 음성 인식 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention also relates to a voice recognition method for processing an input voice in units of frames and to recognize the voice based on the same, and a computer-readable recording medium having recorded thereon a program for realizing the method. The present invention relates to a speech recognition method that reduces the number of frames by using a compression method to improve recognition processing speed, and a computer-readable recording medium that records a program for realizing the method.

일반적으로, 음성 인식 시스템은 사람의 음성을 입력으로 하여 그 발성 내용을 출력하는 장치이다.In general, a speech recognition system is a device that inputs a human voice and outputs speech content.

종래의 음성 인식 시스템은 고정된 프레임레이트(frame rate)를 사용하여 음성의 내용과 관계없이 일정한 시간 간격으로 특징을 추출하여 이를 처리하여 음성 인식을 수행하였다.Conventional speech recognition systems perform speech recognition by processing features by extracting features at a fixed time interval regardless of the content of speech using a fixed frame rate.

일본 오키사(OKI Electric Industry Co., Ltd.,)의 미국특허 US4,979,212호(Speech recognition system in which voiced intervals are broken into segments that may have unequal durations)를 살펴보면, 여기서는 음성 인식에서 사용하기 위한 위한 특징열을 추출함에 있어, 특징열 추출의 대상이 되는 음성 구간의 길이를 음성의 파워의 변화를 통하여 결정하였다.Looking at US Patent No. 4,979,212 (Speech recognition system in which voiced intervals are broken into segments that may have unequal durations) of OKI Electric Industry Co., Ltd., Japan, for use in speech recognition. In extracting the feature string, the length of the speech section to be extracted is determined by changing the power of the speech.

파워가 크게 변하는 부분을 경계로 하여 일반적으로 파워가 큰 모음 부분이 하나의 프레임을 형성하고, 모음 사이의 자음이 또다른 프레임을 형성하도록 하여 음절의 개수에 비례하는 특징열이 추출되게 된다.In general, a vowel portion having a large power forms one frame, and a consonant between vowels forms another frame, and a feature string proportional to the number of syllables is extracted.

이에 따라 하나의 음절에 대해서는 음절의 발성 길이에 관계없이 일정한 길이의 특징열로 정규화되게 된다.Accordingly, one syllable is normalized to a feature string having a constant length regardless of the utterance length of the syllable.

이처럼 일정 시간 간격의 프레임으로부터 특징열을 추출하여 이를 처리하여 음성 인식을 수행하는 종래의 음성 인식 시스템은, 일정 시간마다 프레임이 생성되기 때문에 프레임의 개수가 많아 음성 인식의 속도가 감소되는 문제점이 있었다.As described above, the conventional speech recognition system that extracts feature strings from frames at predetermined time intervals and processes them to perform speech recognition has a problem in that the speed of speech recognition is reduced because a large number of frames are generated every predetermined time. .

또한, 널리 사용되고 있는 히든마르코프모델(Hidden Markov Model) 방식의 음성 인식 시스템은, 지속시간 모델링의 문제점이 있었다.In addition, the widely used Hidden Markov Model type speech recognition system has a problem of duration modeling.

상기 문제점을 해결하기 위하여 안출된 본 발명은, 일정 시간 간격의 프레임으로부터 특징열을 추출한 후에 유사한 프레임에 대하여 하나의 대표 프레임(대표 특징열)을 구하므로써 성능의 감퇴없이 프레임의 개수를 감소시킬 수 있는 프레임 압축 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention devised to solve the above problems can reduce the number of frames without loss of performance by extracting feature strings from frames at regular time intervals and then obtaining one representative frame (representative feature string) for similar frames. It is an object of the present invention to provide a frame compression method and a computer-readable recording medium having recorded thereon a program for realizing the method.

또한, 본 발명은, 추출한 대표 특징열이 대표하고 있는 실제 프레임의 개수를 특징으로 추가하므로써 각 프레임의 길이 정보가 특징으로 추가되도록 한 프레임 압축 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.In addition, the present invention provides a frame compression method in which the length information of each frame is added as a feature by adding the number of actual frames represented by the extracted representative feature string as a feature, and a computer program for recording the program for realizing the method. The purpose is to provide a recording medium that can be used.

또한, 본 발명은, 상기 프레임 압축 방법을 이용하여 음성인식의 속도를 향상시키고, 음성인식 성능(인식률)을 향상시킨 음성 인식 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 다른 목적이 있다.In addition, the present invention provides a computer readable recording medium recording a speech recognition method and a program for realizing the speech recognition method using the frame compression method to improve the speech recognition speed and the speech recognition performance (recognition rate). There is another purpose to provide.

도 1 은 본 발명이 적용되는 음성 인식 시스템의 구성예시도.1 is an exemplary configuration diagram of a speech recognition system to which the present invention is applied.

도 2 는 본 발명에 따른 대표 특징열을 이용한 프레임 압축 방법의 개념도.2 is a conceptual diagram of a frame compression method using a representative feature sequence according to the present invention;

도 3 은 본 발명에 따른 대표 특징열을 이용한 프레임 압축 방법 및 그를 이용한 음성 인식 방법에 대한 일실시예 흐름도.3 is a flowchart illustrating a frame compression method using a representative feature sequence and a speech recognition method using the same according to the present invention.

도 4 는 본 발명에 따른 다층 퍼셉트론 인공 신경회로망을 이용한 특징열 유사도 측정 과정의 일예시도.Figure 4 is an example of a feature sequence similarity measurement process using a multi-layer perceptron artificial neural network in accordance with the present invention.

* 도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

102 : 마이크 103 : A/D 변환부102: microphone 103: A / D conversion unit

104 : 특징 추출부 105 : 최적 모델 탐색부104: feature extraction unit 105: optimal model search unit

106 : 음성 데이터 107 : 음성 모델 선정부106: voice data 107: voice model selection unit

108 : 음성 모델 훈련부108: voice model training department

상기 목적을 달성하기 위한 본 발명의 프레임 압축 방법은, 프레임 압축 장치에 적용되는 프레임 압축 방법에 있어서, 입력 신호를 소정의 시간 간격으로 나누어 프레임으로 구분하는 제 1 단계; 상기 구분된 각각의 프레임에 대해 특징열을 추출하는 제 2 단계; 상기 추출한 특징열들간의 유사도를 구하는 제 3 단계; 및 상기 구한 유사도를 이용하여 유사한 특징열들을 대표하는 대표 특징열을 구하여 프레임 수를 감소시키는 제 4 단계를 포함하여 이루어진 것을 특징으로 한다.In accordance with another aspect of the present invention, there is provided a frame compression method, comprising: a first step of dividing an input signal into frames at predetermined time intervals; Extracting a feature sequence for each of the divided frames; Obtaining a similarity between the extracted feature strings; And a fourth step of reducing the number of frames by obtaining a representative feature sequence representing similar feature sequences using the obtained similarity.

또한, 상기 프레임 압축 방법은, 상기 대표 특징열에 상기 대표 특징열이 대표하는 원래(실제)의 특징열의 개수를 특징으로 추가하는 제 5 단계를 더 포함하여 이루어진 것을 특징으로 한다.The frame compression method may further include a fifth step of adding to the representative feature string a number of original (actual) feature strings represented by the representative feature string.

한편, 상기 다른 목적을 달성하기 위한 본 발명의 음성 인식 방법은, 음성 인식 시스템에 적용되는 음성 인식 방법에 있어서, 음성 입력 신호를 소정의 시간 간격으로 나누어 프레임으로 구분하는 제 1 단계; 상기 구분된 각각의 프레임에 대해 특징열을 추출하는 제 2 단계; 상기 추출한 특징열들간의 유사도를 구하는 제 3 단계; 상기 구한 유사도를 이용하여 유사한 특징열들을 대표하는 대표 특징열을 구하여 프레임 수를 감소시키는 제 4 단계; 및 상기 프레임 수가 감소된 음성 신호에 대하여 음성 인식을 수행하여 그 결과를 출력하는 제 5 단계를 포함하여 이루어진 것을 특징으로 한다.On the other hand, the voice recognition method of the present invention for achieving the above another object, the voice recognition method applied to the voice recognition system, the first step of dividing the voice input signal into a predetermined time interval divided into frames; Extracting a feature sequence for each of the divided frames; Obtaining a similarity between the extracted feature strings; A fourth step of reducing the number of frames by obtaining a representative feature sequence representing similar feature sequences using the obtained similarity degree; And a fifth step of performing voice recognition on the voice signal having the reduced number of frames and outputting a result thereof.

또한, 상기 음성 인식 방법는, 상기 제 4 단계 수행 후, 상기 대표 특징열에 상기 대표 특징열이 대표하는 원래(실제)의 특징열의 개수를 특징으로 추가하여 음성의 지속 시간을 이용하여 음성을 인식할 수 있도록 하는 제 6 단계를 더 포함하여 이루어진 것을 특징으로 한다.In addition, the speech recognition method may recognize the speech using the duration of the speech by adding the number of original (real) feature strings represented by the representative feature string to the representative feature string after performing the fourth step. It further comprises a sixth step to make.

한편, 본 발명은, 프로세서를 구비한 프레임 압축 장치에, 입력 신호를 소정의 시간 간격으로 나누어 프레임으로 구분하는 제 1 기능; 상기 구분된 각각의 프레임에 대해 특징열을 추출하는 제 2 기능; 상기 추출한 특징열들간의 유사도를 구하는 제 3 기능; 및 상기 구한 유사도를 이용하여 유사한 특징열들을 대표하는 대표 특징열을 구하여 프레임 수를 감소시키는 제 4 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.On the other hand, the present invention provides a frame compression apparatus having a processor, comprising: a first function of dividing an input signal into frames at predetermined time intervals; A second function of extracting a feature sequence for each of the divided frames; A third function of obtaining similarity between the extracted feature strings; And a computer readable recording medium having recorded thereon a program for realizing a fourth function of reducing the number of frames by obtaining representative feature strings representing similar feature strings using the obtained similarity.

또한, 본 발명은, 프로세서를 구비한 음성 인식 시스템에, 음성 입력 신호를 소정의 시간 간격으로 나누어 프레임으로 구분하는 제 1 기능; 상기 구분된 각각의 프레임에 대해 특징열을 추출하는 제 2 기능; 상기 추출한 특징열들간의 유사도를 구하는 제 3 기능; 상기 구한 유사도를 이용하여 유사한 특징열들을 대표하는 대표 특징열을 구하여 프레임 수를 감소시키는 제 4 기능; 및 상기 프레임 수가 감소된 음성 신호에 대하여 음성 인식을 수행하여 그 결과를 출력하는 제 5 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention also provides a speech recognition system including a processor, comprising: a first function of dividing a voice input signal into predetermined frames at predetermined time intervals; A second function of extracting a feature sequence for each of the divided frames; A third function of obtaining similarity between the extracted feature strings; A fourth function of reducing the number of frames by obtaining representative feature strings representing similar feature strings using the obtained similarity degree; And a computer readable recording medium having recorded thereon a program for realizing a fifth function of performing speech recognition on a speech signal having a reduced number of frames and outputting a result thereof.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 음성 인식 시스템의 구성예시도이다.1 is an exemplary configuration diagram of a speech recognition system to which the present invention is applied.

먼저, 사람이 말을 하게 되면, 그 음성(101)이 마이크(102)를 통하여 음성 인식 시스템으로 입력된다. 그러면, 아날로그/디지털(A/D) 변환부(103)에서는 마이크(102)를 통하여 입력된 음성 신호를 디지털 데이터(digital data)로 변환시키며, 특징 추출부(104)에서는 상기 아날로그/디지털 변환부(103)에서 변환된 디지털 데이터로부터 음성 인식에 필요한 특징을 추출한다.First, when a person speaks, the voice 101 is input to the voice recognition system through the microphone 102. Then, the analog / digital (A / D) converter 103 converts the voice signal input through the microphone 102 into digital data, and the feature extractor 104 converts the voice signal into digital data. The feature necessary for speech recognition is extracted from the digital data converted at 103.

그리고, 최적 모델 탐색부(105)에서는 특징 추출부(104)에서 추출된 특징 벡터들을 음성 모델(109)과 비교하여, 가장 유사한 음성 모델이 표현하는 단어를 인식 결과로 출력하게 된다. 이때, 비교를 위한 음성 모델(109)은 다수의 사람들로부터 녹음한 음성 데이터(106)를 대상으로 음성 모델을 선정하고(107), 음성 모델을 훈련하여 구한다(108).The optimum model searcher 105 compares the feature vectors extracted by the feature extractor 104 with the speech model 109 and outputs a word represented by the most similar speech model as a recognition result. In this case, the voice model 109 for comparison selects a voice model from the voice data 106 recorded from a plurality of people (107), and trains the voice model to obtain the voice model (108).

도 2 는 본 발명에 따른 대표 특징열을 이용한 프레임 압축 방법의 개념도이고, 도 3 은 본 발명에 따른 대표 특징열을 이용한 프레임 압축 방법 및 그를 이용한 음성 인식 방법에 대한 일실시예 흐름도이다.2 is a conceptual diagram of a frame compression method using a representative feature sequence according to the present invention, and FIG. 3 is a flowchart of a frame compression method using a representative feature sequence and a speech recognition method using the same according to the present invention.

먼저, 본 발명의 개념을 개략적으로 살펴보면, 기존의 음성 인식 시스템은 일정한 시간 간격의 프레임을 이용하여 음성 인식을 처리하는데 반하여, 본 발명에서는 유사한 프레임에 대해서는 하나의 대표 프레임(대표 특징열)을 구하여, 이 대표 프레임에 대해 음성 인식을 수행하여 음성 인식 시스템에서 처리하게 되는 프레임의 수를 감소시켜 처리 속도의 향상을 꾀한다.First, the concept of the present invention is outlined, while the conventional speech recognition system processes speech recognition using frames at regular time intervals, whereas in the present invention, one representative frame (representative feature string) is obtained for similar frames. In addition, by performing speech recognition on the representative frame, the number of frames to be processed by the speech recognition system is reduced to improve the processing speed.

또한, 본 발명은 프레임으로부터 구한 음성 인식을 위한 특징에 해당 프레임이 대표하고 있는 실제 프레임의 개수를 특징으로 추가하여, 기존의 음성 인식 시스템과 달리, 음성의 길이 정보를 이용하므로써 음성 인식 성능(인식률)을 향상시킨다.In addition, the present invention adds the number of actual frames represented by the frame to the feature for speech recognition obtained from the frame, and, unlike the conventional speech recognition system, uses speech length information to recognize speech performance (recognition rate). Improve).

도 2 및 도 3 을 참조하여, 본 발명에 따른 대표 특징열을 이용한 프레임 압축 방법 및 그를 이용한 음성 인식 방법을 상세히 살펴보면 다음과 같다.2 and 3, a frame compression method using a representative feature sequence and a speech recognition method using the same according to the present invention will be described in detail.

먼저, 사용자로부터 마이크(102)를 통하여 음성을 입력받으면(301), 아날로그 음성 신호를 디지털로 변환한다(302). 도 2 에서 201은 입력된 음성의 파형을 나타낸다.First, when a voice is input from the user through the microphone 102 (301), the analog voice signal is converted into digital (302). In FIG. 2, 201 shows a waveform of an input voice.

이후, 입력 음성 파형을 일정한 시간 간격으로 나누어 프레임으로 구분한다(303). 도 2 의 201에 도시된 바와 같이 세로선으로 구분된 하나의 구간을 하나의 프레임이라고 한다. 이 프레임들은 일부 겹칠 수도 있다.Thereafter, the input speech waveform is divided into frames at regular time intervals (303). As shown in 201 of FIG. 2, one section divided by vertical lines is referred to as one frame. These frames may overlap some.

이후, 구분된 각각의 프레임에 대해 특징열을 추출한다(304). 도 2 의 202가 추출된 특징열(特徵列)을 나타낸다. 이때, 일반적인 음성 인식 시스템에서는 이 특징열을 이용하여 음성 인식을 수행한다.Then, the feature string is extracted 304 for each divided frame. 202 of FIG. 2 shows the extracted feature string. In this case, the general speech recognition system performs speech recognition using this feature sequence.

그러나, 본 발명에서는 이 특징열과 이웃하는 특징열과의 유사한 정도를 살펴 유사한 특징열이 연속되면 그들을 대표할 수 있는 하나의 대표 특징열(203)을 구하여(305) 이에 대해 음성 인식을 수행하여(307) 음성 인식 결과를 출력한다(308).However, in the present invention, if a similar feature sequence is contiguous by looking at the similarity level between the feature sequence and the neighboring feature sequence, one representative feature sequence 203 that can represent them is obtained (305), and speech recognition is performed thereto (307). In operation 308, the voice recognition result is output.

또한, 본 발명에서는 대표 특징열(203)에 이 특징열이 대표하는 원래의 특징열의 개수(204)를 특징으로 추가하는 과정(306)을 더 수행한 후에, 음성의 지속시간을 이용하여 음성 인식을 수행하여(307) 음성 인식 결과를 출력한다(308).Further, in the present invention, after further performing step 306 of adding the feature 204 of the feature string represented by the feature string to the representative feature string 203 as a feature, speech recognition using the duration of speech is performed. In operation 307, the voice recognition result is output (308).

한편, 상기 특징열간의 유사한 정도를 측정하는 과정을 좀 더 상세히 살펴보면, 두 개 이상의 연속하는 특징열들 사이의 유사도가 모두 정해진 값보다 크면, 이 연속하는 특징열들을 하나의 대표 특징열로 표현한다. 이때, 대표 특징열은 각 특징열의 평균으로 정한다.Meanwhile, when the similarity between the feature strings is measured in more detail, when the similarity between two or more consecutive feature strings is larger than a predetermined value, the consecutive feature strings are represented as one representative feature string. . At this time, the representative feature string is determined as an average of each feature string.

그리고, 특징열의 유사한 정도를 측정하는 방법의 예로서, 다음의 세 가지 방법을 제안한다.As an example of a method for measuring the similarity of feature strings, the following three methods are proposed.

첫 번째 예로서, 유클리디안(Euclidean) 거리의 역수를 이용하는 방법을 제안한다.As a first example, we propose a method that uses the inverse of the Euclidean distance.

특징열 A와 B 사이의 유사도 = 이고, 여기서 A_i 는 특징열 A의 i 번째 차원의 값이고, B_i 는 특징열 B의 i 번째 차원의 값이고, N은 특징열의 차원수를 각각 나타낸다.Similarity between feature columns A and B = , Where A _i Feature of column A i Value of the second dimension, B _i Feature of column B i Is the value of the second dimension, and N represents the number of dimensions of each feature column.

두 번째 예로서, 다층 퍼셉트론 인공 신경회로망(Multilayer Perceptron Artificial Neural networks)의 출력을 이용하는 방법을 제안한다.As a second example, a method using the output of multilayer perceptron artificial neural networks is proposed.

도 4 에 도시된 바와 같이, 두 개의 특징열(401)을 신경회로망의 입력으로 하고(402), 1 개의 출력 노드(403)를 두어 이 출력 노드의 값을 유사도로 삼는다. 신경회로망의 훈련을 위한 목표값(target pattern)은 훈련 음성 데이터에 대해 비터비(Viterbi) 알고리즘을 이용하여 음성 인식 시스템의 각 히든마르코프 모델(hidden Markov model)의 스테이트의 경계를 구하고, 하나의 스테이트에 속한 특징열들은 유사도 1로 주고, 다른 스테이트에 속한 특징열들은 유사도 0으로 목표값을 주어 훈련한다.As shown in Fig. 4, two feature strings 401 are input to the neural network (402), and one output node 403 is placed to make the value of this output node a similarity. The target pattern for the training of neural networks is the state of each hidden Markov model of the speech recognition system using the Viterbi algorithm on the training speech data, and the state of one state. The feature strings belonging to are given a similarity of 1, and the feature strings belonging to other states are trained by giving a target value of similarity 0.

그리고, 세 번째 예는 다음과 같다.And the third example is as follows.

훈련 음성 데이터에 대해 특징열을 구한 후에 연속하는 두 특징열의 차이를 구한다. 이 특징열 차이를 대상으로 벡터 양자화를 수행하고, 각 그룹에 대해 가우시안(Gaussian) 분포함수를 구한다. 구해진 각각의 가우시안 함수에 대해 그 그룹에 속한 특징열의 차이가 같은 스테이트에 속한 것인지 아닌지의 다수를 비교하여 '같음' 또는 '다름'의 꼬리표를 붙인다. 인식 상황에서는 특징열의 차이를 각 분포 함수에 입력하여 확률 밀도를 구하고, 확률 밀도가 가장 높은 분포 함수의 꼬리표에 따라 유사도를 '같음'인 경우에는 1로, '다름'인 경우에는 0으로 지정한다.After the feature string is obtained for the training voice data, the difference between two consecutive feature strings is obtained. Vector quantization is performed on the feature sequence differences, and a Gaussian distribution function is obtained for each group. For each Gaussian function obtained, compare the majority of the feature strings in the group with or without the same state and label them equal or different. In the recognition situation, the difference of feature strings is input to each distribution function to determine the probability density, and according to the tag of the distribution function having the highest probability density, the similarity is set to 1 for 'equal' and 0 for 'different'. .

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes can be made in the art without departing from the technical spirit of the present invention. It will be apparent to those of ordinary knowledge.

상기와 같은 본 발명은, 기존의 프레임 방식의 음성 인식 시스템의 수행 속도와 인식 성능을 크게 향상시킬 수 있는 효과가 있다.The present invention as described above has the effect of greatly improving the performance and recognition performance of the conventional frame-based speech recognition system.

즉, 본 발명은, 기존의 음성 인식 시스템이 일정한 시간 간격의 프레임을 이용하여 음성 인식을 처리하는데 반하여, 유사한 프레임에 대해서는 하나의 대표 프레임(대표 특징열)을 구하여, 이 대표 프레임에 대해 음성 인식을 수행하여 음성 인식 시스템에서 처리하게 되는 프레임의 수를 감소시켜 처리 속도를 향상시킬 수 있는 효과가 있다.That is, in the present invention, while the existing speech recognition system processes speech recognition using frames at regular time intervals, one representative frame (representative feature string) is obtained for similar frames, and the speech recognition is performed for this representative frame. By reducing the number of frames to be processed in the speech recognition system has an effect that can improve the processing speed.

또한, 본 발명은 프레임으로부터 구한 음성 인식을 위한 특징에 해당 프레임이 대표하고 있는 실제 프레임의 개수를 특징으로 추가하여, 기존의 음성 인식 시스템과 달리, 음성의 길이 정보를 이용하므로써 음성 인식 성능(인식률)을 향상시킬 수 있는 효과가 있다.In addition, the present invention adds the number of actual frames represented by the frame to the feature for speech recognition obtained from the frame, and, unlike the conventional speech recognition system, uses speech length information to recognize speech performance (recognition rate). ) Can improve the effect.

Claims

In the frame compression method applied to the frame compression apparatus,

A first step of dividing the input signal into frames at predetermined time intervals;

Extracting a feature sequence for each of the divided frames;

Obtaining a similarity between the extracted feature strings; And

A fourth step of reducing the number of frames by obtaining a representative feature sequence representing similar feature sequences using the obtained similarity;

Frame compression method comprising a.

The method of claim 1,

A fifth step of adding to the representative feature string the number of original (actual) feature columns represented by the representative feature string

Frame compression method further comprising.

The method according to claim 1 or 2,

The process of measuring the similarity between the feature strings of the third step,

And if both similarities between two or more consecutive feature strings are larger than a predetermined value, expressing the consecutive feature strings as one representative feature string.

The method of claim 3, wherein

The representative feature string is,

And determining the average of each successive feature string having the similarity greater than a predetermined value.

The method according to claim 1 or 2,

The third step,

Frame compression method characterized by using the inverse of the Euclidean distance.

The method of claim 5,

The process of obtaining the similarity of the third step is

The similarity between the feature strings A and B is calculated by the following equation.

(

A _i

Feature of column A

i

Value of the second dimension,

B _i

Feature of column B

i

Value of the first dimension, N is the number of dimensions of the feature column)

The method according to claim 1 or 2,

The third step,

Frame compression method characterized by using the output of the Multilayer Perceptron Artificial Neural networks.

The method of claim 7, wherein

The multilayer perceptron artificial neural network,

A frame compression method using a state boundary of a Hidden Markov model as a target value.

The method according to claim 1 or 2,

The third step,

A frame compression method characterized by using the magnitude of the probability density of a Gaussian distribution function.

The method of claim 9,

The third step,

A sixth step of obtaining a difference between two successive feature strings after obtaining the feature strings for the training voice data;

A seventh step of performing vector quantization on the obtained feature string differences and obtaining a Gaussian distribution function for each group;

An eighth step of comparing each of the obtained Gaussian functions with each other whether or not the difference in the feature strings belonging to the group belongs to the same state; And

A ninth step of inputting a difference of feature strings into each distribution function to obtain a probability density and specifying similarity according to the comparison result of the eighth step with respect to the distribution function having the highest probability density;

Frame compression method comprising a.

In the speech recognition method applied to the speech recognition system,

A first step of dividing the voice input signal into frames at predetermined time intervals;

Extracting a feature sequence for each of the divided frames;

Obtaining a similarity between the extracted feature strings;

A fourth step of reducing the number of frames by obtaining a representative feature sequence representing similar feature sequences using the obtained similarity degree; And

A fifth step of performing speech recognition on the speech signal having the reduced number of frames and outputting the result;

Speech recognition method comprising a.

The method of claim 11,

After performing the fourth step, adding the number of original (actual) feature strings represented by the representative feature strings to the representative feature strings so that the speech can be recognized using the duration of the voices.

Speech recognition method further comprising.

The method according to claim 11 or 12,

And if the similarity between two or more consecutive feature strings is greater than a predetermined value, expressing the consecutive feature strings as one representative feature string.

The method of claim 13,

The representative feature string is,

And determining the average of each successive feature string whose similarity is greater than a predetermined value.

The method according to claim 11 or 12,

The third step,

A speech recognition method using an inverse of the Euclidean distance.

The method of claim 15,

The process of obtaining the similarity of the third step is

(

A _i

Feature of column A

i

Value of the second dimension,

B _i

Feature of column B

i

The method according to claim 11 or 12,

The third step,

A speech recognition method using the output of multilayer perceptron artificial neural networks.

The method of claim 17,

The multilayer perceptron artificial neural network,

A speech recognition method using a state boundary of a Hidden Markov model as a target value.

The method according to claim 11 or 12,

The third step,

A speech recognition method using the magnitude of the probability density of a Gaussian distribution function.

The method of claim 19,

The third step,

A seventh step of obtaining a difference between two consecutive feature strings after obtaining a feature string for the training voice data;

An eighth step of performing vector quantization on the obtained feature string differences and obtaining a Gaussian distribution function for each group;

A ninth step of comparing each of the obtained Gaussian functions with each other whether or not the difference in the feature strings belonging to the group belongs to the same state; And

A tenth step of inputting a difference of feature strings into each distribution function to obtain a probability density and specifying similarity according to the comparison result of the eighth step with respect to the distribution function having the highest probability density;

Speech recognition method comprising a.

In a frame compression device having a processor,

A first function of dividing an input signal into frames at predetermined time intervals and dividing the input signal into frames;

A second function of extracting a feature sequence for each of the divided frames;

A third function of obtaining similarity between the extracted feature strings; And

A fourth function of reducing the number of frames by obtaining a representative feature string representing similar feature sequences using the obtained similarity

A computer-readable recording medium having recorded thereon a program for realizing this.

The method of claim 21,

A fifth function of adding to the representative feature string a number of original (actual) feature columns represented by the representative feature string

A computer-readable recording medium that records a program for further realization.

In a speech recognition system having a processor,

A first function of dividing a voice input signal into frames at predetermined time intervals and dividing the voice input signal into frames;

A third function of obtaining similarity between the extracted feature strings;

A fourth function of reducing the number of frames by obtaining representative feature strings representing similar feature strings using the obtained similarity degree; And

A fifth function of performing speech recognition on the speech signal having the reduced number of frames and outputting the result;

The method of claim 23,

A sixth function that adds the number of original (actual) feature strings represented by the representative feature string to the representative feature string after performing the fourth function to recognize the voice using the duration of the voice;