KR20060064494A

KR20060064494A - Method for verifying speech/non-speech and voice recognition apparatus using the same

Info

Publication number: KR20060064494A
Application number: KR1020050069041A
Authority: KR
Inventors: 김갑기; 이성주; 정호영; 김상훈
Original assignee: 한국전자통신연구원
Priority date: 2004-12-08
Filing date: 2005-07-28
Publication date: 2006-06-13
Also published as: KR100737358B1

Abstract

본 발명이 이루고자 하는 기술적 과제는 음성과 비음성을 보다 명확히 구분함으로써, 음성 인식부의 부하를 낮출 수 있고, 비음성 신호를 음성 신호로 판단하고 음성인식함으로써 발생하는 음성인식의 오류를 줄일 수 있는 음성/비음성 검증 방법 및 이를 이용한 음성 인식 장치를 제공하는 것이다. The technical problem to be achieved by the present invention is to more clearly distinguish between speech and non-voice, it is possible to lower the load of the speech recognition unit, and to determine the non-voice signal as a speech signal and to reduce the error of speech recognition generated by speech recognition A non-speech verification method and a speech recognition apparatus using the same are provided.

본 발명은 입력되는 음성 데이터로부터 특징 벡터를 추출하고, 음성/비음성 모델을 이용하여 특징 벡터가 음성에 해당하는 것인지 비음성에 해당하는 것인지 구분하는 음성/비음성 검증부; 및 상기 음성/비음성 검증부가 음성으로 판단한 구간에 대응하는 데이터로부터 음성을 인식하는 음성 인식부를 포함하는 음성 인식 장치를 제공한다. The present invention provides a speech / non-voice verification unit that extracts a feature vector from input voice data and distinguishes whether the feature vector corresponds to speech or non-voice using a speech / non-voice model; And a voice recognition unit recognizing a voice from data corresponding to a section determined by the voice / non-voice verification unit as the voice.

Description

Speech / non-speech verification method and speech recognition apparatus using the same {Method for verifying speech / non-speech and voice recognition apparatus using the same}

도 1은 종래기술에 의한 음성 인식 장치를 나타내는 도면이다. 1 is a view showing a speech recognition apparatus according to the prior art.

도 2는 본 발명의 실시예에 의한 음성 인식 장치를 나타내는 도면이다. 2 is a diagram illustrating a speech recognition apparatus according to an embodiment of the present invention.

도 3은 도 2의 음성 인식 장치가 네트워크적으로 연결된 경우의 일례를 설명하기 위한 도면이다.FIG. 3 is a diagram for describing an example in which the voice recognition apparatus of FIG. 2 is connected through a network.

도 4는 도 2의 음성/비음성 검증부(22)에서 수행되는 음성/비음성 검증 방법을 설명하기 위한 도면이다. FIG. 4 is a diagram for describing a voice / non-voice verification method performed by the voice / non-voice verification unit 22 of FIG. 2.

도 5는 도 4의 도면부호 S43에 해당하는 단계에서 사용되는 음성/비음성 모델의 초기 모델링 방법을 설명하기 위한 도면이다.FIG. 5 is a diagram for describing an initial modeling method of a voice / non-voice model used in a step corresponding to S43 of FIG. 4.

본 발명은 음성/비음성 검증 방법 및 이를 이용한 음성 인식 장치에 관한 발명으로서, 보다 상세하게는 많은 연산을 필요로 하는 음성 인식부의 부하를 줄일수 있는 음성/비음성 검증 방법 및 이를 이용한 음성 인식 장치에 관한 발명이다. The present invention relates to a voice / non-voice verification method and a voice recognition device using the same, and more particularly, to a voice / non-voice verification method capable of reducing the load of a voice recognition unit requiring a lot of operations and a voice recognition device using the same. The invention relates to.

도 1은 종래기술에 의한 음성 인식 장치를 나타내는 도면이다. 도 1을 참조하면 종래기술에 의한 음성 인식 장치는 음성 끝점 검출부(11) 및 음성 인식부(12)를 포함한다. 음성 끝점 검출부(11)는 음성 신호 구간의 시작점 및 끝점을 검출하는 기능을 수행하며, 일례로 음성신호의 단시간 에너지(short-time energy) 및 영교차율(zero crossing rate)를 이용하여 음성 구간을 검출한다. 음성 인식부(12)는 음성 끝점 검출부(11)에서 출력되는 음성 구간 내에서 음성을 인식하는 기능을 수행한다. 1 is a view showing a speech recognition apparatus according to the prior art. Referring to FIG. 1, a speech recognition apparatus according to the related art includes a speech endpoint detector 11 and a speech recognizer 12. The voice endpoint detecting unit 11 detects a start point and an end point of the voice signal section, and detects the voice section using, for example, short-time energy and zero crossing rate of the voice signal section. do. The voice recognition unit 12 performs a function of recognizing the voice in the voice section output from the voice endpoint detecting unit 11.

이와 같은 구성을 가지는 종래기술에 의한 음성 인식 장치에 있어서, 음성 끝점 검출부(11)는 음성 신호와 비음성 신호를 구분하는데 한계가 있었다. 특히 기계음, 음악 등 주변에서 흔히 접할 수 있는 생활 잡음을 음성 신호화 잘 구분하지 못한다는 문제점이 있었다. 이와 같이, 음성 끝점 검출부(11)가 음성 신호와 비음성 신호를 잘 구분하지 못하고 대부분 음성 신호로 인식하여 출력하는 경우에, 음성 인식부(12)는 많은 양의 연산을 수행해야 한다는 문제점이 있다. 특히, 로봇에 응용되는 경우에는, 버튼을 누른 후에 말을 하는 방식 즉 푸쉬-버튼(push-button) 방식과 달리 항상 대기 상태에서 듣는 소리가 음성인지 비음성인지를 판단하여야 한다. 따라서, 음성과 비음성을 잘 구분하지 못하는 경우에, 음성 인식부(12)의 잦은 연산에 의하여 로봇의 충전지가 빨리 소모되는 문제점이 있다. 또한, 비음성인 신호를 입력받았음에도 불구하고, 음성으로 판단하여, 음성 인식을 수행함으로써 음성 인식의 오류가 발생하는 문제점이 있다. In the speech recognition apparatus according to the prior art having such a configuration, the voice end point detection unit 11 has a limitation in distinguishing between the voice signal and the non-voice signal. In particular, there is a problem in that it is not possible to distinguish between the living signal that can be commonly encountered in the surroundings such as mechanical sound and music. As described above, when the voice endpoint detector 11 does not distinguish the voice signal from the non-voice signal well and recognizes and outputs most of them as voice signals, the voice recognition unit 12 has a problem of performing a large amount of calculation. . In particular, when applied to a robot, it is necessary to determine whether the sound heard in the standby state is voice or non-voice, unlike the method of talking after pressing a button, that is, a push-button method. Therefore, when the voice and the non-voice can not be distinguished well, there is a problem that the rechargeable battery of the robot is quickly consumed by the frequent operation of the voice recognition unit 12. In addition, despite receiving a non-negative signal, there is a problem in that an error of speech recognition occurs by judging by voice and performing speech recognition.

따라서, 본 발명이 이루고자 하는 기술적 과제는 상기한 문제점들을 해결하기 위한 것으로서, 음성과 비음성을 보다 명확히 구분함으로써, 음성 인식부의 부하를 낮출 수 있고, 비음성인 신호를 입력받아 인식의 오류를 일으키는 것을 줄일 수 있는 음성/비음성 검증 방법 및 이를 이용한 음성 인식 장치를 제공하는 것이다. Therefore, the technical problem to be solved by the present invention is to solve the above problems, and by clearly distinguishing between speech and non-voice, it is possible to lower the load of the speech recognition unit, and to cause a recognition error by receiving a non-voice signal. It is an object of the present invention to provide a speech / non-voice verification method and a speech recognition apparatus using the same.

상술한 목적을 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면은 입력되는 음성 데이터로부터 특징 벡터를 추출하고, 음성/비음성 모델을 이용하여 특징 벡터가 음성에 해당하는 것인지 비음성에 해당하는 것인지 구분하는 음성/비음성 검증부; 및 상기 음성/비음성 검증부가 음성으로 판단한 구간에 대응하는 데이터로부터 음성을 인식하는 음성 인식부를 포함하는 음성 인식 장치를 제공한다. As a technical means for achieving the above object, the first aspect of the present invention is to extract a feature vector from the input voice data, and to use the voice / non-voice model to determine whether the feature vector corresponds to speech or non-voice Voice / non-voice verification unit for distinguishing whether or not; And a voice recognition unit recognizing a voice from data corresponding to a section determined by the voice / non-voice verification unit as the voice.

또한, 본 발명의 제 2 측면은 (a) 프레임 단위의 데이터로부터 특징 벡터를 추출하는 단계; (b) 음성/비음성 모델을 이용하여 프레임 단위로 음성/비음성 결정을 수행하는 단계; (c) 연속하는 복수의 프레임의 음성/비음성 결정 값을 창의 길이만큼 버퍼링하는 단계; 및 (d) 창 단위로 음성/비음성 결정을 내리는 단계를 포함하는 음성/비음성 검증 방법을 제공한다. In addition, a second aspect of the present invention includes the steps of (a) extracting a feature vector from the data in units of frames; (b) performing speech / non-voice determination on a frame-by-frame basis using the speech / non-voice model; (c) buffering speech / non-voice determination values of a plurality of consecutive frames by the length of the window; And (d) making a voice / non-voice decision on a window basis.

이하, 첨부한 도면들을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 그러나, 본 발명의 실시예들은 여러가지 형태로 변형될 수 있으며, 본 발명의 범위가 아래에서 상술하는 실시예들로 인하여 한정되는 식으로 해석되어 져서는 안된다. 본 발명의 실시예들은 당업계에서 평균적 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해 제공되는 것이다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, embodiments of the present invention may be modified in various forms, and the scope of the present invention should not be construed as being limited by the embodiments described below. Embodiments of the present invention are provided to more fully explain the present invention to those skilled in the art.

도 2는 본 발명의 실시예에 의한 음성 인식 장치를 나타내는 도면이다. 도 2를 참조하면, 음성 인식 장치는 음성 끝점 검출부(21), 음성/비음성 검증부(22) 및 음성 인식부(23)를 포함한다. 2 is a diagram illustrating a speech recognition apparatus according to an embodiment of the present invention. Referring to FIG. 2, the voice recognition apparatus includes a voice endpoint detector 21, a voice / non-voice verification unit 22, and a voice recognition unit 23.

음성 끝점 검출부(21)는 음성 신호 구간의 시작점 및 끝점을 검출하는 기능을 수행하며, 일례로 음선신호의 단시간 에너지(short-time energy) 및 영교차율(zero crossing rate)을 이용하여 음성 구간을 검출한다. 음성 끝점 검출부(21)는 선택적 구성요소로서, 음성/비음성 검증부(22) 및 음성 인식부(23)만으로 구성된 음성 인식 장치만으로도 본 발명의 목적을 달성할 수 있다. 다만, 음성 끝점 검출부(21)를 추가적으로 포함하는 경우 음성/비음성 검증부(22)의 부하를 감소시킬 수 있다는 장점이 있다. The voice end point detector 21 detects a start point and an end point of the voice signal section. For example, the voice endpoint detection unit 21 detects the voice section by using a short-time energy and a zero crossing rate of the audio signal. do. The voice endpoint detection unit 21 is an optional component and can achieve the object of the present invention by using only a voice recognition device including only the voice / non-voice verification unit 22 and the voice recognition unit 23. However, when the voice endpoint detection unit 21 is additionally included, the load of the voice / non-voice verification unit 22 may be reduced.

음성/비음성 검증부(22)는 음성 끝점 검출부에서 출력되는 음성 구간의 데이터에 대하여 특징 벡터를 이용하여 음성 구간과 비음성 구간을 검증하여 최종적으로 음성 구간을 검출하는 기능을 수행한다. The voice / non-voice verification unit 22 performs a function of finally detecting the voice section and the non-voice section by using a feature vector on the data of the voice section output from the voice endpoint detector.

음성/비음성 검증부(22)에서 추출되는 특징 벡터의 예로서 필터 뱅크 에너 지, 피치, 프레임내에 필터 뱅크간의 에너지 값의 변화량, 프레임간의 필터뱅크 에너지 값의 변화량 및 멜 필터 뱅크 계수 등이 있다. 음성/비음성 검증부(22)는 음성/비음성 모델을 이용하여 특징 벡터가 음성에 해당하는 것인지 비음성에 해당하는 것인지 구분한다. 음성/비음성 검증부(22)는 음성 구간을 검출하면, 음성 구간에 해당하는 음성 데이터, 및 음성 구간에 해당하는 특징 벡터 중 적어도 어느 하나를 음성 인식부(23)로 출력한다. 만일 음성/비음성 검증부(22)가 특징 벡터를 음성 인식부(23)로 출력하고, 음성 인식부(23)가 이를 이용하여 음성 인식을 수행하는 경우에는, 음성 인식부(23)는 특징 벡터를 추출하기 위한 별도의 구성 요소를 포함하지 않아도 된다는 장점이 있다.Examples of feature vectors extracted by the speech / non-voice verification unit 22 include filter bank energy, pitch, amount of change of energy value between filter banks in a frame, amount of change of filter bank energy values between frames, mel filter bank coefficients, and the like. . The speech / non-voice verification unit 22 distinguishes whether the feature vector corresponds to speech or non-voice using the speech / non-voice model. When the voice / non-voice verification unit 22 detects a voice section, the voice / non-voice verification unit 22 outputs at least one of voice data corresponding to the voice section and a feature vector corresponding to the voice section to the voice recognition unit 23. If the voice / non-voice verification unit 22 outputs the feature vector to the voice recognition unit 23, and the voice recognition unit 23 performs voice recognition using the voice recognition unit 23, the voice recognition unit 23 performs the feature. The advantage is that it does not have to include a separate component for extracting the vector.

음성 인식부(23)는 음성/비음성 검증부(22)에서 출력되는 음성 데이터 및 특징 벡터 중 적어도 어느 하나를 이용하여 음성 구간 내에서 음성을 인식하는 기능을 수행한다. The speech recognizer 23 performs a function of recognizing the speech in the speech section using at least one of the speech data and the feature vector output from the speech / non-voice verifier 22.

도 3은 도 2의 음성 인식 장치가 네트워크로 연결된 경우의 일례를 설명하기 위한 도면이다. 도 3을 참조하면, 음성 인식 장치는 음성 인식 서버(31) 및 적어도 하나의 클라이언트(32A, 32B, 32C)를 포함한다. FIG. 3 is a diagram for describing an example in which the speech recognition apparatus of FIG. 2 is connected through a network. Referring to FIG. 3, the speech recognition apparatus includes a speech recognition server 31 and at least one client 32A, 32B, and 32C.

음성 인식 서버(31)는 통신을 통하여 적어도 하나의 클라이언트(32A, 32B, 32C)와 접속되며, 적어도 음성 인식부(23)를 포함한다. The voice recognition server 31 is connected to at least one client 32A, 32B, 32C through communication, and includes at least a voice recognition unit 23.

각 클라이언트(32A, 32B, 32C)는 통신을 통하여 음성 인식 서버(31)와 접속되며, 마이크(33A, 33B, 33C), 음성 끝점 검출부(21A, 21B, 21C) 및 음성/비음성 검증부(22A, 22B, 22C)를 포함한다. 클라이언트(32A, 32B, 32C)는 바람직하게 로봇일 수 있다. Each client 32A, 32B, 32C is connected to the voice recognition server 31 through communication, and has microphones 33A, 33B, 33C, voice endpoint detectors 21A, 21B, 21C, and voice / non-voice verification unit ( 22A, 22B, 22C). The clients 32A, 32B, 32C may preferably be robots.

도면에 표현된 음성 인식 장치에 있어서, 마이크(33A, 33B, 33C)는 클라이언트(32A, 32B, 32C)에 위치하여야 하며, 음성 인식부(23)는 음성 인식 서버(31)에 위치하여야 하나, 음성 끝점 검출부(21A, 21B, 21C) 및 음성/비음성 검증부(22A, 22B, 22C)는 도면과 같이 클라이언트(32A, 32B, 32C)에 위치할 수도 있으며, 음성 끝점 검출부(21A, 21B, 21C)는 클라이언트(32A, 32B, 32C)에 위치하고 음성/비음성 검증부(22A, 22B, 22C)는 음성 인식 서버(31)에 위치할 수도 있으며, 음성 끝점 검출부(21A, 21B, 21C) 및 음성/비음성 검증부(22A, 22B, 22C) 모두 음성 인식 서버(31)에 위치할 수도 있다. 음성 끝점 검출부(21A, 21B, 21C)가 음성 인식 서버(31)에 위치하는 경우, 각 클라이언트(32A, 32B, 32C)별로 별도의 음성 끝점 검출부를 둘 수도 있고, 복수의 클라이언트(32A, 32B, 32C)에 대하여 하나의 음성 끝점 검출부를 둘 수도 있다. 또한, 음성/비음성 검증부(22A, 22B, 22C)가 음성 인식 서버(31)에 위치하는 경우, 각 클라이언트(32A, 32B, 32C)별로 별도의 음성/비음성 검증부를 둘 수도 있고, 복수의 클라이언트(32A, 32B, 32C)에 대하여 하나의 음성/비음성 검증부를 둘 수도 있다. 마이크(33A, 33B, 33C)만이 클라이언트(32A, 32B, 32C)에 위치하는 경우에는 클라이언트(32A, 32B, 32C)는 항상 또는 빈번하게 음성 인식 서버(31)과 통신을 수행하여야 하므로, 통신에 많은 부하를 준다는 문제점이 있다. 특히 푸쉬-버튼 방식으로 동작하지 아니하는 로봇 등의 응용에 있어서, 이 문제는 더욱 심각하다. 따라서, 음성 끝점 검출부(21A, 21B, 21C)가 클라이언트 (32A, 32B, 32C)에 위치하고 음성/비음성 검증부(22A, 22B, 22C)가 음성 인식 서버(31)에 위치하거나, 도면과 같이 음성 끝점 검출부(21A, 21B, 21C) 및 음성/비음성 검증부(22A, 22B, 22C) 모두가 클라이언트(32A, 32B, 32C)에 위치하는 것이 통신 부하를 줄일 수 있으므로 보다 바람직하다. 만일 음성 끝점 검출부(22A, 22B, 22C)가 사용되지 아니하는 경우에는 음성/비음성 검증부(22A, 22B, 22C)가 클라이언트(32A, 32B, 32C)에 위치하는 것이 통신 부하를 줄일 수 있으므로 보다 바람직하다. In the speech recognition apparatus represented in the figure, the microphones 33A, 33B, and 33C should be located at the clients 32A, 32B, and 32C, and the speech recognition unit 23 should be located at the speech recognition server 31. The voice endpoint detectors 21A, 21B, 21C and the voice / non-voice verifiers 22A, 22B, 22C may be located at the clients 32A, 32B, 32C as shown in the figure, and the voice endpoint detectors 21A, 21B, 21C) may be located on the clients 32A, 32B, 32C, and the voice / non-voice verification units 22A, 22B, 22C may be located on the voice recognition server 31, and the voice endpoint detectors 21A, 21B, 21C and All of the voice / non-voice verification units 22A, 22B, and 22C may be located in the voice recognition server 31. When the voice endpoint detectors 21A, 21B, and 21C are located in the voice recognition server 31, a separate voice endpoint detector may be provided for each client 32A, 32B, and 32C, and a plurality of clients 32A, 32B, One voice endpoint detector may be provided for 32C). In addition, when the voice / non-voice verification units 22A, 22B, and 22C are located in the voice recognition server 31, a separate voice / non-voice verification unit may be provided for each client 32A, 32B, and 32C. One voice / non-voice verification unit may be provided for the clients 32A, 32B, and 32C. When only the microphones 33A, 33B, 33C are located at the clients 32A, 32B, 32C, the clients 32A, 32B, 32C must communicate with the voice recognition server 31 at all times or frequently. The problem is that it puts a lot of load. This problem is particularly acute in applications such as robots that do not operate in a push-button manner. Accordingly, the voice endpoint detectors 21A, 21B, 21C are located at the clients 32A, 32B, 32C, and the voice / non-voice verification units 22A, 22B, 22C are located at the voice recognition server 31, or as shown in the figure. It is more preferable that the voice endpoint detection sections 21A, 21B, 21C and the voice / non-voice verification sections 22A, 22B, 22C are located at the clients 32A, 32B, 32C because the communication load can be reduced. If the voice endpoint detectors 22A, 22B, 22C are not used, the location of the voice / non-voice verifiers 22A, 22B, 22C at the clients 32A, 32B, 32C can reduce the communication load. More preferred.

도 4는 도 2의 음성/비음성 검증부(22)에서 수행되는 음성/비음성 검증 방법을 설명하기 위한 도면이다. 도 4를 참조하면, 음성/비음성 검증 방법은 제 1 버퍼링 단계(S41), 특징 벡터 추출 단계(S42), 모델 개선 단계(S43), 프레임 분류 단계(S44), 제 2 버퍼링 단계(S45) 및 창 분류 단계(S46)를 포함한다. FIG. 4 is a diagram for describing a voice / non-voice verification method performed by the voice / non-voice verification unit 22 of FIG. 2. Referring to FIG. 4, the voice / non-voice verification method includes a first buffering step S41, a feature vector extraction step S42, a model improvement step S43, a frame classification step S44, and a second buffering step S45. And window classification step S46.

제 1 버퍼링 단계(S41)에서는 입력 음성 데이터를 버퍼링한 후에 한 프레임 단위로 출력하는 동작이 이루어진다. 프레임의 길이는 특징 벡터를 추출하기에 적합한 길이로써, 일례로 20ms일 수 있으며, 앞 프레임과 뒷 프레임은 10ms씩 겹쳐지는 방식으로 프레임이 구하여질 수 있다. In the first buffering step S41, an operation of outputting the input voice data in one frame unit is performed. The length of the frame is a length suitable for extracting the feature vector. For example, the length of the frame may be 20 ms, and the frame may be obtained by overlapping the front frame and the rear frame by 10 ms.

특징 벡터 추출 단계(S42)에서는 일례로 필터 뱅크 에너지, 피치, 프레임내에 필터 뱅크간의 에너지 값의 변화량, 프레임간의 필터뱅크 에너지 값의 변화량 및 멜 필터 뱅크 계수 등의 특징 벡터 중 적어도 하나를 추출한다. 필터 뱅크 에너지는 음성 데이터의 주파수 대역에서 중에서 필요한 주파수 대역들의 에너지를 구한 값을 의미한다. 필요한 주파수 대역만을 추출할 때 사용하는 것이 필터이고, 필 터 뱅크는 필터들의 집합이다. 필터 뱅크 에너지는 필터별로 하나의 실수 값의 형태로 표현된다. 프레임내에 필터 뱅크간 에너지 값의 변화량은 한 프레임 내에서 필터 뱅크간의 에너지 차의 양을 의미하며 필터의 수보다 하나 작은 수의 개수로 표현된다. 프레임간의 필터 뱅크 에너지 값의 변화량은 시간의 흐름에 따른 프레임 열들 중에 인접한 프레임간의 같은 필터에서 나온 에너지 값의 차를 의미한다. 멜 필터 뱅크 계수는 MFCC(Mel Frequency Cepstrum Coefficient)로도 호칭되며, 주파수 대역에서 멜(Mel) 주파수 대역으로 바꾼 것의 필터를 이용하여 에너지 값을 얻고 그것을 다시 역 고속 프리에 변환(Inverse Fast Fourier Transform)을 취하여 얻은 값이다. In the feature vector extraction step (S42), for example, at least one of a feature vector such as a filter bank energy, a pitch, an amount of change of energy values between filter banks in a frame, an amount of change of filter bank energy values between frames, and a mel filter bank coefficient is extracted. The filter bank energy refers to a value obtained by obtaining energy of required frequency bands among frequency bands of voice data. The filter is used to extract only the required frequency band, and the filter bank is a set of filters. The filter bank energy is expressed in the form of one real value per filter. The amount of change in the energy value between filter banks in a frame means the amount of energy difference between filter banks in one frame and is expressed by the number less than the number of filters. The amount of change in the filter bank energy value between frames means a difference of energy values from the same filter between adjacent frames among frame columns over time. Mel filter bank coefficients are also referred to as MFCC (Mel Frequency Cepstrum Coefficient), by using the filter of the frequency band to Mel frequency band to obtain an energy value and take an Inverse Fast Fourier Transform again The value obtained.

모델 개선 단계(S43)에서는 음성/비음성 모델의 재모델링이 이루어지며, 적응 기술을 사용하여 최적화된 모델로 재모델링이 이루어진다. 적응 기법으로 고유 음성(Eigen Voice), MLLR(Maximum Likelihood Linear Regression), MAP(Maximum A-Posterior) 등의 방법 중 적어도 하나가 사용될 수 있다. 또한, 음성/비음성 모델을 개선함에 있어서, 온-라인(on-line) 상에서 재모델링이 이루어질 수 있다. 이와 같이 음성/비음성 모델이 개선되면, 음성/비음성 결정이 더욱 정확해질 수 있다. 모델 개선 단계(S43)은 선택적인 단계로서, 필요에 따라 수행될 수 있다.In the model improvement step S43, the remodeling of the voice / non-speech model is performed, and the remodeling is performed with the optimized model using an adaptive technique. As an adaptive technique, at least one of a method such as an eigen voice, a maximum likelihood linear regression (MLLR), a maximum A-Posterior (MAP), or the like may be used. Further, in improving the voice / non-voice model, remodeling can be done on-line. As such, when the voice / non-voice model is improved, the voice / non-voice decision may be more accurate. The model improvement step S43 is an optional step and may be performed as necessary.

프레임 분류 단계(S44)에서 음성/비음성 모델을 이용하여 프레임 단위의 음성/비음성 결정을 내린다. 음성/비음성의 결정을 수행함에 있어서, 통계적 방법의 모델링을 사용한 경우에는 GMM(Gaussian Mixture Model), HMM(Hidden Markov Model), SVM(Support Vector Machine), NN(Neural Network) 등의 방법 중 적어도 하나의 방법을 이용하여 음성/비음성의 결정을 하고, 규칙 기반의 방법으로 모델링을 한 경우는 규칙을 이용하여 프레임에 대한 음성/비음성의 결정을 한다. 음성/비음성 결정 값을 출력함에 있어서, 하드(hard) 결정 방식 즉 단순히 음성에 해당하는 값 및 비음성에 해당하는 값 중 어느 하나를 출력하는 방식에 의하여 결정 값이 출력될 수 있으며, 소프트(soft) 결정 방식 즉 음성 또는 비음성에 가까운 정도를 실수로써 출력하는 방식에 의하여 결정 값이 출력될 수도 있다. In the frame classification step S44, the voice / non-voice decision is made on a frame-by-frame basis using the voice / non-voice model. In the determination of speech / non-speech, when statistical modeling is used, at least one of methods such as Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machine (SVM), and Neural Network (NN) Determining speech / non-voice using one method, and modeling by rule-based method, determines voice / non-voice for a frame using rules. In outputting the voice / non-voice determination value, the determination value may be output by a hard determination method, that is, a method of simply outputting one of a value corresponding to voice and a value corresponding to non-voice. soft) The determination value may be output by a determination method, i.e., a method of outputting a degree close to voice or non-voice by real.

제 2 버퍼링 단계(S45)에서, 연속하는 프레임에 대하여 창(window)의 길이만큼 프레임 분류 단계(S44)에서 수행된 음성/비음성에 대한 결정 값을 버퍼링한다. 창은 연속하는 프레임의 집합으로 300ms ~ 1000ms 등 필요에 따른 적정한 크기를 사용한다. 제 2 버퍼링 단계(S45)에서 특징 벡터 추출 단계(S42)에서 추출된 특징 벡터도 추가적으로 버퍼링 할 수도 있다. In the second buffering step S45, the decision value for the voice / non-voice performed in the frame classification step S44 is buffered for the consecutive frames by the length of the window. A window is a set of consecutive frames that uses an appropriate size, such as 300ms to 1000ms. The feature vector extracted in the feature vector extraction step S42 in the second buffering step S45 may be additionally buffered.

창 분류 단계(S46)에서, 최종적으로 창 단위의 음성/비음성 결정을 내린다. 음성/비음성 결정을 수행함에 있어서, 규칙에 기반한 방법 즉 임계치를 이용하여 결정할 수도 있고, 또는 통계적인 방법 즉 분류기인 GMM, HMM, SVM, NN등을 이용하여 결정할 수도 있다.In the window classification step S46, a voice / non-voice decision is finally made in the window unit. In performing the voice / non-speech determination, it may be determined using a rule-based method, that is, a threshold value, or may be determined using a statistical method, that is, a classifier GMM, HMM, SVM, NN, or the like.

이와 같은 단계를 수행하면, 입력되는 데이터가 음성인지 비음성인지를 결정할 수 있으며, 이에 대응하여 음성 구간을 정하여 그에 대응하는 데이터(음성 신호 및/또는 특징 벡터)를 전달할 수 있다. 특히, 음성/비음성을 결정함에 있어서, 특징 벡터를 사용함으로써, 단시간 에너지(short-time energy) 및/또는 영교차율(zero crossing rate)을 사용하는 음성 끝점 검출부에 비하여 보다 정확히 음성/비 음성을 결정할 수 있다. By performing such a step, it is possible to determine whether the input data is speech or non-voice, and in response thereto, a speech section may be determined and the corresponding data (voice signal and / or feature vector) may be delivered. In particular, in determining the voice / non-voice, by using the feature vector, the voice / non-voice is more accurately compared to the voice endpoint detection unit using short-time energy and / or zero crossing rate. You can decide.

도 5는 도 4의 도면부호 S43에 해당하는 단계에서 사용되는 음성/비음성 모델의 초기 모델링 방법을 설명하기 위한 도면이다. 음성/비음성 모델의 초기 모델링 방법은 도 4에 표현된 음성/비음성 검증 방법 이전에 수행되며, 수행된 이후에 얻어지는 음성/비음성 모델이 음성/비음성 검증 방법에 사용된다. 모델링 과정은 바람직하게 오프-라인(Off-line)에서 음성/비음성에 대한 특징 벡터들을 이용하여 각각의 모델을 정교하게 만드는 것이다. 도 5를 참조하면, 음성/비음성 모델의 초기 모델링 방법은 버퍼링 단계(S51), 특징 벡터 추출 단계(S52) 및 음성/비음성 모델링 단계(S53)를 포함한다. FIG. 5 is a diagram for describing an initial modeling method of a voice / non-voice model used in a step corresponding to S43 of FIG. 4. The initial modeling method of the speech / non-speech model is performed before the speech / non-speech verification method represented in FIG. 4, and the speech / non-speech model obtained after the performance is used in the speech / non-speech verification method. The modeling process is to elaborate each model using feature vectors for speech / non-voice, preferably off-line. Referring to FIG. 5, the initial modeling method of the speech / non-voice model includes a buffering step S51, a feature vector extraction step S52, and a speech / non-voice modeling step S53.

버퍼링 단계(S51)에서는 입력 음성 데이터를 버퍼링한 후에 한 프레임 단위로 출력하는 동작이 이루어진다. 프레임의 길이는 특징 벡터를 추출하기에 적합한 길이로써, 도 4의 제 1 버퍼링 단계에서 사용되는 프레임 길이와 같을 수 있다. In the buffering step S51, the input voice data is buffered and then output in one frame unit. The length of the frame is a length suitable for extracting the feature vector, and may be the same as the frame length used in the first buffering step of FIG. 4.

특징 벡터 추출 단계(S52)에서는 일례로 필터 뱅크 에너지, 피치, 프레임내에 필터 뱅크간의 에너지 값의 변화량, 프레임간의 필터뱅크 에너지 값의 변화량 및 멜 필터 뱅크 계수 등의 특징 벡터를 추출한다. In the feature vector extraction step (S52), for example, feature vectors such as filter bank energy, pitch, change amount of energy value between filter banks in a frame, change amount of filter bank energy values between frames, and mel filter bank coefficients are extracted.

음성/비음성 모델링 단계(S53)에서, 모델을 만드는 방법으로 통계적인 방법이 사용될 수 있고, 또는 규칙 기반의 모델을 위한 규칙을 정하는 방식이 사용될 수 있고, 두 가지 방법의 하이브리드 방식이 사용될 수 있다. 통계적인 방법으로 벡터 양자화 방법, 가우시안 모델링을 기반한 방법등이 있으며, 이들에 대한 변별 력을 향상하기 위해 변별 학습 기능이 사용될 수 있다.In the speech / non-speech modeling step S53, a statistical method may be used as a model making method, or a rule setting method for a rule-based model may be used, or a hybrid method of two methods may be used. . Statistical methods include vector quantization, Gaussian modeling, etc. Discriminative learning can be used to improve their discrimination.

아래에 표현된 표 1 및 2는 본 발명의 실시예에 의한 음성 인식 장치와 종래 기술에 의한 음성 인식 장치의 성능을 비교하기 위한 표이다. Tables 1 and 2 expressed below are tables for comparing the performance of the speech recognition apparatus according to the embodiment of the present invention and the speech recognition apparatus according to the prior art.

발성vocalization 인식오류Recognition error 입력거부Refuse input 에러율Error rate 359359 359359 00 100%100%

발성vocalization 인식오류Recognition error 입력거부Refuse input 에러율Error rate 359359 6262 297297 17%17%

표 1은 음성이 아닌 잡음을 입력하였을 때 종래기술에 의한 음성인식 장치가 이를 음성으로 인식하여 오류가 발생하는 정도를 나타내는 도면이다. 표 1에서 알 수 있듯이, 359 회의 잡음을 입력하는 경우, 비음성으로 인식하여 입력거부를 한 경우는 없으며, 모두 음성으로 인식하여 인식오류가 발생하였다. 따라서, 에러율은 100%에 해당한다. Table 1 is a diagram showing the degree to which an error occurs when the voice recognition device according to the prior art recognizes this as a voice when noise other than voice is input. As can be seen from Table 1, when inputting 359 noises, the input was not rejected because it was recognized as non-voice, and all of them were recognized as voice and a recognition error occurred. Therefore, the error rate corresponds to 100%.

표 2는 음성이 아닌 잡음을 입력하였을 때 본발명에 의한 음성인식 장치가 이를 음성으로 인식하여 오류가 발생하는 정도를 나타내는 도면이다. 표 2에서 알 수 있듯이, 359 회의 잡음을 입력하는 경우, 비음성으로 인식하여 입력거부를 한 경우가 297회이고, 음성으로 인식하여 인식오류가 발생하는 경우가 62회였다. 따라서, 에러율은 17%로, 본 발명에 의한 음성인식 장치가 비음성인 잡음을 제거함으로써, 음성 인식 오류를 감소시키는 개선된 효과가 있음을 명확히 알 수 있다. Table 2 is a diagram showing the extent to which an error occurs when the voice recognition device according to the present invention recognizes this as a voice when noise other than voice is input. As shown in Table 2, when 359 noises were input, 297 times were denied by recognizing it as non-voice, and 62 times when recognition errors occurred by recognizing it as voice. Therefore, the error rate is 17%, and it can be clearly seen that the speech recognition apparatus according to the present invention has an improved effect of reducing the speech recognition error by removing the non-voice noise.

본 발명에 의한 음성/비음성 검증 방법 및 이를 이용한 음성 인식 장치는 종래기술에 비하여 음성 구간과 비음성 구간을 명확히 구분함으로써, 많은 연산을 필요로 하는 음성 인식부의 부하와 음성 인식에 소요되는 시간을 줄이고, 비음성인 신호를 입력받아 인식의 오류를 일으키는 것을 줄인다는 장점이 있다. Voice / non-voice verification method and voice recognition device using the same according to the present invention by clearly distinguishing the speech section and the non-voice section, compared to the prior art, the load of the speech recognition unit that requires a lot of operations and the time required for speech recognition It has the advantage of reducing the error of recognition by receiving a non-voice signal.

또한, 본 발명에 의한 음성/비음성 검증 방법 및 이를 이용한 음성 인식 장치는 복수의 클라이언트(가정에 있는 로봇 등)에서 발생하는 음성 신호를 음성 인식 서버에서 음성 인식하는 경우에 음성 인식 서버의 부하 또는 수를 줄일 수 있고, 그 인식 성능을 향상 시킬 수 있다는 있다는 장점이 있다. In addition, the voice / non-voice verification method and the voice recognition apparatus using the same in accordance with the present invention loads the voice recognition server when the voice recognition server to recognize the voice signals generated from a plurality of clients (such as robots at home) or The advantage is that the number can be reduced and the recognition performance can be improved.

Claims

A voice / non-voice verification unit which extracts a feature vector from input voice data and distinguishes whether the feature vector corresponds to voice or non-voice using a voice / non-voice model; And

And a speech recognizer configured to recognize a speech from data corresponding to a section determined by the speech / non-voice verifier as a speech.

The method of claim 1,

And the feature vector is at least one of a filter bank energy, a pitch, an amount of change of energy values between filter banks in a frame, an amount of change of filter bank energy values between frames, and a mel filter bank coefficient.

The method of claim 1,

And a data corresponding to the section determined as the voice is at least one of a feature vector and voice data.

The method according to any one of claims 1 to 3,

The voice / non-voice verification unit is located in the client, the voice recognition unit is located in the voice recognition server.

The method according to any one of claims 1 to 3,

And a voice end point detector to detect a voice section using at least one of a short time energy and a zero crossing rate, and transmit data corresponding to the voice section to the voice / non-voice verification unit.

The method of claim 5, wherein

The voice endpoint detecting unit is located in the client, the voice recognition unit is located in the voice recognition server, the voice / non-voice verification unit is located in the client or the voice recognition server.

(a) extracting a feature vector from frame-based data;

(b) performing speech / non-voice determination on a frame-by-frame basis using the speech / non-voice model;

(c) buffering speech / non-voice determination values of a plurality of consecutive frames by the length of the window; And

(d) making voice / non-voice decisions on a window-by-window basis.

The method of claim 7, wherein

In determining the voice / non-voice in step (b), the voice / non-voice verification method is determined using at least one of a statistical method and a rule-based method.

The method of claim 7, wherein

And (c) additionally buffering the feature vector.

The method of claim 7, wherein

In determining the voice / non-voice in step (d), the voice / non-voice verification method is determined using at least one of a statistical method and a rule-based method.

The method according to any one of claims 7 to 11,

Performed before step (a)

and (e) buffering and outputting the input voice data into data in units of frames.

The method according to any one of claims 7 to 11,

After the step (a)

(f) remodeling the speech / non-speech model using an adaptive technique using the feature vector.

The method according to any one of claims 7 to 11,

Performed before step (a)

(g) further comprising a voice / non-voice model initialization step.

The method of claim 14,

Step (g)

Buffering input voice data for initialization in units of frames;

Extracting a feature vector for initialization from input voice data for initialization of the buffered frame unit; And

Creating a speech / non-speech model using the feature vector for initialization.