KR960001950B1

KR960001950B1 - Voice recognizing method and its device

Info

Publication number: KR960001950B1
Application number: KR1019920020262A
Authority: KR
Inventors: 김재영
Original assignee: 삼성전자주식회사; 윤종용
Priority date: 1992-10-30
Filing date: 1992-10-30
Publication date: 1996-02-08
Also published as: KR940009928A

Abstract

The voice perception method comprises the steps of storing a plurality of reference voice signals composed of frame data; comparing a predetermined reference voice signal with a plurality of test voices one by one and storing a plurality of path patterns for each reference voice signal and the difference value between the number of frame of the reference voice signal and that of the test voice signal; generating the number of frame and the frame data from a predetermined input voice signal; calculating the frame data difference value between the input voice signal and the reference voice signal and accumulating it by each path pattern; and detecting a reference voice signal identical to the input voice signal by using the frame number of the input voice signal and the frame number of the reference voice signal and the accumulated frame data difference value.

Description

Voice recognition method and device

제1도는 동적시간감소방식을 이용한 종래의 단어인식방법을 설명하기 위한 순서도.1 is a flowchart illustrating a conventional word recognition method using a dynamic time reduction method.

제2도는 기준음성신호와 시험음성신호의 데이타 크기비교에서 발생한 데이타간의 크기 차이값을 표시한 개념도.2 is a conceptual diagram showing the magnitude difference between data generated in the data size comparison of the reference voice signal and the test voice signal.

제3도는 종래의 단어인식방법에 이용된 데이타계산량 감축방법을 나타낸 개념도.3 is a conceptual diagram showing a data calculation amount reduction method used in a conventional word recognition method.

제4도는 본 발명의 바람직한 일 실시예를 구현한 음성인식 시스템을 나타낸 블록도.Figure 4 is a block diagram showing a speech recognition system implementing one preferred embodiment of the present invention.

제5도는 본 발명의 음성인식방법에서 비교되는 음성신호에 의한 최소평균 차이값을 추출하는 과정을 나타낸 순서도.5 is a flowchart illustrating a process of extracting a minimum mean difference value of a speech signal compared in the speech recognition method of the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

41 : 음성신호입력부 42 : 형상추출부41: voice signal input unit 42: shape extraction unit

43 : 음성패턴비교부 44 : 기준패턴저장부43: voice pattern comparison unit 44: reference pattern storage unit

45 : 인식단어출력부45: recognition word output unit

본 발명은 음성인식방법 및 그 장치에 관한 것으로서 특히, 음성인식 시스템에서 입력되는 음성신호와 시스템에 내장된 기준음성신호의 비교에 근거하여 두 음성신호의 일치여부를 판정하는 음성인식방법 및 그 장치에 관한 것이다.The present invention relates to a voice recognition method and apparatus, and more particularly, to a voice recognition method and apparatus for determining whether two voice signals match based on a comparison between a voice signal input from a voice recognition system and a reference voice signal built into the system. It is about.

음성인식 시스템은 사람의 육성으로 입력된 음성신호(이하, '시험음성신호'라 함)를 시스템내부에 저장된 기준음성신호와 비교한 결과를 이용하여 입력되는 음성신호의 의미를 인식하는 것이다. 음성인식 시스템에 널리 사용되는 방식은 동적시간감소(Dynamic Time Warping; DTW)방식과 히든마코브방식(Hidden Markov Mode; HMM)이 있는데, 동적시간감소(DTW)방식은 단어인식(isolated word recognition)에 주로 이용하며, 히든마코브방식(HMM)은 문장인식(connected word recognition)에 이용된다.The speech recognition system recognizes the meaning of the input speech signal using the result of comparing the speech signal input by human development (hereinafter referred to as a 'test speech signal') with a reference speech signal stored in the system. The widely used methods for speech recognition systems include Dynamic Time Warping (DTW) and Hidden Markov Mode (HMM). The dynamic time reduction (DTW) method is isolated word recognition. Hidden Markov (HMM) is used for connected word recognition.

일반적으로 사람이 이야기하는 단어들은 기계적인 신호와는 달라 사람들마다 다른 길이로 발음된다. 예를 들면, '시작'이란 단어를 발음할 때 어떤 사람들은 '시'자를 길게 발음하며, 다른 어떤 사람들은 '작'이란 단어를 '시'자에 비해 상대적으로 길게 발음하기도 한다. 그리고, 동일한 음장으로 발음되는 경우에도 각 시간대에서의 소리크기가 다르므로, 동일한 단어에 대하여 시험음성신호를 샘플링한 경우에도 특정 샘플링 시간에서의 데이타 역시 발음하는 사람에 따라 달라진다.In general, words spoken by people are pronounced different lengths from people, unlike mechanical signals. For example, when pronouncing the word 'start', some people pronounce the 'poet' long, while others pronounce the word 'small' relatively longer than the 'poet'. In addition, since the loudness in each time zone is different even when pronounced in the same sound field, even when a test voice signal is sampled for the same word, the data at a specific sampling time also varies according to the person who pronounces the sound.

음장 및 음의 크기에 의해 발생하는 음성데이타 분포차이를 고려한 음성인식방법, 특히 단어인식방법을 제1도 내지 제3도를 참조하여 설명한다.A speech recognition method, in particular, a word recognition method, considering the sound data distribution difference generated by the sound field and the loudness, will be described with reference to FIGS. 1 to 3.

제1도는 동적시간감소방식을 이용한 종래의 단어인식방법을 나타낸 순서도이다. 제2도는 기준음성신호와 시험음성신호의 데이타 크기비교에 의해 발생된 데이타간의 크기 차이값을 표시한 개념도로서, 프레임갯수가 5인 기준음성신호와 프레임갯수가 4인 시험음성신호의 데이타 크기비교에 의한 차이값들을 나타낸 것이다. 제2도에서 보여진 차이값들은 설명의 용이함을 위해 임의로 설정한 것이다.1 is a flowchart illustrating a conventional word recognition method using a dynamic time reduction method. 2 is a conceptual diagram showing the size difference between data generated by the comparison of the data size of the reference audio signal and the test audio signal. The data size comparison of the reference audio signal having 5 frames and the test audio signal having 4 frames is shown in FIG. The difference values are shown. The difference values shown in FIG. 2 are arbitrarily set for ease of explanation.

제3도는 종래의 단어인식방법에 이용된 데이타계산량 감축방법을 설명하기 위한 개념도로서, 제3도(a)는 데이타비교를 실행하는 영역과 데이타비교를 실행하지 않는 영역을 구분하여 보여주는 것으로, 점선으로 표시된 윈도우내의 영역에 데이타비교를 실행하는 영역이 된다. 제3도(b)는 프레임간의 국소경로구속(local path constraint)을 설명하기 위한 도면이다.FIG. 3 is a conceptual diagram illustrating a method of reducing data calculation amount used in a conventional word recognition method. FIG. 3 (a) shows a region in which data comparison is performed and an area in which data comparison is not performed. This is the area for performing data comparison to the area within the window indicated by. 3 (b) is a diagram for explaining a local path constraint between frames.

음성인식 시스템은 시험음성신호와 기준음성신호를 비교하기 위하여 입력되는 시험음성신호를 소정의 샘플링 시간간격으로 샘플링한다. 일반적으로 소정갯수의 샘플링 데이타가 하나의 프레임을 형성하며, 다수의 프레임들이 하나의 음성신호를 구성한다. 각 프레임의 대표값은 그 프레임에 속한 샘플링 데이터들에 의해 생성된다. 프레임을 구성하는 샘플링 데이타의 갯수 결정 및 프레임을 대표하는 데이터를 계산하는 방법은 여러 가지가 있으나, 당업자에게는 공지된 기술이므로 그 구체적 설명은 생략한다. 시스템내부에 저장되는 단어들 역시 상술의 샘플링 시간간격과 동일한 시간간격을 샘플링되므로, 상술의 용어 '프레임'을 기준음성 신호의 샘플링 데이터들을 구분하기 위해서도 사용한다. 그리고, 이후로는 기준음성신호의 프레임갯수는 5개, 시험음성신호의 프레임갯수는 4로 각각 설정하여 설명한다.The voice recognition system samples the input test voice signal at a predetermined sampling time interval to compare the test voice signal with the reference voice signal. In general, a predetermined number of sampling data forms one frame, and a plurality of frames constitute one audio signal. The representative value of each frame is generated by the sampling data belonging to that frame. Although there are various methods of determining the number of sampling data constituting the frame and calculating the data representing the frame, a detailed description thereof is omitted since it is known to those skilled in the art. Since the words stored in the system are also sampled at the same time interval as the above-described sampling time interval, the term 'frame' is also used to distinguish sampling data of the reference voice signal. After that, the number of frames of the reference audio signal is set to 5 and the number of frames of the test audio signal is set to 4, respectively.

음성인식 시스템은 내장된 모든 단어들을 제1도의 순서도에 따라 4개의 프레임으로 구성된 시험음성신호에 대해 비교처리하여 평균 차이값을 계산한다. 제1도의 순서도에서, 시험음성신호의 샘플링으로부터 얻어지는 시험프레임갯수(Nt) 및 기준음성신호의 기준프레임갯수(Nr)가 읽혀지면(단계 11), 기준음성신호의 프레임번호(J)와 시험음성신호의 프레임번호(I)는 각각 1로 초기화된다(단계 12, 단계 13), 프레임번호(J)가 1인 기준음성신호의 데이타 및 프레임번호(I)가 1인 시험음성신호의 데이터는 그 크기가 비교되어 데이터크기 차이값이 생성되고(단계 14), 단계 15에서, 제2도의 가로, 세로 및 대각선 방향으로 I 및 J가 각각 1인 위치부터 4 및 5인 위치까지를 연결하는 모든 경로들에 대한 차이값들이 각각의 경로별로 누산된다. 프레임 번호 I=J=1인 경우 생성되는 데이터크기 차이값은 제2도에서 수치 2로 표시되었다. 차이값 누산단계가 최초로 수행되는 경우, 누산에 의해 생성된 값이 누산에 이용된 차이값 그 자체가 된다. 차이값 누산단계(단계 15)가 완료되면, 시험음성신호의 프레임번호(I)는 시험음성신호의 전체 프레임갯수(Nt)와 비교된다(단계 16). 비교에 의해 프레임번호(I=1)는 전체 프레임갯수(Nt=4)보다 작은 경우, 단계 17에서 프레임번호(I)에 1이 더해지고, 데이타 차이값계산을 위해 다시 단계 14가 수행된다. 두 번째 수행되는 단계 14에서 생성된 차이값은 제2도의 경우 7이 된다. 따라서, 그 이후의 단계 15의 수행결과, 누산된 차이값은 9가 된다.The speech recognition system calculates an average difference value by comparing all embedded words with a test speech signal consisting of four frames according to the flowchart of FIG. In the flowchart of FIG. 1, when the number of test frames Nt obtained from the sampling of the test voice signal and the number of reference frames Nr of the reference voice signal are read (step 11), the frame number J of the reference voice signal and the test voice The frame number I of the signal is initialized to 1, respectively (steps 12 and 13). The magnitudes are compared to produce a data size difference (step 14), and in step 15, all paths connecting I and J from 1 to 4 and 5 in the horizontal, vertical and diagonal directions of FIG. Difference values are accumulated for each path. The data size difference generated when the frame number I = J = 1 is represented by the numeral 2 in FIG. When the difference value accumulation step is first performed, the value generated by the accumulation becomes the difference value itself used for accumulation. When the difference value accumulation step (step 15) is completed, the frame number I of the test speech signal is compared with the total number of frames Nt of the test speech signal (step 16). If the frame number (I = 1) is smaller than the total number of frames (Nt = 4) by comparison, 1 is added to the frame number (I) in step 17, and step 14 is again performed for data difference value calculation. The difference generated in the second step 14 is 7 in FIG. Therefore, as a result of the subsequent step 15, the accumulated difference value becomes 9.

이와같은 과정으로 기준음성신호의 프레임번호(J)가 1인 경우, 시험음성신호의 프레임번호(I) 1부터 4를 연결하는 경로에 의해 얻어지는 누산된 차이값은 25가 된다. 시험음성신호의 프레임번호(I)와 전체 시험프레임갯수(Nt)의 비교(단계 16)에서 프레임번호(I)가 전체 시험프레임갯수(Nt)보다 작지 않으면, 기준음성신호의 프레임번호(J)와 전체 프레임갯수(Nr)는 비교된다(단계 18). 프레임번호(J)가 1인 경우, 프레임번호(J)에는 1인 더해진 다음(단계 19), 시험음성신호의 프레임번호(I)를 1로 초기화하는 단계 13이 수행된다. 프레임번호 J=2, I=1에 대해 단계 14에서 데이타 차이값이 계산되고, 단계 15에서 그 차이값이 누산되면, 새로운 누산된 차이값 10이 얻어진다. 데이타 차이값 누산단계는 기준음성신호 및 시험음성신호의 초기값(제2도에서 I=J=1)을 갖는 시작위치부터 그들의 최종값(제2도에서 I=4, J=5)을 갖는 끝위치까지 연결하는 모든 경로들에 대해 계산된다. 따라서, 기준음성신호의 프레임번호(J=5)가 전체 프레임갯수(Nr)와 비교되는 단계(단계 18)까지 실행된 경우, 제2도에 표시된 바와같은 데이타 차이값들을 얻을 수 있으며, 상술한 시작위치에서부터 좌측, 우측, 대각선방향으로 끝위치까지 이어지는 무수한 경로에 의한 누산된 데이타 차이값들이 얻어진다. 무수한 경로들중 제2도에서 일점쇄선으로 표시된 경로를 예를들면, 누산된 데이타 차이값은 2+1+2+2+1=8이 된다.In this process, when the frame number J of the reference audio signal is 1, the accumulated difference value obtained by the path connecting the frame numbers I to 4 of the test audio signal is 25. If the frame number I is not smaller than the total number of test frames Nt in the comparison between the frame number I of the test audio signal and the total number of test frames Nt (step 16), the frame number J of the reference audio signal And the total frame number Nr are compared (step 18). If the frame number J is 1, the frame number J is added to 1 (step 19), and then step 13 of initializing the frame number I of the test speech signal to 1 is performed. If the difference value is calculated in step 14 for frame number J = 2, I = 1, and the difference value is accumulated in step 15, a new accumulated difference value 10 is obtained. The data difference value accumulating step has its final value (I = 4, J = 5 in Fig. 2) from the starting position having the initial value of the reference voice signal and the test voice signal (I = J = 1 in Fig. 2). Calculations are made for all routes to the end position. Therefore, when the frame number (J = 5) of the reference audio signal is executed up to the step (step 18) in which the frame number Nr is compared with the total frame number Nr, data difference values as shown in FIG. 2 can be obtained. Accumulated data difference values are obtained by a myriad of paths from the starting position to the left, right and diagonal ends. For example, the path represented by the dashed line in FIG. 2 among the countless paths, the accumulated data difference value is 2 + 1 + 2 + 2 + 1 = 8.

단계 18에서의 기준음성신호의 프레임번호(J)와 전체 프레임갯수(Nr)의 비교에 의해 프레임번호(J)가 전체 프레임갯수(Nr)보다 작지 않으면, 단계 20이 수행되어 가장작은 누산된 차이값이 추출되며, 단계 21에서 평균 차이값 즉, 누산된 차이값을 기준음성신호의 전체 프레임갯수(Nr)와 시험음성신호의 전체 프레임갯수(Nt)를 합한 갯수(Nt+Nr)로 나눈 결과인 평균 차이값이 얻어진다. 누산된 차이값중 가장 작은 것이 8이라면, 그때의 평균 차이값은 8÷(4+5)≒0.88이 된다. 상술의 절차를 이용하여 내장된 모든 단어들에 대한 평균 차이값이 계산되면, 음성인식 시스템은 얻어진 평균 차이값들 중에서 크기가 가장작은 평균 차이값을 발생시키는 기준음성신호를 입력된 시험음성신호와 동일한 의미를 갖는 단어로 인식한다.If the frame number J is not smaller than the total frame number Nr by comparison of the frame number J and the total frame number Nr of the reference audio signal in step 18, step 20 is performed to obtain the smallest accumulated difference. In step 21, the average difference value, that is, the accumulated difference value, is divided by the total number of frames Nr of the reference voice signal and the total number of frames Nt of the test voice signal (Nt + Nr). A mean difference value of is obtained. If the smallest accumulated difference value is 8, then the average difference value is 8 ÷ (4 + 5) ≒ 0.88. When the average difference value for all the words embedded is calculated using the above-described procedure, the speech recognition system receives a reference voice signal that generates the smallest average difference value among the obtained average difference values and the input test voice signal. Recognize the words as having the same meaning.

사람 개개인의 발음특성에 따라 음의 크기 및 음장이 달라지더라도 동일한 단어인 경우에는 샘플링된 데이터들에도 유사성이 많으므로, 데이타비교에서 무의미한 영역 즉, 제2도에서 기준음성신호의 큰 프레임신호를 갖는 데이터와 시험음성신호의 낮은 프레임번호를 갖는 데이터들이 비교되는 영역 및 기준음성신호의 작은 프레임번호를 갖는 데이터와 시험음성신호의 큰 프레임번호를 갖는 데이터들이 비교되는 영역을 무시하면 데이터처리량을 크게 줄일 수 있다. 이런 이유로, 국소경로구속(local path constraint)방법을 이용하는 음성인식 시스템은 제3도(a)와 같은 점선으로 표시된 윈도우를 이용하여 윈도우내에서 서로 대응되는 시험음성신호의 데이터들 및 기준음성신호의 데이터들만을 서로 비교하며 데이타간의 차이값을 계산한다. 그리고, 데이타 차이값을 누산하는 과정에서 특정 프레임간에 발생할 가능성이 없는 경로, 예를들면, 제3도(b)에서 C→B→A로 연결되는 경로를 제외하며, A→D→F, B→F, C→E→F 등의 설정된 경로상에 있는 데이터만을 누산하는 국소경로구속(local path constraint)을 적용한다. 따라서, 윈도우내에서의 데이타 크기비교에서 생성된 데이타 차이값들을 이용하여 만들어지는 누산된 차이값의 가지수가 크게 줄어들므로 전체적인 데이터처리량 역시 크게 줄어든다.Even if the loudness and the sound field are different according to the pronunciation characteristics of each person, the same words have many similarities in the sampled data. Therefore, in the data comparison, a large frame signal of the reference speech signal is shown in a meaningless area, that is, FIG. Ignoring the area where the data with the low frame number of the test voice signal is compared and the area where the data having the small frame number of the reference voice signal and the data having the large frame number of the test voice signal are compared, the data throughput is large. Can be reduced. For this reason, the voice recognition system using the local path constraint method uses the window indicated by the dotted line as shown in FIG. 3 (a) to compare the data of the test voice signal and the reference voice signal corresponding to each other in the window. Only the data are compared with each other and the difference between the data is calculated. In the process of accumulating the data difference, a path that is unlikely to occur between specific frames, for example, a path connecting C → B → A in FIG. 3 (b) is excluded, and A → D → F, B Local path constraint is applied to accumulate only the data on the set path such as → F, C → E → F. Therefore, since the number of accumulated difference values generated using the data difference values generated in the data size comparison in the window is greatly reduced, the overall data throughput is also greatly reduced.

전술의 국소경로구속방법을 이용한 데이터처리량의 축소에도 불구하고 윈도우내에 존재하는 기준음성신호 및 시험음성신호의 모든 프레임간에는 제1도의 순서도와 유사한 많은 데이터처리가 요구된다. 특히, 음성인식 시스템이 인식할 수 있는 단어의 숫자가 많아지는 경우, 단어인식에 소요되는 시간이 더욱 길어지는 문제가 발생한다.Despite the reduction of the data throughput using the above-described local path confinement method, many data processing similar to the flowchart of FIG. 1 is required between all frames of the reference voice signal and the test voice signal existing in the window. In particular, when the number of words that can be recognized by the speech recognition system increases, a problem in that the time required for word recognition becomes longer.

상기의 문제점을 해결하기 위한 본 발명의 목적은, 동일한 단어이지만 사람의 발음특성에 의해 서로 상이한 샘플링 데이타분포를 갖는 다수의 시험음성신호를 동일한 단어인 기준음성신호와 데이타 크기비교하여 서로 관련된 시험음성신호와 기준음성신호마다 가장작은 누산된 차이값을 갖는 경로패턴을 구하는 과정을 시스템이 지원하는 모든 단어들에 대해 실행하여 다수의 경로패턴들을 생성한 다음, 상술의 경로패턴상에 놓인 위치에서만 시험음성신호와 기준음성신호의 데이타 크기비교를 실행하여 생성되는 누산된 차이값 및 평균 차이값을 이용함으로써 입력단어를 인식할 수 있는 음성인식방법을 제공함에 있다.An object of the present invention for solving the above problems is to compare a plurality of test voice signals having the same sampling data distribution with the same word but the sound quality of a person, and compare the test voice with the reference voice signal which is the same word and the data size. The process of obtaining the path pattern with the smallest accumulated difference value for each signal and the reference voice signal is performed for all words supported by the system to generate a plurality of path patterns, and then only tests the positions on the path patterns described above. The present invention provides a speech recognition method capable of recognizing an input word by using an accumulated difference value and an average difference value generated by performing data size comparison between a speech signal and a reference speech signal.

상기의 문제점을 해결하기 위한 본 발명의 다른 목적은, 동일한 단어이지만 사람의 발음특성에 의해 서로 상이한 샘플링 데이타분포를 갖는 다수의 시험음성신호를 동일한 단어인 기준음성신호와 데이타 크기비교하여 서로 관련된 시험음성신호와 기준음성신호마다 가장작은 누산된 차이값을 갖는 경로패턴을 구하는 과정을 시스템이 지원하는 모든 단어들에 대해 실행하여 다수의 경로패턴들을 생성한 다음, 상술의 경로패턴상에 놓인 위치에서만 시험음성신호와 기준음성신호의 데이타 크기비교를 실행하여 생성되는 누산된 차이값 및 평균 차이값을 이용함으로써 입력단어를 인식할 수 있는 음성인식장치를 제공함에 있다.Another object of the present invention for solving the above problems is a test in which a plurality of test voice signals having the same word but different sampling data distributions due to human pronunciation characteristics are compared with the reference voice signal which is the same word and the data size. A process of obtaining a path pattern having the smallest accumulated difference value for each voice signal and a reference voice signal is performed for all words supported by the system to generate a plurality of path patterns, and then only at a position placed on the path pattern described above. The present invention provides a speech recognition apparatus capable of recognizing an input word by using an accumulated difference value and an average difference value generated by performing data size comparison between a test speech signal and a reference speech signal.

이와같은 본 발명의 목적을 달성하기 위한 음성인식방법은, 프레임 데이터들로 이루어진 다수의 기준음성신호를 저장하는 단계, 상기 저장된 다수의 기준음성신호 각각에 대하여 소정의 기준음성신호와 동일한 글자이지만 음장과 음의 크기가 서로다른 다수의 시험음성들과 상기 소정의 기준음성신호를 일대일 비교하여 각 기준음성신호마다 생성된 다수의 경로패턴 및 경로패턴발생에 이용된 상기 소정 기준음성신호와 소정 시험음성신호 사이의 프레임갯수 차이값을 저장하는 단계, 소정의 입력음성신호로부터 프레임갯수 및 프레임 데이터를 생성하는 단계, 상기 저장된 다수의 기준음성신호 각각에 대하여 상기 입력음성신호와 소정 기준음성신호의 프레임갯수차이에 대응하는 다수의 경로패턴상에서 상기 입력음성신호와 소정 기준음성신호의 프레임 데이타 차이값을 계산하여 각 경로패턴별로 누산하는 단계, 및 상기 경로패턴마다 누산된 프레임 데이타 차이값과 각 경로패턴에 대응하는 상기 입력음성신호의 프레임갯수와 소정 기준음성신호의 프레임갯수를 이용하여 입력음성신호와 동일한 글자인 기준음성신호를 검출하는 단계를 포함한다.The voice recognition method for achieving the object of the present invention comprises the steps of: storing a plurality of reference voice signals made of frame data, the same letters but a sound field for a predetermined reference voice signal for each of the stored plurality of reference voice signals And a plurality of test voices having different sound levels and the predetermined reference voice signal by one-to-one comparison, and the predetermined reference voice signal and the predetermined test voice used for generating a plurality of path patterns and path patterns generated for each reference voice signal. Storing a frame number difference value between signals, generating a frame number and frame data from a predetermined input voice signal, and a frame number of the input voice signal and a predetermined reference voice signal for each of the stored plurality of reference voice signals On the plurality of path patterns corresponding to the difference between the input voice signal and the predetermined reference voice signal Calculating frame data difference values and accumulating them for each path pattern, and using frame data difference values accumulated for each path pattern and the number of frames of the input audio signal corresponding to each path pattern and the number of frames of a predetermined reference audio signal. Detecting a reference voice signal that is the same letter as the input voice signal.

또한 본 발명의 다른 목적을 달성하기 위한 음성인식장치는, 프레임 데이터로 이루어진 다수의 기준음성신호, 상기 저장된 다수의 기준음성신호 각각에 대하여 소정의 기준음성신호와 동일한 글자이지만 음장과 음의 크기가 서로 다른 다수의 시험음성들과 상기 소정의 기준음성신호를 각 기준음성신호마다 일대일 비교하여 생성된 다수의 경로패턴 및 경로패턴발생에 이용된 상기 소정 기준음성신호와 소정 시험음성신호 사이의 프레임갯수 차이값을 저장하는 기준패턴저장수단, 입력음성신호로부터 프레임 데이타 및 프레임갯수를 추출하는 형상추출수단, 상기 형상추출수단으로부터 출력하는 입력음성신호의 프레임갯수를 상기 기준패턴저장수단에 저장된 소정 기준음성신호의 프레임갯수와 비교하여 프레임갯수 차이값을 생성하며, 생성된 프레임갯수 차이값에 대응하는 경로패턴들의 각 경로패턴상에서 서로 대응하는 기준음성신호의 프레임 데이터와 입력음성신호의 프레임 데이터를 비교하여 생성된 데이타 차이값들을 각 경로패턴별로 누산하며 비교에 이용된 기준음성신호의 프레임갯수와 입력음성신호의 프레임갯수를 이용하여 생성된 소정의 수자로 상기 누산된 데이타 차이값을 나누어 얻어지는 평균 차이값들 중에서 최소의 평균 차이값을 추출하는 과정을 시스템에 저장된 모든 기준음성신호에 대하여 실행하고, 입력음성신호와 모든 기준음성신호 각각의 비교에 의해 생성된 최소의 평균 차이값에 대응하는 기준음성신호를 입력음성신호와 동일한 글자로 인식하는 음성패턴비교수단을 포함한다.In addition, a voice recognition device for achieving another object of the present invention, a plurality of reference voice signals made of frame data, the same letter as a predetermined reference voice signal for each of the stored plurality of reference voice signals, but the sound field and the size of the sound The number of frames between the predetermined reference audio signal and the predetermined test audio signal used for generating the plurality of path patterns and path patterns generated by one-to-one comparison of the plurality of different test voices and the predetermined reference voice signal for each reference voice signal. A reference pattern storage means for storing the difference value, shape extraction means for extracting frame data and the number of frames from the input voice signal, and a predetermined reference voice stored in the reference pattern storage means for the frame number of the input voice signal output from the shape extraction means; The frame number difference value is generated by comparing with the frame number of the signal. The reference data used in the comparison is calculated by accumulating the data difference values generated by comparing the frame data of the reference voice signal and the frame data of the input voice signal corresponding to each other on the path patterns of the path patterns corresponding to the number of differences. All criteria stored in the system extract the minimum average difference value among the average difference values obtained by dividing the accumulated data difference value by a predetermined number generated using the frame number of the voice signal and the frame number of the input voice signal. Speech pattern comparison means for executing a speech signal and recognizing a reference speech signal corresponding to the minimum mean difference value generated by comparison of each of the input speech signal and all reference speech signals with the same letter as the input speech signal. .

이하, 본 발명의 바람직한 일 실시예에 따른 음성인식방법 및 그 장치를 첨부된 도면을 참조하여 상세히 설명한다.Hereinafter, a voice recognition method and an apparatus thereof according to an exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings.

제4도는 본 발명의 바람직한 일 실시예에 따른 음성인식 시스템을 나타낸 블록도이다. 제4도의 장치에서, 음성신호입력부(41)는 사람의 육성으로 입력되는 단어를 그 출력단에 연결된 형상추출부(42)로 전송한다. 형상추출부(42)는 입력되는 단어형태의 입력음성신호 또는 시험음성신호를 시스템에 내장된 단어를 샘플링한 시간간격과 동일한 시간간격으로 샘플링한다. 그리고, 샘플링한 결과에 근거하여 형상추출부(42)는 입력음성신호에 대한 프레임 데이타 및 프레임갯수(N_I), 또는 시험음성신호에 대한 프레임 데이타 및 프레임갯수(Nt)를 생성한다. 생성된 프레임 데이타 및 프레임갯수(N_I, 또는 Nt)는 음성패턴비교부(43)로 인가된다. 한편, 전술한 것처럼, 동일한 단어라도 사람들간의 발음특성의 차이 때문에, 동일한 단어에 대하여 서로 다른 샘플링 데이타 분포를 갖는 다수의 시험음성신호들이 형성된다. 이런 점에 근거하여, 기준패턴저장부(44)는 동일한 단어에 대해 서로 다르게 형성되는 다수의 시험음성신호들 각각과 음성인식 시스템내부에 있는 동일한 단어의 기준음성신호에 대한 최소의 누산된 차이값에 대응하는 경로패턴과 그 경로패턴을 생성할 때 이용된 기준음성신호의 전체 프레임갯수(Nr)를 저장한다. 기준패턴저장부(44)에 저장되는 기준음성신호의 전체 프레임갯수(Nr) 및 경로패턴은 제1도에 관련하여 전술한 방법을 통해 얻어진다. 그리고, 저장되는 경로패턴은 동일한 단어로 인정되는 다수의 시험음성신호에 대해 얻어진다. 따라서, 실제로 기준 패턴저장부(44)는 음성인식 시스템에 의해 인식할 수 있는 모든 단어들 각각에 대해 다수의 경로패턴, 그 경로패턴생성에 이용된 기준음성신호의 전체 프레임갯수(Nr) 및 프레임 데이타, 그리고 각 경로패턴에 대응하는 프레임갯수 차이값을 저장하게 된다. 음성패턴비교부(43)는 형상추출부(42)로부터 인가되는 음성신호와 기준패턴저장부(44)에 저장된 정보들을 이용하여 입력된 음성신호에 대응하는 단어를 인식한다. 음성패턴비교부(43)에 의해 인식된 단어는 인식단어출력부(45)를 통해 뒷단의 미도시된 기기들에서의 사용에 적합한 소정 형태로 변환되어 출력된다. 전술한 구성을 갖는 제4도 장치의 동작을 제5도를 참조하여 설명한다.4 is a block diagram showing a voice recognition system according to an embodiment of the present invention. In the apparatus of FIG. 4, the voice signal input unit 41 transmits a word input by human development to the shape extracting unit 42 connected to the output terminal. The shape extracting unit 42 samples the input speech signal or the test speech signal in the form of a word to be input at the same time interval as the sampling interval of the words embedded in the system. Based on the sampling result, the shape extracting unit 42 generates the frame data and the frame number N _I for the input audio signal or the frame data and the frame number Nt for the test audio signal. The generated frame data and the number of frames N _I or Nt are applied to the voice pattern comparison unit 43. On the other hand, as described above, due to the difference in pronunciation characteristics among people, even a plurality of test voice signals having different sampling data distributions are formed for the same word. Based on this, the reference pattern storage section 44 has a minimum accumulated difference value for each of the plurality of test speech signals formed differently for the same word and the reference speech signal of the same word in the speech recognition system. The path pattern corresponding to and the total frame number Nr of the reference voice signal used when generating the path pattern are stored. The total frame number Nr and the path pattern of the reference voice signal stored in the reference pattern storage 44 are obtained through the method described above with reference to FIG. The stored path pattern is obtained for a plurality of test speech signals recognized as the same word. Therefore, in practice, the reference pattern storage unit 44 includes a plurality of path patterns for each word that can be recognized by the speech recognition system, the total number of frames Nr of the reference voice signal used for generating the path pattern, and the frame. Data and frame number difference corresponding to each path pattern are stored. The voice pattern comparing unit 43 recognizes a word corresponding to the input voice signal using the voice signal applied from the shape extracting unit 42 and the information stored in the reference pattern storage unit 44. The word recognized by the voice pattern comparison unit 43 is converted into a predetermined form suitable for use in devices not shown in the rear stage through the recognition word output unit 45 and output. The operation of the FIG. 4 apparatus having the above-described configuration will be described with reference to FIG.

제5도는 본 발명의 바람직한 일 실시예에 따른 음성인식방법을 설명하기 위한 순서도이다.5 is a flowchart illustrating a voice recognition method according to an embodiment of the present invention.

형상추출부(42)는 음성신호입력부(41)에 의해 입력되는 단어형태의 입력음성신호를 시스템에 내장된 단어를 샘플링한 시간간격과 동일한 시간간격으로 샘플링하며 샘플링한 결과에 대한 신호처리에 의해 얻어진 프레임 데이타 및 프레임갯수(N_I)를 음성패턴비교부(43)로 공급된다. 형상추출부(42)에 의해 입력음성신호에 대한 프레임 데이타 및 프레임갯수(N_I)가 생성되면, 음성패턴비교부(43)는 형상추출부(42)로부터 입력음성신호의 프레임갯수(Nt) 및 기준패턴저장부(44)에 저장된 소정 단어에 대응하는 기준음성신호의 프레임갯수(Nr)를 읽어들인다(단계 51). 음성패턴비교부(43)는 비교에 이용되는 기준음성신호가 갖는 프레임갯수(Nr) 및 입력음성신호의 프레임갯수(N_I)의 절대치 차이값(D)인│Nr-Nt│를 계산한다(단계 52). 음성패턴비교부(43)는 기준패턴저장부(44)에 저장된 동일한 기준음성신호에 대응하는 다수의 경로패턴들 중에서 계산된 차이값(D)에 대응하는 경로패턴들을 읽어내며, 각 경로패턴에 의한 프레임 데이터와 입력음성신호의 프레임 데이터를 비교하여 입력음성신호와 기준음성신호의 데이터들간의 차이값을 생성한다(단계 53). 단계 53에서의 비교되는 데이타간의 차이값계산을 경로패턴상의 프레임 데이타간에서만 실행한다. 즉, 입력음성신호와 소정의 기준음성신호에 의해 결정된 절대치 차이값(D)에 대응하는 경로패턴들에 대해서만 누산 차이값계산이 이루어진다. 절대치 차이값(D)에 대응되는 경로패턴들을 이용한 다수의 차이값계산이 완료되면, 음성패턴비교부(43)는 각 경로패턴내의 차이값들을 누산하며, 누산된 차이값들을 이용하여 평균 차이값을 계산한다(단계 54). 평균 차이값계산은 종래의 경우와 마찬가지로 입력음성신호의 프레임갯수와 기준음성신호의 프레임갯수를 합한 수치(Nt+Nr)로 누산된 차이값을 나눔으로써 각각의 경로패턴에 대한 평균 차이값을 계산한다. 음성패턴비교부(43)는 계산된 평균 차이값 중에서 가장 작은 평균 차이값을 검출한다(단계 55). 음성패턴비교부(43)는 단계 55를 수행하여 가장 작은 누산된 차이값을 얻은 다음, 평균 차이값을 계산함으로써 가장 작은 평균 차이값을 계산하는 과정을 이룰 수도 있다.The shape extracting unit 42 samples the input voice signal in the form of a word input by the voice signal input unit 41 at the same time interval as the time interval of sampling the words embedded in the system, and by signal processing on the sampled result. The obtained frame data and the number of frames N _{I are} supplied to the voice pattern comparing unit 43. When the frame data and the frame number N _I for the input voice signal are generated by the shape extracting unit 42, the voice pattern comparing unit 43 receives the frame number Nt of the input voice signal from the shape extracting unit 42. And the frame number Nr of the reference voice signal corresponding to the predetermined word stored in the reference pattern storage 44 (step 51). The voice pattern comparing unit 43 calculates | Nr-Nt |, which is an absolute difference value D between the frame number Nr of the reference voice signal used for the comparison and the frame number N _I of the input voice signal ( Step 52). The voice pattern comparison unit 43 reads path patterns corresponding to the calculated difference value D among a plurality of path patterns corresponding to the same reference voice signal stored in the reference pattern storage unit 44, and reads each path pattern. The frame data of the input voice signal is compared with the frame data of the input voice signal to generate a difference value between the data of the input voice signal and the reference voice signal (step 53). The difference value calculation between the data to be compared in step 53 is performed only between the frame data on the path pattern. That is, accumulation difference value calculation is performed only for path patterns corresponding to the absolute difference value D determined by the input voice signal and the predetermined reference voice signal. When a plurality of difference value calculations are completed using path patterns corresponding to the absolute difference value D, the voice pattern comparison unit 43 accumulates the difference values in each path pattern, and averages the difference values using the accumulated difference values. Is calculated (step 54). The average difference value calculation calculates the average difference value for each path pattern by dividing the difference value accumulated by the sum of the number of frames of the input voice signal and the number of frames of the reference voice signal (Nt + Nr) as in the conventional case. do. The speech pattern comparison unit 43 detects the smallest average difference value among the calculated average difference values (step 55). The speech pattern comparison unit 43 may perform a step 55 to obtain the smallest accumulated difference value, and then calculate the smallest average difference value by calculating the average difference value.

상술의 데이터처리과정은 시스템에 내장된 모든 기준음성신호에 대하여 수행된다. 다시 말하면, 전술의 단계 51부터 54는 기준패턴저장부(44)내의 모든 단어들 즉, 모든 기준음성신호들에 대응하는 프레임 데이타 및 프레임갯수(Nr)에 대하여 수행된다. 따라서, 각각의 기준음성신호와 입력음성신호에 의한 기준음성신호 수 만큼의 평균 차이값들의 생성된다. 음성패턴비교부(43)는 음성인식 시스템이 지원하는 모든 단어들에 대응하는 평균 차이값들 중에서 최소의 평균 차이값을 계산한다(단계 55). 최소의 평균 차이값의 생성에 이용된 기준음성신호를 입력된 시험음성신호와 동일한 단어로 판정한다. 음성패턴비교부(43)에서 출력하는 음성신호는 인식단어출력부(45)를 통하여 출력되고, 음성인식 시스템을 포함하는 전자기기에 이용된다.The above data processing is performed on all reference voice signals built into the system. In other words, steps 51 to 54 described above are performed on all words in the reference pattern storage 44, that is, frame data and frame number Nr corresponding to all reference voice signals. Accordingly, average difference values are generated by the number of reference voice signals by each reference voice signal and the input voice signal. The speech pattern comparison unit 43 calculates a minimum mean difference value among the mean difference values corresponding to all words supported by the speech recognition system (step 55). The reference speech signal used to generate the minimum mean difference value is determined to be the same word as the input test speech signal. The voice signal output from the voice pattern comparison unit 43 is output through the recognition word output unit 45 and used for an electronic device including a voice recognition system.

상기와 같은 본 발명의 음성인식방법에 의하면, 입력되는 음성신호를 인식하기 위한 데이터처리량을 크게 감소시킴으로써 짧은 시간내에 입력하는 음성신호를 인식할 수 있는 효과가 있다.According to the voice recognition method of the present invention as described above, it is possible to recognize the voice signal to be input in a short time by greatly reducing the data throughput for recognizing the input voice signal.

Claims

A voice recognition method, comprising: storing a plurality of reference voice signals composed of frame data; For each of the stored plurality of reference voice signals, a plurality of test voices having the same letter as a predetermined reference voice signal but having different sound fields and loudness and a predetermined reference voice signal are generated for each reference voice signal. Storing a plurality of path patterns and a frame number difference value between the predetermined reference voice signal and a predetermined test voice signal used for generating the plurality of path patterns; Generating a frame number and frame data from a predetermined input audio signal; For each of the stored plurality of reference voice signals, a frame data difference value between the input voice signal and the predetermined reference voice signal is calculated on a plurality of path patterns corresponding to the difference in the number of frames of the input voice signal and the predetermined reference voice signal. Accumulating for each pattern; And detecting a reference voice signal that is the same letter as the input voice signal by using the frame data difference value accumulated for each path pattern, the number of frames of the input voice signal corresponding to each path pattern, and the number of frames of a predetermined reference voice signal. Speech recognition method comprising a.

The test voice signal of claim 1, wherein the predetermined path pattern arranges frame data of a predetermined test voice signal on a horizontal axis and frame data of a reference voice signal which is the same letter as the predetermined test voice signal on a vertical axis. Accumulate frame data difference values generated in all paths connecting the first frame data of the first frame data and the first frame data of the reference voice signal to each of the last frame data, and the smallest of the accumulated difference values. Speech recognition method, characterized in that the path pattern on the rectangular coordinate system having an accumulated difference value.

The method of claim 1 or 2, wherein the detecting step comprises: accumulating frame data difference values for all path patterns corresponding to the difference in the number of frames of all reference voice signals compared with the input voice signal; A voice characterized in that the predetermined character of the reference voice signal corresponding to the path pattern having the smallest average difference value among the average difference generated by dividing the frame number of the signal and the frame number of the predetermined reference voice signal by the same letter Recognition method.

In the speech recognition apparatus, a plurality of reference voice signals composed of frame data and a plurality of test voices having the same letter as a predetermined reference voice signal but having different sound fields and loudness for each of the stored plurality of reference voice signals A reference pattern storage means for storing a plurality of path patterns generated by one-to-one comparison of a predetermined reference audio signal for each reference audio signal and a frame number difference value between the predetermined reference audio signal and a predetermined test audio signal used for generating the path pattern ; Shape extracting means for extracting frame data and the number of frames from an input speech signal; The frame number difference value is generated by comparing the frame number of the input audio signal output from the shape extracting means with the frame number of the predetermined reference audio signal stored in the reference pattern storage means, and a path pattern corresponding to the generated frame number difference value. The data difference values generated by comparing the frame data of the reference audio signal and the frame data of the input audio signal corresponding to each other on each path pattern are accumulated for each path pattern, and the number of frames and the input voice signal of the reference voice signal used for the comparison are accumulated. Extracting the minimum average difference value among the average difference values obtained by dividing the accumulated data difference value by a predetermined number generated using the number of frames of the input signal, and performing the input voice on all reference voice signals stored in the system. Minimum flatness produced by comparison of the signal with each reference audio signal Speech recognition apparatus comprising comparing means for sound pattern recognition based on the sound signal corresponding to the difference value of the same character as the input test speech signal.

5. The speech recognition apparatus according to claim 4, wherein the shape extraction means calculates the frame data and the number of frames of the input speech signal using sampling data obtained by sampling the speech signal input in the form of voice.

The speech recognition apparatus of claim 5, wherein the input speech signal has a word form.

7. The speech recognition apparatus of claim 6, wherein the number used for calculating the average difference value is a value obtained by adding a frame number of a predetermined reference audio signal used for data comparison to a frame number of an input speech signal.