KR100350003B1

KR100350003B1 - A system for determining a word from a speech signal

Info

Publication number: KR100350003B1
Application number: KR1019950031742A
Authority: KR
Inventors: 스테판도블레; 한스-빌름루흘
Original assignee: 코닌클리케 필립스 일렉트로닉스 엔.브이.
Priority date: 1994-09-20
Filing date: 1995-09-20
Publication date: 2002-12-31
Also published as: DE59507882D1; DE4438185A1; TW291555B; KR960011835A

Abstract

음성 인식은 스코어를 형성하기 위해 소정의 기준 신호에 비교되는 음성 신호로부터 테스트 신호를 발생한다. 각각의 후속적 테스트 신호는 선행의 테스트 신호에 대해 최적으로 결정된 기준값의 소정의 근접도내에 위치된 기준값과 비교된다. 이러한 근접도에 따라, 전이 확률에 일치하는 전이값이 그 스코어에 더해진다. 순간적인 화자의 발음 속도가 다를 경우에 그 결과를 현저히 향상시키기 위해서, 화자의 속도에 따라 전이값을 적응시키는 것이 제안되고 있다. 또다른 향상은 관련된 화자의 발음에 그 기준값을 적용함으로써 얻어질 수 있다. 이러한 적응은 또한 다수의 단계로 반복적으로 실행될 수 있다.Speech recognition generates a test signal from a speech signal that is compared to a predetermined reference signal to form a score. Each subsequent test signal is compared with a reference value located within a predetermined proximity of the reference value that is optimally determined for the preceding test signal. Depending on this proximity, a transition value that matches the transition probability is added to the score. It has been proposed to adapt the transition value according to the speed of the speaker in order to remarkably improve the result when the instantaneous speaking speed of the speaker is different. Another enhancement can be obtained by applying the reference value to the pronunciation of the associated speaker. This adaptation can also be repeatedly performed in multiple steps.

Description

A system for determining a word from a speech signal

본 발명은 음성 신호로부터 소정 어휘의 워드를 결정하는 시스템에 관한 것이며, 상기 시스템은The present invention relates to a system for determining a word of a given vocabulary from a speech signal,

음성 신호를 취하여 디지탈 테스트 신호들의 시퀀스를 공급하는 제 1 수단과,First means for taking a voice signal and supplying a sequence of digital test signals,

어휘의 워드들에 해당하는 기준 신호들의 시퀀스들을 저장하는 제 2 수단과,Second means for storing sequences of reference signals corresponding to words of the lexicon;

상기 제 1 수단 및 상기 제 2 수단에 결합되어, 상기 테스트 신호들과 제 1 기준 신호들을 비교함으로써 상기 테스트 신호와 제 1 기준 신호간의 차에 의존하는 각각의 제 1 기준 신호에 대한 스코어를 형성하는 제 3 수단으로서, 상기 제 1기준 신호는 소정의 방법 따라 선행 테스트 신호에 대해 비교가 성공적으로 수행된 관련 시퀀스내의 제 2 기준 신호와 이웃하거나 동일한 신호이며, 상기 제 3 수단은 상기 제 2 기준 신호로부터의 거리에 따라 전이 확률에 따른 전이값만큼 상기 스코어를 증가시키도록 배열된, 상기 제 3 수단과,A first and second means for comparing the test signals with the first reference signals to form a score for each first reference signal depending on the difference between the test signal and the first reference signal The third reference signal is a signal neighboring or the same as the second reference signal in the related sequence in which the comparison is successfully performed with respect to the preceding test signal according to a predetermined method, Said third means being arranged to increase said score by a transition value according to a transition probability according to a distance from said first means,

연속적 테스트 신호들과 비교된 기준 신호들의 각각의 시퀀스에 대해 상기 증가된 스코어들을 합산하며, 상기 증가된 스코어들의 최소 합을 갖는 최적 시퀀스를 결정하며, 상기 최적 시퀀스와 관련된 워드 또는 워드들을 출력하는 제 4 수단을 포함한다.Summing the incremented scores for each sequence of reference signals compared to successive test signals, determining an optimal sequence with a minimum sum of the incremented scores, and outputting a word or words associated with the optimal sequence 4 means.

이러한 종류의 시스템은 DE 32 15 868 C2 로부터 공지되어 있다. 공지된 시스템은 워드 시퀀스를 결정하기 위해 작용하는데, 이 시스템에서 기준 신호의 해당 시퀀스가 개별적 워드에 대해 저장되고 특정 단계가 실행되며 워드 전이를 결정하기 위해 취해진다. 연속적 테스트 신호와 기준 신호들의 비교 또는 그 결과는 2차원 그릿(grid)으로 표현될 수 있으며 시작점으로부터의 연속적인 테스트 신호와 워드내에서 비교되었던 시작점으로부터 발생하는 기준 신호의 각각의 시퀀스에 대해, 어느 기준 신호가 그 다음의 테스트 신호에 관련하여 최소의 스코어 합을 발생 하는지가 결정되어 워드내의 주어진 시작점으로부터 그 워드의 끝까지의 그릿내의 경로를 찾는다. 따라서, 워드내에서, 그 다음의 테스트 신호는 방금 도달된 경로의 끝 주위의 주어진 이웃에 위치된 기준값파 비교된다, 이 방식에서, 실제 발음된 워드와 이 워드의 기준값 시퀀스간에 비선형의 일시적 적응이 실현된다. 워드 내에서, 여러 전이, 즉, 테스트 신호에 대해 최적의 것으로 판단된 기준값과 그 선행테스트 신호에 대한 최적 기준값의 근접도는 동일 방식으로 다루어진다.This type of system is known from DE 32 15 868 C2. The known system acts to determine a word sequence in which the corresponding sequence of reference signals is stored for an individual word and a specific step is carried out and taken to determine the word transition. The comparison or the result of the successive test signal and the reference signal can be represented by a two-dimensional grit, and for each sequence of the reference signal originating from the starting point that has been compared in successive test signals from the starting point, It is determined whether the reference signal generates a minimum score sum in relation to the next test signal to find the path in the grit from the given starting point in the word to the end of the word. Thus, within the word, the next test signal is compared to a reference value wave located at a given neighbor around the end of the path just reached. In this way, a non-linear temporal adaptation between the actually pronounced word and the reference value sequence of this word . Within the word, the proximity of a plurality of transitions, i. E., A reference value determined to be optimal for a test signal and an optimum reference value for that preceding test signal, is handled in the same manner.

DE 37 10 507 A1 는 연속의 테스트 신호에 대한 최적의 기준 신호의 다른 근접도가 고려되는 발음된 워드를 인식하기 위한 유사한 시스템을 개시하고 있다. 이와같이, 전이 확률은 분명하게 모델링된다. 특히, 고정된 전이값은 상기 근접도에 의존하여 스코어에 부가된다. 이 위치에서 실제 발음된 워드가 관련 기준 신호에 대응하는 확률의 네가티브 로그(logarithm)에 의해 스코어가 형성된다고 가정한다.DE 37 10 507 A1 discloses a similar system for recognizing pronounced words in which another proximity of the optimal reference signal for a continuous test signal is taken into account. Thus, the transition probability is clearly modeled. In particular, a fixed transition value is added to the score depending on the proximity. It is assumed that in this position the score is formed by the negative logarithm of the probability that the actually pronounced word corresponds to the relevant reference signal.

발음된 워드의 속도가 기준 신호 시퀀스에 대응하기 때문에, 대각선 경로가 가장 적합하기 때문에, 적절한 전이값의 선택에 의해 경로의 대각선 코스가 선호될 수 있다. 따라서, 다른 발음 방식의 경우에도 다른 스코어로 워드가 인식될 수 있다. 발음 속도.는 전이값의 선택에 의해 모델링된다.Because the velocity of the pronounced word corresponds to the reference signal sequence, the diagonal path of the path may be preferred by the selection of the appropriate transition value, since the diagonal path is the most appropriate. Therefore, even in the case of different pronunciation methods, words can be recognized with different scores. The pronunciation rate is modeled by the choice of the transition value.

시스템의 실제 사용에 앞서 발음되어야 하는 테스트 문장에 기초하여 기준값이 결정된다. 시스템이 주어진 사용자를 위한 것일때, 이러한 테스트 문장은 그 사용자만을 위해 기록된다. 따라서, 발음 속도는 동시에 모델링된다. 그러나, 시스템이 다수의 사용자를 위한 것이거나 이상적인 경우로서 화자와 완전히 무관 할때, 기준값은 다수의 다른 화자에 의해 발음된 테스트 문장으로부터 유도될 수 있다. 평균값은 전이값뿐만 아니라 기준값에 대해 결정되며, 상기 평균값은 모든 워드내의 모든 위치에 대해 동일한 것으로 가정된다. 그러나, 이것은, 화자가 워드를 통해 기준 신호의 최적 경로가 대각선보다 더 경사지도록 매우 빠르게 말할 때, 그 전체적인 전이값은 덜 매력적인 스코어를 형성하고, 발음된 워드는 유사하게 소리나는 기준 워드로 오인되기 쉽기 때문에, 인식의 ·신뢰도를 감소시킨다.A reference value is determined based on a test sentence to be pronounced prior to actual use of the system. When the system is for a given user, these test statements are recorded for that user only. Thus, the pronunciation speed is modeled simultaneously. However, when the system is for a plurality of users or is completely independent of the speaker as an ideal case, the reference value may be derived from test sentences pronounced by a number of different speakers. The average value is determined for the reference value as well as the transition value, and the average value is assumed to be the same for all positions in all words. However, this means that when the speaker speaks very fast through the word the optimal path of the reference signal is tilted more than the diagonal, the overall transition value forms a less attractive score, and the pronounced word is misidentified as a similarly sounding reference word Because it is easy, it decreases recognition, reliability.

본 발명은 여러 화자에 의해 사용할 고신뢰도의 시스템을 제공하는 것을 목적으로 한다.It is an object of the present invention to provide a highly reliable system to be used by various speakers.

이 목적은 상기 기준 신호들의 최적 시퀀스에 비교된 테스트 신호들이 유도되는 음성 신호의 길이와 기준 신호들의 최적 시퀀스 길이간의 차에 의존하여, 전이값들을 후속적 비교들을 위해 새로운 전이값들로 변경하는 제 5 수단이 제공된 본 발명에 따라 달성된다.This object is achieved by a method for changing the transition values to new transition values for subsequent comparisons, depending on the difference between the length of the speech signal from which the test signals compared to the optimal sequence of reference signals are derived and the optimal sequence length of the reference signals 5 means are provided.

이와같이, 전이값의 적응에 의해, 본 발명에 따른 시스템내의 기준 신호 시퀀스는 순간적인 사용자가 말하는 속도에 적응된다. 워드가 인식되어 알려지자마자, 전이값은 후속 워드가 더욱 쉽게 인식되도록 적응될 수 있다.Thus, by adaptation of the transition value, the reference signal sequence in the system according to the invention is adapted to the rate at which the instantaneous user speaks. As soon as a word is recognized and known, the transition value can be adapted to make subsequent words more recognizable.

본 발명에 따른 전이값의 적응의 가능성은 제 5 수단이 n = T/N에 따라 전이값 a 를 새로운 전이값 a'로 다음과 같이 변경하도록 배열되는 것으로 인한 것이다.The possibility of adaptation of the transition value according to the invention is due to the fifth means being arranged to change the transition value a to a new transition value a 'according to n = T / N as follows:

여기서, T 는 테스트 신호의 시권스의 길이이고, N 은 기준 신호의 시퀀스 길이이며, 인덱스i, i 는 제 1 기준 신호와 제 2 기준 신호가 동일함을 의미하고, 인덱스 i, i + 1 은 제 1 기준 신호와 제 2 기준 신호가 서로 직접 이웃하고 있음을 의미하고, 인덱스i, i + 2 는 제 1 기준 신호와 제 2 기준 신호가 또다른 기준신호에 의해 분리되어 있음을 의미하며, b 는 소정의 비례 인수이다.The index i, i means that the first reference signal and the second reference signal are the same, and the index i, i + 1 is the length of the reference signal, The indexes i and i + 2 mean that the first reference signal and the second reference signal are separated by another reference signal, and the indexes i and i + 2 indicate that the first reference signal and the second reference signal are directly adjacent to each other, Is a predetermined proportional factor.

실제 발음된 워드의 길이와 기준 신호 시퀀스 길이의 비는 전이값을 변경하는데 사용되어, 전체적 전이 확률은 대각선으로부터의 편차를 또다른 편차가 억제되는 정도와 동일한 정도로 유리하게 사용하여 모든 전이에 대해 일정하게 유지된 다.The ratio of the length of the actual spoken word to the length of the reference signal sequence is used to change the transition value so that the overall transition probability is advantageously used as the degree to which another deviation is suppressed .

이러한 방법으로 시스템의 실제 사용자의 실제 발음 속도를 고려하면 인식 신뢰도를 향상시킨다.This method improves recognition reliability by considering the actual pronunciation speed of the actual user of the system.

추가적인 향상이 기준값 r_i을 새로운 기준값 r_i'로 다음과 같이 변경하는 제 6 수단이 제공된 본 발명의 실시예에서 달성된다.A further improvement is achieved in an embodiment of the present invention in which a sixth means is provided for changing the reference value r _i to a new reference value r _i 'as follows.

즉,여기서, y_t, 는 기준값의 최적 시킨스에서의 기준값 r_i와 비교된 테스트 신호이며, c 는 소정의 값이다. 이와같이, 발음 속도가 고려될뿐 아니라, 억양, 즉, 화자의 발성계(vocal tract) 또한 고려된다. 기준값들을 순간적인 화자에 적응시키는 이론이 공지되어 있지만 발음 속도에 대한 적응과는 관련되지 않았다.That is, y _t , is the test signal compared with the reference value r _i at the optimum value of the reference value, and c is a predetermined value. Thus, not only is the speed of speech taken into account, but also the intonation, that is, the vocal tract of the speaker, is also taken into account. Although the theory of adapting the reference values to the instantaneous speaker is known, it is not related to the adaptation to the pronunciation rate.

발음 속도와 순간적인 화자의 발음에 대한 적응은 순간적인 화자에 의해 특정한 방법으로 발음된 워드(들)내의 우발적인 극한 값에 대해 적응되어서는 않 되기 때문에 주기깊게 실행되어야 하는데, 왜냐하면, 동일 화자가 그의 발음 방식을 그 후에 변경할 수 있기 때문이다. 적응 정도는 발음 속도에 대해서는 비례 인수 b 에 의해 또한 기준값 변경을 위해서는 소정값 c 에 의해 실현될 수 있으며, 이The adaptation to the pronunciation rate and the instantaneous speaker's pronunciation must be performed in a deep cycle since it must not be adapted to accidental extreme values in the word (s) pronounced by the instantaneous speaker in a particular way, This is because his pronunciation method can be changed later. The adaptation degree can be realized by the proportional factor b for the pronunciation speed and by the predetermined value c for changing the reference value,

두 인수의 크기는 너무 커서는 안된다. 그러나, 순간적인 화자의 발음 방식에 대해 적절한 적응을 하기 위해, 본 발명의 또다른 실시예에서는 적어도 동일한 음성 신호내의 기준값의 변경이 여러번 수행된다. 순간적인 화자의 발음 방식에 대한 단계적인 적응은 여러개의 인식된 워드 뒤에 달성될 것이다.The size of both arguments should not be too large. However, in order to make an appropriate adaptation to the instantaneous pronunciation method of the speaker, in another embodiment of the present invention, at least the change of the reference value in the same voice signal is performed several times. A gradual adaptation of the instantaneous speaker's pronunciation will be achieved after several recognized words.

본 발명의 실시예는 도면을 참조하여 상세하게 후술될 것이다.Embodiments of the present invention will be described below in detail with reference to the drawings.

제 1 도의 블럭도는 화자에 의해 발음된 음성 신호를 전기 신호로 변환하기 위한 마이크로폰(2)을 나타낸다. 이 신호는 블럭(10)에서 처리되어 디지탈화되며, 예를들어, 음성 신호의 개별적인 주파수 성분은 세그먼트로 결정된다. 이러한 세그먼트의 길이는 10ms 와 20ms 사이의 일정값을 갖는다. 블럭(10)은 블럭(30)에 테스트 신호를 출력한다.The block diagram of FIG. 1 shows a microphone 2 for converting a voice signal pronounced by a speaker into an electrical signal. This signal is processed and digitized in the block 10, for example, the individual frequency components of the speech signal are determined as a segment. The length of these segments has a constant value between 10 ms and 20 ms. Block 10 outputs a test signal to block 30.

블럭(30)에서, 테스트 신호는 블럭(30)에 의해 제어 및 주소 지정된 메모리(20)로부터 공급되는 기준 신호와 비교된다. 이들 기준 신호들은 몇몇 다른 화자에 의해 발음된 테스트 문장의 해석에 의해 미리 결정되었다. 이 비교는 블럭(30) 내에 저장된 전이값 만큼 증가된 스코어를 발생한다. 블럭(40)에서 증가된 스코어(score)는 다른 워드를 통해 여러 경로에 대해 합산된다. 그러나, 이러한 합산은 스코어의 결정을 위한 비교와 동시에 발생할 수 있다. 워드의 끝 또는 몇몇 워드로 구성된 음성 신호의 끝에서, 최적 전체적 경로가 블럭 (40)에서 결정되고 대응워드 시퀀스가 블럭(70)에 출력된다. 이 블럭(70)은 예를들어 디스플레이 스크린 일 수 있으나, 음성 명령으로 제어되는 장치에 의해 형성되는 것이 바람직하다.At block 30, the test signal is compared to a reference signal supplied from the memory 20, which is controlled and addressed by the block 30. These reference signals were predetermined by interpretation of test sentences pronounced by some other speaker. This comparison generates an increased score by the transition value stored in the block 30. [ The incremented score at block 40 is summed over several paths through different words. However, such summation may occur concurrently with a comparison for determining the score. At the end of the word or at the end of the speech signal consisting of several words, the optimal overall path is determined at block 40 and the corresponding word sequence is output to block 70. [ This block 70 may be, for example, a display screen, but is preferably formed by a device controlled by voice commands.

다른 워드들의 기준 신호와 연속적 테스트 신호의 비교 및 최적의 시퀀스 결정은 제 2 도를 참조하여 후술할 것이다,The comparison of the reference signal of the other words with the successive test signal and the determination of the optimal sequence will be described later with reference to FIG. 2,

시간축 t 는 수신된 음성 신호로부터 결정된 테스트 신호의 시퀀스를 나타내는 반면에, R 축은 다수의 워드에 대한 기준 신호의 시퀀스를 나타낸다. 제 2도는 제 1 테스트 신호가 워드 Wl 에 관련된 기준 신호의 시권스 Rl 에 가장 잘 대응함을 나타내고 있다. 다른 워드들에 관련된 다른 시퀀스들 R2 및 R3 의 시작점들과의 비교 역시 매때마다 시작된다. 그러나, 이러한 일련의 비교가 빨리 종료되도록 그 유사성이 작은 것으로 가정한다. 일반적으로, 후속 테스트 신호를 이용하여, 테스트 신호의 시퀀스 Rl 과의 새로운 비교가 다시 개시된다. 그러나, 이러한 비교 역시 나중에 발음된 음성 신호의 부분들이 기준 신호의 시퀀스 Rl 의 시작점으로부터 너무 많이 이탈되기 때문에 빨리 종료된다.The time axis t represents a sequence of test signals determined from the received speech signal, while the R axis represents a sequence of reference signals for a plurality of words. FIG. 2 shows that the first test signal most corresponds to the time magnitude R 1 of the reference signal associated with the word W 1. A comparison with the starting points of the other sequences R2 and R3 related to the other words also starts every hour. However, it is assumed that the similarity is small so that this series of comparisons ends quickly. Generally, a new comparison with the sequence Rl of the test signal is started again using the subsequent test signal. However, this comparison also ends quickly because later portions of the spoken voice signal deviate too far from the starting point of the sequence Rl of the reference signal.

기준 신호의 시퀀스 Rl 를 통한 경로의 종료후에(이 경로는 위에서 언급한 바와 같이 워드 Wl 에 대응한다), 기준 신호의 시퀀스 Rl 내지 R3 의 시작점과의 비교는 계속되며, 이 예에서, 기준 신호의 시퀀스 R3 를 통해 시작하는 경로가 최적 경로가 되며, 후속적으로 워드 W3 가 인식되어 출력되는 것으로 가정한다. 여러개의 워드가 발음되기 때문에 음성 신호가 오래동안 지속될때 비교는 유사한 방법으로 계속된다.After the end of the path through the sequence Rl of the reference signal (this path corresponds to the word Wl as mentioned above), the comparison with the starting point of the sequence of reference signals R1 to R3 is continued and, in this example, It is assumed that the path starting from the sequence R3 becomes the optimum path, and the word W3 is subsequently recognized and output. Because multiple words are pronounced, the comparison continues in a similar way when the speech signal lasts for a long time.

한 워드내의 비교의 실행동안 발생하는 사건은 일부 기준 신호 r_i, r_i+1, r_i+2등과 함께 t 및 t + 1 순간에서만 2개의 연속 테스트 신호에 대해 제 2 도를 상세히 도시한 제 3 도를 참조하여 상세하게 설명될 것이다. t 순간의 테스트 신호에대해서 최적 경로 P 가 기준값 ri 에서 종료된다고 가정한다. 그 다음의 테스트 신호는 t + 1 순간에, 지정된 전이값이 관련된 허용된 전이 a_i,i, a_i,i+1, a_i,i+2에 일치하여 기준 신호 r_i, r_i+1, r_i+2와 비교된다. 본 실시예에서 이들 전이값은 전이확률의 네가티브 로그에 해당한다. t + 1 순간의 테스트 신호와 기준 신호 r_i의 비교는 두 신호들간의 차에 의존하고 전이값 a_i,i만큼 증가된 스코어를 발생시킨다. 유사한 방법으로, 기준 신호 r_i+1와 이 테스트 신호의 비교는 전이값 a_;,i+1만큼 증가된 스코어를 발생한다. 유사하게, 이 테스트 신호와 기준 신호 r_i+2의 비교는 전이값 a_i,i+2만큼 증가된 스코어를 발생한다. 예를들어, 전이값 a_i,i와 a_i,i+2는 동일하거나 거의 차가 없는 반면에, 전이값 a_i,i+1는 실질적으로 더욱 작다. t + 1 순간의 테스트 신호가 도시된 3개의 기준 신호 모두에 비해 거의 동일한 차를 나타낼때(이것은 이웃하는 기준값들이 종종 유사하기 때문에 상당히 가능하다), 전이값 만큼 증가된 최소의 스코어는 기준값 r_i+1과의 비교로부터 발생되어, 기준값 r_i에서 종료된 경로 P 가 대각선 방향으로 연속된다. 따라서, 이 대각선 방향이 특권을 갖는다.The events that occur during the execution of a comparison in one word are shown in detail in Figure 2 for two consecutive test signals only at time t and t + 1 with some reference signals r _i , r _{i + 1} , r _{i +} Will be described in detail with reference to FIG. It is assumed that the optimum path P is terminated at the reference value ri for the t-moment test signal. The next test signal is applied to the reference signal r _i , r _{i + 1} at time t + 1 in accordance with the allowed transition _{i, i} , a _{i, i + 1} , a _i, , r _{i + 2} . In the present embodiment, these transition values correspond to a negative log of transition probability. The comparison of the instantaneous test signal with the reference signal r _i at time t + 1 depends on the difference between the two signals and generates an increased score by the transition value a _{i, i} . In a similar manner, the comparison of the reference signal r _{i + 1} and this test signal results in an increased score by the transition values a _{;, i + 1} . Similarly, the comparison of this test signal with the reference signal r _{i + 2} results in an increased score by the transition value a _{i, i + 2} . For example, the transition values a _{i, i} and a _{i, i + 2} have the same or almost no difference, while the transition values a _{i, i + 1} are substantially smaller. When the t + 1 instantaneous test signal exhibits approximately the same difference compared to all of the three reference signals shown (which is quite possible because the neighboring reference values are often similar), the minimum score increased by the transition value is the reference value r _{i +1,} and the path P ended at the reference value r _i continues in the diagonal direction. Thus, this diagonal direction has the privilege.

빠른 발음의 화자의 경우, t + 1 순간의 테스트 신호는 기준값 r_i+2에 더욱 유사하게 된다, 그러나·, 전이값 a_i,_i+2이 전이값 a_i,i+1보다 너무 크다면, 대각선 방향은 과도하게 부여될 수 있다. 만일 이것이 워드내에서, 즉, 기준 신호 시퀀스 내에서 반복적으로 발생한다면, 궁극적으로 테스트 신호 시퀀스와 기준 신호 시퀀스간에서, 과도하게 빠른 발음을 제외하고는, 우수한 유사성보다 덜 매력적인 스코어의 합이 얻어질 것이다. 결국, 이것은 신뢰성이 떨어지는 인식을 발생시킨다. 따라서, 화자가 소정의 범위까지 너무 빨리 또는 너무 느리게 발음하는 것이 확립되면(즉, 기정 사실화 되면), 즉시로 전이값은 양호하게 변화되어 상기 대각선에서 다소 이탈된 방향에 특권을 준다.In the case of a fast-pronounced speaker, the test signal at the instant t + 1 becomes more similar to the reference value r _{i + 2} , but if the transition values a _i , _{i + 2} are too large for the transition values a _{i, i +} , The diagonal direction may be excessively given. If this occurs repeatedly in the word, i. E., Within the reference signal sequence, ultimately the sum of the less interesting scores is obtained between the test signal sequence and the reference signal sequence, except for excessively fast pronunciation will be. Eventually, this results in poor perceptions. Thus, if it is established that the speaker is pronunciating too quickly or too slowly to a predetermined range (i.e., if it is normalized), the transition value will be changed to give a privilege to the direction deviating somewhat from the diagonal.

이러한 적응은 제 1 도의 블록(50)에서 수행된다.This adaptation is performed in block 50 of FIG.

워드의 짧은 시퀀스 또는 워드가 인식될때, 즉, 관련 시퀀스의 끝에 이어진 경로가 최소한 한 시퀀쓰의 기준 신호들내에서 결정될때, 얼마나 많은 테스트 신호가 이러한 목적을 위해 요구되었는지 알려진다. 이러한 시퀀스의 기준 신호의 수가 주어지기 때문에, 비(ratio) n 이 계산될 수 있다.It is known how many test signals are required for this purpose when a short sequence or word of a word is recognized, i. E. When a path leading to the end of the relevant sequence is determined in the reference signals of at least one sequence. Since the number of reference signals in this sequence is given, a ratio n can be calculated.

즉, n = T/NThat is, n = T / N

여기서, T 는 워드가 인식된 테스트 신호의 수이며, N 은 관련 시퀀스내의 기준 신호의 수이다. 이 n 을 이용하여, 새로운 전이값 a' 가 현재의 전이값 a로부터 결정된다.Where T is the number of test signals for which the word is recognized and N is the number of reference signals in the associated sequence. Using this n, a new transition value a 'is determined from the current transition value a.

여기에서, 비례 인수 b 는 전이값을 화자의 발음 방식에 적응시키는 정도를 결정한다 이러한 적응이 화자의 발음 방식에서 우연한 극대값에 지나치게 의존하지않는 것을 보장하기 위해서 b 값은 너무 커서는 안된다. 전이값 a 가 앞에서 기술하였듯이 전이 확률의 네가티브 로그에 의해 표현될때, b = 180 인 값이 적절한 절충이라는 것이 발견되었다. 따라서, 대각선을 위한 전이값은 상기 값에 무관하게 일정하게 유지되는데, 왜냐하면 빠른 발음의 경우 더욱 경사가 급한 전이값a_i,i+2는 덜 경사진 전이 a_i,i에 대한 전이값이 증가되는 정도와 동일한 정도로 감소되기 때문이다. 결과적으로, 연속적 테스트 신호에 대한 기준 신호들간의 전체적으로 더욱 경사진 전이가 특권을 갖는다, 느린 발음에 대해서도 동일한 결과가 발생한다, 새로운 전이값은 블록(30)에 전송되고, 그 곳에서 후속적인 비교를 위해 사용된다.Here, the proportional factor b determines the degree to which the transition value is adapted to the speaker's pronunciation. To ensure that this adaptation does not depend too much on accidental maxima in the speaker's pronunciation, the b value should not be too large. When the transition value a is expressed by the negative logarithm of the transition probability as described above, it has been found that a value of b = 180 is a suitable trade-off. Thus, the transition value for the diagonal line is kept constant regardless of the value, because in the case of fast pronunciation, the transition value a _{i, i + 2, which} is more sloppy _, increases the transition value for the less sloped transition a _i, As shown in FIG. As a result, an overall, more inclined transition between the reference signals for successive test signals has the privilege. Again, the same result occurs for slow pronunciations. A new transition value is sent to block 30, where a subsequent comparison .

기준 신호들을 화자의 발음에 적응시킴으로써 인식의 신뢰성이 더욱 향상될 수 있다. 이것은 제 1 도의 블럭(60)에서 다음과 같이 실현된다.The reliability of recognition can be further improved by adapting the reference signals to the pronunciation of the speaker. This is realized in block 60 of FIG. 1 as follows.

워드가 상기 방식으로 인식된 후, 테스트 신호는 전에 최적 경로가 발견된 기준 신호 시퀀스와 다시 비교되며, 각각의 기준 신호 r_i는 다음과 같이 적응된 기준 신호 r'_i로 변화된다.After the word is recognized in this way, the test signal is again compared with the reference signal sequence for which the best path was found before, and each reference signal r _i is changed to an adapted reference signal r ' _i as follows.

여기서, y_t는 t 순간에 기준 신호 r_i와 비교된 테스트 신호이며, 인수 c 는 현재의 기준 신호가 변화되는 범위를 나타낸다. 기준 신호가 연속 워드를 갖는 다수의 단계에 적응될 때 c = 0.13 인 값이 특히 유효하다.Here, y _t is the test signal compared with the reference signal r _i at time t, and the argument c indicates the range in which the current reference signal is changed. A value of c = 0.13 is particularly effective when the reference signal is adapted to a plurality of steps with successive words.

지금까지 기준 신호의 시퀀스가 각각의 워드를 나타내는 것으로 가정되었다.그러나, 기준 신호의 개별적인 시퀀스가 다수의 워드에서 동일한 음소를 나타내는 음성 인식 시스템이 또한 존재한다. 이때, 워드는 인식된 음소로부터 형성된다. 전이값 및 가능하게는 기준 신호를 적응하는 상기 방법은 또한 그러한 시스템에 사용하기 에 적합하다.So far, it has been assumed that the sequence of reference signals represent each word. However, there is also a speech recognition system in which the individual sequences of reference signals represent the same phoneme in a plurality of words. At this time, the word is formed from the recognized phoneme. The method of adapting the transition value and possibly the reference signal is also suitable for use in such a system.

더우기 상기 실시예에 상반되게, 전이값이 모든 기준 신호에 대해 동일하지 않고 시퀀스내의 기준 신호의 위치에 따라 상호 이탈되더라도 상기 방법이 또한 사용될 수 있다. 그러면, 적응 인수 b 는 필요하다면 위치 의존 방식으로 결정되어야 한다.Moreover, contrary to the above embodiment, the method can also be used, although the transition values are not the same for all reference signals and are separated from each other according to the position of the reference signal in the sequence. The adaptation factor b should then be determined in a position dependent manner if necessary.

제 1 도는 본 발명에 따른 시스템을 도시한 블럭도.Figure 1 is a block diagram of a system according to the present invention;

제 2 도는 연속적 워드를 통한 경로 형성을 설명하는 도면.FIG. 2 illustrates path formation through a continuous word; FIG.

제 3 도는 두 연속 테스트 신호에 대한 전이(transition)를 설명하는 도면.FIG. 3 illustrates a transition for two successive test signals; FIG.

* 도면의 주요부분에 대한 부호의 설명 *2 : 마이크로폰 20 : 메모리DESCRIPTION OF REFERENCE NUMERALS 2: microphone 20: memory

Claims

A system for determining a word of a given vocabulary from a speech signal, comprising: first means for taking the speech signal and supplying a sequence of digital test signals;

Second means for storing sequences of reference signals corresponding to words of said lexicon;

A first and second means for comparing the test signals with the first reference signals to form a score for each first reference signal depending on the difference between the test signal and the first reference signal Wherein the first reference signal is a signal neighboring or the same as the second reference signal in the related sequence in which the comparison is successfully performed with respect to the preceding test signal according to the predetermined anticorruption, The third means being arranged to increase the score by a transition value according to a transition probability according to a distance from the signal,

Summing the incremented scores for each sequence of reference signals compared to successive test signals, determining an optimal sequence with a minimum sum of the incremented scores, and outputting a word or words associated with the optimal sequence 4. A system for determining words of a given vocabulary from a speech signal,

Which changes the transition values to new transition values for subsequent comparisons, depending on the difference between the length of the speech signal from which the test signals compared to the optimal sequence of reference signals are derived and the length of the optimal sequence of reference signals, 5 < / RTI > means are provided.

The method according to claim 1,

Wherein the fifth means sets the transition values a according to u = T / N to new transition values a '

Where T is the length of the sequence of test signals, N is the length of the sequence of reference signals, index i, i means that the first reference signal and the second reference signal are identical, and the index i, i + 1 means that the first reference signal and the second reference signal are directly adjacent to each other, and indices i and i + 2 indicate that the first reference signal and the second reference signal are separated by another reference signal , And b is a predetermined proportional factor.

3. The method according to claim 1 or 2,

The reference values r _i are calculated by the new reference values r _i '

Wherein y _t is a test signal compared to a reference value r _i in the optimal sequence of reference values, and c is a predetermined value.

The method of claim 3,

Wherein at least a change of the reference values in the same voice signal is performed several times.