KR100251000B1

KR100251000B1 - Speech recognition telephone information service system

Info

Publication number: KR100251000B1
Application number: KR1019970040745A
Authority: KR
Inventors: 김재인; 박성준
Original assignee: 이계철; 한국전기통신공사
Priority date: 1997-08-25
Filing date: 1997-08-25
Publication date: 2000-04-15
Also published as: KR19990017730A

Abstract

PURPOSE: A system for servicing voice perceiving telephone information is provided to change a fixed connection structure between a DSP(Digital Signal Processor)0 and a DSP1 as a dynamic connection structure. CONSTITUTION: A system performs an A/D conversion for an inputted voice, and emphasizes a high-pass component of the voice. The system divides the voice as frames of a 20msec length. The system overlaps the frames each 10 msec to prevent an information loss of frame boundary surfaces. The system extracts a cepstral coefficient and various coefficients displaying voice characteristics. The system refers to a vector quantization code book each kind, and obtains an index of values which are mostly similar to vector values. The system transmits the index from a DSP(Digital Signal Processor)0 to a DSP1. The system uses results of previous frames while performing a voice characteristic extract and a vector quantization for one frame, and performs a Viterbi search.

Description

Voice recognition telephone information service system

본 발명은 음성인식 시스템에 있어서 DSP(Digital Signal Processor)모듈에서 DSP0와 DPS1과의 고정된 연결구조를 동적 연결 구조로 바꾼 음성인식 전화정보 서비스 시스템에 관한 것이다.The present invention relates to a voice recognition telephone information service system in which a fixed connection structure between DSP0 and DPS1 is converted into a dynamic connection structure in a digital signal processor (DSP) module in a voice recognition system.

음성인식 전화정보 서비스 시스템은 일반인이 전화망을 통해 DTMF(Dual Tone Multi Frequency)가 아닌 음성으로 시스템에 어떤 정보를 요구할 수 있고, 관련된 정보를 음성으로 들려주는 시스템이다. 전화 음성정보시스템에서는 사용자가 관련 서비스 코드를 암기하고 있거나 메뉴를 잘 알고 있어야만 원하는 서비스를 받을 수 있다. 예를 들어 증권정보 안내인 경우 사용자는 원하는 회사명에 해당되는 코드번호를 알고 이를 입력하여야 하지만, 음성인식기능을 사용하면 회사명만을 입력하면 된다. 또 자연의 소리안내의 경우는 새소리 중에서 까치소리를 듣고 싶다면, 새소리 메뉴를 선택하고 까치소리를 선택해야 하지만 음성인식 기능이 있다면 까치소리만을 입력하면 된다.The voice recognition telephone information service system is a system that allows the general public to request certain information from the system by voice, not DTMF (Dual Tone Multi Frequency) through the telephone network, and provides related information by voice. In the telephone voice information system, the user must memorize the relevant service code or know the menu to receive the desired service. For example, in the case of securities information guide, the user needs to know the code number corresponding to the desired company name and input it, but using the voice recognition function, only the company name needs to be entered. In addition, in the case of natural sound guidance, if you want to hear the magpie from among the birds, you must select the bird sounds menu and select the magpie, but if you have voice recognition, you only need to enter the magpie.

음성인식을 구현하는 방법에는 여러 가지가 있는데, 가장 많이 쓰이느 방법은 확률 통계적인 방법 중에서 매순간 최대 빈도 상태를 선택함으로서 정정된 상태의 기대치를 최대로 하는 HMM(Hidden Markov Model)이 사용된다. 그 이유는 어휘, 사용자, 패턴 비교 알고리즘과 단어 결정 로직에 대해 견고하며 모든 분야에서 좋은 성능을 보여주고 있기 때문이다.There are several ways to implement speech recognition. The most popular method is HMM (Hidden Markov Model), which maximizes the expected state of the corrected state by selecting the maximum frequency state at every moment among the probabilistic statistical methods. The reason is that it is robust against vocabulary, users, pattern comparison algorithms and word-determination logic, and performs well in all areas.

음성인식을 하고 있지 않은 전화 음성정보 시스템은 알고자 하는 정보를 찾아가는 데 서비스 코드를 입력하거나 메뉴 명을 알아야 하는 불편이 있다. 음성인식 기능을 추가한 시스템을 생각할 수 있으나 그 시스템에서는 아직 하드웨어의 비용이 비싸서 상용화에 걸림돌이 되고 있다.Telephone voice information system that does not recognize the voice is inconvenient to enter the service code or to know the menu name to find the information you want to know. You can think of a system that adds voice recognition, but the system still has a high cost of hardware, making it an obstacle to commercialization.

이와 같은 단점을 해소하기 위하여 본 발명은 하드웨어의 비용을 다운시킨 음성인식 전화 정보 서비스 시스템을 제공하는 것을 목적으로 한다.In order to solve the above disadvantages, an object of the present invention is to provide a voice recognition telephone information service system with a lower cost of hardware.

제1도는 패턴 인식에 의한 음석 인식 과정을 나타낸 블록도.1 is a block diagram showing a speech recognition process by pattern recognition.

제2도는 기존의 시스템 구조도.2 is a diagram of a conventional system structure.

제3도는 DSP 파이프라인을 나타낸 테이블.3 is a table showing the DSP pipeline.

제4도는 수정된 시스템 구조도.4 is a modified system structure diagram.

본 시스템은 음성인식기능을 추가한 시스템을 사용하여 구현되었으며, 그 과정을 제1도에 나타내었다. 그러나 HMM방법을 사용하는 경우는 계산 량이 많기 때문에 실시간으로 서비스를 해주기 위해서는 하나의 프로세서가 이를 전부 감당하지 못하고 보통 두 개의 프로세서를 사용하여 구성한다. 이 시스템에서는 DSP 칩을 프로세서를 사용하였으나, 보다 나은 프로세서가 있다면 다른 프로세서를 사용해도 무방하다.This system is implemented using a system with a voice recognition function, the process is shown in FIG. However, in the case of using the HMM method, since a large amount of calculation is required, one processor cannot cover all of them in order to provide services in real time, and is usually configured using two processors. The system uses a processor with a DSP chip, but other processors can be used if there is a better processor.

그리고 제2도에는 DSP모듈이 갖추어진 시스템의 구성도를 나타내었다., 기존의 회자독립 고립단어 음성 인식 시스템에서 DSP 모듈이 수행하는 기능을 살펴보면, 안내 방송 중에도 음성 입력을 처리할 수 있는 안내 방송 제거와, 음성 구간을 자동으로 찾아내는 끝점 검출과, 찾아낸 음성구간을 분석하여 시스템 내에서 보유하고 있는 벡터 양자화 코드북(vector quantization codebook)내의 코드번호를 찾아주는 일까지는 DSP0에서 담당한다.2 shows the configuration of a system equipped with a DSP module. Looking at the functions performed by a DSP module in a conventional independent word isolated speech recognition system, a guide broadcast capable of processing a voice input even during a guide broadcast is shown. DSP0 is responsible for elimination, end point detection to automatically find speech sections, and analysis of found speech sections to find code numbers in vector quantization codebooks held in the system.

그리고, 인식단어를 찾는 알고리듬으로서 주어진 입력에 가장 적합한 상태들의 시퀀스를 찾아내는 기술인 비터비 서치(Viterbi search)는 DSIP1에서 담당한다. DSP 칩은 신호처리 및 음성 인식 분야에 사용할 수 있도록 설계된 칩이기 때문에 수학적인 계산이 많은 경우에도 실시간 처리가 가능해지게 만들어진다. 그렇지만 아직 칩과 여기에 사용되는 메모리 가격이 비싸서 상용화에 걸림돌이 되고 있다.In addition, Viterbi search, a technique for finding a sequence of states most suitable for a given input, is performed by DSIP1. DSP chips are designed for use in signal processing and speech recognition, making real-time processing possible even with many mathematical calculations. However, the price of chips and memory used here is still high, which makes it an obstacle to commercialization.

여기서, 패턴 인식에 의한 음성인식 과정을 좀더 자세히 알아보면 다음과 같다.Here, the speech recognition process by pattern recognition will be described in more detail as follows.

먼저, 전화망을 통해 들어온 모든 신호는 8kHz로 표본화되어 DSP0로 보내지며 여기에서는 입력된 데이터에서 음성의 시작부분을 찾는다. 제1도를 참조하여 음성인식과정을 설명하면 아래와 같다. 일단 음성이 들어왔다고 판단되면, 이 음성을 a/d변환하고 그 음성의 고역성분을 강조하는 SN비의 개선을 위하여 베이스밴드 신호 주파수의 높은 쪽을 특히 강조하는 프리엠퍼시스 과정을 거치게 된다. 그 다음에 입력음성은 윈도우 프레임내에서 20msec 길이의 프레임으로 분할되고, 프레임 경계 면에서의 정보손실을 방지하기 위하여 프레임을 10msec씩 중첩시킨다. 즉 이전 10msec 데이터와 새로 입력된 데이터 10sec를 하나의 프레임으로 만들고, 각 프레임 내에 포함된 음성에 주파수 특징을 나타내는 파라메터를 구한다. 파라미터는 LPC(Linear Predicitive Coding, 선형예측분석)분석후 변환을 거쳐 셉스트랄(cepstral)계수를 얻게 된다. 여기서 선형예측분석이란 정상확률과정의 표본 값에 선형조작을 하여 예측 값과 예측오차를 얻고 그로부터 스펙트럼 분해를 구하여 음성의 주파수를 분석하는 방법이고, 셉스트랄 계수란 로그 스케일의 스펙트럼을 푸리에 변환했을 때의 계수를 뜻한다. 이 과정에서 셉스트랄계수를 비롯하여 음성 특징을 나타내는 여러 계수들을 추출하게 된다. 이 계수들은 각 종류별로 미리 준비된 벡터 양자화 코드북을 참조하여 코드북내의 벡터 값들과 가장 비슷한 값에 해당되는 인덱스를 구한다. 그후, DSP0는 이 인덱스를 DSP1로 넘겨준다. 이때, DSP1에서는 HMM에 근거한 비터비 서치를 통해 이미 데이터베이스에 저장해 놓은 인식단어 모델 중에서 가장 유사한 어휘를 선택하여 결정하게 된다. 또 한 프레임에 대한 음성 특징의 추출과 벡터 양자화를 진행하는 동안 이전 프레임의 결과들을 사용하여 비터비 서치를 수행할 수 있도록 했다.First, all signals coming through the telephone network are sampled at 8 kHz and sent to DSP0, where the beginning of the voice is found from the input data. Referring to Figure 1 describes the speech recognition process as follows. Once it is determined that a voice has been received, it is subjected to a pre-emphasis process that specifically emphasizes the higher end of the baseband signal frequency to improve the SN ratio, which a / d converts the voice and emphasizes the high frequency components of the voice. The input voice is then divided into frames of 20 msec length in the window frame, and the frames are overlapped by 10 msec to prevent information loss at the frame boundary. That is, the previous 10msec data and the newly input data 10sec are made into one frame, and a parameter representing frequency characteristics is obtained for the voice included in each frame. The parameter is transformed after linear predicitive coding (LPC) analysis to obtain a cepstral coefficient. Here, linear predictive analysis is a method of linearly manipulating sample values of a normal probability process to obtain predicted values and prediction errors, obtaining spectral decompositions from them, and analyzing the frequency of speech. The coefficient of time. In this process, we extract the coefficients that represent the speech features, including the Sepstral coefficients. These coefficients are obtained by referring to the vector quantization codebook prepared in advance for each type, and obtain an index corresponding to the values most similar to the vector values in the codebook. DSP0 then passes this index to DSP1. At this time, the DSP1 selects the most similar vocabulary among the recognized word models already stored in the database through the Viterbi search based on the HMM. In addition, Viterbi search can be performed using the results of previous frames while extracting speech features and vector quantization for one frame.

결과적으로 DSP 칩들 사이에는 일종의 파이프라인이 형성되며 이를 테이블3에 나타내었다. 여기서 비터비 서치에서 1,2,…,n은 각각 독립적인 것이 아니라 이들이 모두 모여 한 번의 서치 구간이 된다. 테이블 3에서 비터비 서치 구간의 오른쪽을 점선으로 나타낸 이유는 서치에 걸리는 시간이 일정하지는 않기 때문이다. 단어가 인식된 후에는 필요한 정보를 사용자에게 들려주게 되는데, 이 안내 방송 동안 DSP0에서는 사용자의 새로운 음성 입력을 감지하기 위해서 시작점 검출 프로그램이 계속적으로 수행된다. 반면에 DSP1에서는 안내 방송 동안에는 수행되는 작업이 없기 때문에 이 동안에는 DPS1은 유휴시간을 가지고 쉬고 있다.As a result, a pipeline is formed between the DSP chips and is shown in Table 3. Where in Viterbi search 1,2,… , n are not independent of each other but they all come together to form a search. The reason why the right side of the Viterbi search section is shown in dotted line in Table 3 is that the search time is not constant. After the word is recognized, the user is informed of the necessary information. During this announcement, the DSP0 is continuously executed to detect a new voice input of the user. On the other hand, in DSP1, since no work is performed during announcement, DPS1 is idle with idle time.

따라서 채널이 증가할 경우, 그만큼 자원의 낭비를 초래하는 셈이다. 특히 DSP1에서는 인식단어를 구성하는 음소모델에 대한 확률 값들과 단어에 대한 음소 테이블을 저장하기 위해서 메모리를 많이 필요로 하기 때문에 채널 증가에 따른 메모리 비용도 적지 않다.Therefore, if the channel increases, it causes a waste of resources. In particular, DSP1 requires a lot of memory to store probability values for the phoneme model constituting the recognition word and a phoneme table for words.

본 발명은 이러한 비효율성을 없애기 위해 기존의 DSP 칩간의 연결을 동적으로 변화시킨다. 즉 DSP0와 DSP1의 개수를 수행되는 작업량에 따라 적당한 비율로 나누고, DSP0에서 나온 결과를 작업이 없는 DSP1에 할당하는 것이다. 이를 위해 DSP0와 DSP1과의 고정된 연결 구조를 동적 연결 구조로 바꿀 필요가 있는데, 여기에는 복수의 프로세서에 의해서 공통으로 사용되는 공유기억영역(shared memory)등 다양한 방식을 사용할 수 있다.The present invention dynamically changes the connection between existing DSP chips to eliminate this inefficiency. In other words, the number of DSP0 and DSP1 is divided by the appropriate ratio according to the amount of work to be performed, and the result of DSP0 is allocated to the no-working DSP1. To this end, it is necessary to change the fixed connection structure between DSP0 and DSP1 into a dynamic connection structure. Here, various methods such as shared memory commonly used by a plurality of processors can be used.

파이프라인 처리 방식을 사용하는 음성인식 시스템에 있어서, 연결회선수가 4개이고 DSP0와 DSP1의 비율이 N:1, 예컨대 2:1인 경우를 제4도에 나타내었다. 시스템이 작동되는 시나리오의 한 예를 제4도를 참조해 생각해 보면, 하나의 DSP0가 채널 하나에 연결되어 사용자의 음성을 인식하기 시작했을 때, DSP0은 시스템의 자원을 관리하는 프로세서에 이를 알려 준다. 그러면 이 프로세서에서는 전달받은 데이터를 사용가능한 DSP1에 전달한다. 인식결과가 나오면 해당되는 정보를 찾아서 이를 음성으로 출력할 수 있게 해준다. 음성으로 출력되고 있는 경우에는 앞에서 설명한 바와 같이 DSP1이 쉬게 되므로 다른 DSP0에서 나온 데이터를 처리할 수 있다. 결과적으로 본 발명의 파이프라인 처리 방식을 사용하는 음성인식 시스템은, DSP1에 대해 DSP0를 몇 개까지 대응시킬 수 있는지를 파악하여, 채널을 확장할 때 DSP0와 DSP1의 비율을 조정함으로써 시스템에 걸리는 부하를 고르게 분포시킬 수 있다.In the speech recognition system using the pipeline processing method, the connection player has four connections and the ratio of DSP0 and DSP1 is N: 1, for example, 2: 1, is shown in FIG. Considering an example of a scenario in which the system is operating, referring to Figure 4, when one DSP0 is connected to one channel and starts to recognize the user's voice, DSP0 notifies the processor managing the system's resources. . The processor then delivers the received data to the available DSP1. When the recognition result comes out, it finds the relevant information and outputs it by voice. In the case of audio output, DSP1 is stopped as described above, so data from other DSP0 can be processed. As a result, the voice recognition system using the pipeline processing method of the present invention can determine how many DSP0s can be mapped to DSP1, and adjust the ratio of DSP0 and DSP1 as the channel is expanded. Can be evenly distributed.

지금까지의 설명에서는 DSP0가 DSP1로 채널을 통해 직접 데이터를 줄수 있도록 되어 있으나, 제4도에 있는 공통 기억 영역(common memory)을 통해 데이터를 주고받을 수도 있다. 데이터를 주고받을때, 어떤 방식을 사용할 것인가는 프로토콜에 따라 달라지며 시스템의 사양을 고려한 프로토콜이 사용되어야 할 것이다. 한편 ESP 칩간을 연결할 때, 버스 구조가 아닌 다른 동적 연결 구조가 있다면, 굳이 버스 구조를 사용할 필요는 없는데, 다른 구조를 사용하더라도 연결 구조는 동적이어야 부하의 할당을 고르게 할 수 있다.In the above description, DSP0 can directly feed data to the DSP1 through a channel, but data can also be transmitted and received through the common memory shown in FIG. When sending and receiving data, which method is used depends on the protocol and a protocol that takes into account the system's specifications should be used. On the other hand, when connecting between ESP chips, if there is a dynamic connection structure other than the bus structure, the bus structure does not need to be used. Even if other structures are used, the connection structure must be dynamic to evenly allocate load.

시스템이 작동되면, 음성으로 출려되고 있는 경우에 앞에서 설명한 바와 같이, 쉬고 있는 DSP1이 다른 DSP0에서 나온 데이터를 처리할 수 있게 되어 결과적으로 하나의 DSP1에 대해 여러 개의 DSP0를 몇 개까지 대응시킬 수 있다. 또, 채널을 확장할 때 DSP0와 DSP1의 비율을 조정함으로써 시스템에 걸리는 부하를 고르게 분포시킬 수 있다. 이렇게 함으로써 시스템을 효율적으로 활용할 수 있으며, 시스템의 비용을 줄이게 된다.When the system is running, the DSP1 at rest can process data from other DSP0s as described earlier when it is being spoken, resulting in up to several DSP0s mapped to one DSP1. . In addition, as the channel expands, the load on the system can be evenly distributed by adjusting the ratio of DSP0 and DSP1. This makes the system efficient and reduces the cost of the system.

Claims

A speech recognition system using a pipeline processing method, the speech recognition system comprising: a pre-emphasis step of a / d converting an input voice to emphasize a high frequency component of the voice; Dividing the input voice into a frame of 20 msec length in the window frame; Overlapping the frames by 10 msec to prevent information loss on the frame boundary; Extracting several coefficients representing speech features, including septal coefficients; A step of obtaining the indices corresponding to values most similar to the vector values in the codebook by referring to the vector quantization codebook prepared in advance for each type; DSP0 then passes this index to DSP1; And performing a Viterbi search using the results of the previous frame during speech feature extraction and vector quantization for one frame.

2. The speech recognition system of claim 1, comprising separating the modules performing the feature extraction and the vector quantization and the modules performing the Viterbi search to dynamically connect several channels.

The speech recognition system of claim 1, comprising the step of sharing phonemic table information necessary for performing a Viterbi search.

2. The speech recognition system of claim 1, comprising performing feature extraction and vector quantization and a dynamic linking scheme that performs Viterbi search.

The speech recognition system of claim 1, further comprising a speech recognition method that uses fewer modules in the Viterbi search step than the number of modules in the feature extraction and vector quantization.

The speech processing method according to claim 1, wherein in the step of performing feature extraction and vector quantization, the pipeline processing method for performing the search in the step of performing the Viterbi search using the intermediate result before the end of the vector quantization is used. Recognition system.