KR20070022296A

KR20070022296A - Performance prediction for an interactive speech recognition system

Info

Publication number: KR20070022296A
Application number: KR1020067025444A
Authority: KR
Inventors: 홀거 숄
Original assignee: 코닌클리케 필립스 일렉트로닉스 엔.브이.
Priority date: 2004-06-04
Filing date: 2005-05-24
Publication date: 2007-02-26

Abstract

본 발명은 기록된 배경 잡음에 기초하여 음성 인식 절차의 성능 레벨을 결정하기 위한 대화식 음성 인식 시스템 및 대응하는 방법을 제공한다. 본 발명의 시스템은 유저가 음성 인식을 받는 음성을 입력하기 전에 발생하는 음성 포즈(pause)를 효과적으로 이용한다. 바람직하게는, 본 발명의 성능 예측은 훈련된 잡음 분류 모델을 효과적으로 이용한다. 나아가, 예측된 성능 레벨은 음성 인식 절차의 성능을 신뢰성있게 피드백하기 위해 유저에게 표시된다. 이런 방식으로 대화식 음성 인식 시스템은 신뢰성 있는 음성 인식을 생성하는데 적합지 않는 잡음 상태에 반응할 수 있다.The present invention provides an interactive speech recognition system and corresponding method for determining the performance level of a speech recognition procedure based on recorded background noise. The system of the present invention effectively utilizes voice pauses that occur before a user inputs speech-recognized speech. Preferably, the performance prediction of the present invention effectively uses a trained noise classification model. Furthermore, the predicted performance level is presented to the user to reliably feed back the performance of the speech recognition procedure. In this way, the interactive speech recognition system can respond to noise conditions that are not suitable for producing reliable speech recognition.

Description

PERFORMANCE PREDICTION FOR AN INTERACTIVE SPEECH RECOGNITION SYSTEM}

본 발명은 대화식 음성 인식 분야에 관한 것이다.The present invention relates to the field of interactive speech recognition.

자동 음성 인식 시스템(ASR)의 성능 및 신뢰성은 배경 잡음의 특성과 레벨에 크게 좌우된다. 시스템 성능을 증가시키고 여러 가지 상이한 잡음 상태에 대처하는데 여러 가지 접근법이 존재한다. 일반적인 아이디어는 음성과 잡음 사이에 신호대 잡음 비(SNR)를 증가시키기 위해 잡음 저감과 잡음 억압 방법에 기초하는 것이다. 대체로 이것은 적절한 잡음 필터에 의하여 실현될 수 있다.The performance and reliability of an automatic speech recognition system (ASR) is highly dependent on the nature and level of the background noise. There are several approaches to increasing system performance and coping with various different noise conditions. The general idea is based on noise reduction and noise suppression methods to increase the signal-to-noise ratio (SNR) between speech and noise. In principle this can be realized by means of a suitable noise filter.

다른 접근법은 특정 배경 잡음 시나리오에 지정된 잡음 분류 모델에 집중한다. 그러한 잡음 분류 모델은 자동 음성 인식을 위해 음향 모델이나 언어 모델에 병합될 수 있으며 특정 잡음 상태 하에서 훈련(training)을 요구할 수 있다. 그리하여, 잡음 분류 모델에 의하여, 음성 인식 방법은 여러 가지 미리 한정된 잡음 시나리오에 적응될 수 있다. 나아가, 선험적인 지식(a-priori knowledge)을 분류 모델에 병합하는 명백히 잡음에 강한 음향 모델링이 적용될 수 있다.Another approach focuses on the noise classification model specified for a particular background noise scenario. Such noise classification models may be incorporated into acoustic or language models for automatic speech recognition and may require training under certain noise conditions. Thus, by means of the noise classification model, the speech recognition method can be adapted to various predefined noise scenarios. Furthermore, apparently noisy acoustic modeling that incorporates a-priori knowledge into the classification model can be applied.

그러나, 모든 이들 접근법은 일반적인 응용 시나리오에서 발생할 수 있기 때문에 음성 품질을 개선시키거나 여러 잡음 상태에 일치(match)하도록 시도한다. 이 들 잡음 분류 모델의 다양성과 품질에 상관없이, 다수의 예측할 수 없는 잡음 및 교란 시나리오는 적당한 잡음 저감 및/또는 잡음 일치 노력에 의하여 커버(cover)될 수 없다.However, all these approaches attempt to improve speech quality or match various noise conditions because they can occur in common application scenarios. Regardless of the variety and quality of these noise classification models, many unpredictable noise and disturbance scenarios cannot be covered by proper noise reduction and / or noise matching efforts.

그러므로, 에러있는 음성 인식을 유발할 수 있는 문제있는 기록 환경임을 유저(user : 사용자)가 알 수 있도록 순간적인 잡음 레벨을 자동 음성 인식 시스템의 유저에게 표시하는 것이 실제적으로 유용하다. 가장 일반적으로, 잡음 표시기는 마이크로폰 입력의 순간 에너지 레벨을 디스플레이 하며 유저는 표시된 레벨이 충분한 품질의 음성 인식을 허용하는 적절한 영역 내에 있는지 여부를 스스로 평가할 수 있다. Therefore, it is practically useful to display the instantaneous noise level to the user of the automatic speech recognition system so that the user (user) knows that it is a problematic recording environment that may cause errory speech recognition. Most commonly, the noise indicator displays the instantaneous energy level of the microphone input and the user can self-evaluate whether the displayed level is within an appropriate area to allow speech recognition of sufficient quality.

예를 들어 WO 02/095726 A1은 그러한 음성 품질 표시를 개시한다. 여기서, 수신된 음성 신호는 신호의 음성 품질을 정량화하는 음성 품질 평가기에 공급된다. 최종 음성 품질 측정(measure)은 현재 수신된 음성 품질의 적절한 표시를 생성하는 표시기 드라이버에 공급된다. 이 표시는 표시기에 의해 음성 통신 디바이스의 유저에게 보이도록 이루어진다. 음성 품질 평가기는 여러 가지 방식으로 음성 품질을 정량화할 수 있다. 사용될 수 있는 음성 품질 측정의 2가지 간단한 예는 (i) 음성 신호 레벨과 (ⅱ) 음성 신호대 잡음 비이다.WO 02/095726 A1, for example, discloses such a speech quality indication. Here, the received speech signal is supplied to a speech quality estimator that quantifies the speech quality of the signal. The final speech quality measure is supplied to an indicator driver that produces an appropriate indication of the currently received speech quality. This indication is made visible by the indicator to the user of the voice communication device. The speech quality evaluator can quantify the speech quality in several ways. Two simple examples of speech quality measurements that can be used are (i) speech signal level and (ii) speech signal to noise ratio.

유저에게 디스플레이되는 음성 신호의 레벨과 신호대 잡음비는 문제있는 기록 환경을 표시하기 위해 적응될 수 있지만 대체로 자동 음성 인식 시스템의 음성 인식 성능과는 직접 관련되지 않는다. 예를 들어, 특정 잡음 신호가 충분히 필터링될 때, 다소 낮은 신호대 잡음 비가 음성 인식 시스템의 낮은 성능과 반드시 상관 되어 있는 것은 아니다. 추가적으로, 종래 기술에서 알려진 해법은 현재 수신된 음성 품질에 기초하는 표시 신호를 생성하도록 일반적으로 적응된다. 이것은 종종 수신된 음성의 일부(proportion)가 인식 절차(recognition procedure)를 이미 거쳤다는 것을 의미한다. 그러므로, 음성 품질 측정의 생성은 일반적으로 음성 인식 절차를 이미 거친 기록된 음성 및/또는 음성 신호에 기초한다. 두 경우에, 음성의 적어도 일부는, 유저가 잡음 레벨을 감소시키거나 기록 상태를 개선시킬 기회를 가지기도 전에 이미 처리된 것이다.The level of speech signal and signal-to-noise ratio displayed to the user can be adapted to indicate a problematic recording environment but is generally not directly related to the speech recognition performance of an automatic speech recognition system. For example, when a particular noise signal is sufficiently filtered, the rather low signal-to-noise ratio does not necessarily correlate with the low performance of the speech recognition system. In addition, solutions known in the art are generally adapted to generate an indication signal based on the currently received speech quality. This often means that a portion of the received voice has already gone through a recognition procedure. Therefore, the generation of speech quality measures is generally based on recorded speech and / or speech signals that have already undergone the speech recognition procedure. In both cases, at least part of the voice has already been processed before the user has a chance to reduce the noise level or improve the recording state.

본 발명은 유저의 음성을 인식하기 위한 대화식 음성 인식 시스템을 제공한다. 본 발명의 음성 인식 시스템은, 배경 잡음을 포함하는 음향 신호를 수신하기 위한 수단과; 수신된 음향 신호에 기초하여 잡음 모델을 선택하기 위한 수단과; 선택된 잡음 레벨에 기초하여 음성 인식 절차의 성능 레벨을 예측하기 위한 수단과; 예측된 성능 레벨을 유저에 표시하기 위한 수단을 포함한다. 구체적으로, 상기 음향 신호를 수신하기 위한 수단은 유저가 바람직하게는 대화식 음성 인식 시스템에 임의의 음성 신호를 제공하기 전에 잡음 레벨을 기록하기 위해 설계된다. 이런 방식으로, 배경 잡음을 나타내는 음향 신호는 음성 인식 절차를 거치게 되는 음성 신호가 심지어 생성되기 전에 획득된다. 특별히 대화식 시스템에서 적절한 음성 포즈(speech pause)는 어느 미리 한정된 시점에서 발생하며 잡음 특정 음향 신호를 기록하기 위해 효과적으로 사용될 수 있다.The present invention provides an interactive speech recognition system for recognizing a user's voice. The speech recognition system of the present invention comprises: means for receiving an acoustic signal comprising background noise; Means for selecting a noise model based on the received acoustic signal; Means for predicting a performance level of a speech recognition procedure based on the selected noise level; Means for displaying the predicted performance level to the user. Specifically, the means for receiving the acoustic signal is preferably designed for recording the noise level before the user provides any speech signal to the interactive speech recognition system. In this way, an acoustic signal representing the background noise is obtained even before the speech signal is subjected to a speech recognition procedure. In a particularly interactive system, a proper speech pause occurs at any predefined point of time and can be effectively used to record noise specific acoustic signals.

본 발명의 대화식 음성 인식 시스템은 음성 인식 시스템의 특정 응용 조건 하에서 훈련된 잡음 분류 모델을 이용하도록 더 적응된다. 바람직하게는 음성 인식 시스템은 여러 가지 잡음 분류 모델에 액세스하며, 각 잡음 분류 모델은 특정 잡음 상태를 나타낸다. 잡음 모델을 선택하는 것은 일반적으로 수신된 음향 신호를 분석하고 저장되어 있는 이전에 훈련된 잡음 모델과 비교하는 것을 말한다. 이후 수신되고 분석된 음향 신호와 최상으로 일치하는 특정 잡음 모델이 선택된다. The interactive speech recognition system of the present invention is further adapted to use a noise classification model trained under the specific application conditions of the speech recognition system. Preferably, the speech recognition system accesses several noise classification models, each representing a specific noise state. Choosing a noise model generally involves analyzing the received acoustic signal and comparing it to previously trained noise models that are stored. Then a specific noise model is selected that best matches the received and analyzed acoustic signal.

이 선택된 잡음 모델에 기초하여 음성 인식 절차의 성능 레벨이 예측된다. 그러므로, 성능 레벨을 예측하기 위한 수단은 실제 음성 인식이 시작되기 전이라도 음성 인식 절차의 품질 측정의 추정(estimation)을 제공한다. 이것은 음성 인식 단계의 시퀀스에서 가능한 한 일찍 특정 잡음 레벨을 추정하고 인식하기 위한 효과적인 수단을 제공한다. 일단 음성 인식 절차의 성능 레벨이 예측되고 나면, 이 예측하기 위한 수단은 예측된 성능 레벨을 유저에게 알리도록 적응된다.Based on this selected noise model, the performance level of the speech recognition procedure is predicted. Therefore, the means for predicting the performance level provides an estimate of the quality measure of the speech recognition procedure even before the actual speech recognition begins. This provides an effective means for estimating and recognizing specific noise levels as early as possible in the sequence of speech recognition steps. Once the performance level of the speech recognition procedure is predicted, the means for predicting is adapted to inform the user of the predicted performance level.

특별히 음성 인식 방법의 예측된 품질 측정를 유저에게 표시함으로써, 유저는 불충분한 음성 인식 상태를 가능한 한 일찍 알 수 있다. 이런 방식으로, 유저는 음성 인식 시스템을 실제 이용하기 전에 불충분한 음성 인식 상태에 반응할 수 있다. 이러한 기능은 유저가 음향학적으로 제어 명령이나 요청을 입력하는 대화식 시스템에서 특히 유리하다. 그러므로, 본 발명의 음성 인식 시스템은 바람직하게는 유저의 발언된 입력(spoken input)을 처리하며 예를 들어 대중 교통 시간표 정보 시스템(public transport timetable information system)과 같은 요청된 정보를 제공하도록 적응된 자동 대화식 시스템 내에 구현된다. In particular, by displaying the predicted quality measurement of the speech recognition method to the user, the user can know an insufficient speech recognition state as early as possible. In this way, the user can react to insufficient speech recognition status before actually using the speech recognition system. This feature is particularly advantageous in interactive systems where the user acoustically enters control commands or requests. Therefore, the speech recognition system of the present invention preferably handles the spoken input of the user and is automatically adapted to provide the requested information, for example a public transport timetable information system. It is implemented within an interactive system.

본 발명의 다른 바람직한 실시예에 따라, 성능 레벨을 예측하기 위한 수단은 수신된 음향 신호에 기초하여 결정된 잡음 파라미터에 기초하여 성능 레벨을 예측하도록 더 적응된다. 이들 잡음 파라미터는 예를 들어 음성 기록 레벨이나 신호대 잡음 비의 레벨을 표시하거나 음성 인식 절차의 성능 레벨을 예측하는데 더 이용될 수 있다. 이런 방식으로, 본 발명은 일반 잡음 특정 파라미터와 잡음 분류 모델의 어플리케이션을 단일 파라미터, 즉 음성 인식 시스템의 음성 인식 성능을 직접 나타내는 성능 레벨에 결합하기 위한 효과적인 수단을 제공한다. According to another preferred embodiment of the invention, the means for predicting the performance level is further adapted to predict the performance level based on a noise parameter determined based on the received acoustic signal. These noise parameters can be further used, for example, to indicate the level of speech recording or the signal to noise ratio or to predict the performance level of the speech recognition procedure. In this way, the present invention provides an effective means for combining the application of general noise specific parameters and noise classification models to a single parameter, that is, a performance level that directly represents the speech recognition performance of a speech recognition system.

대안적으로, 성능 레벨을 예측하기 위한 수단은 잡음 모델이나 잡음 파라미터를 별도로 이용할 수 있다. 그러나, 별도로 생성된 잡음 파라미터와 결합된 선택된 잡음 모델을 평가함으로써 더 신뢰성있는 성능 레벨이 예상되는 것이다. 그러므로, 성능 레벨을 예측하기 위한 수단은 음성 인식 절차의 특정 에러 율을 직접 나타내는 현실적인 성능 레벨을 제공하도록 복수의 잡음 표시 입력 신호를 범용적으로 이용할 수 있다. Alternatively, the means for predicting the performance level may use a noise model or noise parameter separately. However, more reliable performance levels are expected by evaluating selected noise models combined with separately generated noise parameters. Therefore, the means for predicting the performance level may universally use a plurality of noise indication input signals to provide a realistic performance level that directly represents a particular error rate of the speech recognition procedure.

본 발명의 다른 바람직한 실시예에 따라, 대화식 음성 인식 시스템은 예측된 성능 레벨에 기초하여 음성 인식 절차의 적어도 하나의 음성 인식 파라미터를 조절하도록 더 적응된다. 이런 방식으로 예측된 성능 레벨은 적절한 성능 정보를 유저에게 제공하도록 사용될 뿐만 아니라 음성 인식 방법을 능동적으로 개선시키도록 사용된다. 일반적인 음성 인식 파라미터는 예를 들어 숨은 마르코브 모델(HMM : hidden Markov models)을 이용하는 통계적 절차에 일반적으로 기초하는 언어 인식 방법을 위한 관련 음소 시퀀스(phoneme sequences)의 유효 범위를 특정하는 예를 들어 전지 레벨(pruning level)이다.According to another preferred embodiment of the present invention, the interactive speech recognition system is further adapted to adjust at least one speech recognition parameter of the speech recognition procedure based on the predicted performance level. The predicted performance level in this way is used not only to provide the user with appropriate performance information but also to actively improve the speech recognition method. Typical speech recognition parameters are for example batteries that specify the effective range of relevant phoneme sequences for language recognition methods that are generally based on statistical procedures using, for example, hidden Markov models (HMM). Pruning level.

일반적으로, 전지 레벨을 증가시키면 에러 율이 감소되지만 크게 더 높은 연산 능력을 요구하며 이는 음성 인식의 방법을 느리게 한다. 에러 율은 예를 들어 워드 에러율(WER : word error rate)이나 개념 에러 율(CER : concept error rate)을 말한다. 예측된 성능 레벨에 기초하여 음성 인식 파라미터를 조절함으로써 음성 인식 절차는 예측된 성능에 응답하여 범용적으로 변경될 수 있다.In general, increasing the battery level reduces the error rate but requires significantly higher computational power, which slows down the method of speech recognition. The error rate is, for example, a word error rate (WER) or a concept error rate (CER). By adjusting the speech recognition parameters based on the predicted performance levels, the speech recognition procedure can be universally changed in response to the predicted performance.

다른 바람직한 실시예에 따라, 대화식 음성 인식 시스템은 예측된 성능 레벨에 기초하여 미리 한정된 대화식 모드를 스위칭하기 위한 수단을 더 포함한다. 특별히 대화식 시스템에서 음성 인식 및/또는 대화식 시스템의 대화 및 통신 모드는 복수개 존재한다. 구체적으로, 음성 인식 시스템 및/또는 대화식 시스템은 인식된 음성을 재생하도록 그리고 인식된 음성을 유저에게 제공하도록 적응될 수 있으며, 이후 유저는 음성 인식 방법의 결과를 수용하거나 거부하여야 한다. According to another preferred embodiment, the interactive speech recognition system further comprises means for switching a predefined interactive mode based on the predicted performance level. In a particularly interactive system there are a plurality of modes of speech recognition and / or conversation and communication of the interactive system. In particular, the speech recognition system and / or the interactive system may be adapted to reproduce the recognized speech and provide the recognized speech to the user, after which the user must accept or reject the result of the speech recognition method.

이러한 검증의 트리거(trigger)는 예측된 성능 레벨에 의하여 효과적으로 제어될 수 있다. 예를 들어, 성능 레벨이 불량한 경우에 검증 독촉(verification prompt)이 매우 빈번히 트리거될 수 있는 반면, 성능 레벨이 높은 경우에 그 검증 독촉이 매우 드물게 대화에 삽입될 수 있다. 다른 대화식 모드는 수신된 음성 시퀀스의 완전한 거부를 포함할 수 있다. 이것은 매우 불량한 잡음 상태에서 특히 적당하다. 이 경우에, 유저는 배경 잡음 레벨을 저감하거나 음성 시퀀스를 반복하도록 단순히 지시될 수 있다. 대안적으로, 증가된 잡음 레벨을 보상하기 위해 더 많은 연산 시간을 요구하는 더 높은 전지 레벨로 고유하게 스위칭할 때, 유저는 음성 인식 시스템의 대응하는 지연이나 저감된 성능을 단순히 알 수 있다. The trigger of this verification can be effectively controlled by the predicted performance level. For example, verification prompts may be triggered very frequently when performance levels are poor, whereas verification prompts may be inserted in conversations very rarely when performance levels are high. Another interactive mode may include complete rejection of the received speech sequence. This is particularly suitable in very bad noise conditions. In this case, the user may simply be instructed to reduce the background noise level or to repeat the speech sequence. Alternatively, the user can simply know the corresponding delay or reduced performance of the speech recognition system when inherently switching to a higher cell level requiring more computation time to compensate for the increased noise level.

본 발명의 다른 바람직한 실시예에 따라, 음향 신호를 수신하기 위한 수단은 동작 모듈에 의해 생성된 동작 신호를 수신하는 것에 응답하여 배경 잡음을 기록하도록 더 적응된다. 동작 모듈에 의해 생성된 동작 신호는 음향 신호를 수신하기 위한 수단을 트리거한다. 음향 신호를 수신하기 위한 수단이 바람직하게는 유저의 발언(utterance)의 발생 이전에 배경 잡음을 기록하도록 적응되기 때문에, 동작 모듈은 음성의 부재가 예상될 때 음향 신호를 수신하기 위한 수단을 선택적으로 트리거하도록 시도한다.According to another preferred embodiment of the invention, the means for receiving the acoustic signal is further adapted to record the background noise in response to receiving the operation signal generated by the operation module. The operation signal generated by the operation module triggers a means for receiving an acoustic signal. Since the means for receiving the acoustic signal is preferably adapted to record background noise prior to the occurrence of the user's utterance, the operation module selectively selects the means for receiving the acoustic signal when the absence of speech is expected. Try to trigger

이것은 준비상태 표시기(readiness indicator)와 함께 유저에 의해 눌러질 동작 버턴(activation button)에 의하여 효과적으로 실현될 수 있다. 동작 버턴을 누름으로써 유저는 음성 인식 시스템을 서비스 기능으로 스위칭하고 짧은 지연 후에 음성 인식 시스템은 그 준비상태를 표시한다. 이 지연 내에서 유저는 아직 말하지 않는 것으로 가정될 수 있다. 그러므로, 동작 버턴을 누르는 것과 시스템의 준비를 표시하는 것 사이의 지연은 순간 배경 잡음을 측정하고 기록하기 위해 효과적으로 사용될 수 있다. This can be effectively realized by an activation button to be pressed by the user with a readiness indicator. By pressing the operation button, the user switches the speech recognition system to the service function, and after a short delay, the speech recognition system displays its ready state. Within this delay it can be assumed that the user has not yet spoken. Therefore, the delay between pressing the operation button and indicating system readiness can be effectively used to measure and record instantaneous background noise.

대안적으로, 동작 버턴의 누름은 또한 음성 제어에 기초하여 수행될 수 있다. 이러한 실시예에서, 음성 인식 시스템은 특정한 동작 어구(activation phrases)를 포착하도록 특별히 적응된 별도의 강인한 음성 인식기에 기초한 연속 청취 모드에 있다. 또한 여기서 이 시스템은 인식된 동작 어구에 즉시 반응하지 않고 배경 잡음 정보를 수집하기 위한 미리 한정된 지연을 이용하도록 적응된다.Alternatively, pressing of the operation button may also be performed based on voice control. In this embodiment, the speech recognition system is in a continuous listening mode based on a separate robust speech recognizer specifically adapted to capture specific activation phrases. Also here the system is adapted to use a predefined delay to collect background noise information without immediately reacting to the recognized operating phrase.

추가적으로, 대화식 시스템에서 구현될 때 음성 포즈(speech pause)는 일반적으로 대화식 시스템의 인사 메시지(greeting message) 후에 발생한다. 그러므로, 본 발명의 음성 인식 시스템은 잠재적인 배경 잡음(underlying background noise)을 충분히 결정하기 위해 잘 한정되거나 인공적으로 생성된 음성 포즈를 효과적으로 이용한다. 바람직하게, 배경 잡음의 결정은 배경 잡음 기록 단계를 유저가 알지 못하도록 자연스러운 음성 포즈나 음성 인식 및/또는 대화식 시스템에 일반적인 음성 포즈를 이용함으로써 병합된다. In addition, speech pauses, when implemented in an interactive system, generally occur after a greeting message of the interactive system. Therefore, the speech recognition system of the present invention effectively utilizes well defined or artificially generated speech poses to sufficiently determine potential background noise. Preferably, the determination of the background noise is merged by using natural speech poses or speech poses common to speech recognition and / or interactive systems so that the user is not aware of the background noise recording step.

본 발명의 다른 바람직한 실시예에 따라, 예측된 성능을 유저에게 표시하기 위한 수단은 예측된 성능 레벨을 표시하는 오디오 및/또는 비디오 신호를 생성하도록 적응된다. 예를 들어, 예측된 성능 레벨은 예를 들어 LED의 컬러 인코딩된 깜빡임이나 점멸(flashing)에 의하여 유저에게 디스플레이 될 수 있다. 녹색, 황색, 적색과 같은 상이한 컬러가 우수한 성능 레벨, 중간 성능 레벨, 또는 낮은 성능 레벨을 표시할 수 있다. 나아가, 복수의 광 스폿이 직선을 따라 배열될 수 있으며 성능의 레벨은 동시에 점멸하는 광 스폿의 수에 의해 표시될 수 있다. 추가적으로, 성능 레벨은 비프 톤(beeping tone)으로 표시될 수 있으며, 보다 복잡한 환경에서는 음성 인식 시스템은 음성 인식 시스템에 의해 재생될 수 있는 미리 한정된 음성 시퀀스를 통해 유저에게 청각적으로 지시할 수 있다. 후자는 바람직하게는 예를 들어 전화를 통해서만 접근할 수 있는 음성 인식에 기초한 대화식 시스템에서 구현된다. 여기서, 예측된 성능 레벨이 낮은 경우, 대화식 음성 인식 시스템은 유저에게 잡음 레벨을 저감시키고 및/또는 발언 워드(spoken word)를 반복하도록 지시할 수 있다.According to another preferred embodiment of the invention, the means for displaying the predicted performance to the user is adapted to generate an audio and / or video signal indicative of the predicted performance level. For example, the predicted performance level may be displayed to the user, for example by color encoded blinking or flashing of the LED. Different colors, such as green, yellow, red, may indicate good performance levels, medium performance levels, or low performance levels. Furthermore, a plurality of light spots can be arranged along a straight line and the level of performance can be indicated by the number of light spots flashing at the same time. In addition, the performance level may be represented by a beeping tone, and in more complex environments, the speech recognition system may audibly instruct the user through a predefined speech sequence that may be reproduced by the speech recognition system. The latter is preferably implemented in an interactive system based on speech recognition, which can only be accessed via, for example, a telephone. Here, if the predicted performance level is low, the interactive speech recognition system may instruct the user to reduce the noise level and / or repeat the spoken word.

다른 측면에서, 본 발명은, 대화식으로 음성을 인식하는 방법으로서, 배경 잡음을 포함하는 음향 신호를 수신하는 단계와; 수신된 음향 신호에 기초하여 복수의 훈련된 잡음 모델 중 하나의 잡음 모델을 선택하는 단계와; 선택된 잡음 모델에 기초하여 음성 인식 절차의 성능 레벨을 예측하는 단계와; 예측된 성능 레벨을 유저에게 표시하는 단계를 포함하는, 대화식으로 음성을 인식하는 방법을 제공한다. In another aspect, the present invention provides a method of interactively recognizing speech, comprising: receiving an acoustic signal comprising background noise; Selecting a noise model of one of the plurality of trained noise models based on the received acoustic signal; Predicting a performance level of a speech recognition procedure based on the selected noise model; Presenting a predicted performance level to a user.

본 발명의 다른 바람직한 실시예에 따라, 훈련된 잡음 모델 각각은 특정 잡음을 나타내고 대응하는 잡음 상태 하에서 수행되는 제 1 훈련 절차에 의하여 생성된다. 이것은 복수의 잡음 모델을 생성하기 위한 전용 훈련 절차를 요구한다. 예를 들어, 본 발명의 음성 인식 시스템을 자동차 환경에 적용하면, 대응하는 잡음 모델은 자동차 상태 하에서 또는 적어도 시뮬레이팅된 자동차 상태 하에서 훈련되어야 한다.According to another preferred embodiment of the invention, each trained noise model is produced by a first training procedure that represents a particular noise and is performed under a corresponding noise condition. This requires a dedicated training procedure for generating multiple noise models. For example, applying the speech recognition system of the present invention to a motor vehicle environment, the corresponding noise model should be trained under car conditions or at least under simulated car conditions.

본 발명의 다른 바람직한 실시예에 따라, 음성 인식 절차의 성능 레벨의 예측은 제 2 훈련 절차에 기초한다. 제 2 훈련 절차는 선택된 잡음 상태와 선택된 잡음 모델에 기초하여 성능 레벨의 예측을 훈련하는 기능을 한다. 그러므로, 제 2 훈련 절차는 제 1 훈련 절차에 의하여 생성된 특정 잡음 모델에 대응하는 각 잡음 상태에 대해 음성 인식 절차의 성능을 모니터(monitor)하도록 적응된다. 그러므로, 제 2 훈련 절차는, 음성 인식이 각 잡음 모델을 사용하여 이루어진 특정 잡음 상태 하에서 측정된 음성 인식 절차의 예를 들어 WER 또는 CER와 같은 특정 에러 율을 나타내는 훈련된 데이터를 제공하는 기능을 한다. According to another preferred embodiment of the invention, the prediction of the performance level of the speech recognition procedure is based on the second training procedure. The second training procedure functions to train the prediction of the performance level based on the selected noise state and the selected noise model. Therefore, the second training procedure is adapted to monitor the performance of the speech recognition procedure for each noise condition corresponding to the particular noise model generated by the first training procedure. Therefore, the second training procedure functions to provide trained data indicative of a specific error rate such as, for example, WER or CER, of the speech recognition procedure, where speech recognition is measured under specific noise conditions made using each noise model. .

다른 측면에서, 본 발명은 대화식 음성 인식 시스템을 위한 컴퓨터 프로그램 제품을 제공한다. 본 발명의 컴퓨터 프로그램 제품은, 배경 잡음을 포함하는 음향 신호를 수신하며, 수신된 음향 신호에 기초하여 잡음 모델을 선택하며, 선택된 잡음 모델에 기초하여 음성 인식 절차의 성능 레벨을 연산하며, 예측된 성능 레벨을 유저에 표시하기 위해 적응된 컴퓨터 프로그램 수단을 포함한다. In another aspect, the present invention provides a computer program product for an interactive speech recognition system. The computer program product of the present invention receives an acoustic signal comprising background noise, selects a noise model based on the received acoustic signal, calculates a performance level of the speech recognition procedure based on the selected noise model, and predicts Computer program means adapted for displaying the performance level to the user.

또 다른 측면에서, 본 발명은 유저에 의해 생성된 음성 입력을 처리하여 서비스를 유저에게 제공하기 위한 대화식 시스템을 제공한다. 이 대화식 시스템은 본 발명의 대화식 음성 인식 시스템을 포함한다. 그리하여, 본 발명의 음성 인식 시스템은 대중 교통 정보를 제공하는 예를 들어 자동 시간표 정보 시스템과 같은 대화식 시스템 내에 일체 부분으로서 병합된다.In another aspect, the present invention provides an interactive system for processing a voice input generated by a user to provide a service to the user. This interactive system includes the interactive speech recognition system of the present invention. Thus, the speech recognition system of the present invention is incorporated as an integral part in an interactive system such as, for example, an automatic timetable information system for providing public transportation information.

나아가, 청구범위에 있는 임의의 참조 부호는 본 발명의 범위를 제한하는 것으로 해석하여서는 아니된다는 것을 주의하여야 할 것이다.Furthermore, it should be noted that any reference signs in the claims should not be construed as limiting the scope of the invention.

이하에서는 본 발명의 바람직한 실시예를 도면을 참조하여 상세히 설명할 것이다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 음성 인식 시스템의 블록도.1 is a block diagram of a speech recognition system.

도 2는 음성 인식 시스템의 상세 블록도.2 is a detailed block diagram of a speech recognition system.

도 3은 음성 인식 시스템의 성능 레벨을 예측하는 흐름도.3 is a flowchart for predicting a performance level of a speech recognition system.

도 4는 음성 인식 절차에 성능 레벨 예측이 병합된 흐름도.4 is a flowchart in which performance level prediction is merged into a speech recognition procedure.

도 1은 본 발명의 대화식 음성 인식 시스템(100)의 블록도를 도시한다. 음성 인식 시스템은, 음성 인식 모듈(102)과, 잡음 기록 모듈(104)과, 잡음 분류 모듈 (106)과, 성능 예측 모듈(108)과, 표시 모듈(110)을 구비한다. 유저(user)(112)는 음성 인식 시스템(100)에 의하여 인식되는 음성을 제공함으로써 그리고 표시 모듈(110)을 통해 음성 인식의 성능을 나타내는 피드백을 수신함으로써 음성 인식 시스템(100)과 대화할 수 있다.1 shows a block diagram of an interactive speech recognition system 100 of the present invention. The speech recognition system includes a speech recognition module 102, a noise recording module 104, a noise classification module 106, a performance prediction module 108, and a display module 110. The user 112 can communicate with the speech recognition system 100 by providing a speech recognized by the speech recognition system 100 and by receiving feedback indicating the performance of speech recognition via the display module 110. have.

단일 모듈(102,...,110)은 음성 인식 시스템(100)의 성능 예측 기능을 실현하도록 설계된다. 추가적으로, 음성 인식 시스템(100)은 명시적으로 도시되어 있지 않으나 종래 기술에 알려져 있는 표준 음성 인식 성분을 포함한다.The single modules 102,... 110 are designed to realize the performance prediction function of the speech recognition system 100. Additionally, speech recognition system 100 includes standard speech recognition components, which are not explicitly shown but are known in the art.

유저(112)에 의해 제공되는 음성은, 음향 신호를 음성 인식 시스템(100)에 의해 처리될 수 있는 대응하는 전기 신호로 변환하는 예를 들어 마이크로폰과 같은 일종의 기록 디바이스에 의하여 음성 인식 시스템(100)으로 입력된다. 음성 인식 모듈(102)은 음성 인식 시스템(100)의 중앙 성분을 나타내며 기록된 음소(phonemes)의 분석을 제공하며 언어 모듈에 의하여 제공되는 워드 시퀀스나 어구로의 맵핑을 수행한다. 원리적으로 임의의 음성 인식 기술이 본 발명에 적용가능하다. 나아가, 유저(112)에 의해 입력된 음성은 음성 인식을 위하여 음성 인식 모듈(102)에 직접 제공된다. The speech provided by the user 112 may be converted into a speech recognition system 100 by a kind of recording device such as a microphone, for example, which converts an acoustic signal into a corresponding electrical signal that can be processed by the speech recognition system 100. Is entered. Speech recognition module 102 represents the central component of speech recognition system 100 and provides analysis of recorded phonemes and performs mapping to word sequences or phrases provided by language modules. In principle, any speech recognition technique is applicable to the present invention. Furthermore, the voice input by the user 112 is provided directly to the speech recognition module 102 for speech recognition.

성능 예측 모듈(108) 뿐만 아니라 잡음 기록 및 잡음 분류 모듈(104, 106)은 기록된 배경 잡음에만 기초하여 음성 인식 모듈(102)에 의하여 실행되는 음성 인식 방법의 성능을 예측하도록 설계된다. 잡음 기록 모듈(104)은 배경 잡음을 기록하기 위해 그리고 기록된 잡음 신호를 잡음 분류 모듈(106)에 제공하도록 설계된다. 예를 들어, 잡음 기록 모듈(104)은 음성 인식 시스템(100)의 지연 동안 잡음 신호를 기록한다. 일반적으로, 유저(112)는 음성 인식 시스템(100)을 동작시키며 미리 한정된 지연 간격이 경과된 후, 음성 인식 시스템은 그 준비상태를 유저(112)에 표시한다. 이 지연 동안, 유저(112)는 음성 인식 시스템의 준비 상태를 단순히 기다리며 그리하여 임의의 음성을 생성하지 않는 것으로 가정될 수 있다. 그리하여 이 지연 간격 동안 기록된 음향 신호는 배경 잡음만을 나타내는 것으로 예상된다. The noise recording and noise classification modules 104, 106 as well as the performance prediction module 108 are designed to predict the performance of the speech recognition method executed by the speech recognition module 102 based only on the recorded background noise. The noise recording module 104 is designed to record background noise and to provide the recorded noise signal to the noise classification module 106. For example, the noise recording module 104 records the noise signal during the delay of the speech recognition system 100. In general, the user 112 operates the speech recognition system 100 and after a predetermined delay interval elapses, the speech recognition system displays the ready state to the user 112. During this delay, the user 112 may be assumed to simply wait for the ready state of the speech recognition system and thus not generate any speech. Thus, the acoustic signal recorded during this delay interval is expected to show only background noise.

잡음 기록 모듈(104)에 의하여 잡음을 기록한 후에, 잡음 분류 모듈은 기록된 잡음 신호를 식별하는 역할을 한다. 바람직하게는, 잡음 분류 모듈(106)은 음성 인식 시스템(100)에 저장되어 있고 여러 배경 잡음 시나리오에 대해 특정된 잡음 분류 모델을 이용한다. 이들 잡음 분류 모델은 대응하는 잡음 상태 하에서 일반적으로 훈련된다. 예를 들어, 특정 잡음 분류 모델은 자동차 배경 잡음을 나타낼 수 있다. 유저(112)가 자동차 환경에서 음성 인식 시스템(100)을 이용할 때, 기록된 잡음 신호는 잡음 분류 모듈(106)에 의하여 자동차 잡음으로서 식별될 가능성이 높으며 각 자동차 잡음 분류 모델이 선택될 수 있다. 특정 잡음 분류 모델의 선택은 잡음 분류 모듈(106)에 의하여 또한 수행된다. 잡음 분류 모듈(106)은 잡음 신호 레벨이나 신호대 잡음비와 같은 여러 잡음 파라미터를 추출하고 지정하도록 더 적응될 수 있다. After recording the noise by the noise recording module 104, the noise classification module serves to identify the recorded noise signal. Preferably, the noise classification module 106 uses a noise classification model stored in the speech recognition system 100 and specified for various background noise scenarios. These noise classification models are generally trained under the corresponding noise conditions. For example, certain noise classification models may represent automotive background noise. When the user 112 uses the speech recognition system 100 in an automotive environment, the recorded noise signal is likely to be identified as the automobile noise by the noise classification module 106 and each vehicle noise classification model may be selected. Selection of a particular noise classification model is also performed by the noise classification module 106. The noise classification module 106 may be further adapted to extract and specify various noise parameters such as noise signal level or signal to noise ratio.

일반적으로, 잡음 분류 모듈(106)에 의해 결정되고 선택된 다른 잡음 특정 파라미터 뿐만 아니라 선택된 잡음 분류 모듈은 성능 예측 모듈(108)에 제공된다. 성능 예측 모듈(108)은 잡음 기록 모듈(104)로부터 변경되지 않게 기록된 잡음 신호를 더 수신할 수 있다. 성능 예측 모듈(108)은 제공된 잡음 신호, 잡음 특정 파 라미터 또는 선택된 잡음 분류 모델 중 어느 하나에 기초하여 음성 인식 모듈(102)의 예측된 성능을 연산한다. 나아가, 성능 예측 모듈(108)은 제공된 잡음 특정 입력 중 여러 가지 입력을 사용하여 성능 예측을 결정하도록 적응된다. 예를 들어, 성능 예측 모듈(108)은 음성 인식 방법의 신뢰성있는 성능 예측을 결정하기 위해 선택된 잡음 분류 모델과 잡음 특정 파라미터를 효과적으로 결합한다. 그 결과, 성능 예측 모듈(108)은 표시 모듈(110)과 음성 인식 모듈(102)에 제공되는 성능 레벨을 생성한다. In general, the selected noise classification module as well as other noise specific parameters determined and selected by the noise classification module 106 are provided to the performance prediction module 108. The performance prediction module 108 may further receive the noise signal recorded unaltered from the noise recording module 104. The performance prediction module 108 calculates the predicted performance of the speech recognition module 102 based on any one of the provided noise signal, noise specific parameters or selected noise classification model. Further, performance prediction module 108 is adapted to determine performance prediction using several of the provided noise specific inputs. For example, the performance prediction module 108 effectively combines the selected noise classification model and noise specific parameters to determine reliable performance prediction of the speech recognition method. As a result, the performance prediction module 108 generates the performance level provided to the display module 110 and the speech recognition module 102.

음성 인식 방법의 결정된 성능 레벨을 표시 모듈(110)에 제공하는 것에 의하여, 유저(112)는 음성 인식 방법의 예측된 성능과 신뢰성을 효과적으로 알 수 있다. 표시 모델(110)은 복수의 여러 방식으로 구현될 수 있다. 이것은 유저(112)에 의해 해석되어야 하는 깜빡거림, 컬러 인코딩된 출력을 생성할 수 있다. 보다 복잡한 실시예에서, 표시 모듈(110)은, 심지어 유저(112)에게 음성 품질을 개선시키거나 및/또는 배경 잡음을 각각 저감시키기 위해 일정 동작을 수행할 것을 지시하는 청각 출력(audible output)을 유저(112)에 생성하도록 음성 합성 수단을 또한 구비할 수 있다. By providing the determined performance level of the speech recognition method to the display module 110, the user 112 can effectively know the predicted performance and reliability of the speech recognition method. The display model 110 may be implemented in a number of different ways. This may produce blinking, color encoded output that should be interpreted by the user 112. In a more complex embodiment, the display module 110 even has an audible output instructing the user 112 to perform certain actions to improve voice quality and / or reduce background noise, respectively. Speech synthesizing means may also be provided to generate to the user 112.

음성 인식 모듈(102)은, 유저(112)로부터의 입력 신호와, 잡음 기록 모듈(104)로부터의 기록된 잡음 신호와, 선택된 잡음 분류 모델(106)로부터의 잡음 파라미터와 선택된 잡음 분류 모델 뿐만 아니라 성능 예측 모듈(108)로부터의 음성 인식 절차의 예측된 성능 레벨을 직접 수신하도록 더 적응된다. 생성된 파라미터 중 임의의 파라미터를 음성 인식 모듈(102)에 제공함으로써 음성 인식 방법의 예상 된 성능이 결정될 뿐 아니라 음성 인식 방법 자체도 본 잡음 상황에 효과적으로 적응될 수 있다.The speech recognition module 102 may include input signals from the user 112, recorded noise signals from the noise recording module 104, noise parameters from the selected noise classification model 106 and selected noise classification models. It is further adapted to directly receive the predicted performance level of the speech recognition procedure from the performance prediction module 108. By providing any of the generated parameters to the speech recognition module 102, not only the expected performance of the speech recognition method is determined, but also the speech recognition method itself can be effectively adapted to the present noise situation.

특히, 잡음 분류 모듈(106)에 의하여 선택된 잡음 모델과 연관된 잡음 파라미터를 음성 인식 모듈(102)에 제공함으로써, 잠재적인 음성 인식 절차는 선택된 잡음 모델을 효과적으로 이용할 수 있다. 나아가, 성능 예측 모듈(108)에 의하여 예상된 성능 레벨을 음성 인식 모듈(102)에 제공함으로써, 음성 인식 절차는 적절히 조절될 수 있다. 예를 들어, 상대적으로 높은 에러 율이 성능 예측 모듈(108)에 의하여 결정되었을 때 음성 인식 절차의 전지 레벨은 음성 인식 방법의 신뢰성을 증가시키기 위하여 적응적으로 조절될 수 있다. 더 높은 값 쪽으로 전지 레벨을 이동하는 것이 상당한 추가적인 연산한 시간을 요구하기 때문에 잠재적인 음성 인식 방법의 전반적인 효율은 상당히 감소할 수 있다. 그 결과 전체 음성 인식 방법은 보다 신뢰성이 있게 되지만 더 느리게 된다. 이 경우에, 이런 종류의 더 낮은 성능을 유저(112)에 표시하기 위해 표시 모듈(110)을 이용하는 것이 적합하다. In particular, by providing the speech recognition module 102 with a noise parameter associated with the noise model selected by the noise classification module 106, the potential speech recognition procedure can effectively use the selected noise model. Furthermore, by providing the speech recognition module 102 with the performance level expected by the performance prediction module 108, the speech recognition procedure can be adjusted accordingly. For example, the battery level of the speech recognition procedure may be adaptively adjusted to increase the reliability of the speech recognition method when a relatively high error rate has been determined by the performance prediction module 108. Since moving battery levels towards higher values requires significant additional computation time, the overall efficiency of potential speech recognition methods can be significantly reduced. As a result, the overall speech recognition method is more reliable but slower. In this case, it is appropriate to use the display module 110 to display this kind of lower performance to the user 112.

도 2는 대화식 음성 인식 시스템(100)의 보다 복잡한 실시예를 도시한다. 도 1에 도시된 실시예에 비해, 도 2는 대화식 음성 인식 시스템(100)의 추가적인 성분을 도시한다. 여기서, 음성 인식 시스템(100)은, 대화 모듈(114)과, 잡음 모듈(116)과, 동작 모듈(118)과 제어 모듈(120)을 더 구비한다. 바람직하게, 음성 인식 모듈(102)은 도 1에 이미 도시된 바와 같이 여러 모듈(104,..., 108)에 연결된다. 제어 모듈(120)은 상호작용(interplay)을 제어하고 대화식 음성 인식 시스템(100)의 여러 모듈의 기능을 조정하도록 적응된다. 2 illustrates a more complex embodiment of the interactive speech recognition system 100. Compared to the embodiment shown in FIG. 1, FIG. 2 illustrates additional components of the interactive speech recognition system 100. Here, the speech recognition system 100 further includes a dialogue module 114, a noise module 116, an operation module 118, and a control module 120. Preferably, voice recognition module 102 is coupled to several modules 104,... 108 as already shown in FIG. 1. The control module 120 is adapted to control the interplay and to adjust the functions of the various modules of the interactive speech recognition system 100.

대화 모듈(114)은 성능 예측 모듈(108)로부터 예측된 성능 레벨을 수신하고 표시 모듈(110)을 제어하도록 적응된다. 바람직하게는, 대화 모듈(114)은 유저(112)와 통신하도록 적용될 수 있는 여러 가지 대화 전략을 제공한다. 예를 들어, 대화 모듈(114)은 표시 모듈(110)에 의하여 유저(112)에 제공되는 검증 독촉(verification prompt)을 트리거(trigger) 하도록 적응된다. 이러한 검증 독촉은 유저(112)의 인식된 음성의 재생을 포함할 수 있다. 유저(112)는 재생된 음성이 유저의 오리지널 음성(original speech)의 구문적 의미(semantic meaning)를 실제 나타내고 있는지에 따라 재생된 음성을 수용하거나 버려야 한다. The dialogue module 114 is adapted to receive the predicted performance level from the performance prediction module 108 and to control the display module 110. Preferably, the conversation module 114 provides various conversation strategies that can be applied to communicate with the user 112. For example, the dialogue module 114 is adapted to trigger a verification prompt provided to the user 112 by the display module 110. This verification reminder may include playback of the recognized voice of user 112. The user 112 must accept or discard the reproduced speech depending on whether the reproduced speech actually represents the semantic meaning of the user's original speech.

대화 모듈(114)은 바람직하게는 음성 인식 절차의 예측된 성능 레벨에 의하여 제어된다. 예측된 성능의 레벨에 따라 검증 독촉의 트리거는 대응하게 적응될 수 있다. 성능의 레벨이 신뢰성있는 음성 인식이 가능하지 않다는 것을 나타내는 극단적인 경우에, 대화 모듈(114)은 예를 들어 유저(112)에게 배경 잡음을 저감시키도록 지시하는 것과 같은 적절한 유저 지시를 생성하도록 표시 모듈(110)을 심지어 트리거할 수 있다.The conversation module 114 is preferably controlled by the predicted performance level of the speech recognition procedure. Depending on the level of expected performance, the trigger of the verification reminder may be adapted accordingly. In extreme cases where the level of performance indicates that reliable speech recognition is not possible, the dialogue module 114 indicates to generate appropriate user instructions such as, for example, instructing the user 112 to reduce background noise. It may even trigger module 110.

잡음 모델 모듈(116)은 여러 가지 잡음 분류 모델을 저장하는 저장소 역할을 한다. 복수의 상이한 잡음 분류 모델은 바람직하게는 각 잡음 상태 하에서 수행되는 대응하는 훈련 절차에 의하여 생성된다. 구체적으로, 잡음 분류 모듈(106)은 특정 잡음 모델을 선택하기 위해 잡음 모델 모듈(116)에 액세스한다. 대안적으로, 잡음 모델 선택은 잡음 모델 모듈(116)에 의하여 실현될 수도 있다. 이 경우에, 잡음 모델 모듈(116)은 잡음 기록 모듈(104)로부터 기록된 잡음 신호를 수신하며, 수신 된 잡음 신호의 일부와 여러 가지 저장된 잡음 분류 모델을 비교하며, 기록된 잡음의 일부와 일치하는 잡음 분류 모델 중 적어도 하나를 결정한다. 그후 최상으로 일치하는 잡음 분류 모델이 다른 잡음 특정 파라미터를 생성할 수 있는 잡음 분류 모듈(106)에 제공된다.The noise model module 116 serves as a storage for storing various noise classification models. A plurality of different noise classification models are preferably generated by corresponding training procedures performed under each noise condition. Specifically, noise classification module 106 accesses noise model module 116 to select a particular noise model. Alternatively, noise model selection may be realized by noise model module 116. In this case, the noise model module 116 receives the recorded noise signal from the noise recording module 104, compares some of the received noise signal with various stored noise classification models, and matches some of the recorded noise. Determine at least one of the noise classification model. The best matched noise classification model is then provided to the noise classification module 106, which can generate other noise specific parameters.

동작 모듈(118)은 잡음 기록 모듈(104)을 트리거 하는 역할을 한다. 바람직하게는, 동작 모듈(118)은 유저에 의해 발언되는 특정 동작 어구를 포착하도록 적응된 특정 설계된 음성 인식기로 구현된다. 동작 어구와 이 동작 어구의 각 식별을 수신하는 것에 응답하여, 동작 모듈(118)은 잡음 기록 모듈(104)을 동작시킨다. 추가적으로, 동작 모듈(118)은 또한 유저(112)에 그 준비 상태를 표시하기 위하여 제어 모듈(120)을 통해 표시 모델(110)을 트리거한다. 바람직하게, 준비 상태의 표시는 잡음 기록 모듈(104)이 동작된 후에 수행된다. 이 지연 동안, 유저(112)는 발언하지 않고 음성 인식 시스템(100)의 준비를 기다리는 것으로 가정될 수 있다. 그리하여, 이 지연 간격은 실제 배경 잡음을 순수하게 나타내는 음향 신호를 기록하는데 이상적으로 적합하다.The operation module 118 serves to trigger the noise recording module 104. Preferably, the action module 118 is implemented with a specially designed speech recognizer adapted to capture a particular action phrase spoken by the user. In response to receiving the operating phrase and each identification of the operating phrase, the operating module 118 operates the noise recording module 104. In addition, the operation module 118 also triggers the display model 110 via the control module 120 to indicate its ready state to the user 112. Preferably, the indication of the ready state is performed after the noise recording module 104 is operated. During this delay, it may be assumed that the user 112 waits for the preparation of the speech recognition system 100 without speaking. Thus, this delay interval is ideally suited for recording acoustic signals that represent purely background noise.

별도의 음성 인식 모듈을 이용하여 동작 모듈(118)을 구현하는 대신에, 동작 모듈은 일부 다른 종류의 동작 수단에 의하여도 구현될 수 있다. 예를 들어, 동작 모듈(118)은 음성 인식 시스템을 동작시키기 위해 유저(112)에 의해 눌러져야 하는 동작 버턴을 제공할 수 있다. 또한 여기서 배경 잡음을 기록하는데 필요한 지연이 적절히 구현될 수 있다. 특별히 대화식 음성 인식 시스템이 전화 기반 대화식 시스템으로 구현될 때 동작 모듈(118)은 대화식 시스템의 일정 종류의 메시지가 유저 (112)에 제공된 후 잡음 기록을 동작시키도록 적응될 수 있다. 가장 일반적으로, 환영 메시지를 유저(112)에 제공한 후 배경 잡음 기록을 위해 사용될 수 있는 적절한 음성 포즈가 발생한다. Instead of implementing the operation module 118 using a separate speech recognition module, the operation module may also be implemented by some other kind of operation means. For example, the operation module 118 may provide an operation button that must be pressed by the user 112 to operate the speech recognition system. In addition, the delay required for recording the background noise can be properly implemented here. Specifically, when the interactive speech recognition system is implemented as a telephone-based interactive system, the operation module 118 may be adapted to operate the noise recording after some kind of message of the interactive system is provided to the user 112. Most commonly, an appropriate voice pose occurs that can be used for recording background noise after providing a welcome message to the user 112.

도 3은 본 발명의 대화식 음성 인식 시스템의 성능 레벨을 예측하기 위한 흐름도를 도시한다. 제 1 단계(200)에서, 동작 신호가 수신된다. 이 동작 신호는 유저에 의하여 발언되는 동작 어구를 수신함으로써 또는 전화 기반 대화식 시스템에 구현될 때 유저(112)에 인사 메시지를 제공한 후 유저(112)에 의하여 버턴을 누르는 것을 말할 수 있다. 단계(200)에서 동작 신호를 수신하는 것에 응답하여, 후속 단계(202)에서 잡음 신호가 기록된다. 동작 신호는 음성 없는 기간의 시작을 나타내므로, 기록된 신호는 배경 잡음을 유일하게 나타내는 것일 것이다. 배경 잡음이 단계(202)에서 기록된 후 그 다음 단계(204)에서 기록된 잡음 신호는 잡음 분류 모듈(106)에 의하여 평가된다. 잡음 신호의 평가는 단계(206)에서 특정 잡음 모델의 선택 뿐만 아니라 단계(208)에서 잡음 파라미터의 생성을 말한다. 단계(206, 208)에 의하여 특정 잡음 모델 및 그 연관된 잡음 파라미터가 결정된다.3 shows a flowchart for predicting the performance level of the interactive speech recognition system of the present invention. In a first step 200, an operation signal is received. This action signal may refer to pressing a button by the user 112 after receiving an action phrase spoken by the user or after providing a greeting message to the user 112 when implemented in a telephone-based interactive system. In response to receiving the operational signal at step 200, a noise signal is recorded at a subsequent step 202. Since the operating signal indicates the beginning of a period of no speech, the recorded signal would be uniquely representative of the background noise. The background noise is recorded in step 202 and then the noise signal recorded in step 204 is evaluated by noise classification module 106. The evaluation of the noise signal refers to the selection of a particular noise model in step 206 as well as the generation of a noise parameter in step 208. Steps 206 and 208 determine a particular noise model and its associated noise parameter.

선택된 잡음 모델과 생성된 잡음 파라미터에 기초하여 그 다음 단계(210)에서 음성 인식 절차의 성능 레벨이 성능 예측 모듈(108)에 의하여 예측된다. 예측된 성능 레벨은 표시 모듈(110)을 사용하여 단계(212)에서 유저에 표시된다. 이후 또는 동시에 음성 인식은 단계(214)에서 처리된다. 성능 레벨의 예측이 음성의 입력 이전에 잡음 입력에 기초하여 수행되므로, 원리적으로 예측된 성능 레벨은 유저가 발언하기 시작하기 전이라도 유저(112)에게 디스플레이될 수 있다.Based on the selected noise model and the generated noise parameters, the performance level of the speech recognition procedure is then predicted by the performance prediction module 108 at 210. The predicted performance level is displayed to the user at step 212 using the display module 110. Speech recognition is then processed at step 214 either simultaneously or simultaneously. Since the prediction of the performance level is performed based on the noise input prior to the voice input, the predicted performance level can in principle be displayed to the user 112 even before the user starts to speak.

나아가, 예측된 성능 레벨은 여러 잡음 모델과 잡음 파라미터와 측정된 에러 율 사이에 관계를 제공하는 추가적인 훈련 절차에 기초하여 생성될 수 있다. 그리하여, 예측된 성능 레벨은 음성 인식 방법의 예상된 출력에 집중한다. 예측되고 예상된 성능 레벨은 바람직하게는 유저에게 표시될 뿐만 아니라 에러 율을 저감하기 위해 바람직하게는 음성 인식 절차에 의해서도 이용될 수 있다.Furthermore, the predicted performance levels can be generated based on several noise models and additional training procedures that provide a relationship between the noise parameters and the measured error rates. Thus, the predicted performance level concentrates on the expected output of the speech recognition method. The predicted and expected level of performance is preferably not only displayed to the user but also preferably used by a speech recognition procedure to reduce the error rate.

도 4는 음성 인식 절차 내에서 예측된 음성 레벨을 이용하기 위한 흐름도를 도시한다. 단계(300 내지 308)는 도 3에 이미 도시된 바와 같이 단계(200 내지 208)에 대응한다. 단계(300)에서 동작 신호가 수신되며, 단계(302)에서 잡음 신호가 기록되며, 이후 단계(304)에서 기록된 잡음 신호는 평가된다. 잡음 신호의 평가는 2개의 단계(306, 308)를 말하며, 여기서 특정 잡음 분류 모델이 선택되고 대응하는 잡음 파라미터가 생성된다. 잡음 특정 파라미터가 단계(308)에서 생성된 후, 생성된 파라미터는 단계(318)에서 음성 인식 절차의 인식 파라미터를 조절(tune)하기 위해 사용된다. 예를 들어 전지 레벨과 같은 음성 레벨 파라미터가 단계(318)에 조절된 후, 음성 인식 절차는 단계(320)에서 처리되며, 대화식 시스템에 구현될 때 대응하는 대화가 단계(320)에서 또한 수행된다. 일반적으로, 단계(318)와 단계(320)는 음성 인식 방법을 개선시키기 위해 잡음 특정 파라미터를 이용하는 종래 기술의 해법을 나타낸다. 이와 대조하여 단계(310) 내지 단계(316)는 배경 잡음의 평가에 기초하여 음성 인식 절차의 본 발명의 성능 예측을 나타낸다.4 shows a flowchart for using a predicted speech level within a speech recognition procedure. Steps 300-308 correspond to steps 200-208 as already shown in FIG. 3. An operational signal is received at step 300, a noise signal is recorded at step 302, and then the noise signal recorded at step 304 is evaluated. Evaluation of the noise signal refers to two steps 306 and 308, where a particular noise classification model is selected and the corresponding noise parameter is generated. After the noise specific parameter is generated in step 308, the generated parameter is used to tune the recognition parameter of the speech recognition procedure in step 318. After a speech level parameter, such as, for example, battery level, is adjusted in step 318, the speech recognition procedure is processed in step 320, and the corresponding conversation is also performed in step 320 when implemented in the interactive system. . Generally, steps 318 and 320 represent prior art solutions using noise specific parameters to improve the speech recognition method. In contrast, steps 310 to 316 represent the present invention's performance prediction of the speech recognition procedure based on the evaluation of the background noise.

잡음 모델이 단계(306)에서 선택된 후, 단계(310)는 수행되는 선택이 성공적이었는지 여부를 체크한다. 특정 잡음 모델이 선택되지 않은 경우에, 본 방법은 단 계(318)로 계속하며, 여기서 결정된 잡음 파라미터는 음성 인식 절차의 인식 파라미터를 조절하는데 사용된다. 단계(310)에서 특정 잡음 분류 모델의 성공적인 선택이 수용된 경우, 본 방법은 단계(312)로 진행하며, 여기서 선택된 잡음 모델에 기초하여 음성 인식 절차의 성능 레벨이 예측된다. 추가적으로, 성능 레벨의 예측은 단계(308)에서 결정된 잡음 특정 파라미터의 이용을 또한 병합할 수 있다. 성능 레벨이 단계(312)에서 예측된 후, 단계(314) 내지 단계(318)가 동시에 또는 대안적으로 실행된다.After the noise model is selected in step 306, step 310 checks whether the selection being performed was successful. If no specific noise model is selected, the method continues to step 318, where the determined noise parameter is used to adjust the recognition parameter of the speech recognition procedure. If a successful selection of a particular noise classification model is accepted at step 310, the method proceeds to step 312, where the performance level of the speech recognition procedure is predicted based on the selected noise model. In addition, the prediction of the performance level may also incorporate the use of the noise specific parameter determined in step 308. After the performance level is predicted in step 312, steps 314 to 318 are executed simultaneously or alternatively.

단계(314)에서, 대화 모듈(114)을 위한 대화 파라미터는 예측된 성능 레벨에 대해 조절된다. 이들 대화 파라미터는 대화식 시스템에서 검증 독촉이 트리거되어야 하는 시간 간격을 지정한다. 대안적으로, 대화 파라미터는 대화식 음성 인식 시스템과 유저 사이에 여러 가지 대화 시나리오를 지정할 수 있다. 예를 들어, 대화 파라미터는 음성 인식 절차가 수행될 수 있기 전에 배경 잡음을 유저가 저감하여야 하는 것을 제어할 수 있다. 단계(316)에서, 결정된 성능 레벨이 표시 모듈(110)을 이용하여 유저에게 표시한다. 이런 방식으로, 유저(112)는 성능의 정도를 효과적으로 알게 되며 이로 인해 음성 인식 절차의 신뢰성을 알 수 있게 된다. 추가적으로, 단계(318)에서 수행되는 인식 파라미터의 조절은 단계(312)에서 예측되는 성능 레벨을 효과적으로 이용할 수 있다.In step 314, the dialogue parameters for the dialogue module 114 are adjusted for the predicted performance level. These conversation parameters specify the time interval within which the verification reminder should be triggered in the interactive system. Alternatively, the conversation parameters may specify various conversation scenarios between the interactive speech recognition system and the user. For example, the dialogue parameter may control what the user should reduce the background noise before the speech recognition procedure can be performed. In step 316, the determined performance level is displayed to the user using the display module 110. In this way, the user 112 can effectively know the degree of performance and thereby the reliability of the speech recognition procedure. In addition, adjustment of the recognition parameter performed in step 318 may effectively utilize the performance level predicted in step 312.

단계(314), 단계(316) 및 단계(318)는 동시에, 순차적으로, 또는 단지 선택적으로 실행될 수 있다. 선택적인 실행은 단계(314), 단계(316) 및 단계(318) 중 하나 또는 2개의 단계만이 실행되는 경우를 언급한다. 그러나, 단계(314), 단계 (316) 및 단계(318) 중 어느 하나의 단계의 실행 후에 음성 인식 방법이 단계(320)에서 수행된다.Steps 314, 316, and 318 may be executed simultaneously, sequentially, or only selectively. Selective execution refers to the case where only one or two steps of step 314, step 316, and step 318 are executed. However, after execution of any one of steps 314, 316, and 318, a speech recognition method is performed in step 320.

그러므로, 본 발명은 기록된 배경 잡음에 기초하여 음성 인식 절차의 성능 레벨을 추정하기 위한 효과적인 수단을 제공한다. 바람직하게, 본 발명의 대화식 음성 인식 시스템은 음성이 인식 시스템에 입력되기 전이라도 적절한 성능 피드백을 유저(112)에 제공하도록 적응된다. 예측된 성능 레벨의 이용은 복수의 상이한 방법으로 실현될 수 있으므로, 본 발명의 성능 예측은 여러 가지 기존의 음성 인식 시스템에 범용적으로 구현될 수 있다. 특히, 본 발명의 성능 예측은 기존의 잡음 저감 및/또는 잡음 레벨을 표시하는 시스템과 범용적으로 결합될 수 있다. Therefore, the present invention provides an effective means for estimating the performance level of the speech recognition procedure based on the recorded background noise. Preferably, the interactive speech recognition system of the present invention is adapted to provide the user 112 with appropriate performance feedback even before speech is input to the recognition system. Since the use of predicted performance levels can be realized in a number of different ways, the performance prediction of the present invention can be implemented universally in various existing speech recognition systems. In particular, the performance prediction of the present invention can be combined universally with systems that indicate existing noise reduction and / or noise levels.

[도면 번호의 간단한 설명][Brief Description of Drawing Numbers]

100 : 음성 인식 시스템 102 : 음성 인식 모듈100: speech recognition system 102: speech recognition module

104 : 잡음 기록 모듈 106 : 잡음 분류 모듈104: noise recording module 106: noise classification module

108 : 성능 예측 모듈 110 : 표시 모듈108: performance prediction module 110: display module

112 : 유저 114 : 대화 모듈112: user 114: conversation module

116 : 잡음 모델 모듈 118 : 동작 모듈116: noise model module 118: operation module

120 : 제어 모듈120: control module

전술된 바와 같이, 본 발명은 대화식으로 음성을 인식하는데 이용가능하다. As mentioned above, the present invention is available for interactively recognizing speech.

Claims

As the interactive speech recognition system 100 for recognizing the voice of the user 112,

Means for receiving an acoustic signal comprising background noise;

Means (106) for selecting a noise model based on the received acoustic signal;

Means (108) for predicting a performance level of a speech recognition procedure based on the selected noise model;

Means for displaying the predicted performance level to the user (110)

Including, an interactive speech recognition system.

The system of claim 1, wherein the means for predicting the performance level is further adapted to predict the performance level based on a noise parameter determined based on the received acoustic signal.

2. The interactive speech recognition system of claim 1, further adapted to adjust at least one speech recognition parameter of the speech recognition procedure based on the predicted performance level.

The system of claim 1, further comprising means (114) for switching a predefined conversation mode based on the predicted performance level.

The interactive speech recognition system of claim 1, wherein the means for predicting the performance level is adapted to predict the performance level before executing the speech recognition procedure.

2. The interactive speech recognition system of claim 1, wherein the means for receiving the acoustic signal is further adapted to record background noise in response to receiving the operation signal generated by the operation module (118).

The system of claim 1, wherein the means (110) for presenting the predicted performance to a user (112) is adapted to generate an audio and / or video signal indicative of the predicted performance level.

As an interactive way to recognize voice,

Receiving an acoustic signal comprising background noise;

Selecting a noise model of one of the plurality of trained noise models based on the received acoustic signal;

Predicting a performance level of a speech recognition procedure based on the selected noise model;

Displaying the predicted performance level to a user

Containing, interactive speech recognition method.

9. The method of claim 8, further comprising generating each of the noise models using a first training procedure under a corresponding noise condition.

9. The method of claim 8, wherein the prediction of the performance level of the speech recognition procedure is made based on a second training procedure, the second training procedure being adapted to monitor the performance of the speech recognition procedure for each of the noise conditions. , Interactive speech recognition method.

A computer program product for an interactive speech recognition system comprising computer program means, the computer program means comprising:

Receive an acoustic signal comprising background noise,

Select a noise model based on the received acoustic signal,

Calculate a performance level of a speech recognition procedure based on the selected noise model,

To present the predicted performance level to a user.

Adapted,

Computer program products.

An automatic interactive system comprising the interactive speech recognition system of claim 1.