KR20200109833A

KR20200109833A - A Coputer Program for Reducing Waiting Time in Automatic Speech Recognition

Info

Publication number: KR20200109833A
Application number: KR1020190029556A
Authority: KR
Inventors: 오성조; 김은직; 이수익; 박준우
Original assignee: 주식회사 포지큐브
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2020-09-23

Abstract

A program for executing a program in a computer to execute a method for reducing waiting time in automatic voice recognition in a computer can improve fast response time and shorten overall waiting time by monitoring voice buffer to predict the ″end of speech″ of a received voice and improving user experience of natural language-based human computer interaction, classify incoming sounds, remove a noise, detect dissatisfied words and phrases to determine EOS time limit, call EOS based on a keyword to which a weight is assigned, and shorten voice recognition waiting time by varying and controlling EOS time.

Description

A Coputer Program for Reducing Waiting Time in Automatic Speech Recognition, for executing a program on a computer for executing a method of reducing waiting time in automatic speech recognition.

본 발명은 음성 버퍼를 모니터링하여 수신 음성의 "말의 종료"를 예측함으로써 자연 언어 기반 인간 - 컴퓨터 상호 작용의 사용자 경험을 향상시켜보다 빠른 응답 시간과 전체 대기 시간을 단축시킬 수 있도록 한 자동 음성 인식에서 대기 시간을 감소시키는 방법을 컴퓨터에서 실행시키기 위한 프로그램에 관한 것이다The present invention monitors the voice buffer to predict the "end of speech" of the received voice, thereby improving the user experience of human-computer interaction based on natural language, thereby reducing the overall waiting time and faster response time. To a program for running on a computer how to reduce the waiting time in

자동 음성 인식(ASR) 및 자연어 이해(NLU)는, 사용자 음성을 입력받아 해석 프로세스를 거쳐 텍스트 형식으로 가장 유사한 데이터를 찾는다. 심 신경 네트워크(Deep Neural Network, DNN)는, 확립 된 언어 모델에 가장 근접한 텍스트 데이터를 탐색하는 확률 상태 다이어그램을 작성하는 데 사용되며, 사전 데이터, 모 폴터, 교육 데이터 등으로 구성된 방대한 양의 데이터를 사용하여 학습된다.Automatic speech recognition (ASR) and natural language understanding (NLU) receive a user's voice and go through an interpretation process to find the most similar data in text format. Deep Neural Networks (DNNs) are used to create probabilistic state diagrams that search for text data closest to an established linguistic model, and take vast amounts of data consisting of dictionary data, morphors, training data, etc. Is learned using.

ASR 시스템은 ASR 시스템 자체가 EOS(end of speech)를 지시 할 수있는 기능을 제공하지만, 이러한 호출 방법은 의미있는 문맥 데이터를 얻기에는 너무 일찍 시작하거나 너무 늦게 불필요하게 만들 수 있다. 무의미한(Disfluency; 비유창성) 표현은 종종 무음으로 이어지므로 ASR 시스템은 무의미한 데이터를 반환하거나 전반적인 사용자 경험에 대기 시간을 추가하면서 문맥이 부족하여 완전히 폐기한다.The ASR system provides the ability for the ASR system itself to indicate the end of speech (EOS), but this calling method can start too early or make it unnecessary too late to get meaningful contextual data. Disfluency (disfluency) expression often leads to silence, so the ASR system returns meaningless data or discards it entirely due to lack of context, adding latency to the overall user experience.

한국공개특허 10-2009-0054642(발명의 명칭 : 음성 인식 방법 및 장치)Korean Patent Laid-Open Patent 10-2009-0054642 (Name of invention: voice recognition method and apparatus)

스피치 데이터 중에는 음성의 비유창성(Disfluency)에 의한 무음 등이 이어질 수 있으나, 현재의 ARS 시스템은 비유창성에 의한 무음이 이어질 경우 이를 스피치 종료로 판단하여 그대로 무의미한 데이터로서 반환 출력해버리거나, 반대로 스피치 종료시점을 정상적으로 판정하지 못하여 불필요한 대기 시간을 증가시키는 문제점이 발생하고 있는 것이다.Among the speech data, silence due to the disfluency of the voice may be followed, but the current ARS system judges that silence caused by the disfluency is terminated and returns it as meaningless data as it is, or vice versa. Since the time point cannot be determined normally, there is a problem that unnecessary waiting time is increased.

이와 같은 문제점으로 인해, 의미 있는 텍스트 데이터가 정상 추출되지 못하거나, 대기 시간 증가로 인한 사용자 편의성을 저해시키는 문제점들이 발생되고 있다.Due to such a problem, there are problems in that meaningful text data cannot be extracted normally or that user convenience is impaired due to an increase in waiting time.

본 발명은 상기한 바와 같은 문제점을 해결하고자 안출된 것으로, 음성 버퍼를 모니터링하여 수신 음성의 "말의 종료"를 예측함으로써 자연 언어 기반 인간 - 컴퓨터 상호 작용의 사용자 경험을 향상시켜보다 빠른 응답 시간과 전체 대기 시간을 단축시킬 수 있도록 한 자동 음성 인식에서 대기 시간을 감소시키는 방법을 컴퓨터에서 실행시키기 위한 프로그램을 제공하는데 그 목적이 있다.The present invention was devised to solve the above-described problems, and by monitoring the voice buffer to predict the "end of speech" of the received voice, the user experience of human-computer interaction based on natural language was improved, resulting in faster response time and It is an object of the present invention to provide a program for executing a method of reducing the waiting time in a computer in automatic speech recognition that can shorten the overall waiting time.

상기한 본 발명의 과제를 해결하기 위한 수단으로 바람직한 본 발명의 실시 예는, 자동 음성 인식에서 대기 시간을 감소시키는 방법을 컴퓨터에서 실행시키기 위한 프로그램으로서, 오디오 입력으로서 음성을 수신하는 단계; 들어오는 사운드 데이터에 대한 노이즈 분석을 수행하는 단계; 상기 수신 사운드 데이터가 실제 스피치데이터의 특성을 갖는지 여부를 결정하는 단계; 가장 가까운 등가 데이터를 텍스트 형태로 변환하기 위해 자동 음성 인식을 수행하는 단계; ASR 출력의 값, 음성 카테고리 및 스피치 디스 플루 에이트를 결정하는 단계; 상기 ASR 출력의 값, 음성 카테고리, 및 음성 불쾌감에 기초하여 상기 연설 종료 타임 윈도우 종료 기간을 결정하는 단계; 후속 ASR 출력의 값, 음성 카테고리 및 음성 불균형에 기초하여 연설 종료 순간을 결정하는 단계; 최종 ASR 결과에 앞서 EOS 타임 아웃 및 중간 ASR 결과를 기반으로 초기 EOS를 호출하여 제어하는 것을 특징으로 한다.An embodiment of the present invention, which is preferred as a means for solving the problems of the present invention described above, is a program for executing a method of reducing waiting time in automatic speech recognition on a computer, comprising: receiving a speech as an audio input; Performing noise analysis on the incoming sound data; Determining whether the received sound data has characteristics of actual speech data; Performing automatic speech recognition to convert the nearest equivalent data into text format; Determining a value of the ASR output, a speech category, and a speech disfluent; Determining an end period of the speech end time window based on the value of the ASR output, the speech category, and the speech discomfort; Determining an end moment of speech based on the value of the subsequent ASR output, speech category and speech imbalance; It is characterized by calling and controlling the initial EOS based on the EOS timeout and intermediate ASR results prior to the final ASR result.

또한, 상기 노이즈 분석 단계는,In addition, the noise analysis step,

잡음 레벨을 결정하기 위해 입력 사운드의 연속 레벨의 지속 기간을 관찰하는 단계; 를 포함한다.Observing the duration of the continuous level of the input sound to determine the noise level; Includes.

또한, 상기 입력 데이터의 값을 결정하기 위해 사운드 샘플의 무음 및 비 - 잡음 부분의 비율을 관찰하는 단계;를 포함한다.And observing the ratio of the silent and non-noisy portions of the sound sample to determine the value of the input data.

또한, 뒤 따르는 높은 가중치 키워드를 예상하기 위해 불쾌감 단어 목록을 유지하는 단계; 를 포함한다.Also, maintaining a list of offensive words to predict the high weighted keywords that will follow; Includes.

또한, 상기 EOS 시간 프레임 윈도우를 조정하기 위한 알려진 스피치 불균형의리스트;를 포함할 수 있다.It may also include a list of known speech imbalances for adjusting the EOS time frame window.

또한, 상기 가중 된 키워드에 기초하여 상기 EOS를 결정하는 단계;를 포함할 수 있다.In addition, it may include; determining the EOS based on the weighted keyword.

또한, 상기 입력 데이터가 스피치 데이터로 간주 될 수있는 가능성을 결정하기 위해 상기 묵음 및 데이터 버스트의 패턴을 관찰하는 단계;를 포함할 수 있다.Also, observing the pattern of the silence and data burst to determine the likelihood that the input data can be regarded as speech data; may include.

또한, 상기 무음 대 사운드 레벨 비율을 계산하고, 상기 입력 데이터가 스피치 데이터로서 고려 될 수있는 가능성을 결정하는 단계;를 더 포함할 수 있다.Further, the method may further include calculating the ratio of the silence to the sound level, and determining a probability that the input data can be considered as speech data.

또한, 상기 단어의 형태 (이 경우, 상기 단어가 상기 문장의 객체인지 여부)에 기초하여 상기 요청 된 질의를 결정하는 단계;를 포함할 수 있다.Also, it may include the step of determining the requested query based on the shape of the word (in this case, whether the word is an object of the sentence).

또한, 발신자 기록을 기반으로 처음 발신자가 인공 지능과 대화 할 준비를하는 데 도움이되는 더 긴 안내(introduction)를 제공할 수 있다.It can also provide a longer introduction to help first-time callers prepare to talk to artificial intelligence based on caller records.

본 발명의 실시 예에 따르면, 음성 신호의 분석에 따른 스피치 데이터의 특성에 따라, 상기 스피치 종료 타이머 윈도우를 적응적으로 가변 결정하게 함으로써, 보다 신속하고 정확한 스피치 종료(EOS) 시점을 검출할 수 있다. According to an embodiment of the present invention, by adaptively variable determination of the speech end timer window according to the characteristics of speech data according to analysis of a speech signal, it is possible to detect a more rapid and accurate speech end (EOS) time point. .

또한, 상기 스피치 데이터의 특성은 상기 음성 신호 분석에 따라 검출된 상기 스피치 데이터의 가치 정보, 상기 스피치 데이터의 분류 정보 및 상기 스피치 데이터의 비유창성(Disfluency) 정보 중 적어도 하나에 따라 결정됨으로써, 높은 수준의 정확성을 유지하면서도 음성 인식 응답의 대기 시간을 단축시킬 수 있으며, 이에 따른 사용자 편의성을 크게 향상시킬 수 있다.In addition, the characteristics of the speech data are determined according to at least one of the value information of the speech data detected according to the analysis of the speech signal, the classification information of the speech data, and the disfluency information of the speech data. While maintaining the accuracy of, the waiting time for a voice recognition response can be shortened, and accordingly, user convenience can be greatly improved.

본 발명의보다 완전한 이해를 위해, 이제 첨부 도면과 관련하여 취해진 다음의 설명이 참조된다.
도 1은 커맨드가 발행 된 후 사용자에 의해 경험 된 ASR의 레이턴시를 도시한다.
도 2는 본 명세서의 일 양상에 따라 음성 입력 데이터가인지되는 방식 및 EOS 호출이 어떻게 검출되어야 하는지를 도시한다.
도 3은 본 명세서의 일 양상에 따라 대기 시간을 줄이기 위해 이전에 발행 된 명령의 잠재적 일치를 찾기위한 로컬 데이터베이스를 도시한다.
도 4는 본 명세서의 일 양태에 따라 불감증이 존재할 때 EOS 호출이 어떻게 처리되는지를 도시한다.
도 5는 본 명세서의 일 양상에 따라 동적 EOS 윈도우 크기가 대기 시간을 감소시킬 수있는 방법을 도시한다.
도 6은 본 명세서의 일 양상에 따른 불감 (disfluency)을 포함하는 발화들에 대해 감소 될 수있는 대기 시간을 도시한다.
도 7은 본 명세서의 일 양상에 따라 음성 활성도 검출 및 스피치 데이터의 연속 특성에 의해 개선 된 가변 EOS 타임 아웃 상황 의존 호출을 갖는 자동 음성 인식 방법을 도시한다.
도 8은 본 명세서의 일 측면에 따라 ASR 프로세싱 전에 비 - 스피치 데이터를 필터링하는 방법을 도시한다.
도 9는 본 명세서의 일 양태에 따른 동적 단어 가중치 할당을 도시한다
도 10은 본 명세서의 일 양태에 따라 얻을 수있는 전체 대기 시간 감소를 도시한다.For a more complete understanding of the invention, reference is now made to the following description taken in connection with the accompanying drawings.
Figure 1 shows the latency of the ASR experienced by the user after the command is issued.
2 illustrates how voice input data is perceived and how EOS calls should be detected according to an aspect of the present specification.
3 shows a local database for finding potential matches of previously issued commands to reduce waiting time according to an aspect of the present specification.
4 shows how an EOS call is handled in the presence of insensitivity according to an aspect of the present specification.
5 illustrates how the dynamic EOS window size can reduce latency according to an aspect of the present specification.
6 shows the waiting time that can be reduced for utterances including disfluency according to an aspect of the present specification.
FIG. 7 illustrates an automatic speech recognition method with variable EOS timeout context dependent calling improved by speech activity detection and continuous characteristics of speech data according to an aspect of the present specification.
8 illustrates a method of filtering non-speech data prior to ASR processing according to an aspect of the present specification.
9 illustrates dynamic word weight allocation according to an aspect of the present specification
10 illustrates the overall latency reduction that can be achieved according to an aspect of the present specification.

이하 본 발명의 실시예를 첨부된 도면을 참조해서 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명은, 음성 존재 여부, 시퀀스 및 연속성을 기반으로 잡음 분석을 수행하면서 중간 결과 분석 및 무의미한 음성(음성 불투명도) 존재 감지를 기반으로 호출 EOS를 통해 높은 수준의 정확성을 유지하면서 음성 입력으로 해석되는 음성 인식 응답에 나타나는 대기 시간을 줄임으로써 사용자 경험을 향상시킨다.The present invention is interpreted as a voice input while maintaining a high level of accuracy through a call EOS based on an intermediate result analysis and senseless voice (speech opacity) presence detection while performing noise analysis based on the presence of speech, sequence, and continuity. It improves the user experience by reducing the latency that appears in the speech recognition response.

대부분의 클라이언트 측 시스템에는 완전한 언어 모델을 로컬로 호스트하는 데 필요한 하드웨어 사양이 없으므로 언어 인식모델은 일반적으로 원격 서버에서 호스팅 된다. 그런 다음 고객은 음성을 보내 텍스트(STT) 결과를 음성으로 받는다. ASR은 중간 결과를 반환하고 EOS를 호출 할 시점을 결정할 수 있다. EOS를 호출하면 최종 기계 해석 대화 상자가 반환된다.Since most client-side systems do not have the hardware specifications required to host the complete language model locally, the language recognition model is usually hosted on a remote server. The customer then sends a voice to receive the text (STT) results voiced. ASR can return intermediate results and decide when to call EOS. Calling EOS returns the final machine analysis dialog box.

타임 아웃 메커니즘은 대개 ASR 시스템에서 EOS를 호출하는 데 사용된다. 미리 정해진 수의 묵음 시간 프레임을 관찰 한 후, ASR 시스템이 EOS를 호출하고, 얻어진 스피치 데이터를 처리하여 텍스트 데이터로 해석한다. EOS를 결정하기 위한 ASR 시스템에 의해 정의 된 묵음 시간 프레임은 대기 시간 사용자 경험에 의거한다. 명령을 말하는 동안 어떤 이유로 든 사용자가 문장을 일시 중지 할 수 있으므로 잘못된 EOS 호출로 원치 않는 최종 결과가 발생할 수 있다.The timeout mechanism is usually used by the ASR system to call EOS. After observing a predetermined number of silent time frames, the ASR system calls EOS, processes the obtained speech data, and interprets it as text data. The silent time frame defined by the ASR system for determining EOS is based on the latency user experience. While speaking the command, the user can pause the sentence for any reason, so an erroneous EOS call can lead to unwanted end results.

도 1은, 로그 선형 모델을 갖는 의도 된 결과를 수신하기 위해 사용자가 전반적인 대기 시간을 경험 한 것을 보여준다. 전체 대기 시간은 방정식으로 표현할 수 있다. Fig. 1 shows that the user has experienced the overall waiting time to receive the intended result with the logarithmic linear model. The total waiting time can be expressed by the equation.

log (Yi) = μ + α + βXi + υilog (Yi) = μ + α + βXi + υi

여기서 응답 변수는 네트워크 지연 (μ), 103, 독립 변수는 ASR 처리 시간 (α), 102, 차단 변수는 EOS 감지(β), 101,로 인한 ASR 프로세스 요인, 잘못된 사용자의 예상 결과의 비율(е). X는 EOS 호출 시점에 의해 결정되는 발화 길이(104)이다. 예기치 않은 결과로 인해 발화가 반복되면 전체적인 X가 증가할 수 있다.Here, the response variable is the network delay (μ), 103, the independent variable is the ASR processing time (α), 102, the blocking variable is the EOS detection (β), 101, the ASR process factor due to, and the ratio of the expected result of the wrong user (е ). X is the speech length 104 determined by the time of the EOS call. If the utterance is repeated due to unexpected results, the overall X may increase.

ASR이 오디오 입력에서 묵음을 감지하면 EOS 시간 제한이 시작된다. EOS 시간 제한(100 초) 전에 오디오가 감지되면 시간 제한이 재설정된다. 오디오가 EOS 타임 아웃 시간 프레임 (101) 내에 검출되지 않으면, EOS가 호출된다. ASR 시스템이 오디오를 처리하면 결과를 사용자에게 되돌려 보낸다(105).When the ASR detects silence on the audio input, the EOS timeout begins. If audio is detected before the EOS time limit (100 seconds), the time limit is reset. If audio is not detected within the EOS timeout time frame 101, EOS is called. When the ASR system processes the audio, the result is sent back to the user (105).

도 2는 ASR 시스템에서 음성 입력 데이터를 확인하는 방법을 보여준다. 201, 208, 202, 209에 도시 된 바와 같이, 데이터의 버스트(burst) 및 침묵의 패턴이 발생할 수 있지만, 연속적인 데이터는 다음에서 관찰 된 바와 같이 다수의 시간 프레임을 차지할 수 있다.2 shows a method of checking voice input data in an ASR system. As shown in 201, 208, 202, 209, bursts of data and patterns of silence may occur, but continuous data may occupy multiple time frames as observed in the following.

대부분의 ASR 시스템에서, EOS는 완전한 문장을 따르고 침묵의 여러 시간 프레임이 뒤따른 후 207에서 호출 될 것이다. 지점 211에서 초기 EOS를 호출하면 원하는 동작을 훨씬 빠르게 수행 할 수 있으므로 사용자가 느끼는 대기 시간이 줄어 든다. 초기 EOS로 인한 잘못된 해석을 보완하기 위해 캐시 된 해석에 액세스하기 위해 서버 ASR 이외에 로컬 ASR 솔루션이 클라이언트 내에 상주 할 수 있다.In most ASR systems, EOS will be called at 207 after following a full sentence and several time frames of silence. Calling the initial EOS at point 211 can perform the desired action much faster, reducing the waiting time the user feels. In addition to the server ASR, a local ASR solution can reside within the client to access the cached interpretation to compensate for the misinterpretation caused by the initial EOS.

EOS를 일찍 호출하면 사용자가 느끼는 대기 시간을 크게 줄일 수 있으므로 모델이 로그 로그 모델에 가깝게 된다. 그러나 이렇게하는 것은 잠재적 인 오역을 유발하여 사용자가 경험하는 전반적인 대기 시간을 유발한다. 잘못된 결과로 최종 결과가 원하는 확률이되지 않도록 서버를 실행하여 이전 쿼리를 기록하고 가장 자주 묻는 질문에 대한 응답을 캐싱 할 수 있다. 이로 인해 ASR 디자인에 추가적인 확률 모델이 도입되어 잘못된 해석의 가능성이 줄어 든다.Calling EOS early can significantly reduce the wait time the user feels, so the model is closer to the log log model. However, doing so creates a potential misinterpretation, causing the overall latency experienced by the user. You can run the server to log previous queries and cache answers to the most frequently asked questions so that the final result is not the desired probability with an incorrect result. This introduces an additional probabilistic model to the ASR design, reducing the possibility of misinterpretation.

도 3은 중간 ASR 결과를 처리하는 로컬 캐시 된 데이터가 확률 모델을 기반으로 초기 EOS를 호출하는 방법을 보여준다. 사용자(300)가 발언(301)을 제공하면, 스피치 데이터(304)는 로컬 오디오 분석 모듈(302)을 사용하여 로컬 운영 데이터베이스(307) 내에서 일치를 찾고, 필요한 텍스트 데이터 또는 스피치 데이터(308)를 제공하여 매칭되는 경우, 오디오/텍스트 데이터(303)를 사용자 단말로 제공한다. 매칭되지 않는 스피치 데이터(304) 일때 원격 ASR 시스템(305)으로 전송하여 원격 ASR 시스템(305)의 음성인식에 따른 응답 처리(306)를 사용자 단말로 제공한다. 따라서, 원격 ASR 시스템(305)으로부터 데이터를 검색 할 때 보다 로컬 오디오분석모듈에서 처리하여 훨씬 더 빠른 원하는 결과(303)를 생성 할 수 있다. 국부적으로 캐싱 된 데이터는 들어오는 스피치 데이터에 기초하여 업데이트되고, 커맨드 데이터를 축적하고, 커맨드의 확률을 업데이트하며, 확률 모델의 정확도를 높인다.3 shows how locally cached data processing intermediate ASR results calls initial EOS based on a probability model. Once the user 300 provides the utterance 301, the speech data 304 uses the local audio analysis module 302 to find a match within the local operational database 307, and the required text data or speech data 308 If matched by providing the audio/text data 303 is provided to the user terminal. When the speech data 304 does not match, it is transmitted to the remote ASR system 305 to provide a response processing 306 according to the voice recognition of the remote ASR system 305 to the user terminal. Accordingly, it is possible to generate the desired result 303 much faster by processing it in the local audio analysis module than when retrieving data from the remote ASR system 305. The locally cached data is updated based on the incoming speech data, accumulates command data, updates the probability of the command, and improves the accuracy of the probability model.

지연은 ASR 시스템에 불만족이 도입 될 때 증가 될 수 있다. ASR 시스템은 보다 의미있는 데이터를 기다리거나 단순히 EOS를 호출한다. 두 경우 모두 의미없는 데이터를 반환하는 추가 지연이 발생하고 사용자가 음성 명령을 반복해야 할 수도 있다. 본 명세서에서, 의미없는 소리인 비유창성(disfluency)은 문맥 적 데이터를 기대하고 다음 발화의 중요성을 증가 시키도록 시스템을 준비하는 데 사용된다. 한국어는 "subject-object-verb"의 형식을 따르며, 대부분의 경우 주제를 완전히 생략하고 원하는 문맥을 제공 할 수 있다. 이 문장 형식은 예상 된 요청을 예상하고 정확한 결과를 산출하면서 초기 EOS 탐지를 호출하는 데 이상적이다. 이 사양의 경우 3 개월 동안의 데이터가 수집되었으며 사용자 입력의 90 %는 "subject-object-verb"또는 "object-verb"형식을 따르는 요청 형식으로되어 있다.Delay can be increased when dissatisfaction is introduced into the ASR system. The ASR system either waits for more meaningful data or simply calls EOS. In both cases, there is an additional delay in returning meaningless data and the user may have to repeat the voice command. In this specification, disfluency, a meaningless sound, is used to expect contextual data and prepare the system to increase the importance of the next utterance. Korean follows the format of "subject-object-verb", and in most cases, the subject can be omitted completely and the desired context can be provided. This sentence format is ideal for invoking initial EOS detection while anticipating the expected request and yielding accurate results. For this specification, three months of data were collected, and 90% of user input is in the form of a request that follows the "subject-object-verb" or "object-verb" format.

도 4는 이 스펙이 EOS 호출의 타이밍과 직접 관련이 있는 전통적인 ASR 시스템의 불만을 처리하는 방법을 보여준다. 대부분의 ASR 시스템은 불량을 없애거나 일치 프로세스를 거쳐 상황과 관련성이 부족한 해석을 산출한다. 대부분의 경우 ASR 시스템은 시간 프레임의 사전 설정된 크기에 따라 401 또는 403에서 EOS를 호출 할 수 있다. 입력이 처리되고 최종 결과가 반환되는 동안 ASR 시스템은 종종 해석 작업이 완료 될 때까지 추가 입력을 허용하지 않는다. 이것은 종종 사용자가 원하는 발화를 반복하게하여 사용자가 느끼는 전반적인 대기 시간을 연장시킨다. 이 사양은 EOS 시간 창(EOS 타임 윈도우)를 확장하고 비유창성 소리 이후 다음 발화의 가중치를 증가시킨다. 403에서 EOS에 도달하는 대신이 사양은 401 및 403에서 EOS를 비활성화하는 시간 창을 제안한다. 404 및 405의 데이터가 수신되면 EOS 시간 창은 연속적인 시간 프레임에서 사운드 데이터를 더 이상 볼 수 없을 때까지 재설정된다. 408 내에있는 시간대에 EOS를 호출한다.Figure 4 shows how this specification handles the complaints of traditional ASR systems that are directly related to the timing of EOS calls. Most ASR systems either eliminate the defect or go through a matching process to produce an analysis that is not relevant to the situation. In most cases, the ASR system can call EOS at 401 or 403 depending on the preset size of the time frame. While the input is being processed and the final result is returned, the ASR system often does not accept additional input until the interpretation work is complete. This often causes the user to repeat the desired utterance, prolonging the overall waiting time felt by the user. This specification extends the EOS time window (EOS time window) and increases the weight of the next utterance after a non-fluency sound. Instead of reaching EOS at 403, this specification suggests a time window to disable EOS at 401 and 403. When data from 404 and 405 are received, the EOS time window is reset until no more sound data is visible in successive time frames. EOS is called in the time zone within 408.

도 5와 같이, 의미있는 문맥 데이터가있는 비유창성 스피치 데이터가 불규칙성을 따라갈 수 있으므로 불규칙성을 따르는 발화 길이에 따라 EOS 타임 아웃 창을 동적으로 변경할 수도 있다. 일반적인 EOS 타임 아웃 윈도우는 고정 된 위치에서 만료된다. 그러나, EOS 타임 아웃 윈도우(411)의 크기를 변경함으로써, 전체 대기 시간이 감소된다(412).As shown in FIG. 5, since non-fluency speech data with meaningful context data may follow irregularities, the EOS timeout window may be dynamically changed according to the utterance length following the irregularities. A typical EOS timeout window expires at a fixed location. However, by changing the size of the EOS timeout window 411, the total waiting time is reduced (412).

도 6은 이 스펙에 따라 불투명도를 포함하는 발화에 대해 줄일 수있는 대기 시간을 보여준다. 500 및 501은 사용자가 처리하기를 원하는 실제 발언 인 502에 뒤 따르는 불투명도이다. (A)-종래기술에서 EOS는 첫 번째 EOS 시간 종료가 515에서 발생한 후 503에서 호출된다. 505는 EOS를 호출 한 후 추가 된 데이터 전송 시간과 프로세스 시간을 나타낸다. 506은 ASR 시스템 A로부터의 리턴 된 결과이다. (A)-종래기술에서 문맥 적 데이터를 무시하면, 사용자는 의도 된 발언이 불휘 발성으로부터 데이터를 수신 한 후에 무시 된 것을 인식 한 후, 명령을 반복한다(506). 508과 509는 불규칙성이며, ASR이 처리해야하는 상황 별 데이터가 포함 된 510이 뒤 따른다. 518은 EOS 타임 아웃인 511 이후의 EOS 호출 순간이다. 512는 EOS의 데이터 전송 시간과 처리 시간을 나타낸다. 나중에 518에서 EOS를 호출하면 503에서 중간 EOS를 호출하는 것보다 더 많은 지연이 발생할 수 있지만 ASR이 데이터를 해석하고 반환하는 동안 추가 데이터 입력이 차단되므로 506에 표시된 실제 대기 시간은 517만큼 증가한다. 506이 진행 중이고, 이는 515 및 505 모두에 의해 영향을 받으며, 발언(502)은 ASR 시스템으로 전송 될 수 없으므로 사용자는 504에서 발화를 반복하게 된다. EOS 타임 아웃의 시간 프레임을 연장함으로써, 513 시간이 비유창성(disfluency)가 예외 처리를 위해 처리되지 않은 경우와 비교하여 얻을 수 있다.6 shows the waiting time that can be reduced for utterances including opacity according to this specification. 500 and 501 are the opacity following 502, which is the actual utterance the user wants to process. (A)-In conventional technology, EOS is called at 503 after the first EOS timeout occurs at 515. 505 represents the data transfer time and process time added after calling EOS. 506 is the returned result from ASR system A. (A)-If contextual data is ignored in the prior art, the user repeats the command after recognizing that the intended speech was ignored after receiving the data from non-volatile (506). 508 and 509 are irregular, followed by 510, which contains contextual data that ASR must process. 518 is the moment of the EOS call after 511, which is the EOS timeout. 512 represents the data transfer time and processing time of EOS. Calling EOS on the 518 later may result in more delay than calling the intermediate EOS on the 503, but the actual latency shown at 506 increases by 517 as further data entry is blocked while the ASR interprets and returns the data. 506 is in progress, which is affected by both 515 and 505, and the user repeats the utterance at 504 as the utterance 502 cannot be sent to the ASR system. By extending the time frame of the EOS timeout, 513 hours can be obtained compared to the case where disfluency is not handled for exception handling.

노이즈를 잘못 해석하여 원치 않는 초기 EOS를 유발할 수있는 잠재적 인 오류를 보완하기 위해 모든 사운드 데이터는 먼저 잡음 분석 및 음성 활동 감지를 통해 비 문맥 데이터가 필터링되고 STT 및 ASR 시스템으로 전달되지 않도록 할 수 있다. 입력되는 사운드 데이터의 검출은 우선 입력 데이터가 사운드 레벨 모니터링을 통해 잡음 또는 스피치 데이터로 분류되는지의 초기 결정을 형성하도록 분석 될 수있다. 이 방법은 ASR 서버에 불필요한 트랜잭션을 로컬 필터링하여 리소스 사용을 최적화 할 수 있습니다. 스피치 데이터는 종종 버스트 패턴의 데이터를 따르지만 노이즈는 연속적인 데이터 수준을 나타내는 경향이 있다. 주파수 영역에서의 사운드의 다양성과 범위는 또한 입력 사운드가 들어오는 "인간의 말"로 간주 될 수 있는지를 결정하는 데 사용될 수 있습니다. 음성 존재를 갖는 연속적인 프레임의 양은 음성 활동을 검출하는데 사용될 수 있다.To compensate for potential errors that could misinterpret noise and cause unwanted initial EOS, all sound data can first be subjected to noise analysis and voice activity detection to ensure that non-contextual data is filtered and not passed to the STT and ASR systems. . The detection of incoming sound data can first be analyzed to form an initial determination of whether the input data is classified as noise or speech data through sound level monitoring. This method can optimize resource usage by local filtering out unnecessary transactions on the ASR server. Speech data often follows data in a burst pattern, but noise tends to represent a continuous level of data. The variety and range of sound in the frequency domain can also be used to determine if the input sound can be considered an incoming "human speech". The amount of consecutive frames with voice presence can be used to detect voice activity.

도 7은 본 명세서의 일 양태에 따라 음성 활성도 검출 및 스피치 데이터의 연속 특성에 의해 개선 된 가변 EOS 타임 아웃 상황 의존 호출을 갖는 자동 음성 인식 방법을 도시한다. 착신 스피치 데이터(600)는 먼저 음성신호로부터 노이즈 분석(S101)을 수행하는데, 음성 활성도 검출(S103) 및 스피치 데이터 연속성 검사(S105)를 거치는 노이즈 분석(S101)을 거친다. 스피치 데이터(스피치 데이터)가 미리 설정된 카테고리에 기초하여 스피치 데이터로 분류되는 지를 판단(S107)하고, 스피치 데이터로 분류되는 경우에만, 스피치 데이터가 STT 프로세스를 위해 로컬 자동 음성 인식(ASR) 프로세스 진입(S109)한다. 그 다음에, 스피치 데이터는 분류되고(S111), 데이터가 비유창성 인지의 여부를 결정하고(S113), 데이터가 유효한지를 결정하며(S115), 데이터는 데이터베이스 내에서 캐싱 된 일치를 검색하고(S117), 일치하는 데이터가 발견되면 EOS가 호출된다(S121) 즉, 문장종료로 판단하고, 문장 응답에 대응하는 적절한 데이터 처리를 수행한다. 만약, 비유청성 데이터가 존재하고, 스피치 데이터가 유효성이 존재 하지 않거나, 매칭 확인이 안되는 경우, EOS 타임 윈도우를 연장(S125)하여 연장된 EOS타임 만료시까지(S127) 다음 프레임의 스피치 데이터를 수신받아 노이즈 분석, 활동성 정보 검출, 연속성 확인을 하여 분류가능하지를 판단하는 단계로 리턴한다.7 illustrates an automatic speech recognition method with variable EOS timeout context dependent calling improved by speech activity detection and continuous characteristics of speech data according to an aspect of the present specification. The incoming speech data 600 first performs a noise analysis (S101) from a voice signal, and undergoes a noise analysis (S101) that undergoes a voice activity detection (S103) and a speech data continuity check (S105). It is determined whether speech data (speech data) is classified as speech data based on a preset category (S107), and only when it is classified as speech data, the speech data enters a local automatic speech recognition (ASR) process for the STT process ( S109). Then, the speech data is classified (S111), it is determined whether the data is non-fluency (S113), it is determined whether the data is valid (S115), and the data retrieves a cached match in the database (S117). ), if matching data is found, EOS is called (S121), that is, it is determined that the sentence ends, and appropriate data processing corresponding to the sentence response is performed. If there is non-wheezing data and the speech data does not have validity, or if the matching cannot be confirmed, the EOS time window is extended (S125) and the speech data of the next frame is received until the extended EOS time expires (S127). It returns to the step of determining whether classification is possible by performing noise analysis, activity information detection, and continuity check.

도 8은 본 명세서의 일 측면에 따라 ASR 프로세싱 전에 비 - 스피치 데이터를 필터링하는 방법을 도시한다. 데이터를 ASR 시스템으로 보내기 전에 잘못된 스피치 데이터로 인해 EOS 호출에 영향을 줄 수 있다. 이러한 시나리오를 피하기 위해, 오디오 버퍼(S201)로부터 판독 된 데이터는 음성 활동 검출(S203)의 체크이며, 여기서 고정 된 시간 프레임에 기초하여 음성의 존재를 검출한다. 스피치 데이터는 연속적인 파형 형식으로 나오므로 프레임 내의 음성 활동을 검사하면(S205) 다음 프레임 구성에 대한 기대치를 설정하는 데 필요한 통찰력을 얻을 수 있다. 시간 프레임이 끝날 때 음성 존재가 낮 으면 다음 프레임에 데이터가 들어 있지 않을 가능성이 있다. 스피치 데이터가 스피치 데이터와 함께 시간 프레임 사이의 조용한 시간 프레임을 갖는 경향이 있으므로 음성 활동으로 시간 프레임 패턴을 확인하면 스피치 데이터가 일관된 잡음인지 또는 스피치 데이터를 포함 하는지를 결정할 수 있다.8 illustrates a method of filtering non-speech data prior to ASR processing according to an aspect of the present specification. Before sending the data to the ASR system, incorrect speech data can affect EOS calls. To avoid this scenario, the data read from the audio buffer S201 is a check of the voice activity detection S203, where the presence of voice is detected based on a fixed time frame. Since speech data comes in a continuous waveform format, examining voice activity within a frame (S205) gives you the insight needed to set expectations for the next frame configuration. If the voice presence is low at the end of the time frame, there is a possibility that the next frame contains no data. Since speech data tends to have quiet time frames between time frames with the speech data, checking the time frame pattern with speech activity can determine if the speech data is consistent noise or contains speech data.

도 9는 동적 단어 가중치 할당을 보여준다. 오디오가 수신되면(S301), 오디오는 스피치 데이터 분류(S303)에 의해 분류 가능한 경우 음성으로 인식되고, EOS 타이머가 시작되어(S305), 동등한 텍스트 데이터를 획득한다(S307). 텍스트 데이터가 비유창성을 포함하면(S309), EOS 타이머가 연장되고(S311), 다음 단어의 중요도가 증가한다(S313). 중요도 단어의 가중치는 로컬에 저장된 데이터베이스 또는 EOS 시간 프레임 크기와 같은 앞서 언급 한 방법을 기반으로 조정할 수 있다. 상기 비유창성이 감지되지 않으면 텍스트 데이터는, 단어 중요도 가중치 기반 데이터 베이스 분석 매칭(S315)하고, 불필요한 단어로 예측되는지 판단(S317)한다. 불필요한 단어로 예측되지 않으면, 추가 오디오 이전에 EOS 타이머가 종료되는지를 판단(S319)한 후, 종료되면 응답 결과 대응하는 출력 리턴 및 명령 데이터 처리(S321)를 한다. 만약, 불필요한 단어로 예측되거나, EOS 타이머가 종료되지 않았으면, 상기 스피치 데이터 수신(S391)으로 리턴한다.9 shows dynamic word weight allocation. When the audio is received (S301), the audio is recognized as speech if it can be classified by the speech data classification (S303), and the EOS timer is started (S305), and equivalent text data is obtained (S307). If the text data includes influency (S309), the EOS timer is extended (S311), and the importance of the next word increases (S313). The weight of the importance word can be adjusted based on the aforementioned method, such as a locally stored database or EOS time frame size. If the influency is not detected, the text data is analyzed and matched in a database based on a word importance weight (S315), and it is determined whether it is predicted as an unnecessary word (S317). If it is not predicted as an unnecessary word, it is determined whether the EOS timer is ended before the additional audio (S319), and when it is terminated, an output corresponding to a response result and command data processing are performed (S321). If it is predicted as an unnecessary word or if the EOS timer has not expired, the speech data reception (S391) is returned.

도 10(A)를 참조하면, 종래 기술의 단말 장치는 스피치의 종료에 따른 종료 스피치 데이터가 전송되고, 이에 따른 ASR 시스템의 EOS 검출에 따라 스피치 응답 시간이 지연됨을 확인할 수 있다.Referring to FIG. 10A, it can be seen that the terminal device of the prior art transmits the end speech data according to the end of speech, and the speech response time is delayed according to the EOS detection of the ASR system.

이에 반해, 도 10(B)를 참조하면, 본 발명의 실시 예에 따른 단말 장치(100)는, 동일한 스피치 종료 시점이 도래하더라도, 로컬 인식 시스템에 의한 강제 EOS 처리가 가능하게 되며, 이에 따른 EOS 타임 윈도우 가변에 의한 최종 인식 결과 처리가 조기 수행되어 스피치 응답이 종래 기술보다 빠르게 발화자에게 제공될 수 있다.On the other hand, referring to FIG. 10(B), the terminal device 100 according to an embodiment of the present invention enables forced EOS processing by the local recognition system even when the same speech end point arrives. Processing of the final recognition result by varying the time window is performed early, so that a speech response can be provided to the talker faster than in the prior art.

EOS가 명령 체계를 찾기 위해 국지적으로 저장된 데이터베이스를 가짐으로써 처리 된 데이터를 스피치 데이터로 분석하고, EOS가 기존의 시스템인 808(EOS검출)과 비교하여 초기에 811로 호출(강제 EOS)되기 때문에 전반적인 대기 시간 감소 즉, 응답시간 단축(809) 이 가능해진다. 이와 같이 본 발명은 음성의 유형에 기초하여 EOS 타임 아웃 타이머를 조정한다. 종래 기술의 종료 스피치 데이터 807과 본 발명의 종료 스피치 데이터 812가 동시에 끝나는 경우에, (A)종래기술의 스피치 응답 806에 비해 (B)본 발명 기술의 스피치 응답 810이 빠른 응답임을 알 수 있다.As EOS has a database stored locally to find the command system, the processed data is analyzed as speech data, and EOS is initially called as 811 (forced EOS) compared to the existing system 808 (EOS detection). Reduction in waiting time, that is, reduction in response time 809 becomes possible. As such, the present invention adjusts the EOS timeout timer based on the type of voice. When the ending speech data 807 of the prior art and the ending speech data 812 of the present invention are ended at the same time, it can be seen that (B) the speech response 810 of the present technology is faster than (A) the speech response 806 of the prior art.

한편, 상술한 본 발명의 다양한 실시 예들에 따른 방법은 프로그램으로 구현되어 다양한 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장된 상태로 각 서버 또는 기기들에 제공될 수 있다. 이에 따라, 사용자 단말(100)은 서버 또는 기기에 접속하여, 상기 프로그램을 다운로드할 수 있다.Meanwhile, the above-described method according to various embodiments of the present invention may be implemented as a program and provided to each server or devices while being stored in various non-transitory computer readable media. Accordingly, the user terminal 100 can access the server or device and download the program.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently and can be read by a device, not a medium that stores data for a short moment, such as a register, cache, or memory. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, and ROM.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In addition, although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention claimed in the claims. In addition, various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

Claims

As a program for executing a method of reducing waiting time in automatic speech recognition on a computer,
Receiving a voice as an audio input;
Performing noise analysis on the incoming sound data;
Determining whether the received sound data has characteristics of actual speech data;
Performing automatic speech recognition to convert the nearest equivalent data into text format;
Determining an output value, a speech category, and speech inflexibility by the automatic speech recognition;
Variably determining an end time of the speech time out window (EOS) based on a value of the ASR output, a speech category, and speech incompatibility;
Determining a speech timeout moment based on the value of the subsequent ASR output, speech category, and speech incompatibility;
Initial EOS call based on EOS timeout and intermediate ASR results prior to the final ASR result
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that in a computer.

In claim 1,
The noise analysis step,
Checking the continuity of speech data by observing the duration of the continuous level of the input sound to determine a noise level;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Observing the ratio of the silent and non-noisy portions of a sound sample to determine whether or not it has a characteristic of the speech data;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Maintaining a list of non-fluency words to predict the high weighted keywords that will follow;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
A list of known speech disproportions to adjust the EOS time frame window;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Determining the EOS based on the weighted keyword;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Observing the pattern of silence and data bursts to determine the likelihood that the input data can be considered speech data;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Calculating the silent to sound level ratio and determining a likelihood that the input data can be considered as speech data;
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Determining the requested query based on the shape of the word (in this case, whether the word is an object of the sentence);
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that it comprises a computer.

In claim 1,
Based on the caller's record, it provides a longer introduction to help first-time callers prepare to talk to artificial intelligence.
A program for executing a method for reducing a waiting time in automatic speech recognition, characterized in that in a computer.