KR102330496B1

KR102330496B1 - An apparatus and method for speech recognition

Info

Publication number: KR102330496B1
Application number: KR1020190101940A
Authority: KR
Inventors: 조용석
Original assignee: 주식회사 포켓메모리
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2021-11-24
Also published as: KR20210022434A

Abstract

본 발명은 음성인식 방법 및 장치에 관한 것이다. 본 발명의 일 실시예에 따른 음성 인식 방법은 (a) 사용자의 음성 신호를 획득하는 단계; (b) 상기 획득된 음성 신호를 이용하여 상기 음성 신호의 음성인식 여부를 판단하는 단계; (c) 상기 음성 신호가 인식되는 경우, 상기 음성 신호에 대응하는 제1 텍스트 정보를 출력하는 단계; 및 (d) 상기 음성 신호가 인식되지 않는 경우, 상기 사용자의 부가 신호를 획득하고, 상기 부가 신호에 대응하는 제2 텍스트 정보를 출력하는 단계;를 포함할 수 있다. The present invention relates to a voice recognition method and apparatus. A voice recognition method according to an embodiment of the present invention comprises the steps of: (a) acquiring a user's voice signal; (b) determining whether the voice signal is recognized by using the acquired voice signal; (c) outputting first text information corresponding to the voice signal when the voice signal is recognized; and (d) when the voice signal is not recognized, obtaining an additional signal of the user and outputting second text information corresponding to the additional signal.

Description

Speech recognition method and apparatus {AN APPARATUS AND METHOD FOR SPEECH RECOGNITION}

본 발명은 음성인식 방법 및 장치에 관한 것으로, 더욱 상세하게는 사용자의 음성 신호에 대한 음성 인식이 실패하는 경우에 사용자로부터 추가적인 부가 신호를 획득하여, 획득된 부가 신호에 대응하는 텍스트 정보를 제공하기 위한 음성인식 방법 및 장치에 관한 것이다.The present invention relates to a voice recognition method and apparatus, and more particularly, to obtain an additional additional signal from the user when the voice recognition for the user's voice signal fails, and to provide text information corresponding to the acquired additional signal It relates to a voice recognition method and apparatus for

음성 인식 기술은 미리 수집된 음성데이터로부터 각 음소별 확률 모델을 미리 학습하고, 이후 입력된 음성데이터가 어느 음소에 가장 가까운지를 판단하여 이로부터 음소열을 추정하는 방식을 사용한다. 이때 사용하는 각 음소별 확률 모델을 음향모델이라고 부르며, 음향모델은 음성 인식 기술의 성능을 좌우하는 중요한 요소 중에 하나이다.Speech recognition technology uses a method of pre-learning a probabilistic model for each phoneme from previously collected voice data, determining which phoneme the inputted voice data is closest to, and estimating a phoneme sequence from this. The probabilistic model for each phoneme used at this time is called an acoustic model, and the acoustic model is one of the important factors influencing the performance of speech recognition technology.

최근 음성 인식 기술의 관심이 집중되면서, 음성 인식을 용이하게 하기 위한 다양한 알고리즘이 제안되었다. 다만, 기존의 음성 인식 기술은 음성 인식을 어떠한 방식으로 수행할 것인지에 대한 연구만이 진행되고 있을 뿐, 음성 인식이 제대로 수행되지 않은 경우, 이에 대한 대응 방식을 제시하지 못하는 문제점이 있다. As interest in speech recognition technology has recently been focused, various algorithms for facilitating speech recognition have been proposed. However, in the existing voice recognition technology, only research on how to perform voice recognition is being conducted, and when voice recognition is not properly performed, there is a problem in that it is not possible to present a corresponding method.

[특허문헌 1] 한국공개특허 제10-2019-0035454호[Patent Document 1] Korean Patent Publication No. 10-2019-0035454

본 발명은 전술한 문제점을 해결하기 위하여 창출된 것으로, 음성인식 방법 및 장치를 제공하는 것을 그 목적으로 한다.The present invention was created to solve the above problems, and an object of the present invention is to provide a voice recognition method and apparatus.

또한, 본 발명은 음성 신호가 인식되는 경우, 음성 신호에 대응하는 제1 텍스트 정보를 출력하고, 초기 음성 신호가 인식되지 않더라도, 부가 신호를 추가적으로 획득하여, 다수의 후보 텍스트 정보 중 사용자의 음성 신호에 대한 제2 텍스트 정보를 출력하기 위한 음성인식 방법 및 장치를 제공하는 것을 그 목적으로 한다. In addition, according to the present invention, when a voice signal is recognized, the first text information corresponding to the voice signal is output, and even if the initial voice signal is not recognized, an additional signal is additionally obtained, and the user's voice signal among a plurality of candidate text information An object of the present invention is to provide a voice recognition method and apparatus for outputting second text information for

또한, 본 발명은 클라우드 서버에게 사용자의 음성 신호의 적어도 일부를 실시간으로 송신함으로써, 클라우드 서버를 통해 사용자의 자연어 음성(즉, 음성 신호)을 실시간으로 분석하기 위한 음성인식 방법 및 장치를 제공하는 것을 그 목적으로 한다. In addition, the present invention provides a voice recognition method and apparatus for analyzing a user's natural language voice (ie, voice signal) in real time through a cloud server by transmitting at least a part of the user's voice signal to the cloud server in real time. for that purpose

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.Objects of the present invention are not limited to the objects mentioned above, and other objects not mentioned will be clearly understood from the description below.

상기한 목적들을 달성하기 위하여, 본 발명의 일 실시예에 따른 음성 인식 방법은 (a) 사용자의 음성 신호를 획득하는 단계; (b) 상기 획득된 음성 신호를 이용하여 상기 음성 신호의 음성인식 여부를 판단하는 단계; (c) 상기 음성 신호가 인식되는 경우, 상기 음성 신호에 대응하는 제1 텍스트 정보를 출력하는 단계; 및 (d) 상기 음성 신호가 인식되지 않는 경우, 상기 사용자의 부가 신호를 획득하고, 상기 부가 신호에 대응하는 제2 텍스트 정보를 출력하는 단계;를 포함할 수 있다. In order to achieve the above objects, a voice recognition method according to an embodiment of the present invention comprises the steps of: (a) acquiring a user's voice signal; (b) determining whether the voice signal is recognized by using the acquired voice signal; (c) outputting first text information corresponding to the voice signal when the voice signal is recognized; and (d) when the voice signal is not recognized, obtaining the user's additional signal and outputting second text information corresponding to the additional signal.

실시예에서, 상기 (a) 단계는, 상기 음성 신호의 데시벨(decibel, dB)이 임계값 이상인 경우, 상기 사용자의 음성 신호를 획득하는 단계;를 포함할 수 있다.In an embodiment, the step (a) may include obtaining the user's voice signal when the decibel (dB) of the voice signal is equal to or greater than a threshold value.

실시예에서, 상기 (b) 단계는, 상기 음성 신호를 시간 도메인과 주파수 도메인에 대한 사운드 스펙트로그램으로 변환하는 단계; 상기 사운드 스펙트로그램으로부터 상기 제1 텍스트 정보를 추출하는 단계; 및 상기 제1 텍스트 정보를 이용하여 상기 음성 신호의 인식 여부를 판단하는 단계;를 포함할 수 있다.In an embodiment, the step (b) comprises: converting the voice signal into a sound spectrogram for a time domain and a frequency domain; extracting the first text information from the sound spectrogram; and determining whether the voice signal is recognized by using the first text information.

실시예에서, 상기 제1 텍스트 정보를 이용하여 상기 음성 신호의 인식 여부를 판단하는 단계는, 상기 제1 텍스트 정보로부터 상기 제1 텍스트 정보의 구문 정보를 추출하는 단계; 및 상기 추출된 구문 정보가 미리 정의된 구문 정보와 일치하는지 여부를 판단하는 단계;를 포함할 수 있다.In an embodiment, the determining whether the voice signal is recognized by using the first text information may include: extracting syntax information of the first text information from the first text information; and determining whether the extracted syntax information matches predefined syntax information.

실시예에서, 상기 (a) 단계는, 상기 사용자의 음성 신호의 적어도 일부를 획득하는 단계;를 포함하고, 상기 (b) 단계는, 상기 음성 신호의 적어도 일부를 클라우드 서버에게 송신하는 단계; 상기 클라우드 서버로부터 상기 음성 신호의 적어도 일부에 대한 음성 인식 결과를 수신하는 단계; 및 상기 음성 인식 결과를 이용하여 상기 음성 신호의 음성인식 여부를 판단하는 단계;를 포함할 수 있다.In an embodiment, the step (a) includes: obtaining at least a part of the user's voice signal; the step (b) includes: transmitting at least a part of the voice signal to a cloud server; receiving a voice recognition result for at least a part of the voice signal from the cloud server; and determining whether the voice signal is recognized by using the voice recognition result.

실시예에서, 상기 (d) 단계는, 상기 음성 신호가 인식되지 않는 경우, 상기 사용자의 음성 신호의 재요청 신호를 출력하는 단계; 및 상기 재요청 신호의 출력에 응답하여 상기 사용자의 부가 신호를 획득하는 단계;를 포함할 수 있다.In an embodiment, the step (d) may include, when the voice signal is not recognized, outputting a re-request signal of the user's voice signal; and obtaining an additional signal of the user in response to the output of the re-request signal.

실시예에서, 상기 사용자의 부가 신호는, 상기 사용자의 시선(line of sight) 신호, 음성 신호 및 동작 신호 중 적어도 하나를 포함할 수 있다. In an embodiment, the user's additional signal may include at least one of a user's line of sight signal, a voice signal, and an operation signal.

실시예에서, 상기 음성 인식 방법은, 상기 (c) 단계 이후에, 상기 음성 신호와 상기 제1 텍스트 정보를 이용하여 음성인식 인공지능 학습모델을 학습시키는 단계;를 더 포함할 수 있다.In an embodiment, the voice recognition method may further include, after step (c), learning a voice recognition artificial intelligence learning model using the voice signal and the first text information.

실시예에서, 음성 인식 장치는 사용자의 음성 신호를 획득하는 입력부; 상기 획득된 음성 신호를 이용하여 상기 음성 신호의 음성인식 여부를 판단하는 제어부; 및 상기 음성 신호가 인식되는 경우, 상기 음성 신호에 대응하는 제1 텍스트 정보를 출력하고, 상기 음성 신호가 인식되지 않는 경우, 상기 사용자의 부가 신호를 획득하고, 상기 부가 신호에 대응하는 제2 텍스트 정보를 출력하는 출력부;를 포함할 수 있다. In an embodiment, the voice recognition apparatus includes an input unit for acquiring a user's voice signal; a control unit that determines whether the voice signal is recognized by using the acquired voice signal; and outputting first text information corresponding to the voice signal when the voice signal is recognized. When the voice signal is not recognized, obtaining the user's additional signal, and second text corresponding to the additional signal. It may include; an output unit for outputting information.

실시예에서, 상기 입력부는, 상기 음성 신호의 데시벨(decibel, dB)이 임계값 이상인 경우, 상기 사용자의 음성 신호를 획득할 수 있다. In an embodiment, the input unit may acquire the user's voice signal when a decibel (dB) of the voice signal is equal to or greater than a threshold value.

실시예에서, 상기 제어부는, 상기 음성 신호를 시간 도메인과 주파수 도메인에 대한 사운드 스펙트로그램으로 변환하고, 상기 사운드 스펙트로그램으로부터 상기 제1 텍스트 정보를 추출하며, 상기 제1 텍스트 정보를 이용하여 상기 음성 신호의 인식 여부를 판단할 수 있다.In an embodiment, the controller converts the voice signal into a sound spectrogram for a time domain and a frequency domain, extracts the first text information from the sound spectrogram, and uses the first text information to generate the voice It can be determined whether the signal is recognized.

실시예에서, 상기 제어부는, 상기 제1 텍스트 정보로부터 상기 제1 텍스트 정보의 구문 정보를 추출하고, 상기 추출된 구문 정보가 미리 정의된 구문 정보와 일치하는지 여부를 판단할 수 있다.In an embodiment, the controller may extract syntax information of the first text information from the first text information, and determine whether the extracted syntax information matches predefined syntax information.

실시예에서, 상기 입력부는, 상기 사용자의 음성 신호의 적어도 일부를 획득하고, 상기 음성 신호의 적어도 일부를 클라우드 서버에게 송신하고, 상기 클라우드 서버로부터 상기 음성 신호의 적어도 일부에 대한 음성 인식 결과를 수신하는 통신부;를 더 포함하며, 상기 제어부는, 상기 음성 인식 결과를 이용하여 상기 음성 인식 신호의 음성인식 여부를 판단할 수 있다.In an embodiment, the input unit obtains at least a part of the user's voice signal, transmits at least a part of the voice signal to a cloud server, and receives a voice recognition result for at least a part of the voice signal from the cloud server and a communication unit to do so, wherein the control unit may determine whether to recognize the voice of the voice recognition signal using the voice recognition result.

실시예에서, 상기 출력부는, 상기 음성 신호가 인식되지 않는 경우, 상기 사용자의 음성 신호의 재요청 신호를 출력하고, 상기 입력부는, 상기 재요청 신호의 출력에 응답하여 상기 사용자의 부가 신호를 획득할 수 있다.In an embodiment, the output unit outputs a re-request signal of the user's voice signal when the voice signal is not recognized, and the input unit obtains the user's additional signal in response to the output of the re-request signal can do.

실시예에서, 상기 사용자의 부가 신호는, 상기 사용자의 시선(line of sight) 신호, 음성 신호 및 동작 신호 중 적어도 하나를 포함할 수 있다.In an embodiment, the user's additional signal may include at least one of a user's line of sight signal, a voice signal, and an operation signal.

실시예에서, 상기 제어부는, 상기 음성 신호와 상기 제1 텍스트 정보를 이용하여 음성인식 인공지능 학습모델을 학습시킬 수 있다. In an embodiment, the controller may train a voice recognition artificial intelligence learning model by using the voice signal and the first text information.

상기한 목적들을 달성하기 위한 구체적인 사항들은 첨부된 도면과 함께 상세하게 후술될 실시예들을 참조하면 명확해질 것이다.Specific details for achieving the above objects will become clear with reference to the embodiments to be described in detail below in conjunction with the accompanying drawings.

그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라, 서로 다른 다양한 형태로 구성될 수 있으며, 본 발명의 개시가 완전하도록 하고 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자(이하, "통상의 기술자")에게 발명의 범주를 완전하게 알려주기 위해서 제공되는 것이다.However, the present invention is not limited to the embodiments disclosed below, but may be configured in various different forms, and those of ordinary skill in the art to which the present invention belongs ( Hereinafter, "a person skilled in the art") is provided to fully inform the scope of the invention.

본 발명의 일 실시예에 의하면, 음성 신호가 인식되는 경우, 음성 신호에 대응하는 제1 텍스트 정보를 출력함으로써, 사용자는 음성인식 장치가 사용자의 음성 신호를 제대로 인식했는지 직접 확인할 수 있다.According to an embodiment of the present invention, when a voice signal is recognized, the user can directly check whether the voice recognition apparatus properly recognizes the user's voice signal by outputting first text information corresponding to the voice signal.

또한, 본 발명의 일 실시예에 의하면, 초기 음성 신호가 인식되지 않더라도, 부가 신호를 추가적으로 획득하여, 다수의 후보 텍스트 정보 중 사용자의 음성 신호에 대한 제2 텍스트 정보를 출력하여 음성인식 성공률을 높일 수 있다.In addition, according to an embodiment of the present invention, even if the initial voice signal is not recognized, an additional signal is additionally obtained to increase the success rate of voice recognition by outputting second text information about the user's voice signal among a plurality of candidate text information. can

또한, 본 발명의 일 실시예에 의하면, 클라우드 서버에게 사용자의 음성 신호의 적어도 일부를 실시간으로 송신함으로써, 클라우드 서버를 통해 사용자의 자연어 음성(즉, 음성 신호)을 실시간으로 분석하여 텍스트 형식으로 변환함에 따라 보다 빠르게 자연어 처리가 가능하다. In addition, according to an embodiment of the present invention, by transmitting at least a portion of the user's voice signal to the cloud server in real time, the user's natural language voice (ie, voice signal) is analyzed in real time through the cloud server and converted into a text format. This allows for faster natural language processing.

본 발명의 효과들은 상술된 효과들로 제한되지 않으며, 본 발명의 기술적 특징들에 의하여 기대되는 잠정적인 효과들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the above-described effects, and potential effects expected by the technical features of the present invention will be clearly understood from the following description.

도 1은 본 발명의 일 실시예에 따른 음성인식 시스템을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 음성인식 방법을 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 음성인식 장치의 기능적 구성을 도시한 도면이다.1 is a diagram illustrating a voice recognition system according to an embodiment of the present invention.
2 is a diagram illustrating a voice recognition method according to an embodiment of the present invention.
3 is a diagram illustrating a functional configuration of a voice recognition apparatus according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고, 여러 가지 실시예들을 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 이를 상세히 설명하고자 한다. Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail.

청구범위에 개시된 발명의 다양한 특징들은 도면 및 상세한 설명을 고려하여 더 잘 이해될 수 있을 것이다. 명세서에 개시된 장치, 방법, 제법 및 다양한 실시예들은 예시를 위해서 제공되는 것이다. 개시된 구조 및 기능상의 특징들은 통상의 기술자로 하여금 다양한 실시예들을 구체적으로 실시할 수 있도록 하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다. 개시된 용어 및 문장들은 개시된 발명의 다양한 특징들을 이해하기 쉽게 설명하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다.Various features of the invention disclosed in the claims may be better understood upon consideration of the drawings and detailed description. The apparatus, methods, preparations, and various embodiments disclosed herein are provided for purposes of illustration. The disclosed structural and functional features are intended to enable those skilled in the art to specifically practice the various embodiments, and are not intended to limit the scope of the invention. The terms and sentences disclosed are for the purpose of easy-to-understand descriptions of various features of the disclosed invention, and are not intended to limit the scope of the invention.

본 발명을 설명함에 있어서, 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 상세한 설명을 생략한다.In describing the present invention, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

이하, 본 발명의 일 실시예에 따른 음성인식 방법 및 장치를 설명한다.Hereinafter, a voice recognition method and apparatus according to an embodiment of the present invention will be described.

도 1은 본 발명의 일 실시예에 따른 음성인식 시스템을 도시한 도면이다.1 is a diagram illustrating a voice recognition system according to an embodiment of the present invention.

도 1을 참고하면, 음성인식 장치(100)는 사용자로부터 음성 신호를 획득하고, 획득된 음성 신호를 이용하여 음성 신호의 음성인식 여부를 판단할 수 있다.Referring to FIG. 1 , the voice recognition apparatus 100 may obtain a voice signal from a user and determine whether to recognize the voice signal by using the acquired voice signal.

일 실시예에서, 음성인식 장치(100)는 사용자의 음성 신호를 텍스트 정보로 변환한 후, 변환된 텍스트 정보를 이용하여 음성 인식을 수행하여 음성인식 여부를 판단할 수 있다.In an embodiment, the voice recognition apparatus 100 may convert the user's voice signal into text information and then perform voice recognition using the converted text information to determine whether to recognize the voice.

다른 실시예에서, 음성인식 장치(100)는 사용자의 음성 신호의 적어도 일부를 클라우드 서버(110)에게 송신하고, 클라우드 서버(110)로부터 사용자의 음성 신호의 적어도 일부에 대한 음성 인식 결과를 수신하여 음성인식 여부를 판단할 수 있다. In another embodiment, the voice recognition apparatus 100 transmits at least a part of the user's voice signal to the cloud server 110 and receives a voice recognition result for at least a part of the user's voice signal from the cloud server 110, It can be determined whether or not voice recognition is performed.

음성인식 장치(100)는 음성 신호가 인식되는 경우, 음성 신호에 대응하는 제1 텍스트 정보를 출력할 수 있다. 여기서, 제1 텍스트 정보는 사용자의 음성 신호로부터 변환된 텍스트 정보를 포함할 수 있다. When the voice signal is recognized, the voice recognition apparatus 100 may output first text information corresponding to the voice signal. Here, the first text information may include text information converted from the user's voice signal.

즉, 본 발명에 따른 음성인식 장치(100)는 음성 신호가 인식되는 경우, 음성 신호에 대응하는 제1 텍스트 정보를 출력함으로써, 사용자는 음성인식 장치(100)가 사용자의 음성 신호를 제대로 인식했는지 직접 확인할 수 있다. That is, when the voice signal is recognized, the voice recognition apparatus 100 according to the present invention outputs the first text information corresponding to the voice signal, so that the user can check whether the voice recognition apparatus 100 properly recognizes the user's voice signal. You can check it yourself.

반면, 음성인식 장치(100)는 음성 신호가 인식되지 않는 경우, 사용자의 부가 신호를 획득하여, 부가 신호에 대응하는 제2 텍스트 정보를 출력할 수 있다. 여기서, 제2 텍스트 정보는 사용자의 음성 신호에 대한 다수의 후보 텍스트 정보 중 하나를 포함할 수 있다. On the other hand, when the voice signal is not recognized, the voice recognition apparatus 100 may obtain an additional signal of the user and output the second text information corresponding to the additional signal. Here, the second text information may include one of a plurality of candidate text information for the user's voice signal.

즉, 본 발명에 따른 음성인식 장치(100)는 초기 음성 신호가 인식되지 않더라도, 부가 신호를 추가적으로 획득하여, 다수의 후보 텍스트 정보 중 사용자의 음성 신호에 대한 제2 텍스트 정보를 출력하여 음성인식 성공률을 높일 수 있다. That is, even if the initial voice signal is not recognized, the voice recognition apparatus 100 according to the present invention additionally acquires an additional signal and outputs the second text information for the user's voice signal among a plurality of candidate text information to achieve a voice recognition success rate. can increase

일 실시예에서, 음성인식 장치(100)는 제1 텍스트 정보 또는 제2 텍스트 정보 각각에 대응하는 응답 메시지를 출력할 수 있다. In an embodiment, the voice recognition apparatus 100 may output a response message corresponding to each of the first text information or the second text information.

일 실시예에서, 음성인식 장치(100)는 키오스크(kiosk), 스마트폰, 태블릿 PC, 노트북 등 다양한 형태로 구현될 수 있으나, 이에 제한되지 않는다. In an embodiment, the voice recognition apparatus 100 may be implemented in various forms such as a kiosk, a smart phone, a tablet PC, a laptop computer, and the like, but is not limited thereto.

도 2는 본 발명의 일 실시예에 따른 음성인식 장치(100)의 동작 방법을 도시한 도면이다.2 is a diagram illustrating a method of operating the voice recognition apparatus 100 according to an embodiment of the present invention.

도 2를 참고하면, S201 단계는 사용자의 음성 신호를 획득하는 단계이다. 일 실시예에서, 음성 신호의 데시벨(decibel, dB)이 임계값 이상인 경우, 사용자의 음성 신호를 획득할 수 있다. Referring to FIG. 2 , step S201 is a step of acquiring a user's voice signal. In an embodiment, when the decibel (dB) of the voice signal is equal to or greater than a threshold, the user's voice signal may be acquired.

S203 단계는 획득된 음성 신호를 이용하여 음성 신호의 음성인식 여부를 판단하는 단계이다.Step S203 is a step of determining whether a voice signal is recognized by using the acquired voice signal.

일 실시예에서, 음성 신호를 시간 도메인과 주파수 도메인에 대한 사운드 스펙트로그램(spectrogram)으로 변환하고, 사운드 스펙트로그램으로부터 제1 텍스트 정보를 추출하며, 제1 텍스트 정보를 이용하여 음성 신호의 인식 여부를 판단할 수 있다. In one embodiment, a voice signal is converted into a sound spectrogram for a time domain and a frequency domain, first text information is extracted from the sound spectrogram, and whether the voice signal is recognized by using the first text information can judge

일 실시예에서, 변환된 음성 신호에 대한 사운드 스펙트로그램과 해당 사운드 스펙트로그램에 대응되는 제1 텍스트 정보를 추출할 수 있다. In an embodiment, a sound spectrogram for the converted voice signal and first text information corresponding to the sound spectrogram may be extracted.

일 실시예에서, 음성 신호를 사운드 스펙트로그램으로 변환하고, 사운드 스펙트로그램을 음성인식 인공지능 학습모델에 적용하여 제1 텍스트 정보를 추출하고 음성 인식을 수행할 수 있다. 여기서, 사운드 스펙트로그램은 시간 도메인과 주파수 도메인에 대한 음성 신호의 진폭 크기를 색상으로 표현한 그래프를 의미할 수 있다. In an embodiment, a voice signal may be converted into a sound spectrogram, and the sound spectrogram may be applied to a voice recognition artificial intelligence learning model to extract the first text information and perform voice recognition. Here, the sound spectrogram may refer to a graph in which the amplitudes of the voice signals in the time domain and the frequency domain are expressed as colors.

일 실시예에서, S201 단계에서 사용자의 음성 신호의 적어도 일부를 획득할 수 있으며, 이 경우, 음성 신호의 적어도 일부를 클라우드 서버(110)에게 송신할 수 있다. In an embodiment, at least a part of the user's voice signal may be obtained in step S201 , and in this case, at least a part of the voice signal may be transmitted to the cloud server 110 .

즉, 음성 신호 전체를 획득한 후, 전체 음성 신호에 대한 음성 인식을 수행하는 것이 아니라, 실시간으로 획득되는 음성 신호의 일부분을 클라우드 서버(110)에게 송신할 수 있다. That is, after acquiring the entire voice signal, a portion of the voice signal acquired in real time may be transmitted to the cloud server 110 instead of performing voice recognition on the entire voice signal.

여기서, 전체 음성 신호는 음성 신호의 데시벨(decibel, dB)이 임계값 이상인 경우 획득되는 사용자의 음성 신호 전체를 의미할 수 있다. 즉, 전체 음성 신호는 음성인식 장치(100)가 사용자의 음성 신호의 데시벨이 임계값 이상인 경우 음성 신호 획득을 시작하고, 이후, 사용자가 발화를 종료함으로써 음성 신호의 데시벨이 임계값 이하로 떨어지는 경우 음성 신호 획득을 종료하는데, 이 때, 획득되는 전체 음성 신호를 의미할 수 있다. Here, the total voice signal may mean the entire user's voice signal obtained when the decibel (dB) of the voice signal is equal to or greater than a threshold value. That is, in the case of the entire voice signal, the voice recognition apparatus 100 starts acquiring the voice signal when the decibel of the user's voice signal is greater than or equal to the threshold value, and then, when the decibel of the voice signal falls below the threshold value by the user ending the utterance The acquisition of the voice signal is terminated, and in this case, it may mean the entire acquired voice signal.

또한, 일 실시예에서, 클라우드 서버(110)로부터 음성 신호의 적어도 일부에 대한 음성 인식 결과를 수신하며, 음성 인식 결과를 이용하여 음성 신호의 음성인식 여부를 판단할 수 있다. Also, according to an embodiment, it is possible to receive a voice recognition result for at least a part of the voice signal from the cloud server 110 and determine whether to recognize the voice signal by using the voice recognition result.

즉, 클라우드 서버(110)는 실시간으로 수신되는 음성 신호의 각 부분을 텍스트 정보로 변환하고, 변환된 텍스트 정보에 대하여 음성 인식을 수행하여, 음성 인식 결과를 음성인식 장치(100)에게 송신할 수 있다. 여기서, 음성 신호의 각 부분에 대한 음성 인식 결과는 ‘중간 인식 결과’ 또는 이와 동등한 기술적 의미를 갖는 용어로 지칭될 수 있다.That is, the cloud server 110 converts each part of the voice signal received in real time into text information, performs voice recognition on the converted text information, and transmits the voice recognition result to the voice recognition device 100 . have. Here, the voice recognition result for each part of the voice signal may be referred to as an 'intermediate recognition result' or a term having an equivalent technical meaning.

또한, 클라우드 서버(110)는 최종적으로 전체 음성 신호를 전부 수신한 후, 전체 음성 신호를 텍스트 정보로 변환하고, 변환된 텍스트 정보에 대하여 음성 인식을 수행하여, 음성 인식 결과를 음성인식 장치(100)에게 송신할 수 있다. 여기서, 전체 음성 신호에 대한 음성 인식 결과는 ‘최종 인식 결과’ 또는 이와 동등한 기술적 의미를 갖는 용어로 지칭될 수 있다. In addition, the cloud server 110 finally receives all the voice signals, converts the entire voice signal into text information, performs voice recognition on the converted text information, and returns the voice recognition result to the voice recognition apparatus 100 . ) can be sent to Here, the voice recognition result for the entire voice signal may be referred to as a 'final recognition result' or a term having an equivalent technical meaning.

이와 같이, 본 발명에 따른 음성인식 방법은 사용자의 자연어 음성(즉, 음성 신호)을 실시간으로 분석하여 텍스트 형식으로 변환함에 따라, 기존의 사용자의 전체 음성이 끝난 후 분석하여 변환하는 방식에 비해 실시간으로 자연어 처리를 수행함으로써 보다 빠르게 자연어 처리가 가능하다. As described above, the voice recognition method according to the present invention analyzes the user's natural language voice (ie, voice signal) in real time and converts it into a text format. By performing natural language processing with

일 실시예에서, 제1 텍스트 정보로부터 제1 텍스트 정보의 구문(syntax) 정보를 추출하고, 추출된 구문 정보가 미리 정의된 구문 정보와 일치하는지 여부를 판단함으로써, 음성 신호의 인식 여부를 판단할 수 있다. In one embodiment, by extracting syntax information of the first text information from the first text information, and determining whether the extracted syntax information matches predefined syntax information, it is possible to determine whether a voice signal is recognized. can

S205 단계는 음성 신호가 인식되는 경우, 음성 신호에 대응하는 제1 텍스트 정보를 출력하는 단계이다.Step S205 is a step of outputting first text information corresponding to the voice signal when the voice signal is recognized.

일 실시예에서, 음성 신호가 인식되는 경우, 음성 신호와 제1 텍스트 정보를 이용하여 음성인식 인공지능 학습모델을 학습시킬 수 있다. 즉, 음성 신호와 이에 대응하는 제1 텍스트 정보를 학습 데이터로 이용하여 음성인식 인공지능 학습모델을 학습시킴으로써, 지속적으로 음성인식 정확도를 높일 수 있다. In an embodiment, when a voice signal is recognized, a voice recognition artificial intelligence learning model may be trained using the voice signal and the first text information. That is, by learning the voice recognition artificial intelligence learning model by using the voice signal and the first text information corresponding thereto as learning data, it is possible to continuously increase the voice recognition accuracy.

S207 단계는 음성 신호가 인식되지 않는 경우, 사용자의 부가 신호를 획득하고, 부가 신호에 대응하는 제2 텍스트 정보를 출력하는 단계이다. Step S207 is a step of obtaining an additional signal of the user and outputting second text information corresponding to the additional signal when the voice signal is not recognized.

일 실시예에서, 음성 신호가 인식되지 않는 경우, 사용자의 음성 신호의 재요청 신호를 출력하고, 재요청 신호의 출력에 응답하여 사용자의 부가 신호를 획득할 수 있다.In an embodiment, when the voice signal is not recognized, a re-request signal of the user's voice signal may be output, and an additional signal of the user may be obtained in response to the output of the re-request signal.

예를 들어, 재요청 신호는 “다시 말씀을 해주시겠습니까?”와 같은 텍스트 정보 및 사운드 정보 중 적어도 하나를 의미할 수 있다. 즉, 음성 신호가 인식되지 않는 경우, 사용자의 음성 신호의 재요청 신호를 텍스트 정보로 화면에 디스플레이하거나, 사운드 정보로 스피커를 통해 출력할 수 있다. For example, the re-request signal may mean at least one of text information and sound information such as “Would you like to speak again?” That is, when the voice signal is not recognized, the re-request signal of the user's voice signal may be displayed on the screen as text information or may be output as sound information through a speaker.

일 실시예에서, 사용자의 부가 신호는, 사용자의 시선(line of sight) 신호, 음성 신호 및 동작 신호 중 적어도 하나를 포함할 수 있다. In an embodiment, the user's additional signal may include at least one of a user's line of sight signal, a voice signal, and an operation signal.

예를 들어, 사용자의 시선 신호에 포함된 시선 패턴 정보 및 동작 정보에 포함된 동작 패턴 정보 중 적어도 하나를 이용하여 사용자의 음성 신호와 관련된 감정 상태를 파악함으로써, 음성 신호에 대하여 부가 신호에 대응하는 제2 텍스트 정보를 출력할 수 있다. For example, by identifying an emotional state related to the user's voice signal using at least one of gaze pattern information included in the user's gaze signal and motion pattern information included in the motion information, the voice signal corresponds to the additional signal. Second text information may be output.

도 3은 본 발명의 일 실시예에 따른 음성인식 장치(100)의 기능적 구성을 도시한 도면이다.3 is a diagram illustrating a functional configuration of a voice recognition apparatus 100 according to an embodiment of the present invention.

도 3을 참고하면, 음성인식 장치(100)는 입력부(310), 제어부(320), 저장부(330), 출력부(340) 및 통신부(350)를 포함할 수 있다.Referring to FIG. 3 , the voice recognition apparatus 100 may include an input unit 310 , a control unit 320 , a storage unit 330 , an output unit 340 , and a communication unit 350 .

입력부(310)는 사용자의 음성 신호를 획득할 수 있다. 일 실시예에서, 입력부(310)는 마이크로폰으로 구현될 수 있으며, 음성인식 장치(100)와 일체형(all-in-one)뿐만 아니라 분리된 형태로 구현될 수 있다.The input unit 310 may obtain a user's voice signal. In an embodiment, the input unit 310 may be implemented as a microphone, and may be implemented in a separate form as well as an all-in-one form with the voice recognition apparatus 100 .

제어부(320)는 획득된 음성 신호를 이용하여 음성 신호의 인식 여부를 판단할 수 있다. The controller 320 may determine whether the voice signal is recognized by using the acquired voice signal.

일 실시예에서, 제어부(320)는 적어도 하나의 프로세서 또는 마이크로(micro) 프로세서를 포함하거나, 또는, 프로세서의 일부일 수 있다. 또한, 제어부(320)는 CP(communication processor)라 지칭될 수 있다. 제어부(320)는 본 발명의 다양한 실시예에 따른 대화서비스 제공 장치(100)의 동작을 제어할 수 있다. In one embodiment, the control unit 320 may include at least one processor or microprocessor, or may be a part of the processor. Also, the controller 320 may be referred to as a communication processor (CP). The controller 320 may control the operation of the chat service providing apparatus 100 according to various embodiments of the present disclosure.

저장부(330)는 음성인식 인공지능 학습모델을 학습시키기 위한 학습 데이터를 저장할 수 있다. The storage unit 330 may store learning data for learning the voice recognition artificial intelligence learning model.

일 실시예에서, 저장부(330)는 음성 신호의 인식 여부를 판단하기 위해 미리 정의된 구문 정보를 저장할 수 있다. In an embodiment, the storage 330 may store predefined syntax information to determine whether a voice signal is recognized.

일 실시예에서, 저장부(330)는 사용자의 부가 신호와 비교하기 위한 미리 정의된 사용자의 시선 신호, 음성 신호 및 동작 신호 중 적어도 하나를 저장할 수 있다. In an embodiment, the storage 330 may store at least one of a predefined user's gaze signal, a voice signal, and an operation signal for comparison with the user's additional signal.

일 실시예에서, 저장부(330)는 휘발성 메모리, 비휘발성 메모리 또는 휘발성 메모리와 비휘발성 메모리의 조합으로 구성될 수 있다. 그리고, 저장부(330)는 제어부(320)의 요청에 따라 저장된 데이터를 제공할 수 있다.In an embodiment, the storage unit 330 may be configured as a volatile memory, a non-volatile memory, or a combination of a volatile memory and a non-volatile memory. In addition, the storage unit 330 may provide stored data according to the request of the control unit 320 .

출력부(340)는 음성 신호가 인식되는 경우, 음성 신호에 대응하는 제1 텍스트 정보를 출력할 수 있다. 또한, 출력부(340)는 음성 신호가 인식되지 않는 경우, 사용자의 부가 신호를 획득하고, 부가 신호에 대응하는 제2 텍스트 정보를 출력할 수 있다. When the voice signal is recognized, the output unit 340 may output first text information corresponding to the voice signal. Also, when the voice signal is not recognized, the output unit 340 may obtain an additional signal of the user and output second text information corresponding to the additional signal.

일 실시예에서, 출력부(340)는 음성인식 장치(100)에서 처리되는 정보를 화면으로 디스플레이할 수 있다. 예를 들면, 출력부(340)는 액정 디스플레이(LCD; Liquid Crystal Display), 발광 다이오드(LED; Light Emitting Diode) 디스플레이, 유기 발광 다이오드(OLED; Organic LED) 디스플레이, 마이크로 전자기계 시스템(MEMS; Micro Electro Mechanical Systems) 디스플레이 및 전자 종이(electronic paper) 디스플레이 중 적어도 어느 하나를 포함할 수 있으나, 이에 제한되지 않는다.In an embodiment, the output unit 340 may display information processed by the voice recognition apparatus 100 on a screen. For example, the output unit 340 may include a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, and a microelectromechanical system (MEMS). It may include at least one of an Electro Mechanical Systems) display and an electronic paper display, but is not limited thereto.

일 실시예에서, 출력부(340)는 음성인식 장치(100)에서 처리되는 정보를 사운드로 출력할 수 있다. 예를 들어, 출력부(340)는 스피커 또는 사운드를 출력할 수 있는 다양한 출력 단자로 구현될 수 있다. In an embodiment, the output unit 340 may output information processed by the voice recognition apparatus 100 as sound. For example, the output unit 340 may be implemented as a speaker or various output terminals capable of outputting sound.

일 실시예에서, 통신부(350)는 음성 신호의 적어도 일부를 클라우드 서버(110)에게 송신하고, 클라우드 서버(110)로부터 음성 신호의 적어도 일부에 대한 음성 인식 결과를 수신할 수 있다. In an embodiment, the communication unit 350 may transmit at least a portion of the voice signal to the cloud server 110 and receive a voice recognition result of at least a portion of the voice signal from the cloud server 110 .

일 실시예에서, 통신부(350)는 유선 통신 모듈 및 무선 통신 모듈 중 적어도 하나를 포함할 수 있다. 통신부(350)의 전부 또는 일부는 '송신부', '수신부' 또는 '송수신부(transceiver)'로 지칭될 수 있다.In an embodiment, the communication unit 350 may include at least one of a wired communication module and a wireless communication module. All or part of the communication unit 350 may be referred to as a 'transmitter', 'receiver', or 'transceiver'.

도 3을 참고하면, 음성인식 장치(100)는 입력부(310), 제어부(320), 저장부(330), 출력부(340) 및 통신부(350)를 포함할 수 있다. 본 발명의 다양한 실시예들에서 음성인식 장치(100)는 도 3에 설명된 구성들이 필수적인 것은 아니어서, 도 3에 설명된 구성들보다 많은 구성들을 가지거나, 또는 그보다 적은 구성들을 가지는 것으로 구현될 수 있다.Referring to FIG. 3 , the voice recognition apparatus 100 may include an input unit 310 , a control unit 320 , a storage unit 330 , an output unit 340 , and a communication unit 350 . In various embodiments of the present invention, the voice recognition apparatus 100 is not essential to the components described in FIG. 3, so it may be implemented to have more or fewer configurations than those described in FIG. can

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로, 통상의 기술자라면 본 발명의 본질적인 특성이 벗어나지 않는 범위에서 다양한 변경 및 수정이 가능할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention.

따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라, 설명하기 위한 것이고, 이러한 실시예들에 의하여 본 발명의 범위가 한정되는 것은 아니다.Accordingly, the embodiments disclosed in the present specification are not intended to limit the technical spirit of the present invention, but to illustrate, and the scope of the present invention is not limited by these embodiments.

본 발명의 보호범위는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 이해되어야 한다.The protection scope of the present invention should be interpreted by the claims, and all technical ideas within the scope equivalent thereto should be understood to be included in the scope of the present invention.

100: 음성인식 장치
110: 클라우드 서버
310: 입력부
320: 제어부
330: 통신부
340: 출력부
350: 저장부100: voice recognition device
110: cloud server
310: input unit
320: control unit
330: communication unit
340: output unit
350: storage

Claims

(a) obtaining a user's voice signal;
(b) determining whether the voice signal is recognized by using the acquired voice signal;
(c) outputting first text information corresponding to the voice signal when the voice signal is recognized; and
(d) when the voice signal is not recognized, an emotional state related to the user's voice signal is identified using at least one of gaze pattern information included in the user's gaze signal and motion pattern information included in the motion information, outputting second text information corresponding to the emotional state;
containing,
Speech Recognition Method.

According to claim 1,
The step (a) is,
obtaining the user's voice signal when the decibel (dB) of the voice signal is greater than or equal to a threshold value;
containing,
Speech Recognition Method.

According to claim 1,
Step (b) is,
converting the speech signal into a sound spectrogram for a time domain and a frequency domain;
extracting the first text information from the sound spectrogram; and
determining whether the voice signal is recognized by using the first text information;
containing,
Speech Recognition Method.

4. The method of claim 3,
The step of determining whether the voice signal is recognized by using the first text information includes:
extracting syntax information of the first text information from the first text information; and
determining whether the extracted syntax information matches predefined syntax information;
containing,
Speech Recognition Method.

According to claim 1,
The step (a) is,
obtaining at least a portion of the user's voice signal;
including,
Step (b) is,
transmitting at least a portion of the voice signal to a cloud server;
receiving a voice recognition result for at least a part of the voice signal from the cloud server; and
determining whether to recognize the voice signal using the voice recognition result;
containing,
Speech Recognition Method.

According to claim 1,
Step (d) is,
outputting a re-request signal of the user's voice signal when the voice signal is not recognized; and
obtaining an additional signal of the user in response to the output of the re-request signal;
containing,
Speech Recognition Method.

delete

According to claim 1,
After step (c),
learning a voice recognition artificial intelligence learning model using the voice signal and the first text information;
further comprising,
Speech Recognition Method.

an input unit for obtaining a user's voice signal;
a control unit that determines whether the voice signal is recognized by using the acquired voice signal; and
When the voice signal is recognized, outputting first text information corresponding to the voice signal,
When the voice signal is not recognized, an emotional state related to the user's voice signal is determined using at least one of gaze pattern information included in the user's gaze signal and motion pattern information included in the motion information, and the emotional state an output unit for outputting second text information corresponding to ;
containing,
speech recognition device.

10. The method of claim 9,
The input unit,
When the decibel (dB) of the voice signal is greater than or equal to a threshold, obtaining the user's voice signal,
speech recognition device.

10. The method of claim 9,
The control unit is
converting the speech signal into a sound spectrogram for the time domain and the frequency domain,
extracting the first text information from the sound spectrogram,
determining whether the voice signal is recognized by using the first text information,
speech recognition device.

12. The method of claim 11,
The control unit is
extracting syntax information of the first text information from the first text information;
Determining whether the extracted syntax information matches the predefined syntax information,
speech recognition device.

10. The method of claim 9,
The input unit obtains at least a portion of the user's voice signal,
Transmitting at least a portion of the voice signal to a cloud server,
a communication unit configured to receive a voice recognition result of at least a portion of the voice signal from the cloud server;
further comprising,
The control unit determines whether or not the voice signal is recognized by using the voice recognition result,
speech recognition device.

10. The method of claim 9,
The output unit outputs a re-request signal of the user's voice signal when the voice signal is not recognized,
The input unit obtains the additional signal of the user in response to the output of the re-request signal,
speech recognition device.

delete

10. The method of claim 9,
The control unit is
Learning a voice recognition artificial intelligence learning model using the voice signal and the first text information,
speech recognition device.