KR20180047597A

KR20180047597A - Method for providing of voice recognition service using information of voice signal with wakeup and apparatus thereof

Info

Publication number: KR20180047597A
Application number: KR1020160143972A
Authority: KR
Inventors: 진유광; 김영준
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2016-10-31
Filing date: 2016-10-31
Publication date: 2018-05-10

Abstract

The present invention relates to a method for providing a voice recognition service. More particularly, the present invention relates to a service server capable of providing a voice recognition service in cooperation with a sound output apparatus including a microphone and a speaker. The sound output apparatus generates voice signal information for performing preprocessing based on the initial interval of an input voice signal and performs voice recognition more accurately using the voice signal information.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method and apparatus for providing a voice recognition service using voice signal information,

본 발명은 음성 인식 서비스 제공 방법에 관한 것으로, 더욱 상세하게는 마이크와 스피커를 포함하는 음향 출력 장치와 연동하여 음성 인식 서비스를 제공할 수 있는 서비스 서버에 있어서, 상기 음향 출력 장치는 입력되는 음성 신호의 초기 구간을 기초로 전처리 수행을 위한 음성 신호 정보를 생성하고, 서비스 서버는 상기 음성 신호 정보를 이용하여 보다 더 정확하게 음성 인식을 수행할 수 있는 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법 및 이를 위한 장치에 관한 것이다. The present invention relates to a method of providing a voice recognition service, and more particularly, to a service server capable of providing a voice recognition service in cooperation with an audio output device including a microphone and a speaker, And a service server generates a speech recognition service using speech signal information capable of performing speech recognition more accurately by using the speech signal information, .

이 부분에 기술된 내용은 단순히 본 실시 예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute the prior art.

기술이 발달함에 따라 최근 많은 분야에서 음성 인식 기술을 적용한 각종 서비스들이 소개되고 있다. 음성 인식 기술은 사람이 발성하는 음성을 이해하여 컴퓨터가 다룰 수 있는 문자 정보로 변환하는 일련의 과정이라 할 수 있으며, 음성 인식 기술을 이용하는 음성 인식 서비스는 사용자의 음성을 인식하고 이에 해당하는 적합한 서비스를 제공하는 일련의 과정을 의미할 수 있다. As technology develops, various services using voice recognition technology have been introduced in many fields. Speech recognition technology is a series of processes for understanding human speech and converting it into character information that can be handled by a computer. A speech recognition service using a speech recognition technology recognizes a user's speech, Quot; and " a "

그러나 현재까지의 음성 인식 기술을 이용한 서비스들은 단순히 사용자의 음성을 인식하고, 이에 해당하는 단편적인 서비스만을 제공한다는 문제점이 있다. 예컨대, 사용자가 자신의 휴대폰을 이용하여 특정 단어에 대한 음성을 발화하고, 이를 휴대폰이 인식하여 해당 단어에 대응하는 검색 결과를 웹 브라우저를 통해 제공하는 형태의 음성 인식 서비스가 있으나, 단순히 검색 서버와 연동하여 해당하는 검색 결과만을 제공하는 것으로, 다양한 사용자의 편의를 만족시킬 수 없다는 문제점이 있다. However, there is a problem that the services using the speech recognition technology up to now recognize the user's voice and provide only a fragmentary service corresponding to the user's voice. For example, there is a voice recognition service in which a user speaks a voice for a specific word using his or her mobile phone, the mobile phone recognizes it, and provides a search result corresponding to the word through a web browser. However, There is a problem that convenience of various users can not be satisfied by interlocking and providing only the corresponding search results.

이에, 음성 인식의 성능을 높여 사용자의 의도를 보다 더 정확하게 파악하고 다양한 컨텐츠를 외부의 컨텐츠 제공 서버와 연동하여 제공할 수 있는 음성 인식 서비스의 개발이 필요하다. Accordingly, it is necessary to develop a speech recognition service capable of enhancing the performance of speech recognition, grasping the user's intention more accurately, and providing various contents in cooperation with an external content providing server.

한국공개특허 제10-2008-0083553호, 2008년 9월 18일 공개 (명칭: 음성 인식을 이용한 영상 합성 통화 서비스 방법 및 장치)Korean Patent Laid-Open No. 10-2008-0083553, September 18, 2008 (Name: Image synthesis call service method and apparatus using speech recognition)

본 발명은 상기한 종래의 문제점을 해결하기 위해 제안된 것으로서, 적어도 하나 이상의 다채널 마이크를 포함하는 음향 출력 장치가 사용자의 발화 음성이 입력되면, 음성 신호를 생성하되, 상기 음성 신호의 초기 구간에 대한 정보를 이용하여 방향, 채널 및 잡음 정보를 포함하는 음성 신호 정보를 추출할 수 있는 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법 및 이를 위한 장치를 제공하는 데 목적이 있다. SUMMARY OF THE INVENTION The present invention has been proposed in order to solve the above problems, and it is an object of the present invention to provide a sound output apparatus including at least one multi-channel microphone for generating a voice signal when a user's voice is input, The present invention provides a method for providing a voice recognition service using voice signal information capable of extracting voice signal information including direction, channel, and noise information by using information on the voice,

또한, 본 발명은 상기 음향 출력 장치로부터 전달받은 음성 신호 정보를 이용하여 상기 음향 출력 장치로부터 스트리밍되는 음성 신호의 전체에 대한 잡음 제거 및 변이를 보상한 후에 음성 인식을 수행함으로써, 보다 더 정확하게 음성 인식을 수행할 수 있는 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법 및 이를 위한 장치를 제공하는 데 그 목적이 있다. Further, the present invention performs voice recognition after compensating for noise elimination and variations of the entire voice signal streamed from the sound output apparatus using the voice signal information received from the sound output apparatus, The present invention provides a method and apparatus for providing a voice recognition service using voice signal information capable of performing voice recognition.

그러나, 이러한 본 발명의 목적은 상기의 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood from the following description.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법은 음향 출력 장치가, 적어도 하나 이상의 다채널 마이크를 통해 발화되는 사용자의 음성을 입력 받아 음성 신호를 생성하는 단계; 상기 생성되는 음성 신호에서 웨이크업 과정을 수행하기 위한 일정 구간의 음성 신호를 대상으로 방향, 채널 및 잡음 정보를 포함하는 음성 신호 정보를 추출하는 단계; 및 상기 일정 구간의 음성 신호를 대상으로 음성 인식하여 웨이크업 단어가 포함되어 있을 경우, 상기 생성되는 음성 신호를 순차적으로 스트리밍하여 상기 서비스 서버로 전송하고, 상기 추출된 음성 신호 정보를 상기 서비스 서버로 전송하는 단계;를 포함하여 이뤄질 수 있다. According to another aspect of the present invention, there is provided a method of providing a voice recognition service using speech signal information, the method including: receiving a voice of a user who is speaking through at least one multi- Generating a signal; Extracting speech signal information including direction, channel, and noise information on a speech signal of a predetermined interval for performing a wakeup process on the generated speech signal; And if the wakeup word is included in the voice signal of the predetermined interval, the voice signal is sequentially streamed and transmitted to the service server, and the extracted voice signal information is transmitted to the service server The method comprising the steps of:

이때, 상기 음성 신호를 생성하는 단계는 상기 적어도 하나 이상의 다채널 마이크를 통해 각각 입력되는 음성을 이용하여 음성 신호를 생성하는 단계; 및 생성되는 음성 신호를 소정 길이의 프레임으로 구분하고, 웨이크업 과정을 수행하기 위해 일정 구간의 프레임을 저장하는 단계;를 포함하여 이뤄질 수 있다. The generating of the voice signal may include generating a voice signal using a voice input through the at least one multi-channel microphone, Dividing a generated voice signal into frames of a predetermined length, and storing a frame of a predetermined interval in order to perform a wakeup process.

또한, 상기 음성 신호 정보를 추출하는 단계는 상기 적어도 하나 이상의 다채널 마이크를 통해 각각 입력되는 음성에 대한 도달 시간, 도달 거리 및 음성의 세기 중 적어도 어느 하나를 이용하여 방향 정보를 추출하는 단계; 상기 다채널 마이크의 채널 정보를 확인하여 추출하는 단계; 및 상기 저장된 프레임별로 음성 특징 벡터를 추출하고, 상기 추출된 음성 특징 벡터를 이용하여 신호대 잡음비(SNR; Signal to Noise)를 추출하는 단계;를 포함하여 이뤄질 수 있다. The extracting of the voice signal information may include extracting direction information using at least one of a arrival time, a reaching distance, and a voice strength of a voice input through the at least one multi-channel microphone, Checking and extracting channel information of the multi-channel microphone; Extracting a speech feature vector for each of the stored frames, and extracting a signal-to-noise ratio (SNR) using the extracted speech feature vector.

이때, 상기 서비스 서버로 전송하는 단계는 상기 저장된 프레임에 대응하는 음성 신호부터 상기 서비스 서버로 스트리밍하여 전송하거나, 상기 저장된 프레임 이후의 생성되는 음성 신호부터 상기 서비스 서버로 스트리밍하여 전송할 수 있다. At this time, the step of transmitting to the service server may stream the voice signal corresponding to the stored frame to the service server, or may stream the voice signal generated after the stored frame to the service server and transmit the voice signal.

또한, 상기 서비스 서버로 전송하는 단계는 상기 스트리밍되어 전송되는 음성 신호의 헤더 정보에 상기 추출된 음성 신호 정보를 포함시켜 상기 서비스 서버로 전송할 수 있다.In addition, the step of transmitting to the service server may include the extracted voice signal information in the header information of the streaming voice signal to be transmitted to the service server.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법은 서비스 서버가 음향 출력 장치로부터 스트리밍되는 음성 신호와 함께 웨이크업 과정을 수행하기 위한 일정 구간의 음성 신호를 대상으로 추출된 음성 신호 정보를 수신하는 단계; 상기 음성 신호 정보를 이용하여 상기 스트리밍되는 음성 신호에서 노이즈 제거 및 변이 보상을 위한 전처리 과정을 수행하는 단계; 및 상기 전처리 과정이 수행된 음성 신호를 인식하여 의미를 분석하고, 분석된 결과에 따라 대응하는 음성 인식 서비스를 음향 출력 장치로 전달하는 단계;를 포함하여 이뤄질 수 있다. According to another aspect of the present invention, there is provided a method of providing a voice recognition service using speech signal information, the method comprising: receiving a voice signal streamed from a voice output device, Receiving voice signal information extracted with respect to a voice signal of the first mobile terminal; Performing a preprocessing process for noise removal and disparity compensation in the streamed audio signal using the audio signal information; And a step of recognizing the speech signal subjected to the preprocessing process and analyzing the meaning, and delivering the corresponding speech recognition service to the sound output device according to the analyzed result.

이때, 상기 전처리 과정을 수행하는 단계는 상기 방향 정보 및 잡음 정보를 이용하여 상기 음성 신호에서의 노이즈 제거를 위한 특정 주파수 대역의 필터를 구성하고 상기 구성된 필터를 이용하여 노이즈 제거를 수행하며, 상기 채널 정보를 이용하여 상기 음성 신호의 채널 변이를 보상하는 전처리 과정을 수행할 수 있다. In this case, the performing of the preprocessing may include filtering a specific frequency band for removing noise from the voice signal using the direction information and the noise information, performing noise removal using the configured filter, Processing of compensating the channel variation of the voice signal using the information.

추가로 본 발명은 상술한 바와 같은 방법을 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체를 제공할 수 있다.Further, the present invention can provide a computer-readable recording medium on which a program for executing the above-described method is recorded.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 음향 출력 장치는 적어도 하나 이상의 다채널 마이크를 통해 발화되는 사용자의 음성을 입력 받아 음성 신호를 생성하는 음성 신호 입력부; 및 상기 음성 신호 입력부를 통해 전달되는 음성 신호에서 웨이크업 과정을 수행하기 위한 일정 구간의 음성 신호를 대상으로 방향, 채널 및 잡음 정보를 포함하는 음성 신호 정보를 추출하고, 상기 일정 구간의 음성 신호를 대상으로 음성 인식하여 웨이크업 단어가 포함되어 있을 경우, 상기 생성되는 음성 신호를 순차적으로 스트리밍하여 상기 서비스 서버로 전송하고, 상기 추출된 음성 신호 정보를 상기 서비스 서버로 전송되도록 제어하는 장치 제어부;를 포함하여 구성될 수 있다. According to an aspect of the present invention, there is provided an audio output apparatus including: a voice signal input unit for receiving a user's voice uttered through at least one multi-channel microphone and generating a voice signal; And extracting speech signal information including direction, channel, and noise information on a speech signal of a predetermined interval for performing a wakeup process from the speech signal transmitted through the speech signal input unit, A device controller for sequentially streaming the generated voice signal and transmitting the voice signal to the service server and controlling the extracted voice signal information to be transmitted to the service server when the voice recognition includes the wakeup word; And the like.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 서비스 서버는 음향 출력 장치로부터 스트리밍되는 음성 신호를 수신하고, 웨이크업 과정을 수행하기 위한 일정 구간의 음성 신호를 대상으로 추출된 음성 신호 정보를 상기 음향 출력 장치로부터 수신하는 서버 통신부; 및 상기 서버 통신부를 통해 수신되는 음성 신호 정보를 이용하여 상기 스트리밍되는 음성 신호에서 노이즈 제거 및 변이 보상을 위한 전처리 과정을 수행하고, 상기 전처리 과정이 수행된 음성 신호를 인식하여 의미를 분석하고, 분석된 결과에 따라 대응하는 음성 인식 서비스를 상기 음향 출력 장치로 전달하도록 제어하는 서버 제어부;를 포함하여 구성될 수 있다. According to an aspect of the present invention, there is provided a service server for receiving a voice signal streamed from an audio output device, the voice signal being extracted from a voice signal of a predetermined interval for performing a wakeup process, A server communication unit for receiving information from the sound output apparatus; And a preprocessing process for noise removal and disparity compensation in the streamed audio signal using the audio signal information received through the server communication unit, recognizing the audio signal subjected to the preprocessing process, analyzing the meaning, And a server control unit for controlling the voice output unit to transmit the corresponding voice recognition service to the sound output unit according to the result of the determination.

이때, 상기 서버 제어부는 상기 음성 신호 정보에 포함된 방향 정보 및 잡음 정보를 이용하여 상기 음성 신호에서의 노이즈 제거를 위한 특정 주파수 대역의 필터를 구성하고 상기 구성된 필터를 이용하여 노이즈 제거를 수행하며, 상기 음성 신호 정보에 포함된 채널 정보를 이용하여 상기 음성 신호의 채널 변이를 보상하는 전처리 과정을 수행할 수 있다.At this time, the server control unit configures a filter of a specific frequency band for removing noise from the voice signal using the direction information and the noise information included in the voice signal information, performs noise removal using the configured filter, A preprocessing process for compensating for the channel variation of the voice signal using the channel information included in the voice signal information can be performed.

본 발명의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법 및 이를 위한 장치에 의하면, 적어도 하나 이상의 다채널 마이크를 통해 발화되는 사용자의 음성을 입력 받아 음성 신호를 생성할 수 있는 음향 출력 장치가, 웨이크업 과정 수행 시에 상기 음성 신호에서 방향, 채널 및 잡음 정보를 포함하는 음성 신호 정보를 추출함으로써, 보다 신속하게 음성 신호 정보를 추출하고 이를 서비스 서버로 전송할 수 있게 된다. According to the present invention, there is provided an audio output apparatus capable of generating a voice signal by receiving a user's voice uttered through at least one multi-channel microphone, The voice signal information including direction, channel, and noise information is extracted from the voice signal at the time of performing the up process, so that the voice signal information can be extracted more quickly and transmitted to the service server.

또한, 본 발명에 의하면, 상기 음향 출력 장치와 연동하여 음성 인식 서비스를 제공할 수 있는 서비스 서버는 상기 음향 출력 장치가 제공한 상기 음성 신호의 초기 구간에 대한 음성 신호 정보를 이용하여 전처리 과정을 즉시 수행함으로써, 전처리 과정을 수행하기 위한 처리 과정이 필요 없이 보다 신속한 음성 인식 서비스의 제공이 가능하며, 또한, 음향 출력 장치에 의해 생성된 음성 신호 정보를 이용하여 전처리를 수행함으로써, 음성 신호 생성 당시의 잡음, 기타 여러 가지 요인들을 보다 더 정확하게 반영하여 음성 인식의 성능을 높일 수 있게 된다. According to another aspect of the present invention, there is provided a service server capable of providing a voice recognition service in cooperation with the sound output device, the service server performing a preprocessing process using voice signal information for an initial period of the voice signal provided by the sound output device It is possible to provide a faster voice recognition service without requiring a process to perform a preprocessing process and to perform preprocessing using voice signal information generated by an audio output device, Noise, and various other factors more accurately, thereby enhancing the performance of speech recognition.

이를 통해 본 발명은 음성 인식을 통한 여러 가지 서비스 제공에 있어서 사용자의 만족도를 높일 수 있으며, 신속하고 정확한 음성 인식 서비스의 제공이 가능하게 된다. Accordingly, the present invention can increase the satisfaction of users in providing various services through voice recognition, and it is possible to provide a fast and accurate voice recognition service.

아울러, 상술한 효과 이외의 다양한 효과들이 후술될 본 발명의 실시 예에 따른 상세한 설명에서 직접적 또는 암시적으로 개시될 수 있다.In addition, various effects other than the above-described effects can be directly or implicitly disclosed in the detailed description according to the embodiment of the present invention to be described later.

도 1은 본 발명의 실시 예에 따른 음성 인식 서비스 제공 시스템을 설명하기 위한 구성도이다.
도 2는 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 설명하기 위한 데이터 흐름도이다.
도 3은 발명의 실시 예에 따른 음향 출력 장치의 주요 구성을 도시한 구성도이다.
도 4는 본 발명의 실시 예에 따른 음향 출력 장치에서의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 설명하기 위한 흐름도이다.
도 5은 본 발명의 실시 예에 따른 서비스 서버의 주요 구성을 도시한 구성도이다.
도 6은 본 발명의 실시 예에 따른 서비스 서버에서의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 설명하기 위한 흐름도이다. 1 is a block diagram illustrating a system for providing a voice recognition service according to an embodiment of the present invention.
2 is a data flow chart for explaining a method of providing a voice recognition service using voice signal information according to an embodiment of the present invention.
3 is a block diagram showing a main configuration of an audio output apparatus according to an embodiment of the present invention.
4 is a flowchart illustrating a method of providing a voice recognition service using voice signal information in an audio output apparatus according to an embodiment of the present invention.
5 is a configuration diagram illustrating a main configuration of a service server according to an embodiment of the present invention.
6 is a flowchart illustrating a method of providing a voice recognition service using voice signal information in a service server according to an embodiment of the present invention.

본 발명의 과제 해결 수단의 특징 및 이점을 보다 명확히 하기 위하여, 첨부된 도면에 도시된 본 발명의 특정 실시 예를 참조하여 본 발명을 더 상세하게 설명한다. BRIEF DESCRIPTION OF THE DRAWINGS For a more complete understanding of the nature and advantages of the present invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings, in which:

다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. It should be noted that the same constituent elements are denoted by the same reference numerals as possible throughout the drawings.

이하의 설명 및 도면에서 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위한 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다.The terms and words used in the following description and drawings are not to be construed in an ordinary sense or a dictionary, and the inventor can properly define his or her invention as a concept of a term to be described in the best way It should be construed as meaning and concept consistent with the technical idea of the present invention. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention, and not all of the technical ideas of the present invention are described. Therefore, It is to be understood that equivalents and modifications are possible.

또한, 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하기 위해 사용하는 것으로, 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용될 뿐, 상기 구성요소들을 한정하기 위해 사용되지 않는다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제2 구성요소는 제1 구성요소로 명명될 수 있고, 유사하게 제1 구성요소도 제2 구성요소로 명명될 수 있다.Also, terms including ordinal numbers such as first, second, etc. are used to describe various elements, and are used only for the purpose of distinguishing one element from another, Not used. For example, without departing from the scope of the present invention, the second component may be referred to as a first component, and similarly, the first component may also be referred to as a second component.

더하여, 어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급할 경우, 이는 논리적 또는 물리적으로 연결되거나, 접속될 수 있음을 의미한다. 다시 말해, 구성요소가 다른 구성요소에 직접적으로 연결되거나 접속되어 있을 수 있지만, 중간에 다른 구성요소가 존재할 수도 있으며, 간접적으로 연결되거나 접속될 수도 있다고 이해되어야 할 것이다.In addition, when referring to an element as being "connected" or "connected" to another element, it means that it can be connected or connected logically or physically. In other words, it is to be understood that although an element may be directly connected or connected to another element, there may be other elements in between, or indirectly connected or connected.

또한, 본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 본 명세서에서 기술되는 "포함 한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Also, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. It is also to be understood that the terms such as " comprising "or" having ", as used herein, are intended to specify the presence of stated features, integers, It should be understood that the foregoing does not preclude the presence or addition of other features, numbers, steps, operations, elements, parts, or combinations thereof.

이하, 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법 및 이를 위한 장치에 대해 설명하도록 한다. Hereinafter, a method and apparatus for providing a voice recognition service using voice signal information according to an embodiment of the present invention will be described.

먼저, 본 발명의 실시 예에 따른 음성 인식 서비스 제공 시스템에 대하여 설명하도록 한다.First, a speech recognition service providing system according to an embodiment of the present invention will be described.

도 1은 본 발명의 실시 예에 따른 음성 인식 서비스 제공 시스템을 설명하기 위한 구성도이다. 1 is a block diagram illustrating a system for providing a voice recognition service according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 음성 인식 서비스 제공 시스템은 단말(100), 음향 출력 장치(200) 및 서비스 서버(300)를 포함하여 구성될 수 있다. 이때, 본 발명의 단말(100)과 음향 출력 장치(200)는 근거리 통신 방식으로 연결되어 정보를 송수신할 수 있으며, 단말(100)과 서비스 서버(300) 그리고 음향 출력 장치(200) 및 서비스 서버(300)는 통신망(500)을 경유하여 연결될 수 있다. Referring to FIG. 1, the voice recognition service providing system of the present invention may include a terminal 100, an audio output apparatus 200, and a service server 300. The terminal 100 and the sound output apparatus 200 of the present invention can be connected to each other by a local area communication method to transmit and receive information. The terminal 100 and the service server 300, the sound output apparatus 200, (300) may be connected via a communication network (500).

단말(100)은 사용자의 요청에 따라 각종 데이터를 송수신할 수 있는 사용자의 장치를 의미한다. 특히, 본 발명의 단말(100)은 음향 출력 장치(200)와 근거리 통신 방식으로 연결되어 음향 출력 장치(200)의 조작을 위한 각종 설정 정보를 상기 음향 출력 장치(200)로 전송할 수 있다. 예컨대, 음향 출력 장치(200)가 알림 기능을 제공하는 경우 단말(100)은 알림 설정을 위한 각종 정보를 사용자로부터 입력 받아 이를 음향 출력 장치(200)로 제공할 수 있다. The terminal 100 refers to a user apparatus capable of transmitting and receiving various data according to a user's request. In particular, the terminal 100 of the present invention may be connected to the sound output apparatus 200 by a short distance communication method, and may transmit various setting information for operating the sound output apparatus 200 to the sound output apparatus 200. For example, when the sound output apparatus 200 provides a notification function, the terminal 100 may receive various information for notification setting from the user and provide the information to the sound output apparatus 200.

또한, 본 발명의 단말(100)은 음향 출력 장치(200)와 근거리 통신 방식으로 연결된 상태에서, 자신의 식별 정보를 음향 출력 장치(200)로 전송할 수 있으며, 음향 출력 장치(200)는 단말(100)의 식별 정보를 통신망(500)을 통해 서비스 서버(300)로 전송하여 보다 개인화된 서비스 제공을 지원할 수 있다.The terminal 100 of the present invention may transmit identification information of the terminal 100 to the sound output apparatus 200 while the sound output apparatus 200 is connected to the sound output apparatus 200 by a short distance communication method, 100 to the service server 300 through the communication network 500 to support more personalized service provisioning.

아울러, 본 발명의 실시 예에 따른 단말(100)은 웨이크업 단어를 설정하거나 변경할 수 있다. 다시 말해 본 발명의 음향 출력 장치(100)는 사용자의 음성 신호에서 웨이크업 단어(wakeup word)를 인식하고 웨이크업 단어가 인식됨에 따라 음성 신호를 서비스 서버(300)로 전송하거나 지정된 동작을 수행할 수 있는데, 이때의 웨이크업 단어를 지정하거나 변경하는 등의 과정을 단말(100)이 수행할 수 있다. In addition, the terminal 100 according to the embodiment of the present invention can set or change a wakeup word. In other words, the sound output apparatus 100 of the present invention recognizes a wakeup word in a voice signal of a user and transmits a voice signal to the service server 300 as a wakeup word is recognized or performs a specified operation The terminal 100 can perform a process of specifying or changing a wakeup word at this time.

또한, 본 발명의 실시 예에 따른 단말(100)은 컨텐츠 서비스 가입된 컨텐츠 제공 서버(400)에 대한 정보를 서비스 서버(300)로 전송할 수도 있다. 예컨대, 동일한 컨텐츠를 서비스하는 복수 개의 컨텐츠 제공 서버(400)가 존재한다고 가정하면, 단말(100)은 복수 개의 컨텐츠 제공 서버(400) 중 특정한 컨텐츠 제공 서버(400)가 제공하는 서비스에 가입할 수 있으며, 이에 따른 가입 정보를 서비스 서버(300)로 전송하여 컨텐츠 제공 서버(400)가 제공하는 컨텐츠를 음향 출력 장치(200)를 통해 이용할 수 있도록 설정할 수 있다. In addition, the terminal 100 according to the embodiment of the present invention may transmit information on the content providing server 400 to the service server 300. For example, assuming that there are a plurality of content providing servers 400 serving the same content, the terminal 100 can subscribe to a service provided by a specific content providing server 400 among the plurality of content providing servers 400 And the subscription information may be transmitted to the service server 300 so that the content provided by the content providing server 400 can be used through the audio output apparatus 200.

더하여, 본 발명의 단말(100)은 단말(100) 내 구비된 각종 어플리케이션의 설정 정보를 음향 출력 장치(200) 또는 서비스 서버(300)와 공유하거나 제공하여 음향 출력 장치(200)에서의 보다 지능적인 서비스가 이뤄지도록 지원할 수 있다. 예컨대, 사용자가 "오늘 약속 뭐지?"라고 발화하였다면 음향 출력 장치(200)는 이를 음성 신호로 생성하여 서비스 서버(300)로 전송하고 서비스 서버(300)는 사용자의 음성을 인식하고, 이에 따른 서비스를 제공하기 위해 단말(100)에 구비된 일정 관리 어플리케이션과 연동하여 사용자의 일정을 확인한 후, 확인된 응답 정보를 포함하는 응답 메시지, 예컨대 "7시 강남역에서 동창 모임이 있어요"를 음향 출력 장치(200)로 전송하여 제공할 수 있게 된다. In addition, the terminal 100 of the present invention shares or provides the setting information of various applications provided in the terminal 100 with the sound output apparatus 200 or the service server 300 to provide more intelligence in the sound output apparatus 200 Service can be performed. For example, if the user has uttered "What is the appointment today?", The sound output apparatus 200 generates a voice signal to the service server 300, the service server 300 recognizes the voice of the user, A schedule management application provided in the terminal 100 to provide a response message including the confirmed response information such as "There is a alumni meeting at 7:00 Kangnam station" to the sound output device 200).

이를 위해 본 발명의 단말(100)은 음향 출력 장치(200)를 제어하기 위한 서비스 어플리케이션을 미리 구비하고 있는 상태여야 한다. 상기 서비스 어플리케이션은 서비스 서버(300) 또는 별도의 어플리케이션 스토어를 통해 다운로드 하여 설치하거나 서비스 서버(300)에 접속하여 클라우드 서비스 방식으로 이용할 수도 있다. For this purpose, the terminal 100 of the present invention must be provided with a service application for controlling the sound output apparatus 200 in advance. The service application may be downloaded and installed through the service server 300 or a separate application store, or may be used as a cloud service method by accessing the service server 300.

본 발명의 음성 인식 서비스를 스피커를 통해 출력하여 제공하는 음향 출력 장치(200)는 일정 반경 내 사용자가 발화하는 사용자의 음성을 입력받아 음성 신호를 생성하고 생성된 음성 신호를 서비스 서버(300)와 연동하여 각종 컨텐츠 제공 서버(400)가 제공하는 컨텐츠를 출력하여 제공하는 역할을 수행한다. 특히, 본 발명의 음향 출력 장치(200)는 마이크를 포함하며 마이크를 통해 사용자의 발화 음성을 입력 받아 이를 음성 신호로 생성할 수 있다. 그리고 본 발명의 음향 출력 장치(200)는 마이크를 통해 입력되는 음성 신호를 소정 길이의 프레임으로 구분하고 상기 음성 신호의 초기 구간에 대한 음성 인식을 수행하여 미리 지정된 웨이크업 단어가 포함되는 지 여부를 확인한다. 여기서 웨이크업 단어는 미리 음향 출력 장치(200)에 등록된 단어를 의미하는 것으로, 사용자의 발화 음성에 웨이크업 단어가 포함되어 있을 경우, 본 발명의 음향 출력 장치(200)는 생성되는 음성 신호를 통신망(500)을 경유하여 서비스 서버(300)로 전송한다. 이때, 본 발명의 음향 출력 장치(200)는 입력되는 음성 신호에서 웨이크업 단어가 존재하는 것으로 확인되면, 통신 모듈에 전원을 인가하여 웨이크업 단어를 포함하는 음성 신호를 스트리밍하여 서비스 서버(300)로 전송하게 된다. The sound output apparatus 200, which outputs speech recognition service of the present invention through a speaker, generates a sound signal by receiving a user's speech uttered by a user within a certain radius and transmits the generated sound signal to the service server 300 And outputs the contents provided by the various contents providing servers 400 and provides the contents. In particular, the sound output apparatus 200 of the present invention includes a microphone and receives a user's utterance voice through a microphone and generates it as a voice signal. The sound output apparatus 200 of the present invention classifies a voice signal inputted through a microphone into a frame of a predetermined length and performs speech recognition on an initial section of the voice signal to determine whether a predetermined wakeup word is included Check. Here, the wake-up word means a word registered in advance in the sound output apparatus 200. When the wake-up word is included in the voice uttered by the user, the sound output apparatus 200 of the present invention outputs the generated voice signal To the service server 300 via the communication network 500. When it is determined that a wake-up word exists in the input voice signal, the audio output apparatus 200 of the present invention applies power to the communication module to stream the voice signal including the wake-up word to the service server 300, .

특히, 본 발명의 음향 출력 장치(200)는 음성 신호를 서비스 서버(300)로 전송 시 상기 음성 신호의 초기 구간에 대한 전처리 과정을 수행할 수 있다. 이때, 본 발명의 음향 출력 장치(200)는 음성 신호에 대응하여 방향, 채널, 잡음 정보를 포함하는 음성 신호 정보를 생성할 수 있다. 여기서 방향 정보는 적어도 하나 이상의 다채널 마이크를 통해 각각 입력되는 음성에 대한 도달 시간, 도달 거리 및 음성의 세기 중 적어도 어느 하나를 이용하여 추출될 수 있는 정보이며, 상기 채널 정보는 상기 다채널 마이크의 채널 정보를 의미하며, 상기 잡음 정보는 상기 음성 신호에 대한 신호대 잡음비를 의미하는 것으로, 본 발명의 음향 출력 장치(200)는 상기 음성 신호 정보를 생성하고 이를 서비스 서버(300)로 전송할 수 있다. Particularly, the sound output apparatus 200 of the present invention can perform a preprocessing process on an initial section of the voice signal when the voice signal is transmitted to the service server 300. At this time, the sound output apparatus 200 of the present invention can generate voice signal information including direction, channel, and noise information corresponding to the voice signal. Herein, the direction information is information that can be extracted using at least one of the arrival time, the reaching distance, and the sound intensity of a voice input through each of the at least one multi-channel microphone, Channel information, and the noise information means a signal-to-noise ratio of the voice signal. The sound output apparatus 200 of the present invention can generate the voice signal information and transmit the voice signal information to the service server 300.

그리고, 본 발명의 음향 출력 장치(200)는 상기 음성 신호를 인식하여 사용자의 의도에 적합한 응답 정보가 서비스 서버(300)로부터 수신되면, 이를 구비된 스피커를 통해 음성 신호로 출력하는 과정을 수행하게 된다. 이때, 본 발명의 음향 출력 장치(200)는 서비스 서버(300)로부터 음성 신호 형태의 응답 메시지를 수신할 수 있으며, 수신된 응답 메시지를 사용자가 인지할 수 있는 가청 주파수 대역으로 변환하여 이를 스피커를 통해 출력할 수 있다.The sound output apparatus 200 of the present invention recognizes the voice signal and, when the response information suitable for the user's intention is received from the service server 300, the sound output apparatus 200 performs a process of outputting it as a voice signal through the provided speaker do. At this time, the sound output apparatus 200 of the present invention can receive a response message in the form of a voice signal from the service server 300, convert the received response message into an audible frequency band that can be recognized by the user, .

반면, 본 발명의 음향 출력 장치(200)는 서비스 서버(300)로부터 응답 메시지와 함께 링크 정보가 수신되었다면, 상기 링크 정보를 이용하여 컨텐츠 제공 서버(400)에 접속하고, 컨텐츠 제공 서버(400)로부터 해당 링크 정보에 해당하는 컨텐츠를 스트리밍 형태로 수신하여 이를 스피커를 통해 출력하는 과정을 수행할 수 있다. 예컨대 상기 링크 정보가 도 1의 컨텐츠 제공 서버B(400b)가 제공하는 A 음원에 대한 링크 정보일 경우, 음향 출력 장치(200)는 통신망(500)을 경유하여 컨텐츠 제공 서버B(400b)에 접속하고, A 음원을 요청하여 컨텐츠 제공 서버B(400b)로부터 스트리밍 데이터를 수신하고 이를 재생할 수 있게 된다. 이때, 상기 링크 정보에는 컨텐츠 제공 서버(400)에 접속해도 된다는 승인 정보를 포함할 수 있으며, 링크 정보를 수신한 음향 출력 장치(200)는 상기 컨텐츠 제공 서버(400)에 접속할 수 있는 권한이 있는 상태이므로, 별다른 인증 과정 없이 컨텐츠 제공 서버(400)는 컨텐츠를 제공할 수 있게 된다. On the other hand, if the link information is received together with the response message from the service server 300, the sound output apparatus 200 of the present invention accesses the content providing server 400 using the link information, And receiving the contents corresponding to the corresponding link information in a streaming form and outputting the contents through the speaker. For example, when the link information is link information for the sound source A provided by the content providing server B 400b of FIG. 1, the sound output apparatus 200 connects to the content providing server B 400b via the communication network 500 And requests stream A to receive streaming data from content providing server B 400b and reproduce the streaming data. At this time, the link information may include approval information that the content providing server 400 can be connected. The sound output apparatus 200 receiving the link information may have permission to access the content providing server 400 The content providing server 400 can provide the content without any authentication process.

한편, 서비스 서버(300)는 본 발명의 실시 예에 따른 음성 인식 서비스를 제공하는 서비스 사업자 주체의 장치를 의미한다. 본 발명의 서비스 서버(300)는 단말(100)과 연동하여 서비스 제공을 위한 각종 정보를 송수신할 수 있다. 또한, 본 발명의 서비스 서버(300)는 음향 출력 장치(200)로부터 사용자가 발화하여 생성된 음성 신호를 수신하고 이에 해당하는 음성 인식 서비스를 제공할 수 있다. Meanwhile, the service server 300 refers to a service provider entity providing a voice recognition service according to an embodiment of the present invention. The service server 300 of the present invention can transmit / receive various information for providing a service in cooperation with the terminal 100. In addition, the service server 300 of the present invention can receive a voice signal generated by a user from the sound output apparatus 200 and provide a voice recognition service corresponding thereto.

특히, 본 발명의 서비스 서버(300)는 음향 출력 장치(200)로부터 음성 신호와 함께 음성 신호 정보를 수신할 수 있다. 음성 신호 정보는 상기 음성 신호의 초기 일부 구간에 대응하여 생성되는 음성 신호의 특성과 관련된 정보를 의미하는 것으로 방향, 채널, 잡음 정보를 포함할 수 있다. 이때, 본 발명의 음성 신호 정보는 상기 음향 출력 장치(200)로부터 스트리밍되어 수신되는 음성 신호와 별도의 경로를 통해 수신될 수 있으며, 상기 음성 신호의 헤더 정보에 포함되어 수신될 수 있다. In particular, the service server 300 of the present invention can receive the voice signal information together with the voice signal from the voice output device 200. The voice signal information refers to information related to a characteristic of a voice signal generated corresponding to an initial partial period of the voice signal, and may include direction, channel, and noise information. At this time, the voice signal information of the present invention may be received through a separate path from the voice signal that is streamed and received from the voice output device 200, and may be received in the header information of the voice signal.

이러한 음성 신호 정보가 수신되면 서비스 서버(300)는 상기 음성 신호 정보를 이용하여 노이즈 제거 및 변이 보상을 위한 전처리 과정을 수행할 수 있다. 여기서, 노이즈 제거는 상기 음성 신호 정보에서 방향 및 잡음 정보를 기초로 잡음에 대응하는 특정 주파수 대역을 제거하기 위한 필터를 구성한 후 구성된 필터에 따라 노이즈가 제거될 수 있으며, 채널 변이는 상기 다채널 마이크의 채널 특성이 상기 음향 출력 장치(200)로부터 서비스 서버(300)로 전송되는 과정에서 변이될 수 있으므로, 상기 음성 신호 정보의 채널 정보를 이용하여 원래의 채널을 확인하고 채널 변이를 보상하는 과정을 수행할 수 있다. When the voice signal information is received, the service server 300 may perform a preprocessing process for noise removal and disparity compensation using the voice signal information. In this case, the noise removal may be performed by removing a specific frequency band corresponding to the noise based on the direction and noise information in the speech signal information, and then noise may be removed according to a configured filter. Since the channel characteristics of the voice signal information may be varied in the process of being transmitted from the sound output apparatus 200 to the service server 300, the process of checking the original channel using the channel information of the voice signal information and compensating for the channel variation Can be performed.

이러한 과정을 통해 본 발명의 서비스 서버(300)는 보다 더 효율적이고 정확하게 음성 인식을 수행할 수 있으며, 이를 통해 사용자의 의도를 보다 더 정확하게 파악할 수 있게 된다. Through the above process, the service server 300 of the present invention can perform voice recognition more efficiently and accurately, and thereby can grasp the intention of the user more accurately.

또한, 본 발명의 서비스 서버(300)는 후단에 복수 개의 컨텐츠 제공 서버(400)와 연동할 수 있으며, 사용자의 음성 인식을 통해 사용자의 의도를 파악하고 적합한 컨텐츠를 제공할 수 있는 컨텐츠 제공 서버(400)를 확인한 후, 컨텐츠 제공 서버(400)로부터 컨텐츠를 전달받아 이를 음향 출력 장치(200)로 제공할 수 있다. 이때, 본 발명의 서비스 서버(300)는 단순히 컨텐츠 제공 서버(400)가 제공하는 컨텐츠를 음향 출력 장치(200)로 전달하는 것이 아니라, 음성 인식 서비스에 따른 응답 메시지를 구성하여 생성하고 이를 TTS(Text to Speech) 엔진을 통해 음성 신호로 변환한 후 변환된 음성 신호를 스트리밍 형태로 음향 출력 장치(200)로 전송하게 된다. In addition, the service server 300 of the present invention can interoperate with a plurality of content providing servers 400 at a subsequent stage, and can receive content from a content providing server 400, and then receive the content from the content providing server 400 and provide the content to the sound output apparatus 200. At this time, the service server 300 of the present invention not only delivers the content provided by the content providing server 400 to the sound output apparatus 200, but also constructs and generates a response message according to the voice recognition service, Text to Speech) engine, and then transmits the converted voice signal to the sound output apparatus 200 in a streaming form.

본 발명의 음향 출력 장치(200)와 서비스 서버(300) 사이의 동작에 대해서는 후술하여 보다 더 구체적으로 설명하도록 하며, 본 발명의 컨텐츠 제공 서버(400)는 서비스 서버(300)의 요청에 따라 해당하는 서비스 또는 정보를 상기 서비스 서버(300)로 전송하는 역할을 수행하는 컨텐츠 제공자(CP; Contents Provider)의 장치를 의미한다. 특히, 본 발명의 컨텐츠 제공 서버(400)는 서비스 서버(300)의 질의 종류 또는 자신이 제공하는 컨텐츠의 종류에 따라 서로 다른 형태의 응답을 전송할 수 있다. 예컨대 컨텐츠 제공 서버 A(400a)가 음원에 대한 스트리밍 서비스를 제공할 수 있는 컨텐츠 제공 서버(400)라면 상기 서비스 서버(300)로 요청에 따른 해당 음원을 스트리밍 할 수 있는 링크 정보를 전송할 수 있다. 반면, 컨텐츠 제공 서버 B(400b)가 날씨, 온도 등의 특정한 컨텐츠를 제공할 수 있는 서버라면 상기 서비스 서버(300)로 요청에 따라 확인된 특정 정보(현재 날씨, 현재 온도 등)만을 제공할 수 있다. The operation of the audio output apparatus 200 and the service server 300 of the present invention will be described in more detail with reference to the accompanying drawings. (CP) that performs a function of transmitting a service or information to the service server 300. [ In particular, the content providing server 400 of the present invention can transmit different types of responses according to the type of the query of the service server 300 or the type of the content provided by the service server 300 itself. For example, if the content providing server A 400a is a content providing server 400 capable of providing a streaming service for a sound source, the service providing server A 400a can transmit link information capable of streaming the corresponding sound source according to the request. On the other hand, if the content providing server B 400b is a server capable of providing specific contents such as weather and temperature, it can provide only the specific information (current weather, current temperature, etc.) have.

더하여, 본 발명에서의 컨텐츠는 영화, 음악과 같은 순수 컨텐츠는 날씨, 시간 등의 특정 정보, 게임 어플리케이션, 일정 관리 어플리케이션 등과 같은 특정 기능을 수행하는 응용 프로그램 모두를 지칭하는 개념이다. 이러한 컨텐츠는 스트리밍 데이터 형태로 음향 출력 장치(200)로 전송되거나 서비스 서버(300)로 전달될 수 있다. In addition, the content according to the present invention is a concept that refers to all application programs that perform specific functions such as movies, music, and the like, such as specific information such as weather, time, game application, and schedule management application. Such contents can be transmitted to the audio output apparatus 200 or to the service server 300 in the form of streaming data.

상술한 과정을 지원하는 통신망(500)은 인터넷 망과 같은 IP 기반의 유선 통신망뿐만 아니라, LTE(Long term evolution) 망, WCDMA 망과 같은 이동통신망, Wi-Fi망과 같은 다양한 종류의 무선망, 및 이들의 조합으로 이루어질 수 있으며, 이러한 통신망(500)은 접속망, 백본망, 인터넷망을 포함하여 구성될 수 있으나, 구체적인 구성 및 통신망에서의 동작은 공지된 다양한 기술을 적용할 수 있으므로, 구체적인 설명에 대해서는 생략하도록 한다. The communication network 500 supporting the above-described processes may be used not only for an IP-based wired communication network such as the Internet, but also for various types of wireless networks such as a long term evolution (LTE) network, a mobile communication network such as a WCDMA network, The backbone network, and the Internet network. However, since a variety of known technologies can be applied to the specific configuration and operation of the communication network, Are omitted.

또한, 본 발명의 통신망(500)은 하드웨어, 소프트웨어 등의 컴퓨팅 자원을 저장하고, 클라이언트가 필요로 하는 컴퓨팅 자원을 해당 단말기로 제공할 수 있는 클라우드 컴퓨팅망을 포함할 수 있다. 여기서, 클라우드 컴퓨팅이란 정보가 인터넷 상의 서버에 영구적으로 저장되고, 데스크톱, 태블릿 컴퓨터, 노트북, 넷북, 스마트폰 등의 클라이언트 단말기에는 일시적으로 보관되는 컴퓨터 환경을 의미하며, 클라우드 컴퓨팅은 이용자의 모든 정보를 인터넷 상의 서버에 저장하고, 이 정보를 각종 IT 기기를 통하여 언제 어디서든 이용할 수 있도록 하는 컴퓨터 환경 접속망을 의미한다. 이때, 본 발명의 단말(100)은 서비스 서버(300)의 접속을 위해 서비스 서버(300)가 제공하는 서비스 어플리케이션을 클라우드 컴퓨팅 방식으로 접근하여 이용할 수 있게 된다. In addition, the communication network 500 of the present invention may include a cloud computing network capable of storing computing resources such as hardware and software, and providing the computing resources required by the client to the terminals. Here, cloud computing refers to a computer environment in which information is permanently stored on a server on the Internet and temporarily stored in a client terminal such as a desktop, a tablet computer, a notebook, a netbook, or a smart phone. Cloud computing, Refers to a computer environment access network that stores information on a server on the Internet and makes the information available anytime and anywhere through various IT devices. At this time, the terminal 100 of the present invention can access the service application provided by the service server 300 for accessing the service server 300 by using a cloud computing method.

이상으로 본 발명의 실시 예를 위한 각 장치의 주요 구성에 대해 개략적으로 설명하였다. 본 발명의 실시 예에 따른 각 장치에 탑재되는 프로세서는 본 발명에 따른 방법을 실행하기 위한 프로그램 명령을 처리할 수 있다. 일 구현 예에서, 이 프로세서는 싱글 쓰레드(Single-threaded) 프로세서일 수 있으며, 다른 구현 예에서 본 프로세서는 멀티 쓰레드(Multithreaded) 프로세서일 수 있다. 나아가 본 프로세서는 메모리 혹은 저장 장치 상에 저장된 명령을 처리하는 것이 가능하다.The main constitution of each device for the embodiment of the present invention has been outlined above. A processor mounted on each device according to an embodiment of the present invention may process program instructions for executing the method according to the present invention. In one implementation, the processor may be a single-threaded processor, and in other embodiments, the processor may be a multithreaded processor. Further, the processor is capable of processing instructions stored on a memory or storage device.

이하, 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법에 대해 보다 더 구체적으로 설명하도록 한다. Hereinafter, a method of providing a voice recognition service using voice signal information according to an embodiment of the present invention will be described in more detail.

도 2는 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 설명하기 위한 데이터 흐름도이다. 2 is a data flow chart for explaining a method of providing a voice recognition service using voice signal information according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 음향 출력 장치(200)는 기본적으로 대기 상태이다. 대기 상태에서 음향 출력 장치(200)는 사용자의 음성을 입력 받기 위한 마이크와 같이 일부 모듈에만 전원을 인가하고 그 외 서비스 서버(300)와의 통신을 위한 통신 모듈과 같은 불필요한 모듈에는 전원을 차단하게 된다. Referring to FIG. 2, the sound output apparatus 200 of the present invention is basically in a standby state. In a standby state, the audio output apparatus 200 applies power to only some modules, such as a microphone for receiving user's voice, and disconnects unnecessary modules such as a communication module for communication with the service server 300 .

또한, 본 발명의 대기 상태는 음성 인식 시작 전의 상태를 총칭할 수 있다. 다시 말해, 음향 출력 장치(200) 인근에 여러 사용자가 발화하거나 시끄러운 잡음이 존재하여 마이크를 통해 음성 신호가 지속적으로 인식되더라도, 웨이크업 단어가 입력될 때까지 입력되는 음성 신호를 지속적으로 삭제하고 서비스 서버(300)로 전달되지 않도록 제어하는 상태를 의미할 수 있다. The standby state of the present invention can be collectively referred to as a state before speech recognition is started. In other words, even if a plurality of users are ignited near the sound output apparatus 200 or there is noisy noise and the voice signal is continuously recognized through the microphone, the voice signal input until the wake-up word is input is continuously deleted, It may mean a state in which it is controlled not to be transmitted to the server 300.

이러한 상태에서 본 발명의 음향 출력 장치(200)는 일정 반경 내 사용자로부터 전달되는 음성을 구비된 마이크를 통해 입력 받아(S101) 이를 음성 신호로 생성하게 된다(S103). 이때, 본 발명의 음향 출력 장치(200)는 실시간으로 마이크를 통해 입력되는 사용자 음성을 전기적 신호 형태의 음성 신호로 생성할 수 있다. In this state, the sound output apparatus 200 of the present invention receives a voice transmitted from a user within a certain radius through a microphone (S101), and generates a voice signal (S103). At this time, the sound output apparatus 200 of the present invention can generate a user's voice input through a microphone in real time as a voice signal in the form of an electrical signal.

특히, 본 발명의 음향 출력 장치(200)는 웨이크업 과정을 수행하기 위한 전 단계로 미리 설정된 초기 구간에 대한 음성 신호를 저장하고 상기 초기 구간에 대한 음성 신호를 이용하여 전처리 과정을 수행할 수 있다. In particular, the sound output apparatus 200 of the present invention may store a voice signal for an initial period set in a previous step for performing a wakeup process, and perform a preprocessing process using a voice signal for the initial period .

예컨대, 본 발명의 음향 출력 장치(200)는 다채널 마이크를 통해 각각 입력되는 음성을 이용하여 전기적 형태의 음성 신호로 변환 시, 상기 각각 입력되는 음성별로 도달 시간, 도달 거리 및 음성의 세기 중 적어도 어느 하나를 이용하여 방향 정보를 추출할 수 있다. 그리고, 본 발명의 음향 출력 장치(200)는 상기 다채널 마이크의 채널 정보를 확인하여 추출할 수 있다. 이때, 본 발명의 음향 출력 장치(200)는 다채널 마이크 각각의 채널 정보를 확인하거나 다채널 마이크의 채널이 믹싱된 채널 정보를 확인할 수도 있다. 또한, 본 발명의 음향 출력 장치(200)는 상기 음성 신호에 대한 신호대 잡음비를 추출할 수 있다. 이때, 본 발명의 음향 출력 장치(200)는 상기 음성 신호를 일정 길이의 프레임으로 분리할 수 있으며, 분리된 프레임별로 음성 특징 벡터를 추출하고, 음성 특징 벡터를 이용하여 신호대 잡음비를 추출할 수 있다. For example, when converting the audio output apparatus 200 of the present invention into a voice signal of an electrical form using a voice input through each of the multi-channel microphones, at least one of the arrival time, Direction information can be extracted using any one of them. The sound output apparatus 200 of the present invention can identify and extract channel information of the multi-channel microphone. At this time, the sound output apparatus 200 of the present invention can check the channel information of each of the multi-channel microphones or check the channel information in which the channels of the multi-channel microphones are mixed. In addition, the sound output apparatus 200 of the present invention can extract a signal-to-noise ratio for the speech signal. At this time, the sound output apparatus 200 of the present invention can separate the speech signal into a frame having a predetermined length, extract a speech feature vector for each separated frame, and extract a signal-to-noise ratio using speech feature vectors .

그리고, 본 발명의 음향 출력 장치(200)는 음성 인식을 수행하게 된다(S105). 이때, 본 발명의 음향 출력 장치(200)는 S103 단계에서 추출된 정보를 이용하여 노이즈 제거 등의 과정을 수행한 후 음성 신호를 인식할 수도 있으며, 서비스 서버(300)에 의해 다시 노이즈 제거 과정이 수행되므로 별다른 과정 없이 음성 신호를 인식할 수도 있다. Then, the sound output apparatus 200 of the present invention performs speech recognition (S105). At this time, the sound output apparatus 200 of the present invention may recognize a voice signal after performing a noise removal process or the like using the information extracted in step S103, and the noise removal process may be performed again by the service server 300 It is possible to recognize the voice signal without any special process.

또한, 본 발명의 음향 출력 장치(200)는 전술한 바와 같이 웨이크업 과정을 수행하기 위한 미리 설정된 초기 구간에 대한 음성 신호를 인식할 수 있다. 예컨대, 초기의 10 프레임에 대한 음성 신호를 인식하도록 설정되어 있을 경우, 본 발명의 음향 출력 장치(200)는 상기 10 프레임에 대한 음성 신호 구간을 저장한 후 저장된 정보를 이용하여 음성 인식 과정을 수행할 수 있다. 이를 위해 본 발명의 음향 출력 장치(200)는 입력되는 음성 신호를 장치 저장부에 저장할 수 있으며, 상기 프레임에 대한 인식 결과, 웨이크업에 해당하는 단어가 존재하는 지 여부를 판단하여(S107), 웨이크업에 해당하는 단어가 존재하지 않다면 상기 장치 저장부에 저장된 음성 신호를 삭제하고, 존재하는 경우, 실시간으로 프레임으로 분리되는 음성 신호를 스트리밍하여 서비스 서버(300)로 전송할 수 있다(S109). 이때, 본 발명의 음향 출력 장치(200)는 상기 스트리밍되는 음성 신호와 함께 상기 음성 신호의 초기 구간을 대상으로 추출된 음성 신호 정보를 서비스 서버(300)로 전송하게 된다. 또한, 본 발명의 음향 출력 장치(200)는 상기 웨이크업 단어에 해당하는 음성 신호가 제거된 음성 신호를 서비스 서버(300)로 전송할 수 있으며, 상기 웨이크업 단어를 포함하는 음성 신호를 서비스 서버(300)로 전송할 수도 있다. Also, the sound output apparatus 200 of the present invention can recognize a voice signal for a preset initial period for performing a wake up process as described above. For example, if it is set to recognize a speech signal for the initial 10 frames, the sound output apparatus 200 of the present invention stores the speech signal interval for the 10 frames and performs a speech recognition process using the stored information can do. For this purpose, the sound output apparatus 200 of the present invention can store an input voice signal in a device storage unit, and judges whether there is a word corresponding to a wake up result (S107) If there is no word corresponding to the wakeup, the voice signal stored in the device storage unit is deleted, and if it exists, the voice signal separated into frames in real time may be streamed and transmitted to the service server 300 (S109). At this time, the sound output apparatus 200 of the present invention transmits voice signal information extracted for the initial section of the voice signal to the service server 300 together with the streaming voice signal. In addition, the audio output apparatus 200 of the present invention can transmit a voice signal from which the voice signal corresponding to the wake-up word is removed to the service server 300, and transmits a voice signal including the wake- 300).

그리고 서비스 서버(300)는 음향 출력 장치(200)로부터 수신된 음성 신호 정보를 이용하여 수신되는 음성 신호에 대한 노이즈 제거를 위한 필터를 구성할 수 있다(S111). 일반적인 방식에서는 음성 신호를 수신하고, 수신된 음성 신호를 이용하여 노이즈 제거를 위한 전처리 과정을 수행하나, 본 발명의 서비스 서버(300)는 노이즈 제거를 위한 별도의 판단, 설계 과정을 수행하는 것이 아니라, 음향 출력 장치(200)를 통해 전달된 음성 신호 정보를 이용함으로써, 보다 신속하게 노이즈 제거를 위한 필터를 구성하고, 구성된 필터를 이용하여 노이즈 제거를 수행할 수 있게 된다. 또한, 본 발명의 서비스 서버(300)는 음성 신호 정보에 포함된 채널 정보를 이용하여 상기 음성 신호의 채널 변이를 보상하는 과정을 수행할 수도 있다. The service server 300 may configure a filter for removing noise from a voice signal received from the sound output apparatus 200 (S111). In the general method, a voice signal is received and a preprocessing process for removing noise is performed using the received voice signal. However, the service server 300 of the present invention does not perform a separate determination and design process for noise removal , By using the voice signal information transmitted through the sound output apparatus 200, a filter for noise removal can be configured more quickly, and the noise can be removed using the configured filter. In addition, the service server 300 of the present invention may perform a process of compensating a channel variation of the voice signal using channel information included in voice signal information.

이와 같이, 본 발명의 서비스 서버(300)는 노이즈 제거 및 채널 변이를 위한 별도의 판단 과정 없이, 음향 출력 장치(200)로 전달되는 음성 신호 정보를 이용하여 수신되는 음성 신호 전체에 대한 전처리 과정을 수행함으로써 보다 신속하게 음성 인식을 수행할 수 있으며, 음향 출력 장치(200)로부터 전달되는 채널 정보를 이용하여 통신 과정에서 발생될 수 있는 채널 변이에 대한 보상 처리 과정을 수행할 수 있게 된다. As described above, the service server 300 of the present invention performs a preprocessing process on the entire voice signal received using the voice signal information transmitted to the sound output apparatus 200 without performing a separate determination process for noise removal and channel variation It is possible to perform speech recognition more quickly and to perform a compensation process for a channel variation that may be generated in a communication process using channel information transmitted from the sound output device 200. [

이후, 본 발명의 서비스 서버(300)는 전처리가 수행된 음성 신호를 대상으로 음성 인식 과정을 수행하게 된다(S115). 이때, 본 발명의 서비스 서버(300)는 수신된 음성 신호에서 음성 특징 벡터를 추출하고 추출된 음성 특징 벡터에 대응하는 음소를 확인하게 된다. 여기서, 음소는 의미를 구별 짓는 최소의 단위(예: ㄱ, ㄹ, ㅓ, ㅕ, ㅁ)를 의미하는 것으로, 음소 인식에서는 이러한 단위의 집합으로 변환하는 과정을 수행할 수 있다. Thereafter, the service server 300 of the present invention performs speech recognition on the preprocessed speech signal (S115). At this time, the service server 300 of the present invention extracts a speech feature vector from the received speech signal and confirms a phoneme corresponding to the extracted speech feature vector. Here, a phoneme refers to the smallest unit (eg, a, d, ㅓ, ㅕ, ㅁ) that distinguishes meaning. In phonemic recognition, a process of converting into a set of such units can be performed.

이후, 본 발명의 서비스 서버(300)는 음소의 집합인 단어열을 구성하고 구성된 단어열을 대상으로 의미 분석 과정을 수행한다(S117). 즉, 본 발명의 서비스 서버(300)는 단어열로 구성되는 문장을 분석하여, 각각의 단어열이 엔티티(entity)에 해당하는지 인텐트(intent)에 해당하는 지를 확인하게 된다(S119). 여기서, 엔티티는 사용자가 원하는 특정 서비스의 이름과 같이 미리 정의된 고유 명사를 의미할 수 있다. 예컨대, 날씨, 최신가요, 팝송, 시간, 약속 등과 같이 미리 약속된 고유 명사를 엔티티로 추출할 수 있다. 반면 인텐트는 사용자의 의도를 포함하고 있는 단어열을 의미하는 것으로, 예컨대, '틀어줘', '몇시야', '어때' 등과 같이 엔티티를 제외한 사용자의 의도를 표현하기 위한 단어열에 해당할 수 있다. Thereafter, the service server 300 of the present invention constructs a word string, which is a set of phonemes, and performs a semantic analysis process on the formed word string (S117). That is, the service server 300 analyzes the sentence composed of the word sequence and confirms whether each word string corresponds to an entity corresponding to the entity (S119). Here, the entity may mean a predefined proper noun such as the name of a specific service desired by the user. For example, it is possible to extract a proper noun that has been promised in advance, such as weather, latest song, pop song, time, appointment, etc., as an entity. On the other hand, an intent means a word string including a user's intention. For example, the intent may correspond to a word string for expressing a user's intention except an entity such as 'play', 'what time is' have.

예를 들어, 사용자가 "음악 들려줘"를 발화한 경우, 서비스 서버(300)는 '음악'은 엔티티로, '들려줘'는 인텐트로 분석할 수 있다. 또한, 사용자가 "공기청정기 켜줘"를 발화한 경우, 서비스 서버(300)는 '공기청정기'는 엔티티로, '켜줘'는 인텐트로 분석할 수 있다. 아울러, 본 발명의 서비스 서버(300)는 각각의 엔티티에 대응하여 서비스 제공 주체에 대한 정보를 테이블 형태로 저장할 수 있으며, 상기 분석된 엔티티를 이용하여 서비스 제공 주체, 다시 말해 컨텐츠 제공 서버를 확인할 수 있다(S121). 이때, 별도의 컨텐츠 제공 서버가 없을 경우, 서비스 서버(300)가 서비스 제공 주체가 될 수 있다. For example, when the user has uttered "Tell me music ", the service server 300 can analyze the 'music' as an entity and the 'tell me' intent. In addition, if the user has uttered "Turn on air purifier ", the service server 300 can analyze the 'air purifier' as an entity and the 'turn on' intent. In addition, the service server 300 of the present invention can store information on the service providing entity corresponding to each entity in the form of a table, and can identify the service providing entity, that is, the content providing server using the analyzed entity (S121). At this time, if there is no separate content providing server, the service server 300 may be a service providing entity.

또한, 본 발명의 서비스 서버(300)는 인텐트에 대응하여 사용자의 의도에 대한 정보를 테이블 형태로 저장하고 관리할 수 있으며, 분석된 인텐트를 이용하여 사용자의 의도를 정확하게 확인하고 이를 해당 컨텐츠 제공 서버(400)로 질의하여 서비스를 요청할 수 있다(S123). 예컨대, '들려줘' 인텐트는 '재생'으로 확인하여 해당 컨텐츠 제공 서버(400)로 재생에 대한 서비스 요청을 전송할 수 있으며, '켜줘' 인텐트는 '실행'으로 확인하여 해당 컨텐츠 제공 서버(400) 또는 자신이 직접 엔티티에 대응하는 기기를 실행시킬 수 있다. In addition, the service server 300 of the present invention can store and manage information on a user's intention in the form of a table corresponding to the intent, accurately confirm the intention of the user using the analyzed intent, The service providing server 400 may request the service by requesting it (S123). For example, it is possible to confirm that the tent is 'play' and transmit the service request for playback to the content providing server 400, confirm that the tent is 'play' ) Or it can execute the device corresponding to the entity directly.

또 다른 예를 들어, 엔티티가 '날씨'이며, '어때'의 인텐트가 '문의' 라고 할 경우, 서비스 서버(300)는 상기 날씨 컨텐츠를 제공할 수 있는 컨텐츠 제공 서버a(400a)를 확인하고, 해당 컨텐츠 제공 서버a(400a)로 날씨에 대한 질의를 요청할 수도 있다. 이때, 본 발명의 서비스 서버(300)는 음향 출력 장치(200)의 식별 정보를 확인하고, 음향 출력 장치(200)의 현재 위치 및 시간에 대한 정보를 확인한 후, 보다 구체적인 문의를 컨텐츠 제공 서버(400)로 질의할 수 있게 된다. 예컨대, 음향 출력 장치(200)의 현재 위치가 '서울시 구로구'라고 할 경우, 서비스 서버(300)는 현재 시간의 서울시 구로에 대한 날씨를 질의할 수 있으며, 컨텐츠 제공 서버(400)는 서비스 서버(300)의 질의 요청에 따라 해당하는 응답을 확인하여 서비스 서버(300)로 전송한다(S125).In another example, if the entity is 'weather' and the intent of 'how' is 'inquiry', the service server 300 confirms the content providing server a 400a capable of providing the weather content , And may request the content providing server a 400a to inquire about the weather. At this time, the service server 300 of the present invention confirms the identification information of the sound output apparatus 200, confirms information on the current position and time of the sound output apparatus 200, 400). &Lt; / RTI > For example, when the current location of the sound output apparatus 200 is 'Seoul City', the service server 300 can query the weather for the current time in Seoul, 300 in response to a query request and transmits the corresponding response to the service server 300 (S125).

이후, 본 발명의 서비스 서버(300)는 수신된 응답을 이용하여 응답 메시지를 구성한다(S127). 즉, 본 발명의 서비스 서버(300)는 컨텐츠 제공 서버(400)로부터 컨텐츠에 대한 응답 정보만을 수신한 상태이므로, 이를 음향 출력 장치(200)로 전송하기 위한 가공의 과정을 수행할 수 있다. Thereafter, the service server 300 of the present invention constructs a response message using the received response (S127). That is, since the service server 300 of the present invention receives only the response information about the content from the content providing server 400, the service server 300 can perform a process of transmitting the response information to the sound output device 200.

먼저, 컨텐츠 제공 서버(400)로부터 '비'라는 응답 정보가 수신되었다면, 서비스 서버(300)는 상기 컨텐츠를 포함하는 일련의 문장을 구성할 수 있다. 이때, 상기 일련의 문장은 음향 출력 장치(200)의 현재 위치 및 현재 시각은 고려하여 구성될 수 있다. 전술한 예에서, 음향 출력 장치(200)의 현재 위치가 서울시 구로라고 하며, 현재 시간이 2시라고 가정할 경우, 서비스 서버(300)는 상기 정보를 포함하여 "서울시 구로, 2시 현재 비가 내리고 있습니다"의 형태로 응답 메시지를 구성할 수 있다. 응답 메시지를 구성할 때 본 발명의 서비스 서버(300)는 사용자에 의해 설정된 정보를 고려하여 응답 메시지를 구성할 수 있다. 예를 들어, 단말(100)로부터 응답 메시지 구성을 위한 사용자의 설정 정보가 전달되면, 이를 기반으로 응답 메시지를 구성할 수 있는데, 사용자가 언어의 종류를 영어로 선택하면, 영어의 문법에 따라 컨텐츠 제공 서버(400)로부터 전달된 컨텐츠를 변환하고 응답 메시지를 구성할 수 있다. 또한, 사용자가 문체를 '친근하게'로 설정하였다면, 친근한 문체로 상기 응답 메시지를 구성할 수 있으며, 경어로 설정하였다면, 경어 문체로 상기 응답 메시지를 구성할 수도 있다. First, if response information 'rain' is received from the content providing server 400, the service server 300 may construct a series of sentences including the contents. At this time, the series of sentences may be configured in consideration of the current position and the current time of the sound output apparatus 200. In the above example, if the current location of the sound output apparatus 200 is referred to as a city of Seoul, and the current time is assumed to be 2:00, the service server 300 includes the information as " &Quot;, < / RTI > When constructing the response message, the service server 300 of the present invention can construct a response message in consideration of the information set by the user. For example, when the setting information of the user for constituting the response message is transmitted from the terminal 100, a response message can be configured based on the setting information. When the user selects the type of the language as English, It is possible to convert the content transferred from the providing server 400 and construct a response message. In addition, if the user sets the style to 'friendly', the response message can be constructed with a familiar style, and if the user is set to the honorific language, the response message can be configured with a vernacular style.

그리고, 본 발명의 서비스 서버(300)는 음향 출력 장치(200)를 통해 구성된 응답 메시지를 출력하기 위해, 음성 신호로 변환하는 과정을 수행하고(S129), 전기적인 음성 신호로 변환된 응답 메시지를 실시간 스트리밍 형태로 음향 출력 장치(200)로 전송한다(S131). 즉, 음성 신호로 변환된 응답 메시지가 10개의 프레임으로 구성되는 경우, 한 프레임 단위로 실시간적으로 음향 출력 장치(200)로 전송하고, 음향 출력 장치(200)는 이를 스피커를 통해 출력하는 과정을 수행하게 된다(S133). The service server 300 of the present invention performs a process of converting the response message into a speech signal in order to output a response message constructed through the sound output apparatus 200 (S129), and transmits a response message converted into an electrical speech signal And transmits it to the sound output apparatus 200 in a real-time streaming form (S131). That is, when the response message converted into the voice signal is composed of 10 frames, the voice message is transmitted to the sound output apparatus 200 in real time on a frame-by-frame basis, and the sound output apparatus 200 outputs the response message through the speaker (S133).

한편, 본 발명의 음향 출력 장치(200)는 서비스 서버(300)로부터 전송된 링크 정보를 이용하여 컨텐츠 제공 서버(400)에 접속하고, 컨텐츠 제공 서버(400)로부터 대응하는 컨텐츠를 스트리밍 방식으로 수신하여 이용할 수도 있다. Meanwhile, the audio output apparatus 200 of the present invention accesses the content providing server 400 using the link information transmitted from the service server 300, receives the corresponding content from the content providing server 400 in a streaming manner .

이상으로, 본 발명의 실시 예에 따른 음성 인식 서비스 제공 시스템에서의 동작에 대해 설명하였다. The operation in the voice recognition service providing system according to the embodiment of the present invention has been described above.

이하에서는 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 장치를 중심으로 보다 더 구체적으로 설명하도록 한다. Hereinafter, a method of providing a voice recognition service using voice signal information according to an embodiment of the present invention will be described in more detail with reference to the drawings.

이하, 본 발명의 실시 예에 따른 음향 출력 장치(200)의 주요 구성 및 동작에 대해 설명하도록 한다. Hereinafter, the main configuration and operation of the sound output apparatus 200 according to the embodiment of the present invention will be described.

도 3은 본 발명의 실시 예에 따른 음향 출력 장치의 주요 구성을 도시한 구성도이며, 도 4는 본 발명의 실시 예에 따른 음향 출력 장치에서의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 설명하기 위한 흐름도이다.FIG. 3 is a configuration diagram illustrating a main configuration of an audio output apparatus according to an embodiment of the present invention. FIG. 4 is a flowchart illustrating a method of providing a voice recognition service using voice signal information in an audio output apparatus according to an embodiment of the present invention. Fig.

먼저, 도 3을 참조하면, 본 발명의 음향 출력 장치(200)는 음성 신호 입력부(210), 장치 제어부(220), 장치 통신부(230), 출력부(240) 및 장치 저장부(250)를 포함하여 구성될 수 있다. 3, an audio output apparatus 200 according to the present invention includes a voice signal input unit 210, a device control unit 220, a device communication unit 230, an output unit 240, and a device storage unit 250 And the like.

먼저, 본 발명의 음성 신호 입력부(210)는 마이크(211)를 통해 입력되는 아날로그 형태의 음성 신호를 전기적 신호로 변환하는 ADC(212)를 포함하여 구성될 수 있다. 특히, 본 발명의 음성 신호 입력부(210)는 사용자의 발화를 통해 발생되는 음성을 복수 개의 다채널 마이크(211)를 통해 입력 받게 된다. First, the audio signal input unit 210 of the present invention may include an ADC 212 for converting an analog audio signal input through the microphone 211 into an electrical signal. In particular, the voice signal input unit 210 of the present invention receives voice generated through user's utterance through a plurality of multi-channel microphones 211.

이때, 입력되는 음성은 주변의 잡음 및 다채널 마이크를 통해 발생되는 에코를 포함할 수 있다. 본 발명의 ADC(212)는 이러한 에코, 잡음을 포함하는 아날로그 형태의 음성을 전기적 형태의 음성 신호로 변환하여 장치 제어부(220)로 전달한다. At this time, the input voice may include surrounding noise and echo generated through a multi-channel microphone. The ADC 212 of the present invention converts the analog voice including the echo and noise into an electrical voice signal and transmits the voice signal to the device controller 220.

본 발명의 장치 제어부(220)는 음성 인식 모듈(221), 웨이크업 판단 모듈(222), 신호 처리 모듈(223) 및 서비스 처리 모듈(224)를 포함하여 구성될 수 있다. The device control unit 220 of the present invention may include a voice recognition module 221, a wakeup determination module 222, a signal processing module 223, and a service processing module 224.

이때, 상기 음성 인식 모듈(221)은 음성 신호 입력부(210)의 ADC(212)를 통해 전달되는 전기적 신호 형태의 음성 신호를 인식하기 위한 다양한 과정을 수행할 수 있다. 특히, 본 발명의 음성 인식 모듈(221)은 음성 신호의 초기 일부 구간에 대한 전처리 과정 및 음성 인식 과정을 수행할 수 있다. 이를 위해, 본 발명의 음성 인식 모듈(221)은 웨이크업 판단 모듈(222)에서의 웨이크업 판단을 위해 필요한 음성 신호의 초기 일부 구간에 대응하여 방향, 채널, 잡음 정보를 포함하는 음성 신호 정보를 생성하게 된다. At this time, the voice recognition module 221 may perform various processes for recognizing a voice signal in the form of an electrical signal transmitted through the ADC 212 of the voice signal input unit 210. In particular, the speech recognition module 221 of the present invention can perform a preprocessing process and speech recognition process for an initial partial segment of a speech signal. To this end, the voice recognition module 221 of the present invention generates voice signal information including directions, channels, and noise information corresponding to an initial partial period of a voice signal required for wakeup determination in the wakeup determination module 222 Respectively.

보다 구체적으로 설명하면, 본 발명의 음성 신호 입력부(210)의 복수 개의 다채널 마이크(211)들을 통해 음성이 입력되면, 음성 인식 모듈(221)은 입력되는 음성의 시간, 거리 또는 음의 세기(데시벨) 중 적어도 어느 하나를 이용하여 음성을 발화하는 사용자의 방향을 실시간으로 판별할 수 있으며, 이를 기초로 방향 정보를 추출할 수 있게 된다. More specifically, when a voice is input through the plurality of multi-channel microphones 211 of the voice signal input unit 210 of the present invention, the voice recognition module 221 detects a time, a distance, or a sound intensity Decibel), it is possible to determine the direction of the user who uttered the voice in real time, and the direction information can be extracted on the basis of the direction.

또한, 본 발명의 음성 인식 모듈(221)은 ADC(212)를 거쳐 전달되는 복수 개의 다채널 마이크(211)를 통해 입력되는 음성 신호에서 각 채널별 마이크로부터 입력되는 음성 신호에 대한 채널 정보를 추출할 수 있다. 이때, 본 발명의 음성 인식 모듈(221)은 음성 신호 각각에 대한 채널 정보를 추출할 수 있으며, 구현 방식에 따라, 복수 개의 다채널을 믹싱 처리하고, 믹싱 처리된 채널에 대한 정보를 추출할 수도 있다. In addition, the speech recognition module 221 of the present invention extracts channel information on a voice signal input from a microphone for each channel from a voice signal inputted through a plurality of multi-channel microphones 211 transmitted through the ADC 212 can do. At this time, the speech recognition module 221 of the present invention can extract channel information for each of the voice signals. In accordance with the implementation method, the voice recognition module 221 mixes a plurality of channels and extracts information about the mixed channel have.

또한, 본 발명의 음성 인식 모듈(221)은 상기 음성 신호에서 각 채널간의 잡음의 정도를 확인하여 최종적으로 신호대 잡음비를 확인할 수 있다. 이때, 본 발명의 음성 인식 모듈(221)은 상기 음성 신호를 프레임별로 구분하고, 구분된 프레임별로 음성 특징 벡터를 추출한 후 상기 음성 특징 벡터의 스펙트럼을 이용하여 신호대 잡음비(SNR; Signal to Noise)를 추출할 수 있게 된다. Also, the speech recognition module 221 of the present invention can confirm the signal-to-noise ratio finally by checking the degree of noise between the channels in the speech signal. At this time, the speech recognition module 221 of the present invention classifies the speech signal according to a frame, extracts a speech feature vector for each divided frame, and then uses a spectrum of the speech feature vector to obtain a signal to noise ratio (SNR) Extraction can be performed.

그리고 본 발명의 음성 인식 모듈(221)은 프레임 단위로 음성 신호에 대해 웨이크업 판단을 위한 음성 인식 과정을 수행할 수 있다. 이때, 본 발명의 음성 인식 모듈(221)은 전술한 과정을 거쳐 추출된 잡음 정보, 방향 정보 등을 기초로 노이즈가 제거된 음성 신호를 대상으로 음성 인식을 수행할 수도 있다. The speech recognition module 221 of the present invention may perform a speech recognition process for determining a wakeup of a speech signal on a frame-by-frame basis. At this time, the speech recognition module 221 of the present invention may perform speech recognition on speech signals whose noise has been removed based on the extracted noise information, direction information, and the like.

아울러, 본 발명의 음성 인식 모듈(221)은 프레임 단위로 산출된 음성 특징 벡터에 대응하는 음소를 장치 저장부(250)의 음성 인식 정보(251)를 통해 확인하게 된다. 이후, 본 발명의 음성 인식 모듈(221)은 음소의 집합을 구성하고 이에 따른 단어열을 구성하여 문장을 완료함으로써 음성 인식 과정을 수행하게 된다. In addition, the speech recognition module 221 of the present invention confirms the phoneme corresponding to the speech feature vector calculated on a frame-by-frame basis through the speech recognition information 251 of the device storage unit 250. Then, the speech recognition module 221 of the present invention constructs a set of phonemes and constructs a word sequence according to the word sequence, thereby completing the speech recognition process.

이러한 과정은 프레임 단위로 실시간 이뤄질 수 있으며, 본 발명의 음성 인식 모듈(221)을 통해 입력되는 음성 신호의 초기 구간에 대한 음성 인식을 수행하여 웨이크업에 동작을 수행할 지 여부를 웨이크업 판단 모듈(222)이 수행하게 된다. This process can be performed in real time on a frame-by-frame basis. The voice recognition module 221 performs voice recognition on an initial section of a voice signal input through the voice recognition module 221 to determine whether to perform an operation for wake- (222).

웨이크업 판단 모듈(222)은 실시간 단위로 음성 신호가 인식되면, 웨이크업을 위한 웨이크업 단어가 존재하는 지 여부를 확인할 수 있다. 상기 확인 결과, 웨이크업 단어가 존재하는 경우, 본 발명의 웨이크업 판단 모듈(222)은 장치 통신부(230), 출력부(240) 등의 대기 상태의 모듈에 전원이 인가되도록 제어할 수 있다. If the voice signal is recognized in real time, the wake-up determination module 222 can determine whether or not there is a wake-up word for wake-up. If the wakeup word exists, the wakeup determination module 222 of the present invention can control power to be applied to the standby module such as the device communication unit 230, the output unit 240, and the like.

그리고, 본 발명의 서비스 처리 모듈(223)은 이후의 음성 신호를 장치 통신부(230)의 서비스 연동 모듈(231)을 통해 서비스 서버(300)로 전송할 수 있다. 특히, 본 발명의 서비스 처리 모듈(223)은 웨이크업 판단 모듈(222)을 통해 웨이크업 단어가 포함되어 있는 것으로 판단되면, 상기 음성 신호 입력부(210)를 거쳐 실시간 전달되는 음성 신호를 프레임 단위로 상기 장치 통신부(230)를 통해 서비스 서버(300)로 스트리밍하여 전달되도록 제어할 수 있다. The service processing module 223 of the present invention can transmit a subsequent voice signal to the service server 300 through the service interworking module 231 of the device communication unit 230. In particular, when it is determined that the wakeup word is included in the wakeup judgment module 222, the service processing module 223 of the present invention transmits the voice signal transmitted in real time via the voice signal input unit 210, And may be controlled to be streamed to the service server 300 through the device communication unit 230 and transmitted.

이때, 본 발명의 서비스 처리 모듈(223)은 상기 웨이크업 판단을 위해 장치 저장부(250)에 일시적으로 저장된 음성 신호부터 불러와 상기 서비스 서버(300)로 전송할 수도 있다. At this time, the service processing module 223 of the present invention may transmit the voice signal temporarily stored in the device storage unit 250 to the service server 300 for the wakeup judgment.

또한, 본 발명의 음성 신호 정보는 상기 스트리밍되는 음성 신호와 별도로 상기 서비스 서버(300)로 전송될 수 있으며, 상기 음성 신호의 헤더 정보에 포함되어 서비스 서버(300)에 전송될 수도 있다. In addition, the voice signal information of the present invention may be transmitted to the service server 300 separately from the streaming voice signal, and may be transmitted to the service server 300 in the header information of the voice signal.

그리고, 본 발명의 서비스 처리 모듈(223)은 서비스 서버(300)로부터 이에 대한 응답 메시지가 전달되면, 이를 출력부(240)로 전달하게 된다. 이때, 본 발명의 서비스 처리 모듈(223)은 장치 통신부(230)의 서비스 연동 모듈(231)을 통해 컨텐츠 접근을 위한 링크 정보가 함께 수신되는 경우, 상기 링크 정보를 통해 컨텐츠 제공 서버(400)에 접속할 수 있으며, 상기 컨텐츠 제공 서버(400)로부터 해당하는 컨텐츠에 대한 스트리밍 데이터를 상기 서비스 연동 모듈(231)을 통해 수신하고 이를 출력부(240)를 통해 출력하는 과정을 제어할 수 있다. When the response message is received from the service server 300, the service processing module 223 of the present invention transmits the response message to the output unit 240. In this case, when link information for content access is received together with the service interworking module 231 of the device communication unit 230, the service processing module 223 of the present invention transmits the link information to the content providing server 400 through the link information And can control the process of receiving the streaming data for the corresponding content from the content providing server 400 through the service interworking module 231 and outputting the received streaming data through the output unit 240. [

장치 통신부(230)는 서비스 연동 모듈(231) 및 단말 연동 모듈(232)을 포함하며, 서비스 연동 모듈(231)은 통신망(500)을 통해 서비스 서버(300) 또는 컨텐츠 제공 서버(400)와 정보를 송수신하는 역할을 수행할 수 있다. 반면, 단말 연동 모듈(232)은 근거리 통신을 통해 단말(100)과 정보를 송수신하는 역할을 수행할 수 있다. The device communication unit 230 includes a service interlock module 231 and a terminal interlock module 232. The service interlock module 231 communicates with the service server 300 or the content providing server 400 via the communication network 500, Lt; RTI ID = 0.0 > and / or < / RTI > Meanwhile, the terminal interworking module 232 can exchange information with the terminal 100 through short-range communication.

출력부(240)는 전기적 신호인 음성 신호를 아날로그 형태로 변환하여 출력할 수 있는 DAC(Digital to Analog Converter, 241) 및 상기 DAC(241)를 통해 출력되는 음성 신호를 출력하는 스피커(242)를 포함하여 구성될 수 있다. The output unit 240 includes a DAC (Digital to Analog Converter) 241 for converting an audio signal, which is an electrical signal, into an analog form, and a speaker 242 for outputting a voice signal output through the DAC 241 And the like.

장치 저장부(25)는 본 발명의 실시 예에 따른 음향 출력 장치(200)의 구동과 관련된 각종 정보를 저장하고 관리하는 역할을 수행하게 된다. 특히 본 발명의 장치 저장부(250)는 입력되는 음성 신호를 실시간으로 인식하기 위한 음성 인식 정보(251)를 포함할 수 있으며, 웨이크업과 관련된 정보(252)를 저장하고 관리할 수 있다. 즉, 인식된 단어열이 상기 웨이크업 정보(252)에 저장된 웨이크업 단어에 해당하는 지 여부를 비교하는 데 활용될 수 있다. 또한 서비스 처리와 관련된 각종 정보(253)를 저장하고 관리할 수 있다. The device storage unit 25 stores and manages various information related to the driving of the audio output apparatus 200 according to the embodiment of the present invention. In particular, the device storage unit 250 of the present invention may include voice recognition information 251 for recognizing an input voice signal in real time, and may store and manage information 252 related to wakeup. That is, it can be used to compare whether or not the recognized word sequence corresponds to the wakeup word stored in the wakeup information 252. In addition, various information 253 related to the service process can be stored and managed.

아울러, 본 발명의 장치 저장부(250)는 음성 신호 입력부(210)를 통해 실시간 입력되는 음성 신호를 프레임 단위로 구분하여 저장할 수 있으며, 장치 제어부(220)의 요청에 따라 상기 저장된 음성 신호를 장치 통신부(230)로 전달되도록 지원할 수 있다. 또한, 장치 제어부(220)의 요청에 따라 상기 저장된 성 신호를 삭제할 수 있다. In addition, the device storage unit 250 of the present invention can store the voice signal inputted in real time through the voice signal input unit 210 in units of frames, and store the stored voice signal at the request of the device control unit 220. [ To the communication unit 230. In addition, the stored signal can be deleted at the request of the device control unit 220.

이러한 본 발명의 실시 예에 따른 음향 출력 장치(200)에서의 동작에 대해 도 4를 참조하여 설명하도록 한다. The operation of the sound output apparatus 200 according to the embodiment of the present invention will be described with reference to FIG.

도 4를 참조하면, 본 발명의 음향 출력 장치(200)는 웨이크업 단어가 인식되기 전까지는 대기 상태를 유지할 수 있다(S201). Referring to FIG. 4, the sound output apparatus 200 of the present invention can maintain a standby state until a wake-up word is recognized (S201).

이러한 상태에서 본 발명의 음향 출력 장치(200)는 일정 반경 내 사용자로부터 전달되는 음성이 구비된 마이크를 통해 입력되면(S207), 음성 신호를 생성하고 웨이크업 판단을 위해 일정 범위의 음성 신호를 저장할 수 있다. 이때, 본 발명의 음향 출력 장치(200)는 마이크를 통해 전달되는 사용자의 음성을 입력 받아 전기적 신호 형태의 음성 신호로 변환하고, 변환된 음성 신호를 일정 길이의 프레임으로 분리하여 저장하고 이를 이용하여 웨이크업 판단을 위한 음성 인식을 수행할 수 있게 된다. 특히, 본 발명의 음향 출력 장치(200)는 웨이크업 판단을 위한 음성 인식을 수행하기 위한 구간에 대응하여 전처리 과정을 수행하고(S205), 방향, 채널 및 잡음 정보를 포함하는 음성 신호 정보를 생성할 수 있다(S207). In this state, when the sound output apparatus 200 of the present invention is input through a microphone having a voice transmitted from a user within a certain radius (S207), the sound output apparatus 200 generates a voice signal and stores a voice signal of a certain range . At this time, the sound output apparatus 200 of the present invention receives a voice of a user transmitted through a microphone, converts the voice into a voice signal in the form of an electrical signal, separates the converted voice signal into frames of a predetermined length, It is possible to perform speech recognition for wake-up judgment. In particular, the sound output apparatus 200 of the present invention performs a preprocessing process corresponding to a section for performing voice recognition for wakeup determination (S205), and generates voice signal information including direction, channel, and noise information (S207).

이후, 본 발명의 음향 출력 장치(200)는 상기 생성된 음성 신호 정보를 이용하여 노이즈를 제거할 수 있으며, 노이즈가 제거된 음성 신호를 대상으로 음성 인식을 수행할 수도 있다(S209). Thereafter, the sound output apparatus 200 of the present invention can remove noise using the generated speech signal information, and perform speech recognition on the speech signal from which noise has been removed (S209).

이후, 웨이크업에 해당하는 단어가 존재하는 지 여부를 판단하여(S211), 웨이크업 단어가 존재하는 경우, 상기 서비스 서버(300)로 음성 신호를 스트리밍하여 전송한다. 이때, 본 발명의 음향 출력 장치(200)는 웨이크업 판단을 위한 음성 인식 단계에서 저장된 프레임부터 순차적으로 서비스 서버(300)로 전송할 수 있으며, 상기 프레임은 삭제하고, 상기 마이크로부터 입력되는 음성 신호부터 서비스 서버(300)로 전송할 수도 있다. Thereafter, it is determined whether there is a word corresponding to the wakeup (S211). If there is a wakeup word, the service server 300 streams the voice signal and transmits it. At this time, the audio output apparatus 200 of the present invention can transmit the audio output signal to the service server 300 sequentially from the frame stored in the voice recognition step for wakeup determination, To the service server 300.

또한, 본 발명의 음향 출력 장치(200)는 상기 음성 신호와 별도로 또는 상기 음성 신호의 헤더 정보에 포함되어 상기 S207 단계에서 추출된 음성 신호 정보를 서비스 서버(300)로 전송하게 된다(S213). In addition, the sound output apparatus 200 of the present invention transmits the voice signal information extracted in step S207 separately from the voice signal or included in the header information of the voice signal to the service server 300 (S213).

이를 수신한 서비스 서버(300)는 음성 신호를 인식하고, 사용자의 의도를 파악한 후 사용자가 발화한 음성 신호에 대한 응답 메시지를 생성할 수 있다. The service server 300 receiving the voice signal recognizes the voice signal and recognizes the intention of the user, and can generate a response message for the voice signal that the user has uttered.

그리고 서비스 서버(300)는 생성된 응답 메시지를 음향 출력 장치(200)로 전송하게 되며, 본 발명의 음향 출력 장치(200)는 서비스 서버(300)로부터 전기적 신호인 음성 신호로 변환된 응답 메시지가 수신되면(S215), 이를 구비된 스피커를 통해 출력할 수 있다(S217). 이때, 본 발명의 음향 출력 장치(200)는 응답 메시지를 아날로그 형태로 변환하는 과정을 수행할 수 있으며, 아날로그로 변환된 응답 메시지를 출력할 수 있다. Then, the service server 300 transmits the generated response message to the sound output apparatus 200, and the sound output apparatus 200 of the present invention receives the response message converted into the voice signal, which is an electrical signal, from the service server 300 If it is received (S215), it can be outputted through the built-in speaker (S217). At this time, the sound output apparatus 200 of the present invention can perform a process of converting a response message into an analog form, and output a response message converted to analog.

한편, S211 단계에서 웨이크업 단어가 포함되지 않는 것으로 판단되면, 일시적으로 저장하고 있는 음성 신호를 삭제하고, 상기 음성 신호에 대응하여 추출된 음성 신호 정보를 삭제하게 된다(S219).On the other hand, if it is determined in step S211 that the wakeup word is not included, the temporarily stored voice signal is deleted and the extracted voice signal information corresponding to the voice signal is deleted (S219).

이상으로 본 발명의 실시 예에 따른 음향 출력 장치(200)의 주요 구성 및 동작에 대해 설명하였다. The main configuration and operation of the sound output apparatus 200 according to the embodiment of the present invention have been described above.

이하, 본 발명의 실시 예에 따른 서비스 서버(300)의 주요 구성 및 동작에 대해 설명하도록 한다. Hereinafter, the main configuration and operation of the service server 300 according to the embodiment of the present invention will be described.

도 5는 본 발명의 실시 예에 따른 서비스 서버의 주요 구성을 도시한 구성도이며, 도 6은 본 발명의 실시 예에 따른 서비스 서버에서의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법을 설명하기 위한 흐름도이다. FIG. 5 is a configuration diagram illustrating a main configuration of a service server according to an embodiment of the present invention. FIG. 6 illustrates a method of providing a voice recognition service using voice signal information in a service server according to an embodiment of the present invention Fig.

먼저, 도 5를 참조하면, 본 발명의 실시 예에 따른 서비스 서버(300)는 서버 통신부(310), 서버 저장부(320), 서버 제어부(330), 변환부(340) 및 컨텐츠 서버 연동부(350)를 포함하여 구성될 수 있다. 5, a service server 300 according to an exemplary embodiment of the present invention includes a server communication unit 310, a server storage unit 320, a server control unit 330, a conversion unit 340, (350).

각 구성 요소에 대해 보다 더 구체적으로 설명하면, 본 발명의 서버 통신부(310)은 단말(100) 및 음향 출력 장치(200)와 정보를 송수신하는 역할을 수행할 수 있다. 특히, 본 발명의 서버 통신부(310)는 통신망(500)을 경유하여 단말(100)로부터 음향 출력 장치(200) 제어 및 본 발명의 음성 인식 서비스와 관련된 각종 사용자 설정 정보를 입력 받을 수 있다. 또한, 단말(100)로부터 음향 출력 장치(200)의 식별 정보를 수신하고, 단말(100)의 가입자 정보를 수신할 수 있다. 또한 본 발명의 서버 통신부(310)는 음향 출력 장치(200)로부터 사용자가 발화한 음성 신호를 전달받을 수 있으며, 상기 음향 출력 장치(200)로 상기 음성 신호에 대응하여 서버 제어부(330)가 생성한 응답 메시지 및 링크 정보를 전송할 수 있다. The server communication unit 310 of the present invention can perform the function of transmitting and receiving information to and from the terminal 100 and the sound output apparatus 200. [ In particular, the server communication unit 310 of the present invention can receive various user setting information related to the control of the sound output device 200 and the voice recognition service of the present invention from the terminal 100 via the communication network 500. In addition, it is possible to receive the identification information of the sound output apparatus 200 from the terminal 100 and receive the subscriber information of the terminal 100. In addition, the server communication unit 310 of the present invention can receive a voice signal uttered by the user from the sound output apparatus 200, and the server control unit 330 generates A response message and link information can be transmitted.

서버 저장부(320)은 본 발명의 서비스 서버(300) 구동과 관련된 각종 정보를 저장하고 관리하는 역할을 수행할 수 있다. 특히 본 발명의 서버 저장부(320)는 음성 인식 및 의미 분석을 위해 필요한 각종 정보를 저장하고 관리할 수 있는데, 예컨대, 음성 인식을 위한 음향 모델 등을 저장하고 관리할 수 있으며, 의미 분석을 위해 엔티티 및 인텐트에 대한 정보를 저장하고 관리할 수 있다. The server storage unit 320 may store and manage various information related to the operation of the service server 300 of the present invention. In particular, the server storage unit 320 of the present invention can store and manage various kinds of information necessary for speech recognition and semantic analysis. For example, the server storage unit 320 can store and manage an acoustic model for voice recognition, You can store and manage information about entities and intents.

본 발명의 엔티티 및 인텐트에 대해 보다 더 구체적으로 설명하면, 본 발명의 엔티티는 사용자가 원하는 특정 서비스의 이름과 같이 미리 정의된 고유 명사를 의미할 수 있다. 예컨대, 날씨, 최신가요, 팝송, 시간, 약속, 타이머, 청정기, 핸드폰 등과 같이 제어를 수행하고자 하는 대상이 되는 고유 명사를 의미할 수 있다. 또한, 본 발명의 서버 저장부(320)는 상기 엔티티를 처리할 수 있는 주체에 대한 정보를 저장하고 관리할 수 있다. Describing the entities and intents of the present invention in more detail, the entities of the present invention may refer to predefined proper nouns, such as the name of a particular service a user desires. For example, it can mean a proper noun to be controlled, such as weather, latest song, pop song, time, appointment, timer, purifier, mobile phone, and the like. In addition, the server storage unit 320 of the present invention can store and manage information on a subject that can process the entity.

서비스 제공 주체Service provider 엔티티Entity 컨텐츠 제공 서버 AContent providing server A 날씨, 온도, 습도Weather, temperature, humidity 컨텐츠 제공 서버 BContent providing server B 최신곡, 팝송, 케이팝, 팝, 최신노래Latest Songs, Pop Songs, K-Pop, Pop, Latest Songs 서비스 서버Service server 일정, 약속, 타이머Schedule, appointment, timer

즉, 전술한 <표 1>과 같이, 본 발명의 서버 저장부(320)는 특정 엔티티를 처리할 수 있는 서비스 제공 주체에 대한 정보를 저장하고 관리할 수 있다. That is, as shown in Table 1, the server storage unit 320 of the present invention can store and manage information on a service providing entity that can process a specific entity.

또한, 본 발명의 서버 저장부(320)는 인텐트에 대응하는 단어열을 저장하고 관리할 수 있는 인텐트는 사용자의 의도를 포함하고 있는 단어열을 의미하는 것으로, 예컨대, '틀어줘', '몇시야', '어때', '들려줘', '켜줘', '알려줘' 등과 같이 사용자가 엔티티를 대상으로 원하는 의도를 표현하기 위한 단어열에 해당할 수 있다. The intent that can store and manage a word string corresponding to an intent is a word string including an intention of a user. For example, the server storage unit 320 may store ' The user may correspond to a word sequence for expressing a desired intention to an entity such as 'how many times', 'how about', 'tell me', 'turn on', 'tell me'

의도Intent 인텐트Intent 재생play 틀어줘, 들을까, 듣고싶어, 틀어봐, 시작Let me play, I hear it, I want to hear it, Play it, 중지stop 그만, 안들어, 중지, 싫어Stop, stop, stop, stop. 질의vaginal 뭐야, 어때, 확인해봐, 찾아봐Check it out.

즉 <표 2>에 도시된 바와 같이 서버 저장부(320)는 사용자의 의도를 나타낼 수 있는 단어열을 인텐트로 저장하고 관리할 수 있다. That is, as shown in Table 2, the server storage unit 320 can store and manage a word string indicative of a user's intent as an intent.

또한, 본 발명의 서버 저장부(320)는 본 발명의 실시 예에 따른 음성 인식 서비스 제공과 관련된 각종 정보를 저장하고 관리할 수 있다. In addition, the server storage unit 320 of the present invention can store and manage various information related to the provision of the voice recognition service according to the embodiment of the present invention.

서버 제어부(330)는 본 발명의 실시 예에 따른 서비스 서버(300)의 동작 전반을 관리하는 것으로, 전처리 모듈(331), 음성 인식 모듈(332), 의미 분석 모듈(334) 및 서비스 처리 모듈(334)을 포함하여 구성될 수 있다. The server control unit 330 manages overall operation of the service server 300 according to the embodiment of the present invention and includes a preprocessing module 331, a voice recognition module 332, a semantic analysis module 334, and a service processing module 334).

전처리 모듈(331)은 서버 통신부(310)를 통해 음향 출력 장치(200)로부터 전달되는 음성 신호에서 음성 인식 이전의 전처리 단계를 수행한다. 예컨대, 음성 신호에서 잡음을 제거하는 등의 전처리 과정을 수행할 수 있다. 특히, 본 발명의 전처리 모듈(331)은 상기 음향 출력 장치(200)로부터 전달되는 음성 신호 정보를 이용하여 전처리 과정을 수행하게 된다. 즉, 본 발명의 전처리 모듈(331)은 전처리 수행을 위한 별도의 처리 과정이나 연산 과정을 수행하는 것이 아니라, 음향 출력 장치(200)로부터 전달된 음성 신호 정보의 방향 및 잡음 정보를 이용하여 노이즈 제거를 위한 특정 주파수 대역의 필터를 구성하고, 구성된 필터를 이용하여 노이즈 제거를 수행할 수 있으며, 상기 음성 신호 정보의 채널 정보를 이용하여 음성 신호의 채널 변이를 보상하는 전처리 과정을 수행할 수 있게 된다.The preprocessing module 331 performs a preprocessing step before voice recognition in a voice signal transmitted from the sound output apparatus 200 through the server communication unit 310. [ For example, a preprocessing process such as removing noise from a speech signal can be performed. In particular, the preprocessing module 331 of the present invention performs a preprocessing process using voice signal information transmitted from the sound output device 200. That is, the preprocessing module 331 of the present invention does not perform a separate process or calculation process for performing the preprocessing, but uses the direction of the voice signal information transmitted from the sound output device 200 and the noise information, It is possible to perform a noise removal process using a configured filter and to perform a preprocessing process for compensating a channel variation of a voice signal using channel information of the voice signal information .

그리고 음성 인식 모듈(332)은 전처리 모듈(331)을 통해 전달되는 음성 신호를 인식하는 과정을 수행한다. 예컨대, 본 발명의 음성 인식 모듈(332)은 수신된 음성 신호에서 음성 특징 벡터를 추출하고 추출된 음성 특징 벡터에 대응하는 음소를 확인하게 된다. 여기서, 음소는 의미를 구별 짓는 최소의 단위(예: ㄱ, ㄹ, ㅓ, ㅕ, ㅁ)를 의미하는 것으로, 음성 특징 벡터에 대응음소 인식에서는 이러한 단위의 집합으로 변환하는 과정을 수행할 수 있다. 이후에, 본 발명의 음성 인식 모듈(332)은 인식된 음소들의 집합을 이용하여 의미를 가진 단어열로 구성되는 문장을 구성할 수 있다. 아울러, 본 발명의 음성 인식 모듈(332)은 음향 출력 장치(200)로부터 전달되는 음성 신호 정보를 이용하여 적합한 음성 인식 모델을 결정할 수도 있다. The voice recognition module 332 recognizes the voice signal transmitted through the preprocessing module 331. For example, the speech recognition module 332 of the present invention extracts a speech feature vector from the received speech signal and confirms a phoneme corresponding to the extracted speech feature vector. Here, a phoneme refers to a minimum unit (eg, a, d, ㅓ, ㅕ, ㅁ) that distinguishes meaning, and a phoneme recognition corresponding phonetic feature vector can be converted into a set of such units . Thereafter, the speech recognition module 332 of the present invention can construct a sentence composed of word strings having meaning using the set of recognized phonemes. In addition, the speech recognition module 332 of the present invention may determine a suitable speech recognition model using the speech signal information transmitted from the sound output apparatus 200. [

의미 분석 모듈(333)은 음성 인식 모듈(332)을 토대로 음성 인식되는 결과를 이용하여 웨이크업 단어가 포함되는 지 여부를 판단할 수 있으며, 음성 인식된 문장에서 단어열 각각을 분석하고 분석된 단어열이 엔티티에 해당하는 지 인텐트에 해당하는 지를 분석하게 된다. 예컨대, 사용자가 음향 출력 장치(200) 인근에서 "에이야, 지금 날씨 어때"라는 문장을 발화하였다고 가정한다. 이때, '지금', '어때'의 단어열은 인텐트이며, 날씨에 대응하는 단어열은 엔티티로 분석할 수 있다. 이때, 본 발명의 의미 분석 모듈(334)은 <표 1> 및 <표 2>를 참조하여 엔티티에 대한 사용자의 보다 정확한 의도를 파악하게 되고, 파악된 결과를 서비스 처리 모듈(334)로 전달한다. The semantic analysis module 333 can determine whether or not a wakeup word is included based on the speech recognition result based on the speech recognition module 332. The semantic analysis module 333 analyzes each word string in the speech recognition sentence, It analyzes whether the column corresponds to an intent corresponding to an entity. For example, it is assumed that the user has uttered the phrase "EYE, now the weather is good" near the sound output apparatus 200. At this time, the word sequence of 'now' and 'how' is an intent, and the word sequence corresponding to the weather can be analyzed as an entity. At this time, the semantic analysis module 334 of the present invention grasps the user's more accurate intention of the entity with reference to <Table 1> and <Table 2>, and transmits the determined result to the service processing module 334 .

서비스 처리 모듈(334)은 먼저 엔티티에 대응하는 서비스 제공 주체를 확인한다. 전술한 예에서, '날씨' 엔티티에 해당하는 서비스 제공 주체로 컨텐츠 제공 서버 A(400a)를 확인할 수 있다. 그리고, 서비스 처리 모듈(334)은 해당 컨텐츠 제공 서버 A(400a)로 '지금', '어때'에 대응하는 인텐트를 통해 사용자의 의도를 파악하고 현재 시간의 날씨 정보를 질의할 수 있다. 이때, 본 발명의 서비스 처리 모듈(334)은 음향 출력 장치(200)의 식별 정보를 확인하고, 상기 식별 정보에 대응하는 서비스 가입 정보를 서버 저장부(320)를 통해 확인할 수 있다. 그리고 확인된 정보에 따라 음향 출력 장치(200)의 현재 위치를 확인하고, 확인된 위치에서의 날씨 정보를 질의할 수 있다. The service processing module 334 first identifies the service providing entity corresponding to the entity. In the above example, the content providing server A 400a can be confirmed as a service providing entity corresponding to the 'weather' entity. The service processing module 334 can grasp the intention of the user through the intent corresponding to 'now' and 'how' to the contents providing server A 400a and inquire weather information of the current time. At this time, the service processing module 334 of the present invention confirms the identification information of the sound output apparatus 200 and confirms the service subscription information corresponding to the identification information through the server storage unit 320. According to the confirmed information, the current position of the sound output apparatus 200 can be confirmed, and the weather information at the identified position can be inquired.

이후에 컨텐츠 서버 연동부(350)를 통해 컨텐츠 제공 서버a(400a)로부터 대응하는 컨텐츠가 전달되면, 서비스 처리 모듈(334)은 이를 포함하여 응답 메시지를 구성한다. 즉, 컨텐츠 서버 연동부(350)를 통해 '비'라는 컨텐츠가 전달되었다면, 서비스 처리 모듈(334)은 사용자의 현재 위치, 시간 정보를 포함하고, 컨텐츠 제공 서버a(400a)로부터 전달된 응답 정보를 포함하여 '서울 구로, 2시 현재 날씨는 비가 오고 있습니다'와 같은 문장 형태의 응답 메시지를 구성할 수 있다. When the corresponding content is delivered from the content providing server 400a through the content server interworking unit 350, the service processing module 334 composes a response message including the corresponding content. That is, if the 'content' is transmitted through the content server interworking unit 350, the service processing module 334 includes the current location and time information of the user, and the response information transmitted from the content providing server a 400a , And a response message in the form of a sentence such as 'Seoul Gul, 2 o'clock and the weather is raining'.

그리고 본 발명의 서비스 처리 모듈(334)은 변환부(341)의 TTS 엔진(341)을 통해 상기 응답 메시지를 음성 신호로 변환한다. 이후, 변환부(340)로부터 변환된 응답 메시지가 전달되면, 서비스 처리 모듈(334)은 서버 통신부(310)를 통해 상기 변환된 응답 메시지를 음향 출력 장치(200)로 전달할 수 있다. The service processing module 334 of the present invention converts the response message into a voice signal through the TTS engine 341 of the conversion unit 341. Thereafter, when the converted response message is transmitted from the converting unit 340, the service processing module 334 can transmit the converted response message to the sound output apparatus 200 through the server communication unit 310. [

또한, 본 발명의 서비스 처리 모듈(334)은 컨텐츠 서버 연동부(350)를 통해 컨텐츠 제공 서버(400)로부터 컨텐츠 접속을 위한 링크 정보가 수신되었다면, 수신된 링크 정보를 응답 메시지와 함께 서버 통신부(310)를 통해 음향 출력 장치(200)로 전달할 수 있다. The service processing module 334 of the present invention receives the link information for content connection from the content providing server 400 through the content server linking unit 350 and transmits the received link information to the server communication unit 310 to the audio output apparatus 200 via the audio output apparatus 200.

이러한 본 발명의 서비스 서버(300)에서의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법에 대해 흐름도를 참조하여 다시 설명하도록 한다. A method of providing a voice recognition service using voice signal information in the service server 300 according to the present invention will be described with reference to flowcharts.

도 6은 본 발명의 실시 예에 따른 서비스 서버(300)에서의 주요 동작을 설명하기 흐름도로, 서비스 서버(300)는 음향 출력 장치(200)로부터 음성 신호와 함께 음성 신호 정보를 수신한다(S301). 6 is a flowchart illustrating a main operation of the service server 300 according to an embodiment of the present invention. The service server 300 receives voice signal information together with a voice signal from the voice output apparatus 200 (S301 ).

이후, 본 발명의 서비스 서버(300)는 수신된 음성 신호 정보를 이용하여 노이즈 제거를 위한 필터를 구성한다(S303). 이때, 본 발명의 서비스 서버(300)는 음성 신호 정보의 방향 및 잡음 정보를 이용하여 노이즈 제거를 위한 특정 주파수 대역의 필터를 구성하고, 상기 구성된 필터를 이용하여 음성 신호에서 노이즈 제거를 위한 과정을 수행할 수 있다(S305). Thereafter, the service server 300 of the present invention configures a filter for removing noise using the received voice signal information (S303). At this time, the service server 300 of the present invention constructs a filter of a specific frequency band for removing noise using the direction of the voice signal information and the noise information, and performs a process for removing noise from the voice signal using the configured filter (S305).

더하여, 본 발명의 서비스 서버(300)는 음향 출력 장치(200)로부터 전송된 음성 신호 정보의 채널 정보를 이용하여 음성 신호의 채널 변이를 보상하는 전처리 과정을 수행할 수 있게 된다.In addition, the service server 300 of the present invention can perform a preprocessing process of compensating for the channel variation of the voice signal using the channel information of the voice signal information transmitted from the sound output device 200.

이후, 본 발명의 서비스 서버(300)는 음성 신호를 인식하는 과정을 수행한다(S307). 이때, 본 발명의 서비스 서버(300)는 수신된 음성 신호에서 음성 특징 벡터를 추출하고 추출된 음성 특징 벡터에 대응하는 음소를 확인하게 된다. 여기서, 음소는 의미를 구별 짓는 최소의 단위(예: ㄱ, ㄹ, ㅓ, ㅕ, ㅁ)를 의미하는 것으로, 음소 인식에서는 이러한 단위의 집합으로 변환하는 과정을 수행할 수 있다. 이후에, 본 발명의 서비스 서버(300)는 인식된 음소들의 집합을 이용하여 의미를 가진 단어열로 구성되는 문장을 구성하고, 단어열을 대상으로 의미를 분석하고(S309), 단어열이 엔티티에 해당하는지 인텐트에 해당하는 지를 확인하게 된다(S311). Thereafter, the service server 300 of the present invention performs a process of recognizing a voice signal (S307). At this time, the service server 300 of the present invention extracts a speech feature vector from the received speech signal and confirms a phoneme corresponding to the extracted speech feature vector. Here, a phoneme refers to the smallest unit (eg, a, d, ㅓ, ㅕ, ㅁ) that distinguishes meaning. In phonemic recognition, a process of converting into a set of such units can be performed. Thereafter, the service server 300 of the present invention constructs a sentence composed of word strings having meaning by using the set of recognized phonemes, analyzes the meaning with respect to the word string (S309) (Step S311).

여기서, 엔티티는 사용자가 원하는 특정 서비스의 이름과 같이 미리 정의된 고유 명사를 의미할 수 있다. 예컨대, 날씨, 최신가요, 팝송, 시간, 약속 등과 같이 미리 약속된 고유 명사를 엔티티로 추출할 수 있다. 반면 인텐트는 사용자의 의도를 포함하고 있는 단어열을 의미하는 것으로, 예컨대, '틀어줘', '몇시야', '어때' 등과 같이 엔티티를 제외한 사용자의 의도를 표현하기 위한 단어열에 해당할 수 있다. Here, the entity may mean a predefined proper noun such as the name of a specific service desired by the user. For example, it is possible to extract a proper noun that has been promised in advance, such as weather, latest song, pop song, time, appointment, etc., as an entity. On the other hand, an intent means a word string including a user's intention. For example, the intent may correspond to a word string for expressing a user's intention except an entity such as 'play', 'what time is' have.

문장을 예로 들어 설명하면, "최신음악 틀어줘"라는 문장을 사용자가 발화하였다면, 상기 문장에서 '최신음악'은 엔티티를 의미하며, '틀어줘'는 인텐트를 의미한다. If the sentence is taken as an example, if the user has uttered the sentence "Give me the latest music," the phrase 'up-to-date music' in the above sentence means an entity and 'play' means an intent.

본 발명의 서비스 서버(300)는 최신음악 엔티티에 대한 서비스 제공 주체, 상기 최신음악을 서비스하는 컨텐츠 제공 서버(400)를 확인할 수 있다(S313). 그리고, 컨텐츠 제공 서버로 인텐트에 해당하는 서비스를 질의하게 된다(S315). 전술한 예에서, '틀어줘'가 인텐트이며, 서비스 서버(300)는 상기 '틀어줘'에 대응하는 사용자 의도를 인텐트 테이블 정보에서 확인하고 사용자 의도가 '재생'임을 알 수 있게 된다. 이후에, 서비스 서버(300)는 컨텐츠 제공 서버로 '최신음악'에 대한 재생 서비스를 질의하여 요청하게 된다.The service server 300 of the present invention can identify the service providing entity for the latest music entity and the content providing server 400 for providing the latest music at step S313. Then, the service corresponding to the intent is inquired to the content providing server (S315). In the above example, 'play' is an intent, and the service server 300 can check the intention table corresponding to the user 'intention' and know that the intent of the user is 'play'. Thereafter, the service server 300 queries the content providing server for a playback service for 'latest music'.

그리고 본 발명의 서비스 서버(300)는 컨텐츠 제공 서버(400)로부터 해당하는 응답을 수신할 수 있다(S317). 이후에 서비스 서버(300)는 음향 출력 장치(200)로 전송하기 위한 응답 메시지를 구성하고(S319), 이를 음성 신호로 변환한다(S321). 상기 응답 메시지는 사용자로부터 입력된 음성 신호에 대한 응답 메시지로, "최신음악 들려드릴께요"라는 응답 메시지를 생성할 수 있다. The service server 300 of the present invention can receive a corresponding response from the content providing server 400 (S317). Thereafter, the service server 300 constructs a response message for transmission to the sound output apparatus 200 (S319), and converts it into a voice signal (S321). The response message may be a response message to the voice signal input by the user, and may generate a response message "Let's hear the latest music ".

그리고, 서비스 서버(300)는 변환된 응답 메시지를 음향 출력 장치(200)로 전송한다(S323). 아울러, 본 발명의 서비스 서버(300)는 컨텐츠 제공 서버로부터 링크 정보를 응답 정보로 수신하는 경우, 상기 링크 정보를 음향 출력 장치(200)로 전송할 수 있으며, 상기 링크 정보가 복수 개인 경우, 이에 대한 재생 리스트를 생성하여, 재생 리스트에 따라 순차적으로 링크 정보를 전송할 수도 있다. Then, the service server 300 transmits the converted response message to the sound output apparatus 200 (S323). The service server 300 of the present invention may transmit the link information to the sound output apparatus 200 when the link information is received from the content providing server as response information. It is also possible to generate a reproduction list and sequentially transmit the link information according to the reproduction list.

이상으로 본 발명의 실시 예에 따른 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법에 대해 설명하였다. The method for providing the voice recognition service using the voice signal information according to the embodiment of the present invention has been described above.

상술한 바와 같은 본 발명의 음성 신호 정보를 활용한 음성 인식 서비스 제공 방법은 컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체의 형태로 제공될 수도 있다. The method for providing a voice recognition service using the voice signal information of the present invention as described above may be provided in the form of a computer readable medium suitable for storing computer program instructions and data.

특히, 본 발명의 컴퓨터 프로그램은 적어도 하나 이상의 다채널 마이크를 통해 발화되는 사용자의 음성을 입력 받아 음성 신호를 생성하는 단계, 상기 생성되는 음성 신호에서 웨이크업 과정을 수행하기 위한 일정 구간의 음성 신호를 대상으로 방향, 채널 및 잡음 정보를 포함하는 음성 신호 정보를 추출하는 단계 및 상기 일정 구간의 음성 신호를 대상으로 음성 인식하여 웨이크업 단어가 포함되어 있을 경우, 상기 생성되는 음성 신호를 순차적으로 스트리밍하여 상기 서비스 서버로 전송함과 동시에 상기 추출된 음성 신호 정보를 상기 서비스 서버로 전송하는 단계 등을 실행할 수 있다. In particular, the computer program of the present invention includes the steps of generating a voice signal by inputting voice of a user who is uttered through at least one multi-channel microphone, generating a voice signal of a predetermined interval for performing a wake- Extracting speech signal information including directions, channels, and noise information as objects, and if the speech signal is included in the speech signal of the predetermined interval and includes a wakeup word, the generated speech signals are sequentially streamed Transmitting the extracted voice signal information to the service server and transmitting the extracted voice signal information to the service server.

이러한, 컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체는, 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM)과 같은 반도체 메모리를 포함한다. 프로세서와 메모리는 특수 목적의 논리 회로에 의해 보충되거나, 그것에 통합될 수 있다. Such computer-readable media suitable for storing computer program instructions and data include, for example, magnetic media such as hard disks, floppy disks and magnetic tape, compact disk read only memory (CD-ROM) Optical media such as a DVD (Digital Video Disk), a magneto-optical medium such as a floppy disk, and a ROM (Read Only Memory), a RAM , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), and EEPROM (Electrically Erasable Programmable ROM). The processor and memory may be supplemented by, or incorporated in, special purpose logic circuits.

또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(Functional) 프로그램과 이와 관련된 코드 및 코드 세그먼트 등은, 기록매체를 읽어서 프로그램을 실행시키는 컴퓨터의 시스템 환경 등을 고려하여, 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론되거나 변경될 수도 있다.The computer readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner. The functional program for implementing the present invention and the related code and code segment may be implemented by programmers in the technical field of the present invention in consideration of the system environment of the computer that reads the recording medium and executes the program, Or may be easily modified or modified by the user.

아울러, 상술한 바와 같은 컴퓨터가 읽을 수 있는 기록매체에 기록된 컴퓨터 프로그램은 상술한 바와 같은 기능을 수행하는 명령어를 포함하며 기록매체를 통해 배포되고 유통되어 특정 장치, 특정 컴퓨터에 읽히어 설치되고 실행됨으로써 전술한 기능들을 실행할 수 있다. In addition, a computer program recorded on a computer-readable recording medium as described above includes instructions for performing the functions as described above, distributed and distributed through a recording medium, read and installed in a specific device, a specific computer, Thereby performing the above-described functions.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While the specification contains a number of specific implementation details, it should be understood that they are not to be construed as limitations on the scope of any invention or claim, but rather on the description of features that may be specific to a particular embodiment of a particular invention Should be understood. Certain features described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Further, although the features may operate in a particular combination and may be initially described as so claimed, one or more features from the claimed combination may in some cases be excluded from the combination, Or a variant of a subcombination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 시스템 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 시스템들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although the operations are depicted in the drawings in a particular order, it should be understood that such operations must be performed in that particular order or sequential order shown to achieve the desired result, or that all illustrated operations should be performed. In certain cases, multitasking and parallel processing may be advantageous. Also, the separation of the various system components of the above-described embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems will generally be integrated together into a single software product or packaged into multiple software products It should be understood.

본 발명은 음성 인식 서비스 제공 방법에 관한 것으로, 더욱 상세하게는 마이크와 스피커를 포함하는 음향 출력 장치와 연동하여 음성 인식 서비스를 제공할 수 있는 서비스 서버에 있어서, 상기 음향 출력 장치는 입력되는 음성 신호의 초기 구간을 기초로 전처리 수행을 위한 음성 신호 정보를 생성하고, 서비스 서버는 상기 음성 신호 정보를 이용하여 보다 더 정확하게 음성 인식을 수행할 수 있으며, 이를 통해 본 발명은 음성 인식, 인공 지능, 멀티미디어 컨텐츠 산업 등 다양한 산업상 이용 가능성이 충분하다. The present invention relates to a method of providing a voice recognition service, and more particularly, to a service server capable of providing a voice recognition service in cooperation with an audio output device including a microphone and a speaker, And the service server can perform speech recognition more accurately by using the speech signal information. Accordingly, the present invention can provide speech recognition, artificial intelligence, multimedia It is possible to use it in various industries such as contents industry.

100: 단말 200: 음향 출력 장치
210: 음성 신호 입력부 220: 장치 제어부
230: 장치 통신부 240: 출력부
300: 서비스 서버 310: 서버 통신부
320: 서버 저장부 330: 서버 제어부
340: 변환부 350: 컨텐츠 서버 연동부
400: 컨텐츠 제공 서버 500: 통신망100: terminal 200: sound output device
210: audio signal input unit 220:
230: device communication unit 240: output unit
300: service server 310: server communication unit
320: server storage unit 330: server control unit
340: conversion unit 350: content server interface unit
400: Content providing server 500:

Claims

The sound output device comprises:
Generating a voice signal by receiving a user's voice uttered through at least one multi-channel microphone;
Extracting speech signal information including direction, channel, and noise information on a speech signal of a predetermined interval for performing a wakeup process on the generated speech signal; And
If the wakeup word is included in the voice signal of the predetermined period, the voice signal is sequentially streamed to the service server, and the extracted voice signal information is transmitted to the service server ;
The method of claim 1, further comprising:

The method according to claim 1,
The step of generating the speech signal
Generating a voice signal using a voice input through the at least one multi-channel microphone; And
Dividing a generated voice signal into frames of a predetermined length, and storing a frame of a predetermined interval to perform a wakeup process;
The method of claim 1, further comprising:

3. The method of claim 2,
The step of extracting the voice signal information
Extracting direction information using at least one of a arrival time, a reaching distance, and a voice strength of a voice inputted through the at least one multi-channel microphone;
Checking and extracting channel information of the multi-channel microphone; And
Extracting a speech feature vector for each of the stored frames and extracting a signal-to-noise ratio (SNR) using the extracted speech feature vector;
The method of claim 1, further comprising:

3. The method of claim 2,
The step of transmitting to the service server
And transmitting the streaming signal from the audio signal corresponding to the stored frame to the service server or streaming the generated audio signal after the stored frame to the service server and transmitting the streaming signal. Way.

The method according to claim 1,
The step of transmitting to the service server
And transmitting the extracted voice signal information to the service server by including the extracted voice signal information in the header information of the voice signal that is streamed and transmitted.

The service server,
Receiving voice signal information extracted for a voice signal of a predetermined interval for performing a wakeup process with a voice signal streamed from an audio output device;
Performing a preprocessing process for noise removal and disparity compensation in the streamed audio signal using the audio signal information; And
Analyzing the meaning of the speech signal subjected to the preprocessing process, and delivering the corresponding speech recognition service to the sound output device according to the analyzed result;
The method of claim 1, further comprising:

The method according to claim 6,
The voice signal information
A channel, and noise information extracted from the voice signal of the predetermined period.

8. The method of claim 7,
The step of performing the pre-
A filter for a specific frequency band for removing noise from the voice signal is constructed using the direction information and the noise information, noise is removed using the configured filter, and a channel variation of the voice signal is calculated using the channel information, And a preprocessing step of compensating for the speech recognition service.

A computer-readable recording medium storing a program for executing a method of providing a speech recognition service utilizing the speech signal information according to any one of claims 1 to 8.

A voice signal input unit for receiving a voice of a user who is ignited through at least one multi-channel microphone and generating a voice signal; And
Extracting speech signal information including direction, channel, and noise information on a speech signal of a predetermined interval for performing a wakeup process from a speech signal transmitted through the speech signal input unit, A device controller for sequentially streaming the generated voice signal to the service server and controlling the extracted voice signal information to be transmitted to the service server when the voice signal includes a wakeup word;
And a sound output unit for outputting sound.

A server communication unit for receiving a voice signal streamed from an audio output apparatus and receiving voice signal information extracted from a voice signal of a predetermined section for performing a wakeup process from the sound output apparatus; And
A preprocessing process for noise removal and disparity compensation is performed on the streamed audio signal using the audio signal information received through the server communication unit, the speech signal on which the preprocessing process has been performed is recognized to analyze the meaning, A server control unit for controlling the voice output unit to transmit the corresponding voice recognition service to the sound output unit according to the result;
The service server comprising:

12. The method of claim 11,
The server control unit
A filter for a specific frequency band for removing noise in the voice signal is configured using direction information and noise information included in the voice signal information, noise is removed using the configured filter,
And performs a preprocessing process of compensating a channel variation of the voice signal using the channel information included in the voice signal information.