KR102342715B1

KR102342715B1 - System and method for providing supplementary service based on speech recognition

Info

Publication number: KR102342715B1
Application number: KR1020190110522A
Authority: KR
Inventors: 김선희; 김성일; 신광수; 박지혜
Original assignee: 주식회사 엘지유플러스
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2021-12-23
Also published as: KR20210029383A

Abstract

일 실시 예에 의한 음성인식에 기반한 부가 서비스 제공 시스템은, 사용자 발화에 따른 음성 입력을 수신하는 클라이언트; 및 상기 클라이언트와 연동되어 상기 음성 입력에 상응하는 사용자 발화 의도를 도출하고, 상기 사용자 발화 의도에 기반하여 복수의 응답 데이터를 생성하는 음성인식 플랫폼 제공 장치;를 포함하고, 상기 클라이언트는, 응답 대기시간에 기반하여 상기 복수의 응답 데이터를 동기화하여 외부로 출력할 수 있다.A system for providing additional services based on voice recognition according to an embodiment includes: a client for receiving a voice input according to a user's utterance; and an apparatus for providing a speech recognition platform that works with the client to derive a user's speech intent corresponding to the speech input, and generates a plurality of response data based on the user's speech intent. based on the synchronization of the plurality of response data may be output to the outside.

Description

A system and method for providing additional services based on voice recognition

본 발명은 음성인식에 기반한 부가 서비스 제공 시스템 및 그 방법에 관한 것이다.The present invention relates to a system and method for providing additional services based on voice recognition.

기술이 발달함에 따라 최근 많은 분야에서 음성인식 기술을 적용한 각종 서비스들이 소개되고 있다. 음성인식 기술은 사람이 발성하는 음성을 이해하여 컴퓨터가 다룰 수 있는 문자 정보로 변환하는 일련의 과정이라 할 수 있으며, 일반적인 음성인식 서비스는 사용자의 음성을 인식하고, 이에 해당하는 적합한 서비스를 청각 기반의 인터페이스를 통해 제공하는 일련의 과정을 포함할 수 있다.With the development of technology, various services to which voice recognition technology is applied have been introduced in many fields. Speech recognition technology can be said to be a series of processes that understand human voice and convert it into text information that a computer can handle. It can include a series of processes provided through the interface of

그러나, 일반적인 음성인식 서비스는 사용자의 눈에 보이지 않는 특성 때문에 정보 전달에 있어 해상도가 높지 않다는 한계가 있다. 예를 들면, 서울역이 지도상에서 어느 위치에 있는지 사용자에게 알려줄 때 시각 기반의 인터페이스를 통하면 단번에 위치를 특정해 알려줄 수 있는 반면에, 청각 기반의 인터페이스에서는 주소와 같은 위치정보를 알려주고, 이로도 설명이 안 되면 주변에 무엇이 있는지 보조 설명을 거치는 등 정보 전달에 많은 노력이 필요하다.However, the general voice recognition service has a limitation in that the resolution is not high in information delivery due to the invisible characteristics of the user. For example, when notifying the user of the location of Seoul Station on the map, the visual-based interface can specify and inform the location at once, whereas the auditory-based interface informs the user of location information such as an address and explains If this is not the case, a lot of effort is required to transmit information, such as going through an auxiliary explanation of what is around.

이러한 음성인식 기술의 한계는 적은 노력으로 다중 과제(multitasking)를 용이하게 수행할 수 있다는 장점을 희석시키고, 음성인식 서비스의 이용률을 저해하는 요인으로 작용하고 있다.The limitations of such voice recognition technology dilute the advantage of being able to easily perform multitasking with little effort and act as a factor hindering the use rate of the voice recognition service.

실시 예는 청각 기반의 인터페이스에 시각 기반의 인터페이스가 이식된 음성인식 플랫폼을 제시하여 사용자 경험을 개선할 수 있는 음성인식에 기반한 부가 서비스 제공 시스템 및 그 방법을 제공하기 위한 것이다.An embodiment is to provide a voice recognition-based supplementary service providing system and method capable of improving user experience by presenting a voice recognition platform in which a visual-based interface is implanted in an auditory-based interface.

실시 예에서 해결하고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제는 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problem to be solved in the embodiment is not limited to the technical problem mentioned above, and another technical problem not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. will be able

일 실시 예는, 사용자 발화에 따른 음성 입력을 수신하는 클라이언트; 및 상기 클라이언트와 연동되어 상기 음성 입력에 상응하는 사용자 발화 의도를 도출하고, 상기 사용자 발화 의도에 기반하여 복수의 응답 데이터를 생성하는 음성인식 플랫폼 제공 장치;를 포함하고, 상기 클라이언트는, 응답 대기시간에 기반하여 상기 복수의 응답 데이터를 동기화하여 외부로 출력하는, 음성인식에 기반한 부가 서비스 제공 시스템을 제공할 수 있다. According to an embodiment, a client receiving a voice input according to a user's utterance; and an apparatus for providing a speech recognition platform that works with the client to derive a user's speech intent corresponding to the speech input, and generates a plurality of response data based on the user's speech intent. Based on the synchronization of the plurality of response data to output to the outside, it is possible to provide an additional service providing system based on voice recognition.

상기 복수의 응답 데이터는, 청각적 형태의 음성 데이터를 포함하는 제1 응답 데이터; 및 시각적 형태의 콘텐츠 데이터를 포함하는 제2 응답 데이터;를 포함할 수 있다.The plurality of response data may include: first response data including audio data in an auditory form; and second response data including content data in a visual form.

상기 음성인식 플랫폼 제공 장치는, 상기 음성 입력을 질의 텍스트로 변환하고, 자연어 처리 및 개체명 인식 중 적어도 하나를 이용하여 상기 사용자 발화 의도를 도출하는 제1 서버; 기 구축된 데이터 베이스를 토대로 상기 사용자 발화 의도를 탐색하여 응답 텍스트를 생성하는 제2 서버; 및 상기 클라이언트와 규정된 프로토콜에 따라 세션 연결을 수행하는 제3 서버;를 포함할 수 있다.The apparatus for providing the speech recognition platform may include: a first server that converts the speech input into query text and derives the user's intention to speak by using at least one of natural language processing and object name recognition; a second server for generating a response text by searching for the user's utterance intention based on a pre-established database; and a third server that performs a session connection with the client according to a prescribed protocol.

상기 제2 서버는, 상기 제1 서버로 상기 응답 텍스트를 전송하고, 상기 제1 서버는, 상기 응답 텍스트를 음성 신호로 변환하여 상기 제1 응답 데이터를 생성할 수 있다.The second server may transmit the response text to the first server, and the first server may generate the first response data by converting the response text into a voice signal.

상기 제2 서버는, 상기 응답 텍스트의 핵심어를 추출하여 상기 제2 응답 데이터를 생성하고, 상기 제3 서버로 상기 제2 응답 데이터를 전송할 수 있다.The second server may generate the second response data by extracting a key word from the response text, and may transmit the second response data to the third server.

상기 클라이언트는, 상기 제1 서버로부터 상기 제1 응답 데이터를 수신하고, 상기 제3 서버로부터 상기 세션 연결이 유지된 상태에서 상기 제2 응답 데이터를 수신할 수 있다.The client may receive the first response data from the first server and receive the second response data from the third server while the session connection is maintained.

상기 응답 대기시간은, 상기 음성 입력을 수신하는 제1 시점과 상기 제1 및 제2 응답 데이터를 전부 수신하는 제2 시점 사이의 출력 지연시간일 수 있다.The response waiting time may be an output delay time between a first time point of receiving the voice input and a second time point of receiving all of the first and second response data.

상기 클라이언트는, 상기 응답 대기시간이 경과되기 전, 상기 제1 및 제2 응답 데이터 중 어느 하나의 데이터에 대한 출력을 대기 상태로 제어할 수 있다.The client may control the output of any one of the first and second response data in a standby state before the response waiting time elapses.

상기 클라이언트는, 상기 음성 입력과 상기 제1 및 제2 응답 데이터를 서로 매핑하여 저장할 수 있다.The client may store the voice input by mapping the first and second response data to each other.

상기 프로토콜은 HTTP/2(Hyper Text Transfer Protocol Version 2)를 포함할 수 있다.The protocol may include Hyper Text Transfer Protocol Version 2 (HTTP/2).

본 발명의 적어도 일 실시 예에 의하면, 청각과 함께 시각 정보를 함께 전달할 수 있는 음성인식 플랫폼을 제시하여 사용자에게 다중 감각 통합의 효과를 제공할 수 있다.According to at least one embodiment of the present invention, it is possible to provide the effect of multi-sensory integration to the user by presenting a voice recognition platform capable of delivering visual information together with hearing.

또한, 사용자 단말과 음성인식 플랫폼 사이에 중계 서버를 구축하여 우회적으로 시각 정보를 전송할 수 있으므로 시스템의 안정성이 향상될 수 있다. 아울러, 청각 정보와 시각 정보의 출력 시점이 일치되도록 사용자 단말을 제어함으로써 사용자가 느끼는 이질감을 저감시킬 수 있다.In addition, since a relay server can be built between the user terminal and the voice recognition platform to transmit visual information in a detour, the stability of the system can be improved. In addition, by controlling the user terminal so that the output timing of the auditory information and the visual information coincide, it is possible to reduce the sense of heterogeneity felt by the user.

본 실시 예에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며 언급하지 않은 또 다른 효과는 아래의 기재로부터 본 발명이 속하는 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects obtainable in this embodiment are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention pertains from the description below. .

도 1은 본 발명의 일 실시 예에 따른 음성인식에 기반한 부가 서비스 제공 시스템의 운용 환경을 도시한 예시도이다.
도 2는 도 1에 도시된 부가 서비스 제공 시스템의 개략적인 블록도이다.
도 3은 본 발명의 일 실시 예에 따른 음성인식 플랫폼 제공 장치에 의해 생성되는 응답 텍스트 및 부가 정보의 일 례를 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시 예에 따른 음성인식에 기반한 부가 서비스 제공 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 일 실시 예에 따른 클라이언트에 설치된 음성인식 어플리케이션 프로그램을 통하여 구현되는 사용자 발화에 대한 응답을 나타내는 도면이다.1 is an exemplary diagram illustrating an operating environment of a system for providing additional services based on voice recognition according to an embodiment of the present invention.
FIG. 2 is a schematic block diagram of the system for providing additional services shown in FIG. 1 .
3 is a diagram for explaining an example of response text and additional information generated by the apparatus for providing a speech recognition platform according to an embodiment of the present invention.
4 is a flowchart illustrating a method of providing an additional service based on voice recognition according to an embodiment of the present invention.
5 is a diagram illustrating a response to a user's utterance implemented through a voice recognition application program installed in a client according to an embodiment of the present invention.

이하, 첨부된 도면들을 참조하여 실시 예를 상세히 설명한다. 실시 예는 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 실시 예를 특정한 개시 형태에 대해 한정하려는 것이 아니며, 실시 예의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings. Since the embodiment may have various changes and may have various forms, specific embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the embodiment to the specific disclosed form, and it should be understood to include all changes, equivalents, or substitutes included in the spirit and scope of the embodiment.

"제1", "제2" 등의 용어는 다양한 구성요소들을 설명하는 데 사용될 수 있지만, 이러한 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로 사용된다. 또한, 실시 예의 구성 및 작용을 고려하여 특별히 정의된 용어들은 실시 예를 설명하기 위한 것일 뿐이고, 실시 예의 범위를 한정하는 것이 아니다.Terms such as “first” and “second” may be used to describe various elements, but these elements should not be limited by the terms. These terms are used for the purpose of distinguishing one component from another. In addition, terms specifically defined in consideration of the configuration and operation of the embodiment are only for describing the embodiment, and do not limit the scope of the embodiment.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석될 수 있으며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, may have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in a commonly used dictionary may be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, it is interpreted in an ideal or excessively formal meaning. doesn't happen

이하, 실시 예에 의한 음성인식에 기반한 부가 서비스 제공 시스템을 첨부된 도면을 참조하여 다음과 같이 설명한다.Hereinafter, an additional service providing system based on voice recognition according to an embodiment will be described with reference to the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 음성인식에 기반한 부가 서비스 제공 시스템의 운용 환경을 도시한 예시도이다.1 is an exemplary diagram illustrating an operating environment of a system for providing additional services based on voice recognition according to an embodiment of the present invention.

도 1을 참조하면, 음성인식에 기반한 부가 서비스 제공 시스템(1)은 네트워크(30)을 통해 서로 연동되는 클라이언트(10)와 음성인식 플랫폼 제공 장치(20)를 포함할 수 있다. Referring to FIG. 1 , the system 1 for providing additional services based on voice recognition may include a client 10 and a voice recognition platform providing apparatus 20 that are interlocked with each other through a network 30 .

클라이언트(10)는 특정 명령 또는 질의를 내포하는 사용자 발화를 수신할 수 있다. 예컨대, 클라이언트(10)는 사용자의 요청에 의해 음성인식 서비스의 운용을 지원하는 어플리케이션 프로그램을 실행하고, 상기 프로그램의 실행에 따라 활성화되는 입력 장치를 기반으로 사용자 발화에 따른 음성 입력(이하, 편의상 '사용자 발화 음성'이라 칭함)을 수신할 수 있다.The client 10 may receive user utterances containing specific commands or queries. For example, the client 10 executes an application program supporting the operation of the voice recognition service at the user's request, and based on the input device activated according to the execution of the program, a voice input according to the user's utterance (hereinafter, 'for convenience' user's uttered voice) may be received.

이러한 클라이언트(10)는 스마트폰, 웨어러블 단말기, 인공지능 스피커, 로봇 청소기, 셋톱박스, TV, 냉장고 등과 같은 사물인터넷(IoT)에 해당하는 각종 디바이스를 포함하며, 상기 각종 디바이스는 통신망 사업자에 의해 운영되는 이동 통신 서비스에 미리 가입 및/또는 등록될 수 있다. 다만, 이는 예시적인 것으로 본 발명의 범주가 반드시 이에 한정되는 것은 아니다.The client 10 includes various devices corresponding to the Internet of Things (IoT) such as smartphones, wearable terminals, artificial intelligence speakers, robot cleaners, set-top boxes, TVs, refrigerators, etc., and the various devices are operated by a communication network operator may be pre-subscribed and/or registered to a mobile communication service. However, this is exemplary and the scope of the present invention is not necessarily limited thereto.

음성인식 플랫폼 제공 장치(20)는 음성인식에 기반하여 사용자에게 부가 서비스를 제공하기 위한 일련의 프로세스를 수행할 수 있다. 예컨대, 음성인식 플랫폼 제공 장치(20)는 클라이언트(10)로부터 수신한 사용자 발화 음성을 인식하여 사용자의 발화 의도를 도출하고, 상기 발화 의도에 대응하는 적어도 하나의 정보 자원을 클라이언트(10)에 탑재된 출력 장치로 전송하여 사용자 발화에 대한 응답을 지원할 수 있다.The voice recognition platform providing apparatus 20 may perform a series of processes for providing additional services to the user based on voice recognition. For example, the apparatus 20 for providing a speech recognition platform recognizes the user's speech voice received from the client 10 to derive the user's speech intent, and mounts at least one information resource corresponding to the speech intent in the client 10 . It can support response to user's utterance by sending it to an output device.

네트워크(30)는 클라이언트(10)와 음성인식 플랫폼 제공 장치(20)를 연결하는 역할을 수행할 수 있다. 이러한 네트워크(30)는 LANs(local area networks), WANs(wide area networks), MANs(metropolitan area networks), ISDNs(integrated service digital networks) 등의 유선 통신망 및/또는 WCDMA(Wideband Code Division Multiple Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced), 5G(Generation) 등의 이동 통신 규격을 지원하는 무선 통신망을 포함할 수 있으나, 이는 예시적인 것에 불과하고 본 발명의 범주가 반드시 이에 국한되는 것은 아니다.The network 30 may serve to connect the client 10 and the apparatus 20 for providing a voice recognition platform. Such networks 30 include wired networks such as local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), integrated service digital networks (ISDNs), and/or Wideband Code Division Multiple Access (WCDMA), It may include a wireless communication network supporting mobile communication standards such as Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), and 5G (Generation), but this is merely exemplary and the scope of the present invention is not necessarily However, the present invention is not limited thereto.

도 2는 도 1에 도시된 부가 서비스 제공 시스템의 개략적인 블록도이다.FIG. 2 is a schematic block diagram of the system for providing additional services shown in FIG. 1 .

도 2를 참조하면, 클라이언트(10)와 연동되는 음성인식 플랫폼 제공 장치(20)는, 음성인식 서비스를 지원하는 지능형 음성인식 서버(210), 사용자의 발화 의도와 관련된 적어도 하나의 정보 자원을 탐색하는 서비스 제공 서버(230), 및 클라이언트(10)와의 세션 연결을 수행하는 중계 서버(250)를 포함할 수 있다. 이하에서는, 음성인식 플랫폼 제공 장치(20)를 설명하기에 앞서 클라이언트(10)의 각 구성요소에 대하여 먼저 서술하기로 한다.Referring to FIG. 2 , the apparatus 20 for providing a voice recognition platform interworking with the client 10 searches for at least one information resource related to the intelligent voice recognition server 210 supporting the voice recognition service and the user's utterance intention. It may include a service providing server 230 and a relay server 250 performing a session connection with the client (10). Hereinafter, each component of the client 10 will be first described before describing the apparatus 20 for providing a voice recognition platform.

클라이언트(10)는 마이크(110), 통신부(120), 스피커(130), 디스플레이 (140), 메모리(150), 및 제어부(160)를 포함할 수 있다. 다만, 이는 예시적인 것으로 클라이언트(10)는 상술한 구성요소 중 적어도 하나를 생략하거나, 다른 구성요소를 추가적으로 포함할 수 있다.The client 10 may include a microphone 110 , a communication unit 120 , a speaker 130 , a display 140 , a memory 150 , and a control unit 160 . However, this is exemplary, and the client 10 may omit at least one of the above-described components or may additionally include other components.

마이크(110)는 사용자 발화에 따른 음성 입력을 수신할 수 있다. 상기 마이크(110)는 음성인식 어플리케이션 프로그램의 실행에 따라 구동되거나 또는, 상시 구동되는 상태로 제어될 수 있다. 여기서, 음성인식 어플리케이션 프로그램은 사용자의 조작 또는 지정된 발화 명령(wake-up)에 의하여 실행될 수 있다.The microphone 110 may receive a voice input according to a user's utterance. The microphone 110 may be driven according to the execution of a voice recognition application program, or may be controlled in a constantly driven state. Here, the voice recognition application program may be executed by a user's manipulation or a specified wake-up command.

통신부(120)는 클라이언트(100)와 외부 장치 간의 통신을 지원할 수 있다. 예컨대, 통신부(120)는 지능형 음성인식 서버(210) 및/또는 중계 서버(250)와 규정된 프로토콜(protocol)에 따라 유선 또는 무선 통신을 수립할 수 있다. 통신부(120)는 상기 유선 또는 무선 통신을 기반으로 네트워크(30)에 접속함으로써, 지능형 음성인식 서버(210) 및/또는 중계 서버(250)와 음성인식 서비스의 운용에 수반되는 적어도 하나의 정보 자원을 송수신할 수 있다.The communication unit 120 may support communication between the client 100 and an external device. For example, the communication unit 120 may establish wired or wireless communication according to a prescribed protocol with the intelligent voice recognition server 210 and/or the relay server 250 . The communication unit 120 connects to the network 30 based on the wired or wireless communication, and thus at least one information resource involved in the operation of the intelligent voice recognition server 210 and/or the relay server 250 and the voice recognition service. can send and receive

스피커(130)는 클라이언트(10)의 내부에서 생성되거나, 또는 지능형 음성인식 서버(210)로부터 수신하는 음성 신호를 출력할 수 있다. 예컨대, 스피커(130)는 지능형 음성인식 서버(210)로부터 수신하는 청각적 형태의 응답 발화음성을 전송 받아 제어부(160)의 제어 하에 외부로 출력할 수 있다.The speaker 130 may output a voice signal generated inside the client 10 or received from the intelligent voice recognition server 210 . For example, the speaker 130 may receive the audible response utterance received from the intelligent voice recognition server 210 and output it to the outside under the control of the controller 160 .

디스플레이(140)는 제어부(160)의 제어 하에 각종 부가 정보를 출력할 수 있다. 디스플레이(140)는 적어도 일 영역으로 중계 서버(250)로부터 전송 받는 텍스트, 그래픽, 이미지, 비디오 또는 이들의 조합을 포함하는 시각적 형태의 부가 정보를 출력할 수 있으며, 상기 부가 정보는 사용자 발화에 대한 응답의 적어도 일부로써 이해될 수 있다. 이러한 디스플레이(140)는 일 예로, 사용자(User)에 의한 터치 입력을 수신하는 터치 패드와 상호 레이어 구조를 이루어 터치스크린으로 구성될 수 있으며, 클라이언트(100)의 시스템 설정과 관계된 인터페이스를 출력할 수 있다.The display 140 may output various types of additional information under the control of the controller 160 . The display 140 may output additional information in a visual form including text, graphic, image, video, or a combination thereof transmitted from the relay server 250 to at least one area, and the additional information is related to the user's utterance. It can be understood as at least part of the response. For example, the display 140 may be configured as a touch screen by forming a layer structure with a touch pad that receives a touch input by a user, and may output an interface related to the system setting of the client 100 . have.

메모리(150)는 음성인식 서비스의 운용에 수반되는 적어도 하나의 정보 자원을 저장하거나, 클라이언트(100)를 이루는 구성요소의 기능 동작과 관계된 명령을 저장할 수 있다. 여기서, 음성인식 서비스의 운용에 수반되는 적어도 하나의 정보 자원은 마이크(110)에 인가되는 사용자 발화 음성, 지능형 음성인식 서버(210)로부터 수신하는 응답 발화음성, 및 중계 서버(250)로부터 전송 받는 부가 정보 중 적어도 하나를 포함할 수 있다. 또한, 메모리(150)는 전술한 음성인식 어플리케이션 프로그램을 비롯하여, 클라이언트(10)의 운용과 관계되는 적어도 하나의 어플리케이션 프로그램을 저장할 수도 있다.The memory 150 may store at least one information resource involved in the operation of the voice recognition service or may store a command related to the functional operation of a component constituting the client 100 . Here, at least one information resource involved in the operation of the voice recognition service is a user uttered voice applied to the microphone 110 , a response uttered voice received from the intelligent voice recognition server 210 , and transmitted from the relay server 250 . It may include at least one of additional information. Also, the memory 150 may store at least one application program related to the operation of the client 10 , including the aforementioned voice recognition application program.

제어부(160)는 전술한 클라이언트(100)의 적어도 하나의 구성요소와 전기적으로 연결되어, 구성요소에 대한 제어, 연산 또는 데이터 처리 등을 수행할 수 있다. 예컨대, 제어부(160)는 스피커(130)로 하여금 음답 발화음성을 출력하도록 제어할 수 있고, 디스플레이(140)를 제어하여 부가 정보를 출력시킬 수 있다. 또는, 제어부(160)는 음성인식 서비스의 운용에 수반되는 적어도 하나의 정보 자원을 메모리(150)에 저장하거나, 메모리(150)로부터 로드하여 처리할 수 있다.The controller 160 may be electrically connected to at least one component of the above-described client 100 to perform control, calculation, or data processing on the component. For example, the control unit 160 may control the speaker 130 to output an utterance voice, and may control the display 140 to output additional information. Alternatively, the controller 160 may store at least one information resource involved in the operation of the voice recognition service in the memory 150 or load it from the memory 150 and process it.

이때, 제어부(160)는 사용자 발화에 대한 응답의 제공과 관련하여, 응답 대기시간(latency time)을 기반으로 응답 발화음성과 부가 정보를 동기화하여 출력할 수 있다. 예컨대, 제어부(160)는 응답 발화음성 및 부가 정보 중 어느 하나의 데이터를 수신한 시점에서 클라이언트(10)를 스탠바이(standby) 상태로 제어하고, 나머지 하나의 데이터를 수신하는 시점에서 상기 응답 발화음성과 부가 정보를 동기화(synchronization)할 수 있다. 여기서, 스탠바이 상태란 응답 대기시간 동안 클라이언트(10)가 수신하는 어느 하나의 데이터에 대한 출력을 대기 또는 보류하는 것으로 이해할 수 있다. 또한, 응답 대기시간은 사용자 발화음성을 수신한 시점과 상기 응답 발화음성 및 부가 정보를 전부 수신한 시점 사이의 출력 지연시간으로 정의할 수 있다.In this case, in relation to the provision of a response to the user's utterance, the controller 160 may synchronize and output the response uttered voice and additional information based on a response latency time. For example, the controller 160 controls the client 10 to be in a standby state when any one data of the response speech and the additional information is received, and when the other data is received, the controller 160 controls the response speech voice and additional information can be synchronized. Here, the standby state can be understood as waiting or suspending the output of any one data received by the client 10 during the response waiting time. In addition, the response waiting time may be defined as an output delay time between a point in time when a user's uttered voice is received and a time when all of the response uttered voice and additional information are received.

음성인식 플랫폼 제공 장치(20)는 지능형 음성인식 서버(210), 서비스 제공 서버(230), 및 중계 서버(250)를 포함할 수 있다.The voice recognition platform providing apparatus 20 may include an intelligent voice recognition server 210 , a service providing server 230 , and a relay server 250 .

지능형 음성인식 서버(210)는 음성인식 서비스를 지원하며, 음성 인식부(211), 의도 분석부(212), 및 음성 합성부(213)를 포함할 수 있다.The intelligent voice recognition server 210 supports a voice recognition service, and may include a voice recognition unit 211 , an intention analysis unit 212 , and a voice synthesis unit 213 .

음성 인식부(211)는 클라이언트(10)로부터 수신된 사용자 발화음성을 인식하여 질의 텍스트로 변환할 수 있다. 음성 인식부(211)는 발화 또는 발성과 관련된 적어도 하나의 정보를 포함하는 음향 모델(acoustic model) 또는 적어도 하나의 단위 음소 정보를 포함하는 언어 모델(language model)을 이용하여 사용자 발화음성을 질의 텍스트로 변환(Speech To Text, STT)할 수 있다.The voice recognition unit 211 may recognize the user's uttered voice received from the client 10 and convert it into a query text. The voice recognition unit 211 uses an acoustic model including at least one piece of utterance or utterance-related information or a language model including at least one unit phoneme information to generate a query text for the utterance of the user. can be converted to Speech To Text (STT).

의도 판단부(212)는 음성 인식부(211)로부터 전달받은 질의 텍스트에 대한 자연어 처리(Natural Language Processing, NLP) 및 개체명 인식(Named Entity Recognition, NER)을 수행하여 사용자의 발화 의도를 도출할 수 있다. 여기서, 자연어 처리(NLP)란 질의 텍스트를 구문, 품사, 형태소 등의 문법적 단위로 구분하고, 각각의 문법적 단위에 대한 언어적 특징을 분석하여 의미를 판단하는 자연 언어 이해를 의미한다. 그리고, 개체명 인식(NER)이란 질의 텍스트에 내포된 사람, 장소, 시간, 대상, 기관 등의 개체명을 추출하고, 추출되는 개체명의 종류를 분류하는 정보 검색 기술을 말한다.The intention determination unit 212 performs Natural Language Processing (NLP) and Named Entity Recognition (NER) on the query text received from the speech recognition unit 211 to derive the user's speech intention. can Here, natural language processing (NLP) refers to natural language understanding in which a query text is divided into grammatical units such as syntax, part-of-speech, and morpheme, and the meaning is determined by analyzing the linguistic characteristics of each grammatical unit. In addition, entity name recognition (NER) refers to an information retrieval technology for extracting entity names such as person, place, time, object, and organization contained in the query text and classifying the extracted entity name type.

의도 판단부(212)는 인공지능(Artificial Intelligence, AI) 알고리즘-예컨대, 딥러닝, 기계학습 등-을 적용하여 사용자의 언어 습관을 학습함으로써 음성인식의 정확도를 높일 수 있으며, 지능형 음성인식 서버(210)와 연계된 서비스 제공 서버(230)로 상기 발화 의도에 대응하는 응답을 요청할 수 있다. 일 예로, 질의 텍스트가 [오늘 예능 뭐해]인 경우, 의도 판단부(212)는 자연어 처리 및 개체명 인식을 통해 [오늘]이라는 '시간' 개체에 대응되는 [예능 프로그램 정보 요청]울 사용자의 발화 의도로 도출하고, 이를 서비스 제공 서버(250)로 전달할 수 있다.The intention determination unit 212 may increase the accuracy of speech recognition by learning the user's language habit by applying an artificial intelligence (AI) algorithm (eg, deep learning, machine learning, etc.), and an intelligent speech recognition server ( 210) may request a response corresponding to the utterance intention to the service providing server 230 associated with the utterance. As an example, if the query text is [What are you doing in entertainment today], the intention determining unit 212 uses natural language processing and object name recognition to perform [entertainment program information request] corresponding to the 'time' object called [today] through the user's utterance. It can be derived as an intention and delivered to the service providing server 250 .

음성 합성부(213)는 후술하는 서비스 제공 서버(230)가 생성한 자연어 형태의 응답 텍스트를 음성 신호로 변환(Text To Speech, TTS)한 응답 발화음성을 클라이언트(10)로 전송할 수 있다. The voice synthesizer 213 may transmit, to the client 10 , the response utterance obtained by converting (Text To Speech, TTS) the response text in the natural language form generated by the service providing server 230 to be described later.

서비스 제공 서버(230)는 의도 판단부(212)로부터 전달 받은 사용자의 발화 의도에 기반하여 적절한 응답을 생성하며, 탐색부(231) 및 데이터 베이스(Database, 232)를 포함할 수 있다.The service providing server 230 generates an appropriate response based on the user's utterance intention received from the intention determination unit 212 , and may include a search unit 231 and a database 232 .

탐색부(231)는 데이터 베이스(232)에 구축된 지식 관리를 통해 사용자의 발화 의도를 검색하여 사용자가 원하는 응답을 획득할 수 있다. 여기서, 데이터 베이스(232)는 사용자의 질의에 응답하기 위한 필수적인 기초 데이터의 집합으로 지식 기반 데이터 베이스일 수 있다. 데이터 베이스(232)에는 각 도메인-예컨대, 음악, 영화, 방송, 날씨, 뉴스, 스포츠, 교통, 금융, 쇼핑 등-의 범주 별로 연관된 콘텐츠의 메타 데이터가 기 분류되어 저장되며, 상기 메타 데이터는 음성인식 서비스에 대한 가입 또는 제휴를 기반으로 하는 서드 파티(third party) 업체에 의해 제공될 수 있다. 예컨대, '방송' 도메인의 경우, 메타 데이터에는 관련된 콘텐츠의 타이틀, 등장인물, 줄거리, 방송시간, 시청률, 방송국 등에 관한 정보가 포함될 수 있다.The search unit 231 may obtain a response desired by the user by searching for the user's utterance intention through the knowledge management built in the database 232 . Here, the database 232 may be a knowledge-based database as a set of essential basic data for responding to a user's query. In the database 232, metadata of content related to each domain (eg, music, movie, broadcasting, weather, news, sports, transportation, finance, shopping, etc.) is classified and stored in advance, and the metadata is It may be provided by a third party vendor based on a subscription or affiliation with the recognition service. For example, in the case of a 'broadcast' domain, the metadata may include information about titles, characters, plots, broadcast times, ratings, broadcasting stations, etc. of related content.

탐색부(231)는 사용자의 발화 의도에 기반하여 데이터 베이스(232) 상에서 특정 도메인과 각 콘텐츠의 메타 데이터를 추출하고 텍스트 요약 및 개인화 추천 기술을 이용하여 자연어 형태의 적절한 응답 텍스트를 생성할 수 있다. 예컨대, 사용자의 발화 의도가 [오늘] 방송되는 [예능 프로그램 정보 요청]인 경우, 탐색부(231)는 [오늘의 예능 프로그램으로는 영화가 좋다, 아는 형님, 놀면 뭐하니, 불후의 명곡 등이 있어요]를 응답 텍스트로 출력할 수 있다.The search unit 231 may extract metadata of a specific domain and each content from the database 232 based on the user's utterance intention, and may generate an appropriate response text in a natural language form using text summary and personalized recommendation technology. . For example, if the user's utterance intention is [entertainment program information request] broadcast [today], the search unit 231 includes [today's entertainment program, movies are good, acquaintance brother, what do you do when you play, immortal masterpieces, etc. ] can be output as response text.

또한, 탐색부(231)는 상기 응답 텍스트의 핵심어를 추출하고, 서비스 제공 서버(250)의 내부 및/또는 외부에 구비된 검색 엔진(미도시)을 활용하여 사용자의 발화 의도에 대응되는 부가 정보를 생성할 수 있다. 여기서, 부가 정보는 텍스트, 그래픽, 이미지, 비디오 또는 이들의 조합을 포함하는 시각적 형태의 콘텐츠 데이터를 포함할 수 있다.In addition, the search unit 231 extracts a key word of the response text, and utilizes a search engine (not shown) provided inside and/or outside the service providing server 250 to provide additional information corresponding to the user's utterance intention. can create Here, the additional information may include content data in a visual form including text, graphic, image, video, or a combination thereof.

상술한 응답 텍스트 및 부가 정보에 대한 일 례에 대하여는 도 3을 참조하여 이하에서 먼저 설명한다.An example of the above-described response text and additional information will be first described below with reference to FIG. 3 .

도 3은 본 발명의 일 실시 예에 따른 음성인식 플랫폼 제공 장치에 의해 생성되는 응답 텍스트 및 부가 정보의 일 례를 설명하기 위한 도면이다.3 is a diagram for explaining an example of response text and additional information generated by the apparatus for providing a speech recognition platform according to an embodiment of the present invention.

도 3의 (a)를 참조하면, 클라이언트(10)로부터 [오늘 날씨 알려줘]라는 사용자 발화음성(Q)이 전달되는 경우, 지능형 음성인식 서버(210)는 사용자의 발화 의도를 도출(예; '오늘의 날씨 정보 요청')를 도출하고, 서비스 제공 서버(230)는 이에 대응하는 응답 텍스트(A) 및 부가 정보(B)를 생성할 수 있다.Referring to (a) of FIG. 3 , when a user's uttered voice (Q) of [Tell me today's weather] is transmitted from the client 10, the intelligent voice recognition server 210 derives the user's utterance intention (eg, ' 'Today's weather information request'), and the service providing server 230 may generate a response text (A) and additional information (B) corresponding thereto.

탐색부(231)는 데이터 베이스(232) 상의 '날씨' 도메인에 저장된 메타 데이터를 토대로 텍스트 요약 기술을 이용하여 [현재 날씨는 맑아요]라는 응답 텍스트(A)를 획득할 수 있다.The search unit 231 may obtain the response text A stating [The current weather is sunny] using the text summary technology based on the metadata stored in the 'weather' domain on the database 232 .

또한, 탐색부(231)는 데이터 베이스(232) 및/또는 검색 엔진(미도시)에 상기 사용자 발화 의도를 검색하여 획득되는, 오늘의 기온(최저/최고 기온 포함), 습도, 미세먼지 등의 수치 정보(text)와 날씨 상태(맑음, 구름, 비, 눈)를 형상화한 이미지(image)를 조합하여 부가 정보(B)를 생성할 수 있다.In addition, the search unit 231 searches for the user's utterance intention in the database 232 and/or a search engine (not shown). Additional information (B) may be generated by combining numerical information (text) and an image representing weather conditions (sunny, cloud, rain, snow).

도 3의 (b)를 참조하면, 클라이언트(10)로부터 [최근 토트넘 경기 결과 알려줘]라는 사용자 발화음성(Q)이 전달되는 경우, 지능형 음성인식 서버(210)는 사용자의 발화 의도를 도출(예; '최근에 시합한 토트넘의 경기 결과 정보 요청')를 도출하고, 서비스 제공 서버(230)는 이에 대응하는 응답 텍스트(A) 및 부가 정보(B)를 생성할 수 있다.Referring to (b) of FIG. 3 , when a user's uttered voice Q is transmitted from the client 10 to [tell me the latest Tottenham match result], the intelligent voice recognition server 210 derives the user's utterance intention (eg ; 'request for information on the result of Tottenham's recent match'), and the service providing server 230 may generate a response text (A) and additional information (B) corresponding thereto.

탐색부(231)는 데이터 베이스(232) 상의 '스포츠' 도메인에 저장된 메타 데이터를 토대로 텍스트 요약 및 개인화 추천 기술을 이용하여 [토트넘은 어제 뉴캐슬과의 경기에서 0 대 1로 패했습니다]라는 응답 텍스트(A)를 획득할 수 있다.Based on the metadata stored in the 'Sports' domain on the database 232, the search unit 231 responds to the text [Tottenham lost 0–1 against Newcastle yesterday] using text summarization and personalized recommendation technology. (A) can be obtained.

또한, 탐색부(231)는 검색 엔진(미도시)에 상기 응답 텍스트(A)에서 추출된 핵심어(예: 토트넘, 경기결과, 어제)를 검색하여 획득되는 스포츠 리그, 경기장, 경기 일시, 경기 스코어 등의 정보(text)와 양 팀의 엠블럼을 형상화한 이미지(image)를 조합하여 부가 정보(B)를 생성할 수 있다.In addition, the search unit 231 searches a search engine (not shown) for a key word (eg, Tottenham, match result, yesterday) extracted from the response text (A) to obtain a sports league, stadium, match date and time, and match score. Additional information (B) can be generated by combining information such as text and an image in the shape of the emblems of both teams.

도 3의 (a) 내지 (b)에 도시된 바와 같이, 사용자 발화에 대한 응답 결과로 시각적 형태의 부가 정보(B)가 함께 제공될 경우, 사용자는 정보를 직관적으로 이해할 수 있게 되므로 음성인식 서비스 분야에서 사용자 경험(User Experience, UX)이 개선될 수 있다. 예컨대, 일 실시 예에 따르면 청각과 함께 시각 정보를 함께 전달하기 때문에 다중 감각 통합 효과가 발생해 청각의 한계를 보완하는 것을 넘어 사용자에게 더 빠르고 정확한 사용경험을 제공할 수 있다.As shown in (a) to (b) of FIG. 3 , when additional information (B) in a visual form is provided together as a result of a response to a user's utterance, the user can intuitively understand the information, so that the voice recognition service User experience (UX) may be improved in the field. For example, according to an embodiment, since visual information is transmitted together with auditory information, a multi-sensory integration effect occurs, so that a faster and more accurate user experience can be provided beyond supplementing the auditory limitation.

다시 도 2로 돌아와서, 전술한 서비스 제공 서버(230)는 생성되는 응답 텍스트 및 부가 정보 각각을 지능형 음성인식 서버(210) 및 중계 서버(250)로 각각 전송할 수 있다.Returning to FIG. 2 again, the above-described service providing server 230 may transmit each of the generated response text and additional information to the intelligent voice recognition server 210 and the relay server 250 , respectively.

중계 서버(250)는 서비스 제공 서버(230)와 클라이언트(10) 사이에서 부가 정보의 전송을 중계할 수 있다. 이를 위해, 중계 서버(250)는 클라이언트(10)와 규정된 프로토콜(protocol)에 따라 통신을 수립하고, 상호 간에 세션(session) 연결을 수행하여 음성인식에 기반한 부가 서비스를 지원하기 위한 인프라를 구축할 수 있다. 여기서, 프로토콜의 일 예로 HTTP/2(Hyper Text Transfer Protocol Version 2)가 적용될 수 있으나, 본 발명의 범주가 반드시 이에 한정되는 것은 아니다.The relay server 250 may relay the transmission of additional information between the service providing server 230 and the client 10 . To this end, the relay server 250 establishes communication with the client 10 according to a prescribed protocol, and establishes an infrastructure for supporting additional services based on voice recognition by performing a session connection with each other. can do. Here, as an example of the protocol, HTTP/2 (Hyper Text Transfer Protocol Version 2) may be applied, but the scope of the present invention is not necessarily limited thereto.

중계 서버(250)는 클라이언트(10)로부터 세션 요청 메시지를 수신하면, 상기 클라이언트(10)에 세션 정보-예컨대, 세션 ID-를 할당하고 세션 응답 메시지를 송신함으로써 세션 연결이 수행될 수 있다. 그리고, 클라이언트(10)와 중계 서버(250)는 상호 간에 상태 체크 메시지-예컨대, 핑(ping)-를 주기적으로 송수신함으로써 상기 세션을 유지할 수 있다.When the relay server 250 receives the session request message from the client 10 , session connection may be performed by allocating session information—eg, a session ID—to the client 10 and transmitting a session response message. In addition, the client 10 and the relay server 250 may maintain the session by periodically transmitting and receiving a status check message (eg, a ping) between each other.

중계 서버(250)는 클라이언트(10)와의 세션이 유지된 상태에서, 서비스 제공 서버(230)로부터 부가 정보를 수신하여 클라이언트(10)로 전송할 수 있다. 이에 따라, 사용자는 클라이언트(10)의 사양이 서비스 제공 서버(250)가 지원하는 통신 규격에 부합하지 않더라도 중계 서버(10)를 통해 우회적으로 부가 정보를 제공받을 수 있다.The relay server 250 may receive additional information from the service providing server 230 and transmit it to the client 10 while the session with the client 10 is maintained. Accordingly, even if the specification of the client 10 does not conform to the communication standard supported by the service providing server 250 , the user can receive additional information in a detour through the relay server 10 .

또한, 중계 서버(250)는 음성인식 플랫폼 제공 장치(20)에서 클라이언트 (10)로 제공하는 데이터의 지원 범위를 확장하는 역할을 수행할 수 있다. 즉, 일 실시 예에 따른 음성인식 플랫폼 제공 장치(20)는 지능형 음성인식 서버(210) 이외에 별도의 중계 서버(250)를 구축하여 데이터의 형태, 포맷, 또는 전송 규격 등이 서로 상이한 응답 발화음성과 부가 정보의 전송 주체를 이원적으로 할당함으로써, 지능형 음성인식 서버(210)가 지원하는 데이터의 전송 규격에 의존하지 않고 다양한 종류 또는 형태의 데이터-일 예로, 텍스트, 그래픽, 이미지, 비디오, 오디오 또는 이들의 조합을 포함함-를 사용자에게 제공할 수 있다.Also, the relay server 250 may serve to extend the support range of data provided from the apparatus 20 for providing a voice recognition platform to the client 10 . That is, the apparatus 20 for providing a voice recognition platform according to an embodiment builds a separate relay server 250 in addition to the intelligent voice recognition server 210, so that the type, format, or transmission standard of data is different from each other. By dually allocating the transmission subject and the additional information, the intelligent voice recognition server 210 does not depend on the data transmission standard supported by various types or types of data - for example, text, graphics, images, video, audio. or a combination thereof- can be provided to the user.

예컨대, 음성인식 플랫폼 제공 장치(20)는, 청각적 형태의 음성 데이터로 구성된 응답 발화음성의 전송 주체로 지능형 음성인식 서버(210)를 할당하고, 시각적 형태의 콘텐츠 데이터로 구성된 부가 정보의 전송 주체로 중계 서버(250)를 각각 할당할 수 있다. 이처럼, 데이터의 형태에 따라 전송 주체를 이원적으로 할당하는 이유는 지능형 음성인식 서버(210)가 지원하는 데이터의 전송 규격이 극히 제한적이기 때문이다. 좀 더 부연하자면, 음성인식 서비스 실행 시 사용자에게 제공하고자 하는 부가 정보가 존재할지라도, 지능형 음성인식 서버(210)에서 클라이언트(10)로 전달할 규격이 지원되지 아니하면 사용자는 상기 부가 정보를 제공받을 수 없는 문제가 발생한다. 전술한 이유로, 일 실시 예는 중계 서버(250)의 구축을 통해, 클라이언트(10)에서 표현하고자 하는 다양한 종류의 데이터를 지능형 음성인식 서버(210)의 규격에 구애 받지 아니하고 제공하는 효과를 얻을 수 있다.For example, the apparatus 20 for providing a speech recognition platform allocates the intelligent speech recognition server 210 as a transmission subject of a response uttered voice composed of voice data in an auditory form, and a transmission subject of additional information composed of content data in a visual form Each relay server 250 can be assigned. As such, the reason for dually allocating the transmission subject according to the data type is that the data transmission standard supported by the intelligent voice recognition server 210 is extremely limited. In other words, even if there is additional information to be provided to the user when the voice recognition service is executed, if the standard to be transmitted from the intelligent voice recognition server 210 to the client 10 is not supported, the user may be provided with the additional information. no problem arises. For the above reasons, one embodiment can obtain the effect of providing various types of data to be expressed by the client 10 regardless of the specifications of the intelligent voice recognition server 210 through the construction of the relay server 250 . have.

이하에서는, 도 4를 참조하여 음성인식에 기반한 부가 서비스를 제공하는 방법에 대하여 설명하기로 한다.Hereinafter, a method of providing an additional service based on voice recognition will be described with reference to FIG. 4 .

도 4는 본 발명의 일 실시 예에 따른 음성인식에 기반한 부가 서비스 제공 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a method of providing an additional service based on voice recognition according to an embodiment of the present invention.

먼저, 중계 서버(250)는 클라이언트(10)와의 세션 연결을 수행할 수 있다(S1). S1 단계에서, 중계 서버(250)는 클라이언트(10)와 규정된 프로토콜(protocol)에 따라 통신을 수립하고, 클라이언트(10)로부터 세션 요청 메시지(request)를 수신하면 상기 클라이언트(10)에 할당된 세션 ID와 함께 세션 응답 메시지(reply)를 송신함으로써 세션 연결을 실행한다. 세션 연결이 완료되면, 중계 서버(250)는 클라이언트(10)와 상태 체크 메시지를 주기적으로 송수신함으로써 상호 간에 세션 연결 상태를 유지할 수 있다.First, the relay server 250 may perform a session connection with the client 10 (S1). In step S1, the relay server 250 establishes communication with the client 10 according to a prescribed protocol, and when receiving a session request message from the client 10, the A session connection is established by sending a session reply message with a session ID. When the session connection is completed, the relay server 250 can maintain the session connection state with the client 10 by periodically transmitting and receiving a status check message.

이후, 클라이언트(10)는 특정 명령 또는 질의를 내포하는 사용자 발화를 수신할 수 있다(S2). S2 단계에서, 클라이언트(10)는 음성인식 서비스의 운용을 지원하는 어플리케이션 프로그램이 실행됨에 따라 활성화되며, 상기 프로그램은 사용자의 조작 또는 지정된 발화 명령(wake-up)에 의해 실행될 수 있다.Thereafter, the client 10 may receive a user utterance including a specific command or query ( S2 ). In step S2, the client 10 is activated as an application program supporting the operation of the voice recognition service is executed, and the program may be executed by a user's manipulation or a specified wake-up command.

그리고, 클라이언트(10)는 사용자 발화에 따른 음성 입력(이하, '사용자 발화음성'이라 칭함)을 지능형 음성인식 서버(210)로 전송할 수 있다(S3).In addition, the client 10 may transmit a voice input according to the user's utterance (hereinafter, referred to as 'user uttered voice') to the intelligent voice recognition server 210 (S3).

지능형 음성인식 서버(210)는 클라이언트(10)로부터 수신된 사용자 발화음성을 인식하여 질의 텍스트로 변환하고(S4), 질의 텍스트에 대한 자연어 처리 및 개체명 인식을 수행하여 사용자의 발화 의도를 도출할 수 있다(S5). S4 및 S5 단계에서, 지능형 음성인식 서버(210)는 음성 인식 내지 발화 의도의 정확도를 향상시키기 위하여 인공지능(AI) 알고리즘을 토대로 사용자의 언어 습관을 학습할 수 있다.The intelligent speech recognition server 210 recognizes the user's spoken voice received from the client 10 and converts it into a query text (S4), and performs natural language processing and object name recognition on the query text to derive the user's speech intention. can be (S5). In steps S4 and S5, the intelligent voice recognition server 210 may learn the user's language habit based on an artificial intelligence (AI) algorithm in order to improve the accuracy of voice recognition or utterance intention.

이후, 지능형 음성인식 서버(210)는 서비스 제공 서버(230)로 사용자의 발화 의도에 대응하는 응답을 요청할 수 있다(S6).Thereafter, the intelligent voice recognition server 210 may request a response corresponding to the user's utterance intention to the service providing server 230 (S6).

서비스 제공 서버(230)는 데이터 베이스에 구축된 지식 관리를 통해 사용자의 발화 의도를 탐색하여 사용자가 원하는 응답을 획득할 수 있다(S7). 예컨대, 서비스 제공 서버(230)는 사용자의 발화 의도에 기반하여 데이터 베이스 상에서 특정 도메인-예컨대, 음악, 영화, 방송, 날씨, 뉴스, 스포츠, 교통, 금융, 쇼핑 등 중 적어도 하나를 포함한다-에 내재된 관련 콘텐츠의 메타 데이터를 추출하고, 텍스트 요약 및 개인화 추천 기술을 이용하여 자연어 형태의 적절한 응답 텍스트를 생성할 수 있다. 또한, 서비스 제공 서버(230)는 상기 응답 텍스트의 핵심어를 추출하고, 내부 및/또는 외부에 구비된 검색 엔진을 활용하여 사용자의 발화 의도에 대응되는 부가 정보를 생성할 수 있다. 여기서, 부가 정보는 사용자가 응답 결과를 직관적으로 이해할 수 있도록 텍스트, 그래픽, 이미지, 비디오 또는 이들의 조합을 포함하는 시각적 형태의 콘텐츠 데이터를 포함할 수 있다.The service providing server 230 may acquire a response desired by the user by searching for the user's utterance intention through knowledge management built in the database (S7). For example, the service providing server 230 includes at least one of music, movie, broadcasting, weather, news, sports, transportation, finance, shopping, etc. in a specific domain on the database based on the user's intention to speak. It is possible to extract metadata of embedded related content and generate appropriate response text in natural language form using text summary and personalized recommendation technology. In addition, the service providing server 230 may extract a key word from the response text, and may generate additional information corresponding to the user's utterance intention by utilizing an internal and/or external search engine. Here, the additional information may include content data in a visual form including text, graphic, image, video, or a combination thereof so that the user can intuitively understand the response result.

이후, 서비스 제공 서버(230)는 S7 단계에서 생성되는 응답의 적어도 일부인 응답 텍스트를 지능형 음성 인식 서버(210)로 전송할 수 있다(S8).Thereafter, the service providing server 230 may transmit the response text, which is at least a part of the response generated in step S7, to the intelligent voice recognition server 210 (S8).

그리고, 서비스 제공 서버(230)는 S7 단계에서 생성되는 응답의 나머지 일부인 부가 정보를 중계 서버(250)로 전송할 수 있다(S9).In addition, the service providing server 230 may transmit additional information that is the remaining part of the response generated in step S7 to the relay server 250 (S9).

S8 단계 이후, 지능형 음성인식 서버(230)는 자연어 형태의 응답 텍스트를 음성 신호로 변환(Text To Speech, TTS)하고(S9), 상기 변환된 음성 신호에 대응하는 청각적 형태의 응답 발화음성을 클라이언트(10)로 전송할 수 있다(S11).After step S8, the intelligent speech recognition server 230 converts the response text in the natural language form into a voice signal (Text To Speech, TTS) (S9), and generates a response speech in an auditory form corresponding to the converted voice signal. It can be transmitted to the client 10 (S11).

S9 단계 이후, 중계 서버(250)는 상호 간에 세션 연결 상태를 유지하고 있는 클라이언트(10)로 부가 정보를 전송할 수 있다(S12). 이에 따라, 사용자는 클라이언트(10)의 사양이 서비스 제공 서버(250)가 지원하는 통신 규격에 부합하지 않더라도 중계 서버(10)를 통해 우회적으로 부가 정보를 제공받을 수 있다.After step S9, the relay server 250 may transmit additional information to the client 10 maintaining the mutual session connection state (S12). Accordingly, even if the specification of the client 10 does not conform to the communication standard supported by the service providing server 250 , the user can receive additional information in a detour through the relay server 10 .

또한, S11 및 S12 단계를 참조하면, 음성인식 플랫폼 제공 장치(20)는 지능형 음성인식 서버(210) 이외에 별도의 중계 서버(250)를 구축하여 데이터의 형태, 포맷, 또는 전송 규격 등이 서로 상이한 응답 발화음성과 부가 정보의 전송 주체를 이원적으로 할당함으로써, 지능형 음성인식 서버(210)가 지원하는 데이터의 전송 규격에 의존하지 않고 다양한 종류 또는 형태의 데이터-일 예로, 텍스트, 그래픽, 이미지, 비디오, 오디오 또는 이들의 조합을 포함함-를 사용자에게 제공할 수도 있다. 이에 따라, 음성인식 플랫폼 제공 장치(20)에서 클라이언트(10)로 제공되는 데이터의 지원 내지 서비스 범위가 확대될 수 있다.In addition, referring to steps S11 and S12, the voice recognition platform providing device 20 builds a separate relay server 250 in addition to the intelligent voice recognition server 210 to have different data types, formats, or transmission standards. By dually allocating the transmission subject of the response speech and the additional information, the intelligent speech recognition server 210 does not depend on the data transmission standard supported by various types or types of data - for example, text, graphics, images, video, audio, or a combination thereof - may be provided to the user. Accordingly, the support or service range of data provided from the apparatus 20 for providing a voice recognition platform to the client 10 may be expanded.

이후, 클라이언트(10)는 지능형 음성인식 서버(210) 및 중계 서버(250) 각각으로부터 전달 받은 응답 발화음성과 부가 정보를 동기화하여 응답을 출력할 수 있다(S13). S13 단계에서, 클라이언트(10)는 응답 대기시간(latency time)을 기반으로 상기 응답 발화음성과 부가 정보의 출력 시점을 일치시킬 수 있다. 여기서, 응답 대기시간은 사용자 발화음성을 수신한 제1 시점(t₁)과 상기 응답 발화음성 및 부가 정보를 전부 수신한 제3 시점(t₃) 사이의 출력 지연 시간으로 이해할 수 있다.Thereafter, the client 10 may output a response by synchronizing the response speech received from each of the intelligent voice recognition server 210 and the relay server 250 with the additional information (S13). In step S13 , the client 10 may match the output timing of the response uttered voice with the additional information based on the response latency time. Here, the response waiting time can be understood as an output delay time between receiving the user speech utterance first point in time (t ₁₎ and the response voice utterance and all of the received additional information, the third time (t _3).

응답 발화음성과 부가 정보를 송신하는 각 서버(210, 250)는 물리적 또는 논리적으로 분리되며, 각 서버(210, 230)는 서로 다른 감각(청각 또는 시각)의 데이터를 클라이언트(10)로 전송하기 때문에, 클라이언트(10) 입장에서 각각의 데이터를 수신하는 시점에 간극이 발생할 수 있다. 그 이유는, 각 서버 간의 데이터 처리 속도(또는, 전송 속도) 내지 전송되는 데이터의 용량이 서로 다르기 때문이다. 따라서, 클라이언트(10)가 응답 발화음성을 수신하는 제2 시점(t₂)과 부가 정보를 수신하는 제3 시점(t₃) 간에는 소정의 시간 차(

)가 발생될 수 있다. Each

server

210, 250 for transmitting the response speech and additional information is physically or logically separated, and each

server

210, 230 transmits data of different senses (auditory or visual) to the client 10. Therefore, a gap may occur at the point of time when each data is received from the perspective of the client 10 . The reason is that the data processing speed (or transmission speed) or the capacity of transmitted data between each server is different from each other. Therefore, the client 10 and a second point in time (t ₂₎ and a predetermined time period between the third time (t ₃₎ for receiving the additional information to receive a response utterance voice difference (

) may occur.

만일, 클라이언트(10)가 응답 발화음성 또는 부가 정보 중 어느 하나의 데이터를 수신하는 즉시 응답을 출력할 경우, 각 데이터가 소정의 시간 차(

)를 두고 따로따로 구현되기 때문에 사용자 측면에서 음성인식 서비스에 대한 신뢰도를 저해하는 요인으로 작용할 수 있다. 특히, 각 데이터는 인간의 오감 중 서로 다른 감각(청각 또는 시각)에 의존하기 때문에 제공되는 응답에 대하여 사용자가 느끼는 이질감은 증대될 수밖에 없다.If the client 10 outputs a response immediately upon receiving any one of the response speech voice or additional information, each data is separated by a predetermined time difference (

) and implemented separately, it may act as a factor impeding the reliability of the voice recognition service from the user's point of view. In particular, since each data depends on different senses (hearing or sight) among human five senses, the sense of heterogeneity felt by the user in response to the provided response is inevitably increased.

전술한 이유로, 일 실시 예에 따른 클라이언트(10)는 응답 발화음성 및 부가 정보 중 어느 하나의 데이터를 수신한 시점(t₂)에서 상기 데이터의 출력을 대기 상태로 제어하고, 나머지 하나의 데이터를 수신하는 시점(t₃)에서 상기 응답 발화음성과 부가 정보를 동기화함으로써 출력 시점을 일치시킬 수 있다. 이때, 클라이언트(10)는 메모리(150, 도 2 참조)에 사용자 발화음성과 응답 발화음성 및 부가 정보를 서로 매핑하여 저장하고, 응답 대기시간이 경과되면 메모리(150, 도 2 참조)로부터 응답 결과를 로드하여 처리할 수 있다.For the above reasons, the client 10 according to an embodiment controls the output of the data to a standby state at the _{time point t 2} when any one of the response speech voice and the additional information is received, and the other data By synchronizing the response uttered voice and the additional information at the reception time t _{3 , the output time may be matched.} At this time, the client 10 maps and stores the user uttered voice, the response uttered voice, and additional information in the memory 150 (see FIG. 2 ), and when the response waiting time elapses, the response result from the memory 150 (see FIG. 2 ) can be loaded and processed.

전술한 클라이언트(10)에서 구현되는 사용자 발화에 대한 응답의 일 예는 도 5를 참조하여 이하에서 설명한다.An example of a response to a user's utterance implemented in the aforementioned client 10 will be described below with reference to FIG. 5 .

도 5는 본 발명의 일 실시 예에 따른 클라이언트에 설치된 음성인식 어플리케이션 프로그램을 통하여 구현되는 사용자 발화에 대한 응답을 나타내는 도면이다.5 is a diagram illustrating a response to a user's utterance implemented through a voice recognition application program installed in a client according to an embodiment of the present invention.

도 5를 참조하면, 클라이언트(10)는 사용자(User)의 발화에 대한 응답으로, 스피커(130, 도 2 참조)를 통해 응답 발화음성(A)을 출력함과 동시에 디스플레이(140, 도 2 참조)에 부가 정보(B)를 표시할 수 있다.Referring to FIG. 5 , in response to the user's utterance, the client 10 outputs a response utterance voice A through the speaker 130 (refer to FIG. 2 ) and simultaneously displays the display 140 (refer to FIG. 2 ). ) may display additional information (B).

이하에서는, 클라이언트(10)가 사용자(User)로부터 [오늘 예능 뭐해]라는 사용자 발화음성(Q)을 수집하는 것으로 가정하고 설명하기로 한다. 다만, 이는 예시적인 것에 불과하고, 사용자 발화음성(Q)의 범주는 이에 한정되지 아니함은 통상의 기술자에게 자명하다.Hereinafter, it is assumed that the client 10 collects the user's utterance voice Q of [what are you doing in entertainment today] from the user. However, this is only an example, and it is obvious to those skilled in the art that the scope of the user's spoken voice Q is not limited thereto.

사용자로(User)부터 [오늘 예능 뭐해]라는 사용자 발화음성(Q)이 클라이언트(10)로 입력되는 경우, 클라이언트(10)는 음성인식 플랫폼 제공 장치(미도시)로부터 생성된 응답 발화음성(A)과 부가 정보(B)를 수신할 수 있다.When a user uttered voice (Q) of [What are you doing in entertainment today] from the user is input to the client 10, the client 10 responds to the uttered voice (A) generated by the voice recognition platform providing device (not shown) ) and additional information (B) can be received.

클라이언트(10)가 수신하는 응답 발화음성(A)과 부가 정보(B)는 음성인식 플랫폼 제공 장치(미도시)에 의하여 도출되는 사용자의 발화 의도-예: '오늘 방송되는 예능 프로그램'-와 상응하는 응답 결과를 포함할 수 있다. 일 예로, 응답 발화음성(A)은 [오늘의 예능 프로그램으로는 영화가 좋다, 아는 형님, 놀면 뭐하니, 불후의 명곡 등이 있어요]라는 청각적 형태의 데이터를 포함하고, 부가 정보(B)는 '오늘 방송되는 예능 프로그램'의 각 콘텐츠에 대한 썸네일 이미지(image)와 각 콘텐츠의 타이틀, 방송국, 방송시간, 및 시청률에 대한 정보가 수록된 텍스트(text)를 포함하는 시각적 형태의 데이터를 포함할 수 있다.The response utterance voice (A) and additional information (B) received by the client 10 correspond to the user's utterance intention derived by the voice recognition platform providing device (not shown) - eg 'today's entertainment program' - response results may be included. As an example, the response speech (A) includes data in the form of an audible [there are good movies, acquaintances, what do you do when you play, immortal songs, etc. for today's entertainment program], and the additional information (B) is Thumbnail image for each content of 'Today's entertainment program' and visual form data including text containing information on title, broadcasting station, broadcast time, and audience rating of each content have.

클라이언트(10)는 스피커(130, 도 2 참조)를 통해 상기 응답 발화음성(A)을 출력하고, 디스플레이(140, 도 2 참조)를 통해 상기 부가 정보(B)를 출력하되, 상기 응답 발화음성(A)과 부가 정보(B)의 출력 시점을 일치시켜 동시에 구현함으로써 사용자 경험(UX)을 개선할 수 있다.The client 10 outputs the response uttered voice A through the speaker 130 (refer to FIG. 2) and outputs the additional information B through the display 140 (refer to FIG. 2), but the response uttered voice The user experience (UX) can be improved by matching the output timings of (A) and the additional information (B) and implementing them at the same time.

상술한 실시예에 따른 음성인식에 기반한 부가 서비스 제공 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있으며, 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 포함될 수 있다. The method for providing additional services based on voice recognition according to the above embodiment may be produced as a program to be executed on a computer and stored in a computer-readable recording medium, and examples of the computer-readable recording medium include ROM, RAM. , CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상술한 방법을 구현하기 위한 기능적인(function) 프로그램, 코드 및 코드 세그먼트들은 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The computer-readable recording medium is distributed in a network-connected computer system, so that the computer-readable code can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the above-described method can be easily inferred by programmers in the technical field to which the embodiment belongs.

실시 예와 관련하여 전술한 바와 같이 몇 가지만을 기술하였지만, 이외에도 다양한 형태의 실시가 가능하다. 앞서 설명한 실시 예들의 기술적 내용들은 서로 양립할 수 없는 기술이 아닌 이상은 다양한 형태로 조합될 수 있으며, 이를 통해 새로운 실시 형태로 구현될 수도 있다.Although only a few have been described as described above in relation to the embodiments, various other forms of implementation are possible. The technical contents of the above-described embodiments may be combined in various forms unless they are incompatible with each other, and may be implemented in a new embodiment through this.

한편, 전술한 실시 예에 의한 음성인식에 기반한 부가 서비스 제공 시스템 및 그 방법은 스마트폰, 웨어러블 단말기, 인공지능 스피커, 로봇 청소기, 셋톱박스, TV, 냉장고 등과 같은 사물인터넷(IoT)에 해당하는 각종 디바이스에서 사용할 수 있다.On the other hand, the system and method for providing additional services based on voice recognition according to the above-described embodiment are various types of Internet of Things (IoT) such as smartphones, wearable terminals, artificial intelligence speakers, robot cleaners, set-top boxes, TVs, refrigerators, etc. available on the device.

본 발명은 본 발명의 정신 및 필수적 특징을 벗어나지 않는 범위에서 다른 특정한 형태로 구체화될 수 있음은 통상의 기술자에게 자명하다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.It is apparent to those skilled in the art that the present invention may be embodied in other specific forms without departing from the spirit and essential characteristics of the present invention. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

a client receiving a voice input according to a user's utterance; and
A voice recognition platform providing apparatus that interworks with the client to derive a user's utterance intention corresponding to the voice input, and generates a plurality of response data based on the user's utterance intention;
the client,
Synchronize and output the plurality of response data to the outside based on the response waiting time,
The plurality of response data is
first response data including voice data in an auditory form; and
Including; second response data including content data in a visual form;
The voice data is a response text obtained by applying a text summarization technique based on the intention of the utterance into a voice signal,
The content data in the visual form is additional information obtained by combining an image obtained by searching for the intention of utterance in a database or a search engine to form additional information, a system for providing additional services based on voice recognition.

delete

According to claim 1,
The voice recognition platform providing device,
a first server that converts the voice input into a query text and derives the user's intention of speaking by using at least one of natural language processing and entity name recognition;
a second server for generating a response text by searching for the user's utterance intention based on a pre-established database; and
A system for providing additional services based on voice recognition, including; a third server for performing a session connection with the client according to a prescribed protocol.

4. The method of claim 3,
The second server transmits the response text to the first server,
The first server converts the response text into the voice data to generate the first response data, a system for providing additional services based on voice recognition.

5. The method of claim 4,
The second server,
extracting the key words of the response text to generate the second response data,
For transmitting the second response data to the third server, a system for providing additional services based on voice recognition.

6. The method of claim 5,
the client,
receiving the first response data from the first server;
A system for providing additional services based on voice recognition for receiving the second response data from the third server while the session connection is maintained.

7. The method of claim 6,
The response waiting time is
An additional service providing system based on voice recognition, which is an output delay time between a first time point of receiving the voice input and a second time point of receiving all of the first and second response data.

According to claim 1,
the client,
Before the response waiting time elapses, the output of any one of the first and second response data is controlled in a standby state, a voice recognition-based supplementary service providing system.

7. The method of claim 6,
the client,
A system for providing an additional service based on voice recognition that maps the voice input and the first and second response data to each other and stores them.

4. The method of claim 3,
The protocol is an additional service providing system based on voice recognition, including HTTP/2 (Hyper Text Transfer Protocol Version 2).

Client; And in the method for providing additional services based on the voice recognition of the system for providing additional services including a voice recognition platform providing apparatus,
In the apparatus for providing a voice recognition platform, receiving a voice input according to a user's utterance from the client;
deriving a user utterance intention corresponding to the voice input, and generating a plurality of response data based on the user utterance intention in the apparatus for providing a voice recognition platform;
receiving, at the client, the plurality of response data from the apparatus for providing a voice recognition platform; and
In the client, based on the response waiting time, synchronizing and outputting the plurality of response data to the outside;
The plurality of response data is
first response data including voice data in an auditory form; and
Including; second response data including content data in a visual form;
The voice data is a response text obtained by applying a text summarization technique based on the intention of the utterance into a voice signal,
The content data in the visual form is additional information obtained by combining an image obtained by searching for the intention of utterance in a database or a search engine, and is additional information based on voice recognition.

delete

12. The method of claim 11,
The voice recognition platform providing apparatus includes first to third servers,
converting, at the first server, the voice input into a query text, and deriving the user's intention to speak by using at least one of natural language processing and entity name recognition;
generating, in the second server, a response text by searching for the user's utterance intention based on a pre-established database; and
In the third server, performing a session connection with the client according to a prescribed protocol; Containing, a method for providing additional services based on voice recognition.

14. The method of claim 13,
transmitting, by the second server, the response text to the first server; and
In the first server, converting the received response text into the voice data to generate the first response data; Containing, a method for providing additional services based on voice recognition.

15. The method of claim 14,
In the second server, extracting a key word of the response text, generating the second response data, and transmitting the second response data to the third server, a method for providing additional services based on voice recognition.

16. The method of claim 15,
in the client,
receiving the first response data from the first server; and
Receiving the second response data from the third server while the session connection is maintained; Containing, a method for providing additional services based on voice recognition.

17. The method of claim 16,
The response waiting time is
An output delay time between a first time point of receiving the voice input and a second time point of receiving all of the first and second response data, a method for providing additional services based on voice recognition.

12. The method of claim 11,
in the client,
Before the response waiting time elapses, controlling the output of any one of the first and second response data in a standby state; Containing, a method for providing additional services based on voice recognition.

17. The method of claim 16,
in the client,
and storing the voice input and the first and second response data by mapping each other.

14. The method of claim 13,
The protocol is a method for providing additional services based on voice recognition, including HTTP/2 (Hyper Text Transfer Protocol Version 2).

A computer-readable recording medium in which an application program for realizing the method for providing additional services based on voice recognition according to any one of claims 11 and 13 to 20 is recorded through being executed by a processor.