KR20220003050U

KR20220003050U - Electronic apparatus for providing artificial intelligence conversations

Info

Publication number: KR20220003050U
Application number: KR2020220001041U
Authority: KR
Inventors: 오병기
Original assignee: 주식회사 쓰리디팩토리
Priority date: 2021-06-21
Filing date: 2022-04-27
Publication date: 2022-12-28

Abstract

인공지능 대화 제공을 위한 전자 장치가 제공된다. 인공지능 대화 제공을 위한 전자 장치는, 사용자의 음성 신호를 수신하고, 음성 신호를 텍스트 신호로 변환하며, 텍스트 신호의 자연 언어 처리를 수행하여 키워드를 추출하고, 추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성하며, 지식 베이스 응답 신호를 이용하여 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호를 형성하며, 감성 베이스 응답 신호를 음성 출력 신호로 변환한다.An electronic device for providing artificial intelligence conversations is provided. An electronic device for providing artificial intelligence conversation receives a user's voice signal, converts the voice signal into a text signal, performs natural language processing on the text signal to extract keywords, and uses the extracted keywords to match with a database. to form a knowledge base response signal, form an emotion base response signal in consideration of the user's age, personality, and intimacy using the knowledge base response signal, and convert the emotion base response signal into a voice output signal.

Description

Electronic device for providing artificial intelligence conversation {ELECTRONIC APPARATUS FOR PROVIDING ARTIFICIAL INTELLIGENCE CONVERSATIONS}

본 고안은 인공지능 대화 제공을 위한 전자 장치에 관련된 것으로, 보다 구체적으로는 인공지능(AI: Artificial Intelligence) 학습 기반으로 감성 베이스의 대화를 제공할 수 있는 인공지능 대화 제공을 위한 전자 장치에 관련된 것이다.The present invention relates to an electronic device for providing artificial intelligence conversations, and more specifically, to an electronic device for providing artificial intelligence conversations capable of providing emotion-based conversations based on artificial intelligence (AI) learning. .

인공지능 대화 기술의 보급이 확대되고, 인공지능 대화 기술과 연계된 다양한 서비스들이 등장하면서 인공지능 대화 기술의 활용도가 계속해서 높아지고 있다. 사용자들은 다양한 분야의 서비스를 인공지능 대화를 통해서 이용할 수 있다.As the dissemination of AI conversation technology expands and various services related to AI conversation technology appear, the utilization of AI conversation technology continues to increase. Users can use services in various fields through artificial intelligence conversations.

그러나, 인공지능 대화 기술을 이용할 경우 사용자의 문의에 대한 인공지능 시스템의 대답은 미리 설정된 톤(tone), 억양, 말투 등을 이용하여 구현되므로, 사용자가 인공지능 시스템에 대한 친근감을 느끼지 못하는 문제점이 있다. 특히, 어린 아이들의 경우 인공지능 시스템의 대답에 대해서 공포감을 느낄 수 있는 문제점이 있다.However, when artificial intelligence conversation technology is used, the answer of the artificial intelligence system to the user's inquiry is implemented using a preset tone, accent, accent, etc., so the user does not feel familiar with the artificial intelligence system. there is. In particular, there is a problem that young children can feel fear about the answer of the artificial intelligence system.

본 고안이 해결하고자 하는 일 기술적 과제는, 사용자로부터 수신된 음성에 대하여 감성 베이스의 응답을 제공하는 인공지능 대화 제공을 위한 전자 장치를 제공하는데 있다.One technical problem to be solved by the present invention is to provide an electronic device for providing an artificial intelligence conversation that provides an emotion-based response to a voice received from a user.

본 고안이 해결하고자 하는 다른 기술적 과제는, 사용자로부터 수신된 음성에 대하여 사용자의 나이, 성격 및 친밀도를 고려한 감성 베이스의 응답을 제공하는 인공지능 대화 제공을 위한 전자 장치를 제공하는데 있다.Another technical problem to be solved by the present invention is to provide an electronic device for providing an artificial intelligence conversation that provides an emotion-based response in consideration of the age, personality, and intimacy of the user to the voice received from the user.

본 고안이 해결하고자 하는 다른 기술적 과제는, 음성 출력에 대한 사용자의 반응 정도에 따른 가중치를 부여하여 데이터베이스의 업데이트를 수행하는 인공지능 대화 제공을 위한 전자 장치를 제공하는데 있다.Another technical problem to be solved by the present invention is to provide an electronic device for providing artificial intelligence conversations that updates a database by assigning a weight according to the degree of a user's reaction to a voice output.

본 고안이 해결하고자 하는 기술적 과제는 상술된 것에 제한되지 않는다.The technical problem to be solved by the present invention is not limited to the above.

본 고안의 일 실시 예에 따른 인공지능 대화 제공을 위한 전자 장치는, 하나 이상의 프로세서; 및 상기 하나 이상의 프로세서에 의한 실행 시, 상기 하나 이상의 프로세서가 연산을 수행하도록 하는 명령들이 저장된 하나 이상의 메모리를 포함하며, 상기 하나 이상의 프로세서는, 사용자의 음성 신호를 수신하고, 상기 음성 신호를 텍스트 신호로 변환하며, 상기 텍스트 신호의 자연 언어 처리를 수행하여 키워드를 추출하고, 추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성하며, 상기 지식 베이스 응답 신호를 이용하여 상기 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호를 형성하고, 상기 감성 베이스 응답 신호를 음성 출력 신호로 변환할 수 있다.An electronic device for providing artificial intelligence conversations according to an embodiment of the present invention includes one or more processors; and one or more memories storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, wherein the one or more processors receive a user's voice signal, and convert the voice signal into a text signal. , performs natural language processing of the text signal to extract keywords, uses the extracted keywords to perform database matching to form a knowledge base response signal, and uses the knowledge base response signal to determine the information of the user. An emotion base response signal may be formed in consideration of age, personality, and intimacy, and the emotion base response signal may be converted into a voice output signal.

일 실시예로서, 상기 하나 이상의 프로세서는, 상기 음성 신호에 대하여 문장 분리, 띄어쓰기 교정, 철자 교정, 약어 확장 및 기호 제거 절차를 통하여 수신된 음성 신호의 자연 언어 처리를 수행할 수 있다.As an embodiment, the one or more processors may perform natural language processing on the received voice signal through sentence separation, spacing correction, spelling correction, abbreviation expansion, and symbol removal procedures on the voice signal.

일 실시예로서, 상기 하나 이상의 프로세서는, 성별, 지역, 연령 및 다자발화에 따른 다수의 음성 데이터를 수신하고, 상기 다수의 음성 데이터를 이용하여 상기 데이터베이스의 업데이트를 수행하며, 상기 추출된 키워드를 이용하여 상기 데이터베이스와 매칭을 수행하여 상기 지식 베이스 응답 신호를 형성할 수 있다.As an embodiment, the one or more processors receive a plurality of voice data according to gender, region, age, and multilateral speech, update the database using the plurality of voice data, and generate the extracted keywords. It is possible to form the knowledge base response signal by performing matching with the database using the database.

일 실시예로서, 상기 하나 이상의 프로세서는, 상기 음성 출력 신호에 대한 상기 사용자의 반응 정도에 따른 가중치를 부여하여 상기 데이터베이스의 업데이트를 수행할 수 있다.As an embodiment, the one or more processors may update the database by assigning a weight according to the degree of the user's reaction to the audio output signal.

일 실시예로서, 상기 하나 이상의 프로세서는, 상기 추출된 키워드와 상기 데이터베이스와의 매칭이 불가능한 경우 상기 지식 베이스 응답 신호 형성이 불가능한 것으로 판단하고, 기계 학습 알고리즘을 이용하여 상기 추출된 키워드로부터 상기 지식 베이스 응답 신호를 형성할 수 있다.As an embodiment, the one or more processors determine that it is impossible to form the knowledge base response signal when it is impossible to match the extracted keyword with the database, and use a machine learning algorithm to determine the knowledge base from the extracted keyword. A response signal may be formed.

본 고안의 일 실시 예에 따른 하나 이상의 프로세서; 및 상기 하나 이상의 프로세서에 의한 실행 시, 상기 하나 이상의 프로세서가 연산을 수행하도록 하는 명령들이 저장된 하나 이상의 메모리를 포함하는 인공지능 대화 제공을 위한 전자 장치를 이용한 인공지능 대화 제공 방법은, 상기 하나 이상의 프로세서에 의해서, 사용자의 음성 신호를 수신하는 단계; 상기 하나 이상의 프로세서에 의해서, 상기 음성 신호를 텍스트 신호로 변환하는 단계; 상기 하나 이상의 프로세서에 의해서, 상기 텍스트 신호의 자연 언어 처리를 수행하여 키워드를 추출하는 단계; 상기 하나 이상의 프로세서에 의해서, 추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성하는 단계; 상기 하나 이상의 프로세서에 의해서, 상기 지식 베이스 응답 신호를 이용하여 상기 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호를 형성하는 단계; 및 상기 하나 이상의 프로세서에 의해서, 상기 감성 베이스 응답 신호를 음성 출력 신호로 변환하는 단계를 포함할 수 있다.One or more processors according to an embodiment of the present invention; and one or more memories storing instructions for causing the one or more processors to perform calculations when executed by the one or more processors. By, receiving the user's voice signal; converting, by the one or more processors, the voice signal into a text signal; extracting, by the one or more processors, keywords by performing natural language processing on the text signal; forming a knowledge base response signal by matching with a database using the extracted keywords, by the one or more processors; forming, by the one or more processors, an emotion base response signal in consideration of the user's age, personality, and intimacy using the knowledge base response signal; and converting the emotional base response signal into a voice output signal by the one or more processors.

본 고안의 실시예에 따른 하나 이상의 프로세서; 및 상기 하나 이상의 프로세서에 의한 실행 시, 상기 하나 이상의 프로세서가 연산을 수행하도록 하는 명령들이 저장된 하나 이상의 메모리를 포함하는 컴퓨터에서 수행 가능하도록 컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램은, 상기 하나 이상의 프로세서에 의해서, 사용자의 음성 신호를 수신하는 단계; 상기 하나 이상의 프로세서에 의해서, 상기 음성 신호를 텍스트 신호로 변환하는 단계; 상기 하나 이상의 프로세서에 의해서, 상기 텍스트 신호의 자연 언어 처리를 수행하여 키워드를 추출하는 단계; 상기 하나 이상의 프로세서에 의해서, 추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성하는 단계;one or more processors according to an embodiment of the present invention; And a computer program stored in a computer-readable recording medium so as to be executable by a computer including one or more memories storing instructions for causing the one or more processors to perform operations when executed by the one or more processors, the one or more processors Receiving a user's voice signal by the; converting, by the one or more processors, the voice signal into a text signal; extracting, by the one or more processors, keywords by performing natural language processing on the text signal; forming a knowledge base response signal by matching with a database using the extracted keywords, by the one or more processors;

상기 하나 이상의 프로세서에 의해서, 상기 지식 베이스 응답 신호를 이용하여 상기 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호를 형성하는 단계; 및 상기 하나 이상의 프로세서에 의해서, 상기 감성 베이스 응답 신호를 음성 출력 신호로 변환하는 단계를 수행 가능하도록 컴퓨터 판독 가능한 기록매체에 저장될 수 있다.forming, by the one or more processors, an emotion base response signal in consideration of the user's age, personality, and intimacy using the knowledge base response signal; and converting the emotional base response signal into a voice output signal by the one or more processors.

본 고안의 일 실시 예에 따른 인공지능 대화 제공을 위한 전자 장치 및 방법과, 그를 수행하도록 컴퓨터 판독 가능한 기록매체에 저장된 프로그램은, 사용자로부터 수신된 음성에 대하여 감성 베이스의 응답을 제공하여 사용자 친화적인 대화가 가능하도록 할 수 있다.An electronic device and method for providing artificial intelligence conversations according to an embodiment of the present invention, and a program stored in a computer-readable recording medium to perform the same, provide an emotion-based response to a voice received from a user to provide a user-friendly can make conversation possible.

또한, 본 고안의 일 실시 예에 따른 인공지능 대화 제공을 위한 전자 장치 및 방법과, 그를 수행하도록 컴퓨터 판독 가능한 기록매체에 저장된 프로그램은, 사용자로부터 수신된 음성에 대하여 사용자의 나이, 성격, 친밀도 등을 고려한 사용자 맞춤형 대화가 가능하여 사용자가 다시 찾을 수 있는 인공지능 대화를 제공할 수 있다.In addition, an electronic device and method for providing an artificial intelligence conversation according to an embodiment of the present invention, and a program stored in a computer-readable recording medium to perform the same, are related to the user's age, personality, intimacy, etc. with respect to the voice received from the user. It is possible to provide a user-customized conversation considering the artificial intelligence conversation that the user can find again.

또한, 본 고안의 일 실시 예에 따른 인공지능 대화 제공을 위한 전자 장치 및 방법과, 그를 수행하도록 컴퓨터 판독 가능한 기록매체에 저장된 프로그램은, 음성 출력에 대한 사용자의 반응 정도에 따른 가중치를 부여하여 데이터베이스의 업데이트를 수행하므로 보다 정교하면서도 인공지능 대화 제공을 위한 연산량을 감소시킬 수 있다.In addition, an electronic device and method for providing artificial intelligence conversations according to an embodiment of the present invention, and a program stored in a computer-readable recording medium to perform the same, assign weights according to the degree of user's reaction to voice output to database Since it performs an update of , it is possible to reduce the amount of computation for providing more sophisticated and artificial intelligence conversations.

도 1은 본 고안의 실시 예에 따른 인공지능 대화 제공을 위한 전자 장치의 구성을 보이는 예시도이다.
도 2는 본 고안의 실시예에 따른 인공지능 대화 제공 시스템의 구성을 보이는 예시도이다.
도 3은 본 고안의 실시예에 따른 뉴럴 네트워크의 학습을 설명하기 위한 도면이다.
도　4 및 도 5는 본 고안의 실시 예에 따른 인공지능 대화 제공 방법의 절차를 보이는 흐름도이다.1 is an exemplary diagram showing the configuration of an electronic device for providing artificial intelligence conversations according to an embodiment of the present invention.
2 is an exemplary diagram showing the configuration of an artificial intelligence conversation providing system according to an embodiment of the present invention.
3 is a diagram for explaining learning of a neural network according to an embodiment of the present invention.
4 and 5 are flowcharts showing procedures of a method for providing artificial intelligence conversations according to an embodiment of the present invention.

이하, 첨부된 도면들을 참조하여 본 고안의 바람직한 실시 예를 상세히 설명할 것이다. 그러나 본 고안의 기술적 사상은 여기서 설명되는 실시 예에 한정되지 않고 다른 형태로 구체화될 수도 있다. 오히려, 여기서 소개되는 실시 예는 개시된 내용이 철저하고 완전해질 수 있도록 그리고 당업자에게 본 고안의 사상이 충분히 전달될 수 있도록 하기 위해 제공되는 것이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the technical spirit of the present invention is not limited to the embodiments described herein and may be embodied in other forms. Rather, the embodiments introduced herein are provided so that the disclosed content will be thorough and complete, and the spirit of the present invention will be sufficiently conveyed to those skilled in the art.

본 명세서에서, 어떤 구성요소가 다른 구성요소 상에 있다고 언급되는 경우에 그것은 다른 구성요소 상에 직접 형성될 수 있거나 또는 그들 사이에 제 3의 구성요소가 개재될 수도 있다는 것을 의미한다.In this specification, when an element is referred to as being on another element, it means that it may be directly formed on the other element or a third element may be interposed therebetween.

또한, 본 명세서의 다양한 실시 예 들에서 제1, 제2, 제3 등의 용어가 다양한 구성요소들을 기술하기 위해서 사용되었지만, 이들 구성요소들이 이 같은 용어들에 의해서 한정되어서는 안 된다. 이들 용어들은 단지 어느 구성요소를 다른 구성요소와 구별시키기 위해서 사용되었을 뿐이다. 따라서, 어느 한 실시 예에 제 1 구성요소로 언급된 것이 다른 실시 예에서는 제 2 구성요소로 언급될 수도 있다. 여기에 설명되고 예시되는 각 실시 예는 그것의 상보적인 실시 예도 포함한다. 또한, 본 명세서에서 '및/또는'은 전후에 나열한 구성요소들 중 적어도 하나를 포함하는 의미로 사용되었다.In addition, although terms such as first, second, and third are used to describe various elements in various embodiments of the present specification, these elements should not be limited by these terms. These terms are only used to distinguish one component from another. Therefore, what is referred to as a first element in one embodiment may be referred to as a second element in another embodiment. Each embodiment described and illustrated herein also includes its complementary embodiments. In addition, in this specification, 'and/or' is used to mean including at least one of the elements listed before and after.

명세서에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함한다. 또한, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 구성요소 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 구성요소 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하는 것으로 이해되어서는 안 된다. In the specification, expressions in the singular number include plural expressions unless the context clearly dictates otherwise. In addition, the terms "comprise" or "having" are intended to designate that the features, numbers, steps, components, or combinations thereof described in the specification exist, but one or more other features, numbers, steps, or components. It should not be construed as excluding the possibility of the presence or addition of elements or combinations thereof.

또한, 본 명세서에서 "연결"은 복수의 구성 요소를 간접적으로 연결하는 것, 및 직접적으로 연결하는 것을 모두 포함하는 의미로 사용된다. 또한, "연결"이라 함은 물리적인 연결은 물론 전기적인 연결을 포함하는 개념이다.In addition, in this specification, "connection" is used to mean both indirectly and directly connecting a plurality of components. In addition, "connection" is a concept including physical connection as well as electrical connection.

또한, 하기에서 본 고안을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 고안의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략할 것이다.In addition, in the following description of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted.

도 1은 본 고안의 실시 예에 따른 인공지능 대화 제공을 위한 전자 장치의 구성을 보이는 예시도이다.1 is an exemplary diagram showing the configuration of an electronic device for providing artificial intelligence conversations according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 인공지능 대화 제공을 위한 전자 장치(100)는, 하나 이상의 프로세서(110), 하나 이상의 메모리(120), 송수신기(130), 마이크(140) 및 스피커(150)를 포함할 수 있다. 일 실시예로서, 인공지능 대화 제공을 위한 전자 장치(100)의 이 구성요소들 중 적어도 하나가 생략되거나, 다른 구성요소가 인공지능 대화 제공을 위한 전자 장치(100)에 추가될 수 있다. 추가적으로(additionally) 또는 대체적으로(alternatively), 일부의 구성요소들이 통합되어 구현되거나, 단수 또는 복수의 개체로 구현될 수 있다. 인공지능 대화 제공을 위한 전자 장치(100) 내, 외부의 구성요소들 중 적어도 일부의 구성요소들은 시스템 버스(system bus), GPIO(general purpose input/output), SPI(serial peripheral interface) 또는 MIPI(mobile industry processor interface) 등을 통해 서로 연결되어, 데이터 및/또는 시그널을 주고받을 수 있다. 일 실시예로서, 인공지능 대화 제공을 위한 전자 장치(100)는 기계학습(machine learning) 특히, 딥러닝(deep learning)과 같은 심층 강화 학습 알고리즘을 이용하여 감성 베이스 응답 신호를 형성할 수 있다.As shown in FIG. 1, an electronic device 100 for providing artificial intelligence conversation includes one or more processors 110, one or more memories 120, a transceiver 130, a microphone 140, and a speaker 150. can include As an example, at least one of these components of the electronic device 100 for providing artificial intelligence conversations may be omitted or another component may be added to the electronic device 100 for providing artificial intelligence conversations. Additionally or alternatively, some of the components may be integrated and implemented, or implemented as a singular or plural entity. At least some of the internal and external components of the electronic device 100 for providing artificial intelligence conversations are system bus, GPIO (general purpose input/output), SPI (serial peripheral interface) or MIPI ( They are connected to each other through a mobile industry processor interface, etc., and can exchange data and/or signals. As an embodiment, the electronic device 100 for providing artificial intelligence conversations may form an emotion-based response signal using machine learning, in particular, a deep reinforcement learning algorithm such as deep learning.

하나 이상의 프로세서(110)는, 소프트웨어(예: 명령, 프로그램 등)를 구동하여 프로세서(110)에 연결된 인공지능 대화 제공을 위한 전자 장치(100)의 적어도 하나의 구성요소를 제어할 수 있다. 또한, 프로세서(110)는 본 고안과 관련된 다양한 연산, 처리, 데이터 생성, 가공 등의 동작을 수행할 수 있다. 또한, 프로세서(110)는 데이터 등을 하나 이상의 메모리(120)로부터 로드하거나, 하나 이상의 메모리(120)에 저장할 수 있다.The one or more processors 110 may control at least one component of the electronic device 100 for providing artificial intelligence conversations connected to the processor 110 by driving software (eg, commands, programs, etc.). In addition, the processor 110 may perform operations such as various calculations, processing, data generation, and processing related to the present invention. Also, the processor 110 may load data from one or more memories 120 or store data in one or more memories 120 .

하나 이상의 프로세서(110)는, 사용자의 음성 신호를 수신할 수 있다. 일 실시 예에 따르면, 하나 이상의 프로세서(110)는, 마이크(140)를 통하여 사용자의 음성 신호를 컴퓨터에서 처리 가능한 데이터 패킷 형태로 획득할 수 있다. 예를 들어, 프로세서(110)는, 연속 음성 인식(Continuous Speech Recognition) 알고리즘을 사용하여 사용자가 연속적으로 발음한 음성 신호를 여러 단어로 인식할 수 있다. 일 실시 예에 따르면, 사용자의 음성 신호는 자연 음성(Spontaneous Speech)으로 대화하듯이 자연스럽게 발성한 음성, 머뭇거림, 감탄사, 숨소리 등을 포함할 수 있다. 다른 실시 예에 따르면, 프로세서(110)는, 불특정 다수의 사용자가 사용할 수 있도록 임의의 화자의 음성을 인식하는 화자 독립(Speaker Independent)으로 편의성을 향상시킬 수 있다. 예를 들어, 프로세서(110)는, 음성 트레이닝 기능을 통하여 사용자의 음성 인식률을 개선시킬 수 있다. 다른 실시 예에 따르면, 프로세서(110)는, 구글 클라우드 스피치(Google Cloud Speech)를 응용한 스피치 적응(Speech Adaptation) 기술 구현으로 오디오 데이터에서 자주 나타나는 단어와 구문의 정확도를 향상시킬 수 있다.One or more processors 110 may receive a user's voice signal. According to an embodiment, the one or more processors 110 may acquire the user's voice signal through the microphone 140 in the form of data packets that can be processed by a computer. For example, the processor 110 may recognize a voice signal continuously pronounced by the user as several words using a continuous speech recognition algorithm. According to an embodiment, the user's voice signal may include a naturally uttered voice, hesitation, exclamation, breathing, etc. as if talking in a natural voice (Spontaneous Speech). According to another embodiment, the processor 110 may improve convenience through speaker independent recognition of a voice of an arbitrary speaker so that an unspecified number of users can use it. For example, the processor 110 may improve a user's voice recognition rate through a voice training function. According to another embodiment, the processor 110 may improve the accuracy of words and phrases frequently appearing in audio data by implementing speech adaptation technology to which Google Cloud Speech is applied.

하나 이상의 프로세서(110)는, 수신된 음성 신호를 텍스트 신호로 변환할 수 있다. 일 실시 예에 따르면, 하나 이상의 프로세서(110)는, 수신된 음성 신호를 A/D(Analog Digital) 변환기(Converter)에 의하여 샘플링하여 디지털 신호로 변환한 후 잡음 신호를 제거하여 텍스트 신호로 변환할 수 있다. 예를 들어, 프로세서(110)는, STT(Speech to Text)의 인식 엔진으로 다양한 언어를 선택하여 사용할 수 있고, 요청 구성의 언어 코드(language code) 필드에서 BCP-47 식별자를 사용하여 오디오 언어(그리고 국가 또는 지역 방언)를 지정하여 사용할 수 있도록 개발할 수 있다. 프로세서(110)는, 설정된 버튼이나 키를 이용하여 기능을 활성화하여 사용할 수 있다. 스트리밍 음성 인식을 사용하여 오디오를 STT로 스트리밍하고 오디오가 처리됨에 따라 실시간으로 스트리밍 음성 인식 결과를 텍스트로 처리하고 번역 시스템 파일로 저장할 수 있다. 스트리밍 STT API(application programming interface) 인식 호출은, 양방향 스트림 내에서 오디오의 실시간 캡처 및 인식을 위해 설계, 애플리케이션은 요청 스트림에서 오디오를 보내고 응답 스트림에서 중간 및 최종 인식 결과를 실시간으로 텍스트로 저장하고, 중간 결과는 특정 오디오 섹션의 현재 인식 결과를 나타내며, 최종 인식 결과는 해당 오디오 섹션의 가장 높은 최종 추측 결과를 도출할 수 있다. 프로세서(110)는, CDN(Content Delivery Network) 서비스를 적용하여 네트워크 속도를 1/10로 감소시킬 수 있다. 프로세서(110)는, 서비스 운영을 고려한 클라우드 아키텍처 설계 및 구성할 수 있다. 확장성, 안정성을 고려하여 이중화하는 구성이며, 제공되고 있는 Clova, VOD Trans에 대한 서비스 확장을 고려하여 반영할 수 있다. 네이버 클라우드 플랫폼의 Object Storage에 관련 설정 파일을 업로드하고, CDN+(CDN+는 컨텐츠를 많은 사용자에게 손실없이 빠르고 안전하게 전달하는 서비스) 연계를 통해 HTTP/HTTPS 프로토콜을 연결할 수 있다. 프로세서(110)는 통화 음성 파일을 활용하여 인식 및 서비스 연결을 수행할 수 있다. API를 통한 서비스 호출 후 확인 시 결과 값을 도출할 수 있다.One or more processors 110 may convert the received voice signal into a text signal. According to an embodiment, the one or more processors 110 sample the received voice signal by an analog digital (A/D) converter, convert it into a digital signal, remove the noise signal, and convert it into a text signal. can For example, the processor 110 may select and use various languages as a speech to text (STT) recognition engine, and use a BCP-47 identifier in a language code field of a request configuration to use an audio language ( and national or regional dialects) can be designated and developed for use. The processor 110 may activate and use a function using a set button or key. Streaming speech recognition can be used to stream audio into STT, process the streaming speech recognition results into text in real time as the audio is processed, and save it as a translation system file. Streaming STT application programming interface (API) recognition calls, designed for real-time capture and recognition of audio within a bi-directional stream, applications send audio in a request stream and store interim and final recognition results as text in real-time in a response stream; The intermediate result represents the current recognition result of a particular audio section, and the final recognition result may lead to the highest final guess of that audio section. The processor 110 may reduce the network speed to 1/10 by applying a content delivery network (CDN) service. The processor 110 may design and configure a cloud architecture considering service operation. It is a dual configuration in consideration of scalability and stability, and it can be reflected in consideration of service expansion for Clova and VOD Trans that are being provided. You can upload related configuration files to NAVER CLOUD PLATFORM's Object Storage and connect HTTP/HTTPS protocols through CDN+ (CDN+ is a service that delivers content to many users quickly and safely without loss). The processor 110 may perform recognition and service connection by utilizing a call voice file. After calling the service through the API, the result value can be derived when checking.

하나 이상의 프로세서(110)는, 수신된 텍스트 신호의 자연 언어 처리를 수행하여 키워드를 추출할 수 있다. 일 실시 예에 따르면, 하나 이상의 프로세서(110)는, 텍스트 신호에 대하여 문장 분리, 띄어쓰기 교정, 철자 교정, 약어 확장 및 기호 제거 절차를 통하여 수신된 음성 신호의 자연 언어 처리를 수행할 수 있다. 예를 들어, 프로세서(122)는, 수신된 텍스트 신호가 포함하는 모든 어절을 추출하고, 추출된 어절에 대하여 문장 종결 기호를 제거하며, 추출된 어절 중 연속하는 두 어절이 서로 다른 문장으로 분리될 수 있는 문장 분리 확률을 계산하고, 문장 분리 확률과 미리 결정된 기준 확률을 비교하며, 문장 분리 확률이 미리 결정된 기준 확률 이상인 경우, 연속하는 두 어절을 서로 다른 문장으로 분리할 수 있다. One or more processors 110 may extract keywords by performing natural language processing on the received text signal. According to an embodiment, the one or more processors 110 may perform natural language processing on the received voice signal through sentence separation, spacing correction, spelling correction, abbreviation expansion, and symbol removal procedures with respect to the text signal. For example, the processor 122 extracts all words included in the received text signal, removes sentence terminating marks for the extracted words, and separates two consecutive words among the extracted words into different sentences. A possible sentence separation probability is calculated, the sentence separation probability is compared with a predetermined reference probability, and when the sentence separation probability is greater than or equal to the predetermined reference probability, two consecutive words may be separated into different sentences.

또한, 하나 이상의 프로세서(110)는, 수신된 텍스트 신호가 포함하는 각 문장에서 모든 문장 부호를 제거하고, 추출된 어절의 각각이 미리 결정된 말뭉치에 포함되어 있는지 확인하며, 추출된 어절의 각각이 미리 결정된 말뭉치에 포함되어 있는 경우 어절의 철자를 교정하지 않고, 추출된 어절의 각각이 말뭉치에 포함되어 있지 않은 경우 말뭉치에 포함되어 있지 않은 교정 대상 어절을 교정 후보 어절로 교정할 수 있다.In addition, the one or more processors 110 remove all punctuation marks from each sentence included in the received text signal, check whether each of the extracted words is included in a predetermined corpus, and determine whether each of the extracted words is included in a predetermined corpus. If included in the determined corpus, the spelling of the word is not corrected, and if each of the extracted words is not included in the corpus, the correction target word that is not included in the corpus may be corrected as a correction candidate word.

하나 이상의 프로세서(110)는, 추출된 어절의 각각이 미리 결정된 시소러스(Thesaurus) - 서로 다른 어절 간의 유의어 또는 반의어 관계를 나타내는 사전 -에 포함되어 있는지 확인하고, 추출된 어절의 각각이 시소러스에 포함되어 있지 않은 경우 어절의 변경을 수행하지 않고, 추출된 어절의 각각이 시소러스에 포함되어 있는 경우 어절을 시소러스에 포함된 약어 확장 어절로 변경할 수 있다.The one or more processors 110 determine whether each of the extracted words is included in a pre-determined thesaurus (a dictionary representing a synonym or antonymic relationship between different words), and each of the extracted words is included in the thesaurus If not, the word is not changed, and if each of the extracted words is included in the thesaurus, the word may be changed to an abbreviated extended word included in the thesaurus.

또한, 하나 이상의 프로세서(110)는, 추출된 어절의 각각이 의존소 및 지배소로 분류될 확률을 각각 산출하고, 의존소로 분류될 확률과 지배소로 분류될 확률을 서로 비교하여, 의존소로 분류될 확률이 지배소로 분류될 확률보다 높을 경우 추출된 어절을 의존소로 분류하고, 지배소로 분류될 확률이 의존소로 분류될 확률보다 높을 경우 추출된 어절을 지배소로 각각 분류할 수 있다.In addition, the one or more processors 110 calculate the probability that each of the extracted words will be classified as a dependent element and a dominant element, compare the probability of being classified as a dependent element and the probability of being classified as a dominant element, and determine the probability of being classified as a dependent element. If the probability of being classified as a dominant element is higher than the probability of being classified as a dominant element, the extracted word is classified as a dependent element, and if the probability of being classified as a dominant element is higher than the probability of being classified as a dependent element, the extracted word can be classified as a dominant element.

예를 들어, 하나 이상의 프로세서(110)는, 사용자로부터 "내일 설악산의 날씨가 어때?"라는 음성 신호를 수신한 경우 자연 언어 처리를 수행하여 "내일", "설악산", "날씨"를 텍스트 신호의 키워드로 추출할 수 있다.For example, when the one or more processors 110 receive a voice signal from the user, “How is the weather at Mt. Seorak tomorrow?”, the one or more processors 110 perform natural language processing to convert “tomorrow”, “Mt. can be extracted by keyword.

다른 실시 예에 따르면, 하나 이상의 프로세서(110)는, 텍스트 신호로부터 음성 인식에 유용한 특징을 추출하여 특징 벡터를 생성하고, 생성된 특징 벡터를 미리 학습과정에서 준비된 인식 대상 어휘로 구성된 사전을 참조하여 가장 유사한 단어를 키워드로 추출할 수 있다. 예를 들어, 어휘의 크기는 일반적인 채팅에 사용되는 100단어 이하의 소규모 어휘를 사용하나 필요에 의해서 10,000단어 이상의 대어휘도 적용 가능할 수 있다.According to another embodiment, one or more processors 110 extracts features useful for voice recognition from text signals to generate feature vectors, and references the generated feature vectors to a dictionary composed of words to be recognized previously prepared in a learning process. The most similar words can be extracted as keywords. For example, a small vocabulary of 100 words or less used in general chatting is used, but a large vocabulary of 10,000 words or more may be applied if necessary.

하나 이상의 프로세서(110)는, 추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성할 수 있다. 일 실시 예에 따르면, 하나 이상의 프로세서(110)는, 성별, 지역, 연령 및 다자발화에 따른 다수의 음성 데이터를 수신하고, 다수의 음성 데이터를 이용하여 데이터베이스의 업데이트를 수행하며, 텍스트 신호를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성할 수 있다. 예를 들어, 프로세서(110)는, 음성 출력 신호에 대한 사용자의 반응 정도에 따른 가중치를 부여하여 데이터베이스의 업데이트를 수행할 수 있다. 또한, 프로세서(110)는, 지식 베이스 응답 신호 형성을 위하여 룰베이스 대화 시스템을 구축할 수 있다. 즉, 일상 대화를 인식하고 음성 신호를 텍스트 신호로 실시간 변환하는 인공지능 기술 개발을 위한 대화 음성 데이터 세트를 구축할 수 있고, 성별, 지역, 연령, 원거리, 다자발화 등 분야별 원본 음성데이터(4,000 시간), 텍스트 데이터 400만 문장을 확보할 수 있으며, 확보된 음성 데이터 기반 대화 데이터 세트를 구축할 수 있다. 또한, 프로세서(110)는 유연한 답변이 가능한 클라우드 기반 대규모 데이터 세트를 구축할 수 있고, 예상하지 못했던 질문에 대한 적합한 답변이 가능하도록 고도화할 수 있다.One or more processors 110 may form a knowledge base response signal by performing matching with a database using the extracted keyword. According to an embodiment, the one or more processors 110 receive a plurality of voice data according to gender, region, age, and multilateral speech, update a database using the plurality of voice data, and use a text signal. Accordingly, a knowledge base response signal may be formed by performing matching with a database. For example, the processor 110 may update the database by assigning a weight according to the degree of the user's reaction to the audio output signal. In addition, the processor 110 may build a rule-base dialog system to form a knowledge base response signal. In other words, it is possible to build a dialogue voice data set for the development of artificial intelligence technology that recognizes daily conversation and converts voice signals into text signals in real time, and original voice data (4,000 hours) by field such as gender, region, age, long distance, and multilateral speech. ), 4 million sentences of text data can be secured, and a conversation data set based on the secured voice data can be constructed. In addition, the processor 110 can build a cloud-based large-scale data set capable of flexible answers, and can be advanced to enable appropriate answers to unexpected questions.

예를 들어, 프로세서(110)는, 추출된 키워드인 "내일", "설악산", "날씨"라는 키워드를 이용하여 데이터베이스와의 매칭을 수행하고, "설악산의 기온은 14~23℃, 구름 많고 흐림, 미세먼지/초미세먼지 좋음"이라는 지식 베이스 응답 신호를 형성할 수 있다.For example, the processor 110 performs matching with the database using the extracted keywords such as “tomorrow,” “Mt. Seorak,” and “weather,” and “The temperature of Mt. Seorak is 14 to 23° C. A knowledge base response signal of “cloudy, fine dust/ultra fine dust is good” may be formed.

하나 이상의 프로세서(110)는, 지식 베이스 응답 신호를 이용하여 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호를 형성할 수 있다. 일 실시 예에 따르면, 하나 이상의 프로세서(110)는, 사용자의 나이를 고려하여 지식 베이스 응답 신호에 감정이 나타나도록 감성 베이스 응답 신호를 형성할 수 있다. 예를 들어, 프로세서(110)는, 사용자의 음성 신호로부터 사용자의 나이 정보를 추출하고, 추출된 나이 정보를 이용하여 지식 베이스 응답 신호의 어투를 조절하여 감성 베이스 응답 신호를 형성할 수 있다. 즉, 프로세서(110)는, 사용자의 나이가 10대로 추출된 경우, "설악산의 기온은 14~23℃, 구름 많고 흐림, 미세먼지/초미세먼지 좋음"이라는 지식 베이스 응답 신호의 어투를 조절하여, "설악산의 기온은 14~23℃이고, 구름 많고 흐릴거야, 미세먼지/초미세먼지 좋을것 같아"라는 감성 베이스 응답 신호를 형성할 수 있다. 또한, 프로세서(110)는, 사용자의 음성 신호로부터 사용자의 성격 정보를 추출하고, 추출된 성격 정보를 이용하여 지식 베이스 응답 신호의 어투를 조절하여 감성 베이스 응답 신호를 형성할 수 있다. 즉, 프로세서(110)는, 사용자가 급한 성격으로 파악한 경우, "설악산의 기온은 14~23℃, 구름 많고 흐림, 미세먼지/초미세먼지 좋음"이라는 지식 베이스 응답 신호의 어투를 조절하여, "설악산의 기온은 14~23℃, 구름 많고 흐림, 미세먼지/초미세먼지 좋음"이라는 최대한 간략한 형태로 감성 베이스 응답 신호를 형성할 수 있다. 또한, 프로세서(110)는, 사용자의 음성 신호로부터 사용자의 친밀도 정보를 추출하고, 추출된 친밀도 정보를 이용하여 지식 베이스 응답 신호의 어투를 조절하여 감성 베이스 응답 신호를 형성할 수 있다. 즉, 프로세서(110)는, 사용자와의 친밀도가 낮은 것으로 파악한 경우, "설악산의 기온은 14~23℃, 구름 많고 흐림, 미세먼지/초미세먼지 좋음"이라는 지식 베이스 응답 신호의 어투를 조절하여, "설악산의 기온은 14~23℃입니다, 구름 많고 흐릴 것으로 예상됩니다, 미세먼지/초미세먼지 좋을 것으로 예상됩니다"라는 격식을 갖춘 문장 형태로 감성 베이스 응답 신호를 형성할 수 있다. 또한, 하나 이상의 프로세서(110)는, 감성 베이스 응답 신호에 대한 사용자의 피드백을 이용하여 감성 베이스 응답 신호에 가중치를 부여하여 감성 베이스 응답 신호 형성에 대한 기계 학습을 수행할 수 있다. 예를 들어, 프로세서(110)는 감성 베이스 응답 신호에 대하여 사용자가 웃는 경우 감성 베이스 응답 신호에 보다 높은 가중치를 부여하고, 감성 베이스 응답 신호에 대하여 사용자의 표정 변화가 없는 경우 감성 베이스 응답 신호에 보다 낮은 가중치를 부여할 수 있다.One or more processors 110 may use the knowledge base response signal to form an emotion base response signal in consideration of the user's age, personality, and intimacy. According to an embodiment, the one or more processors 110 may form the emotion base response signal so that emotions appear in the knowledge base response signal in consideration of the age of the user. For example, the processor 110 may extract age information of the user from the user's voice signal and adjust the tone of the knowledge base response signal using the extracted age information to form an emotion base response signal. That is, when the age of the user is extracted as a teenager, the processor 110 adjusts the tone of the knowledge base response signal "The temperature of Mt. , "The temperature of Mt. Seorak is 14~23℃, it will be cloudy and cloudy, fine dust/ultra fine dust will be good", can form an emotional base response signal. In addition, the processor 110 may extract the user's personality information from the user's voice signal and adjust the tone of the knowledge base response signal using the extracted personality information to form an emotion base response signal. That is, the processor 110 adjusts the tone of the knowledge base response signal, "The temperature of Mt. Seorak is 14-23°C, cloudy and cloudy, fine dust/ultra-fine dust is good," when the user identifies as impatient. The emotional base response signal can be formed in the simplest form as possible: "The temperature of Mt. In addition, the processor 110 may extract the user's intimacy information from the user's voice signal and adjust the tone of the knowledge base response signal using the extracted intimacy information to form an emotion base response signal. That is, when the processor 110 determines that the intimacy with the user is low, the processor 110 adjusts the tone of the knowledge base response signal, "The temperature of Mt. , An emotional base response signal can be formed in the form of a formal sentence, "The temperature of Mt. In addition, one or more processors 110 may perform machine learning on emotion base response signal formation by assigning a weight to the emotion base response signal using a user's feedback on the emotion base response signal. For example, the processor 110 assigns a higher weight to the emotion base response signal when the user smiles with respect to the emotion base response signal, and assigns a higher weight to the emotion base response signal when there is no change in the user's facial expression with respect to the emotion base response signal. may be given a lower weight.

하나 이상의 프로세서(110)는, 추출된 키워드와 데이터베이스와의 매칭이 불가능한 경우 지식 베이스 응답 신호 형성이 불가능한 것으로 판단하고, 기계 학습 알고리즘을 이용하여 추출된 키워드로부터 지식 베이스 응답 신호를 형성할 수 있다. 일 실시 예에 따르면, 프로세서(110)는 사용자로부터 "아프리카 스키장은 어디가 좋아?"라는 음성 신호를 수신한 경우 데이터베이스에서 음성 신호의 키워드인 "아프리카", "스키장"과 매칭되는 응답이 존재하지 않음을 확인하고, "아프리카 스키장은 사하라사막이 최고지."라는 지식 베이스 응답 신호를 형성할 수 있다. 또한, 하나 이상의 프로세서(110)는, 지식 베이스 응답 신호를 이용하여 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호를 형성할 수 있다.One or more processors 110 may determine that it is impossible to form a knowledge base response signal when it is impossible to match the extracted keyword with the database, and form a knowledge base response signal from the extracted keyword using a machine learning algorithm. According to an embodiment, when the processor 110 receives a voice signal "Where is the favorite ski resort in Africa?" from the user, there is no response matching the keywords of the voice signal "Africa" or "ski resort" in the database. , and a knowledge base response signal such as "The Sahara Desert is the best ski resort in Africa" can be formed. In addition, the one or more processors 110 may form an emotion-based response signal in consideration of the user's age, personality, and intimacy using the knowledge-base response signal.

하나 이상의 프로세서(110)는, 음성 대화 내역 데이터를 통해 적합한 답변을 유도하고, 키워드 트레이닝을 수행할 수 있다. 검색기반 모델(데이터베이스와의 매칭을 이용한 지식 베이스 응답 신호의 형성)과 생성 모델(기계학습을 이용한 지식 베이스 응답 신호의 형성)의 2가지의 유연한 사용으로 각 모델의 단점을 상호 보완할 수 있고, 자연스러운 응답 신호를 형성하기 위해서 언어적 문맥(linguistic context)과 물리적 문맥(physical context)를 병합할 수 있다. 답변의 일관성 유지를 위해 고정된 지식과 모델에 대한 인격 단일화를 추구하고, 인공지능 활용을 위해 단일 환경에서 텍스트 분석, 머신러닝, 딥러닝 알고리즘을 지원하는 통합 분석 플랫폼을 제작할 수 있다. 유연한 답변이 가능한 클라우드 기반 대규모 데이터 세트 구축이 가능할 수 있다.One or more processors 110 may induce appropriate answers through voice conversation history data and perform keyword training. The disadvantages of each model can be complemented by the flexible use of two types of search-based models (formation of knowledge base response signals using matching with databases) and generative models (formation of knowledge base response signals using machine learning), In order to form a natural response signal, a linguistic context and a physical context may be merged. In order to maintain consistency in answers, it is possible to pursue unification of personality for fixed knowledge and models, and to create an integrated analysis platform that supports text analysis, machine learning, and deep learning algorithms in a single environment for the use of artificial intelligence. It may be possible to build cloud-based large data sets with flexible answers.

하나 이상의 프로세서(110)는, 5G 클라우드 실감 콘텐츠 아키텍처를 도출하고, 기존 아키텍처를 기반으로 클라우드 사업자가 제공하고 있는 Native 서비스를 추가 반영할 수 있다. 아키텍처 고도화를 실시하여 실제 서비스 운영을 고려한 클라우드 아키텍처 설계 및 구성을 수행할 수 있다. 확장성, 안정성을 고려하고 이중화하여 구성, 서비스 확장을 고려하여 서버에 대한 이중화 구성을 수행하고, 5G 연동이 가능한 아키텍처로 구성하여 AR(Augmented Reality), VR(Virtual Reality) 등 메타버스 서비스로의 확장성을 고려한 스케일링(Scaling)이 가능하도록 구성할 수 있다. 로드밸런서(서버의 성능과 부하량을 고려해 네트워크 트래픽을 다수의 서버로 분산시켜 주는 서비스)를 활용하여 서비스에 대한 운영의 안정화 및 이중화 구성을 실시하고, 오토 스케일링을 이용하여 미리 등록한 설정에 따라 서버 수를 자동으로 증가 또는 감소시켜 안정적인 서비스를 유지할 수 있다.One or more processors 110 may derive a 5G cloud immersive content architecture and additionally reflect native services provided by cloud operators based on the existing architecture. By upgrading the architecture, it is possible to design and configure the cloud architecture considering actual service operation. Considering scalability and stability, redundant configuration and service expansion, redundant server configuration is performed, and 5G interoperable architecture is configured to provide metaverse services such as AR (Augmented Reality) and VR (Virtual Reality). It can be configured to enable scaling considering scalability. By utilizing a load balancer (a service that distributes network traffic to multiple servers in consideration of server performance and load), stabilization of service operations and redundant configuration are implemented, and the number of servers is set according to pre-registered settings using auto scaling. can be automatically increased or decreased to maintain stable service.

하나 이상의 프로세서(110)는, AI 캐릭터에 어울리는 목소리 TTS(Text to Speech)를 개발하고, 실제 사람의 음성과 같은 자연스러운 목소리 TTS를 구현할 수 있다. 특정인의 목소리를 재연하고 비슷한 수준의 Naver Clova TTS 목소리를 활용, 스펙트럼 비교 분석하여 튜닝된 목소리를 구현할 수 있다. WaveNet 기술과 Clova TTS 기술을 활용하여 개인화 음성 합성 기술을 구현하고, 오디오 신호의 원시 파형을 한번에 한 샘플씩 직접 모델링하는 기술을 구현할 수 있다.The one or more processors 110 may develop a voice TTS (Text to Speech) suitable for the AI character and implement a natural voice TTS like a real human voice. A tuned voice can be implemented by reproducing the voice of a specific person and using the Naver Clova TTS voice of a similar level to compare and analyze the spectrum. By utilizing WaveNet technology and Clova TTS technology, personalized speech synthesis technology can be implemented, and technology that directly models the raw waveform of an audio signal one sample at a time can be implemented.

게임 엔진은 게임 개발이라는 본연의 목적외에도 모바일 앱, AR, VR 등 다양한 개발에 사용되고 있다. 본 고안의 인공지능 대화 제공을 위한 전자 장치(100)는 Unity, Unreal 등의 게임 엔진에 응용, 연동할 수 있는 범용 플랫폼으로 제작하고, 음성 인식, STT, 번역 시스템을 연결하여 글로벌 메타버스 사용자의 커뮤니케이션을 활성화시킬 수 있다.In addition to its original purpose of game development, game engines are used for various developments such as mobile apps, AR, and VR. The electronic device 100 for providing artificial intelligence conversation of the present invention is manufactured as a universal platform that can be applied to and linked to game engines such as Unity and Unreal, and connects voice recognition, STT, and translation systems to provide global metaverse users. communication can be activated.

하나 이상의 프로세서(110)는, 감성 베이스 응답 신호를 음성 출력 신호로 변환할 수 있다. 일 실시 예에 따르면, 프로세서(110)는, 감성 베이스 응답 신호를 음성 출력 신호로 변환할 수 있고, 변환된 음성 출력 신호는 스피커(150)를 통하여 사용자에게 출력할 수 있다.One or more processors 110 may convert the emotional base response signal into a voice output signal. According to an embodiment, the processor 110 may convert the emotional base response signal into a voice output signal, and output the converted voice output signal to the user through the speaker 150 .

하나 이상의 메모리(120)는, 다양한 데이터를 저장할 수 있다. 메모리(120)에 저장되는 데이터는, 인공지능 대화 제공을 위한 전자 장치(100)의 적어도 하나의 구성요소에 의해 획득되거나, 처리되거나, 사용되는 데이터로서, 소프트웨어(예: 명령, 프로그램 등)를 포함할 수 있다. 메모리(120)는 휘발성 및/또는 비휘발성 메모리를 포함할 수 있다. 본 고안에서, 명령 내지 프로그램은 메모리(120)에 저장되는 소프트웨어로서, 인공지능 대화 제공을 위한 전자 장치(100)의 리소스를 제어하기 위한 운영체제, 어플리케이션 및/또는 어플리케이션이 전자 장치의 리소스들을 활용할 수 있도록 다양한 기능을 어플리케이션에 제공하는 미들 웨어 등을 포함할 수 있다.One or more memories 120 may store various data. Data stored in the memory 120 is data obtained, processed, or used by at least one component of the electronic device 100 for providing artificial intelligence conversations, and includes software (eg, commands, programs, etc.) can include Memory 120 may include volatile and/or non-volatile memory. In the present invention, a command or program is software stored in the memory 120, and an operating system, application, and/or application for controlling the resources of the electronic device 100 for providing artificial intelligence conversations can utilize the resources of the electronic device. It can include middleware that provides various functions to applications.

하나 이상의 메모리(120)는, 상술한 사용자의 음성 신호, 텍스트 신호, 키워드, 지식 베이스 응답 신호, 감성 베이스 응답 신호, 음성 출력 신호 등을 저장할 수 있다. 또한 하나 이상의 메모리(120)는, 하나 이상의 프로세서(110)에 의한 실행 시, 하나 이상의 프로세서(110)가 연산을 수행하도록 하는 명령들을 저장할 수 있다.One or more memories 120 may store the above-described user's voice signal, text signal, keyword, knowledge base response signal, emotion base response signal, audio output signal, and the like. Also, the one or more memories 120 may store instructions that, when executed by the one or more processors 110, cause the one or more processors 110 to perform operations.

일 실시 예에 따르면, 인공지능 대화 제공을 위한 전자 장치(100)는 송수신기(130)를 더 포함할 수 있다. 송수신기(130)는, 인공지능 대화 제공을 위한 전자 장치(100)와 서버, 데이터베이스, 클라이언트 장치들 및/또는 기타 다른 장치 간의 무선 또는 유선 통신을 수행할 수 있다. 예를 들어, 송수신기(130)는 eMBB(enhanced Mobile Broadband), URLLC(Ultra Reliable Low-Latency Communications), MMTC(Massive Machine Type Communications), LTE(long-term evolution), LTE-A(LTE Advance), UMTS(Universal Mobile Telecommunications System), GSM(Global System for Mobile communications), CDMA(code division multiple access), WCDMA(wideband CDMA), WiBro(Wireless Broadband), WiFi(wireless fidelity), 블루투스(Bluetooth), NFC(near field communication), GPS(Global Positioning System) 또는 GNSS(global navigation satellite system) 등의 방식에 따른 무선 통신을 수행할 수 있다. 예를 들어, 송수신기(130)는 USB(universal serial bus), HDMI(high definition multimedia interface), RS-232(recommended standard232) 또는 POTS(plain old telephone service) 등의 방식에 따른 유선 통신을 수행할 수 있다.According to an embodiment, the electronic device 100 for providing artificial intelligence conversations may further include a transceiver 130. The transceiver 130 may perform wireless or wired communication between the electronic device 100 for providing artificial intelligence conversations and a server, database, client devices, and/or other devices. For example, the transceiver 130 may transmit eMBB (enhanced mobile broadband), URLLC (Ultra Reliable Low-Latency Communications), MMTC (Massive Machine Type Communications), LTE (long-term evolution), LTE-A (LTE Advance), UMTS (Universal Mobile Telecommunications System), GSM (Global System for Mobile communications), CDMA (code division multiple access), WCDMA (wideband CDMA), WiBro (Wireless Broadband), WiFi (wireless fidelity), Bluetooth (Bluetooth), NFC ( Wireless communication may be performed according to a method such as near field communication (GPS), global positioning system (GPS), or global navigation satellite system (GNSS). For example, the transceiver 130 may perform wired communication using a universal serial bus (USB), high definition multimedia interface (HDMI), recommended standard 232 (RS-232), or plain old telephone service (POTS). there is.

일 실시 예에 따르면, 하나 이상의 프로세서(110)는, 송수신기(130)를 제어하여 서버 및 데이터베이스로부터 정보를 획득할 수 있다. 서버 및 데이터베이스로부터 획득된 정보는 하나 이상의 메모리(120)에 저장될 수 있다. 일 실시예로서, 서버 및 데이터베이스로부터 획득되는 정보는 지식 베이스 응답 신호의 형성을 위한 키워드 매칭에 필요한 정보를 포함할 수 있다.According to one embodiment, one or more processors 110 may obtain information from a server and a database by controlling the transceiver 130 . Information obtained from the server and database may be stored in one or more memories 120 . As an example, the information obtained from the server and the database may include information required for keyword matching to form a knowledge base response signal.

일 실시 예에 따르면, 인공지능 대화 제공을 위한 전자 장치(100)는, 다양한 형태의 장치가 될 수 있다. 예를 들어, 인공지능 대화 제공을 위한 전자 장치(100)는 휴대용 통신 장치, 컴퓨터 장치, 또는 상술한 장치들 중 하나 또는 그 이상의 조합에 따른 장치일 수 있다. 본 고안의 인공지능 대화 제공을 위한 전자 장치(100)는 전술한 장치들에 한정되지 않는다.According to an embodiment, the electronic device 100 for providing artificial intelligence conversations may be various types of devices. For example, the electronic device 100 for providing artificial intelligence conversations may be a portable communication device, a computer device, or a device according to one or more combinations of the above devices. The electronic device 100 for providing artificial intelligence conversations according to the present invention is not limited to the above devices.

본 고안에 따른 인공지능 대화 제공을 위한 전자 장치(100)의 다양한 실시예들은 서로 조합될 수 있다. 각 실시예들은 경우의 수에 따라 조합될 수 있으며, 조합되어 만들어진 인공지능 대화 제공을 위한 전자 장치(100)의 실시예 역시 본 고안의 범위에 속한다. 또한 전술한 본 고안에 따른 전자 장치(100)의 내/외부 구성 요소들은 실시예에 따라 추가, 변경, 대체 또는 삭제될 수 있다. 또한 전술한 인공지능 대화 제공을 위한 전자 장치(100)의 내/외부 구성 요소들은 하드웨어 컴포넌트로 구현될 수 있다.Various embodiments of the electronic device 100 for providing artificial intelligence conversations according to the present invention may be combined with each other. Each embodiment may be combined according to the number of cases, and the combined embodiment of the electronic device 100 for providing artificial intelligence conversations also falls within the scope of the present invention. In addition, internal/external components of the electronic device 100 according to the present invention described above may be added, changed, replaced, or deleted according to embodiments. In addition, the aforementioned internal/external components of the electronic device 100 for providing artificial intelligence conversations may be implemented as hardware components.

본 고안에서, 인공지능(Artificial Intelligence, AI)은 인간의 학습능력, 추론능력, 지각능력 등을 모방하고, 이를 컴퓨터로 구현하는 기술을 의미하고, 기계 학습, 심볼릭 로직(Symbolic Logic) 등의 개념을 포함할 수 있다. 기계 학습(Machine Learning, ML)은 입력 데이터들의 특징을 스스로 분류 또는 학습하는 알고리즘 기술이다. 인공지능의 기술은 기계 학습의 알고리즘으로써 입력 데이터를 분석하고, 그 분석의 결과를 학습하며, 그 학습의 결과에 기초하여 판단이나 예측을 할 수 있다. 또한, 기계 학습의 알고리즘을 활용하여 인간 두뇌의 인지, 판단 등의 기능을 모사하는 기술들 역시 인공지능의 범주로 이해될 수 있다. 예를 들어, 언어적 이해, 시각적 이해, 추론/예측, 지식 표현, 동작 제어 등의 기술 분야가 포함될 수 있다.In the present invention, artificial intelligence (AI) means a technology that imitates human learning ability, reasoning ability, perception ability, etc., and implements it with a computer, and concepts such as machine learning and symbolic logic can include Machine learning (ML) is an algorithm technology that classifies or learns the characteristics of input data by itself. Artificial intelligence technology is a machine learning algorithm that analyzes input data, learns the result of the analysis, and can make judgments or predictions based on the result of the learning. In addition, technologies that mimic the functions of the human brain, such as recognition and judgment, using machine learning algorithms, can also be understood as artificial intelligence. For example, technical fields such as linguistic understanding, visual understanding, inference/prediction, knowledge expression, and motion control may be included.

기계 학습은 데이터를 처리한 경험을 이용해 신경망 모델을 훈련시키는 처리를 의미할 수 있다. 기계 학습을 통해 컴퓨터 소프트웨어는 스스로 데이터 처리 능력을 향상시키는 것을 의미할 수 있다. 신경망 모델은 데이터 사이의 상관 관계를 모델링하여 구축된 것으로서, 그 상관 관계는 복수의 파라미터에 의해 표현될 수 있다. 신경망 모델은 주어진 데이터로부터 특징들을 추출하고 분석하여 데이터 간의 상관 관계를 도출하는데, 이러한 과정을 반복하여 신경망 모델의 파라미터를 최적화해 나가는 것이 기계 학습이라고 할 수 있다. 예를 들어, 신경망 모델은 입출력 쌍으로 주어지는 데이터에 대하여, 입력과 출력 사이의 매핑(상관 관계)을 학습할 수 있다. 또는, 신경망 모델은 입력 데이터만 주어지는 경우에도 주어진 데이터 사이의 규칙성을 도출하여 그 관계를 학습할 수도 있다.Machine learning may refer to processing that trains a neural network model using experience of processing data. Through machine learning, computer software could mean improving its own data processing capabilities. A neural network model is constructed by modeling a correlation between data, and the correlation may be expressed by a plurality of parameters. The neural network model derives a correlation between data by extracting and analyzing features from given data, and optimizing the parameters of the neural network model by repeating this process can be referred to as machine learning. For example, a neural network model may learn a mapping (correlation) between an input and an output with respect to data given as an input/output pair. Alternatively, even when only input data is given, the neural network model may learn the relationship by deriving a regularity between given data.

인공지능 학습모델 또는 신경망 모델은 인간의 뇌 구조를 컴퓨터 상에서 구현하도록 설계될 수 있으며, 인간의 신경망의 뉴런(neuron)을 모의하며 가중치를 가지는 복수의 네트워크 노드들을 포함할 수 있다. 복수의 네트워크 노드들은 뉴런이 시냅스(synapse)를 통하여 신호를 주고받는 뉴런의 시냅틱(synaptic) 활동을 모의하여, 서로 간의 연결 관계를 가질 수 있다. 인공지능 학습모델에서 복수의 네트워크 노드들은 서로 다른 깊이의 레이어에 위치하면서 컨볼루션(convolution) 연결 관계에 따라 데이터를 주고받을 수 있다. 인공지능 학습모델은, 예를 들어, 인공 신경망 모델(Artificial Neural Network), 컨볼루션 신경망 모델(Convolution Neural Network: CNN) 등일 수 있다. 일 실시예로서, 인공지능 학습모델은, 지도학습(Supervised Learning), 비지도 학습(Unsupervised Learning), 강화 학습(Reinforcement Learning) 등의 방식에 따라 기계 학습될 수 있다. 기계 학습을 수행하기 위한 기계 학습 알고리즘에는, 의사결정트리(Decision Tree), 베이지안 망(Bayesian Network), 서포트 벡터 머신(Support Vector Machine), 인공 신경망(Artificial Neural Network), 에이다부스트(Ada-boost), 퍼셉트론(Perceptron), 유전자 프로그래밍(Genetic Programming), 군집화(Clustering) 등이 사용될 수 있다.An artificial intelligence learning model or a neural network model may be designed to implement a human brain structure on a computer, and may include a plurality of network nodes having weights while simulating neurons of a human neural network. A plurality of network nodes may have a connection relationship between them by simulating synaptic activities of neurons that transmit and receive signals through synapses. In the artificial intelligence learning model, a plurality of network nodes can send and receive data according to a convolutional connection relationship while being located in layers of different depths. The artificial intelligence learning model may be, for example, an artificial neural network model, a convolutional neural network model (CNN), and the like. As an embodiment, the artificial intelligence learning model may be machine-learned according to methods such as supervised learning, unsupervised learning, and reinforcement learning. Machine learning algorithms for performing machine learning include Decision Tree, Bayesian Network, Support Vector Machine, Artificial Neural Network, Ada-boost , Perceptron, Genetic Programming, Clustering, etc. may be used.

이중, CNN은 최소한의 전처리(preprocess)를 사용하도록 설계된 다계층 퍼셉트론(multilayer perceptrons)의 한 종류이다. CNN은 하나 또는 여러 개의 합성곱 계층과 그 위에 올려진 일반적인 인공 신경망 계층들로 이루어져 있으며, 가중치와 통합 계층(pooling layer)들을 추가로 활용한다. 이러한 구조 덕분에 CNN은 2차원 구조의 입력 데이터를 충분히 활용할 수 있다. 다른 딥러닝 구조들과 비교해서, CNN은 영상, 음성 분야 모두에서 좋은 성능을 보여준다. CNN은 또한 표준 역전달을 통해 훈련될 수 있다. CNN은 다른 피드포워드 인공신경망 기법들보다 쉽게 훈련되는 편이고 적은 수의 매개변수를 사용한다는 이점이 있다.Of these, CNNs are a type of multilayer perceptrons designed to use minimal preprocessing. A CNN consists of one or several convolution layers and general artificial neural network layers on top, and additionally utilizes weights and pooling layers. Thanks to this structure, CNN can fully utilize the input data with a two-dimensional structure. Compared to other deep learning structures, CNN shows good performance in both video and audio fields. CNNs can also be trained via standard back-propagation. CNNs are easier to train than other feedforward artificial neural network techniques and have the advantage of using fewer parameters.

컨볼루션 네트워크는 묶인 파라미터들을 가지는 노드들의 집합들을 포함하는 신경 네트워크들이다. 사용 가능한 트레이닝 데이터의 크기 증가와 연산 능력의 가용성이, 구분적 선형 단위 및 드롭아웃 트레이닝과 같은 알고리즘 발전과 결합되어, 많은 컴퓨터 비전 작업들이 크게 개선되었다. 오늘날 많은 작업에 사용할 수 있는 데이터 세트들과 같은 엄청난 양의 데이터 세트에서는 초과 맞춤(outfitting)이 중요하지 않으며, 네트워크의 크기를 늘리면 테스트 정확도가 향상된다. 컴퓨팅 리소스들의 최적 사용은 제한 요소가 된다. 이를 위해, 심층 신경 네트워크들의 분산된, 확장 가능한 구현예가 사용될 수 있다.Convolutional networks are neural networks that contain sets of nodes with bound parameters. The increasing size of available training data and the availability of computational power, combined with algorithmic advances such as piecewise linear unit and dropout training, have greatly improved many computer vision tasks. With huge data sets, such as those available for many tasks today, overfitting is not critical, and increasing the size of the network improves test accuracy. Optimal use of computing resources becomes a limiting factor. To this end, a distributed, scalable implementation of deep neural networks can be used.

도 2는 본 고안의 실시예에 따른 인공지능 대화 제공 시스템의 구성을 보이는 예시도이다.2 is an exemplary diagram showing the configuration of an artificial intelligence conversation providing system according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 인공지능 대화 제공 시스템(200)은, 게임 엔진(210), A/D 컨버터(220), 잡음 제거부(230), 특징 추출부(240), 어휘 사전(250), 텍스트 변환부(260), 룰베이스 대화 시스템(270) 및 AI 대화 시스템(280)을 포함할 수 있다.As shown in FIG. 2, the artificial intelligence dialogue providing system 200 includes a game engine 210, an A/D converter 220, a noise removal unit 230, a feature extraction unit 240, and a vocabulary dictionary 250. ), a text conversion unit 260, a rule-based dialogue system 270, and an AI dialogue system 280.

게임 엔진(210)은, 게임 개발이라는 본연의 목적 외에도 모바일 앱, AR, VR 등 다양한 개발에 사용되고 있다. 일 실시 예에 따르면, 본 고안의 인공지능 대화 제공을 위한 전자 장치(100)는 Unity, Unreal 등의 게임 엔진에 응용, 연동할 수 있는 범용 플랫폼으로 제작하고, 음성 인식, STT, 번역 시스템을 연결하여 글로벌 메타버스 사용자의 커뮤니케이션을 활성화시킬 수 있다.The game engine 210 is used for various developments such as mobile apps, AR, and VR, in addition to its original purpose of game development. According to an embodiment, the electronic device 100 for providing artificial intelligence conversations of the present invention is manufactured as a general-purpose platform that can be applied to and linked to game engines such as Unity and Unreal, and connects voice recognition, STT, and translation systems. By doing so, communication between users of the global metaverse can be activated.

A/D 컨버터(220)는, 사용자(UE)로부터 수신된 음성 신호를 샘플링하여 디지털 신호로 변환할 수 있다. 일 실시 예에 따르면, A/D 컨버터(220)는 아날로그 신호인 음성 신호를 전자 장치에서 인식 가능한 디지털 신호로 변환할 수 있다.The A/D converter 220 may sample an audio signal received from a user (UE) and convert it into a digital signal. According to an embodiment, the A/D converter 220 may convert a voice signal, which is an analog signal, into a digital signal recognizable by an electronic device.

잡음 제거부(230)는, A/D 컨버터(220)에서 변환된 디지털 신호에서 잡음 신호를 제거할 수 있다. 일 실시 예에 따르면, 잡음 제거부(230)는 대역 필터링을 수행하여 디지털 신호에서 잡음 신호를 제거할 수 있다.The noise removal unit 230 may remove a noise signal from the digital signal converted by the A/D converter 220 . According to an embodiment, the noise removal unit 230 may remove a noise signal from a digital signal by performing band filtering.

특징 추출부(240)는, 디지털 신호에서 잡음 신호를 제거한 후 조사, 문장 부호, 띄어쓰기 등을 제거하여 키워드(특징)를 추출할 수 있고, 추출된 키워드는 어휘 사전(250)에 저장된 단어로 변환될 수 있다. The feature extractor 240 may extract keywords (features) by removing postpositions, punctuation marks, spaces, etc. after removing noise signals from the digital signal, and converts the extracted keywords into words stored in the vocabulary dictionary 250. It can be.

텍스트 변환부(260)는, 어휘 사전(250)에 저장된 단어로 구성된 텍스트 신호를 형성할 수 있다.The text conversion unit 260 may form a text signal composed of words stored in the vocabulary dictionary 250 .

룰베이스 대화 시스템(270)은, 텍스트 신호에서 추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성할 수 있다.The rule-based dialogue system 270 may form a knowledge base response signal by performing database matching using keywords extracted from text signals.

AI 대화 시스템(280)은, 지식 베이스 응답 신호를 이용하여 사용자의 나이, 성격, 친밀도 등을 고려하여 감성 베이스 응답 신호를 형성할 수 있다. 감성 베이스 응답 신호는 음성 출력 신호로 변환되어 스피커(150)를 통하여 사용자(UE)에게 출력될 수 있다.The AI conversation system 280 may use the knowledge base response signal to form an emotion-based response signal in consideration of the user's age, personality, intimacy, and the like. The emotional base response signal may be converted into a voice output signal and output to the user UE through the speaker 150 .

도 3은 본 고안의 실시예에 따른 뉴럴 네트워크의 학습을 설명하기 위한 도면이다.3 is a diagram for explaining learning of a neural network according to an embodiment of the present invention.

도 3에 도시한 바와 같이, 학습 장치는 대상체의 이미지가 포함하는 특징점들의 인식을 위하여 뉴럴 네트워크(114)를 학습시킬 수 있다. 일 실시예에 따르면, 학습 장치는 인공지능 대화 제공을 위한 전자 장치(100)와 다른 별개의 주체일 수 있지만, 이에 제한되는 것은 아니다.As shown in FIG. 3 , the learning device may train the neural network 114 to recognize feature points included in the image of the object. According to an embodiment, the learning device may be a separate subject different from the electronic device 100 for providing artificial intelligence conversations, but is not limited thereto.

뉴럴 네트워크(114)는 트레이닝 샘플들이 입력되는 입력 레이어(112)와 트레이닝 출력들을 출력하는 출력 레이어(116)를 포함하고, 트레이닝 출력들과 레이블들 사이의 차이에 기초하여 학습될 수 있다. 여기서, 레이블들은 음성 출력 신호에 대응하는 사용자의 반응 정도에 기초하여 정의될 수 있다. 뉴럴 네트워크(114)는 복수의 노드들의 그룹으로 연결되어 있고, 연결된 노드들 사이의 가중치들과 노드들을 활성화시키는 활성화 함수에 의해 정의된다.The neural network 114 includes an input layer 112 into which training samples are input and an output layer 116 which outputs training outputs, and can be learned based on differences between the training outputs and labels. Here, the labels may be defined based on the user's response level corresponding to the audio output signal. The neural network 114 is connected to a group of a plurality of nodes, and is defined by weights between connected nodes and an activation function that activates the nodes.

학습 장치는 GD(Gradient Decent) 기법 또는 SGD(Stochastic Gradient Descent) 기법을 이용하여 뉴럴 네트워크(114)를 학습시킬 수 있다. 학습 장치는 뉴럴 네트워크의 출력들 및 레이블들 의해 설계된 손실 함수(Loss Function)를 이용할 수 있다.The learning device may train the neural network 114 using a gradient descent (GD) technique or a stochastic gradient descent (SGD) technique. The learning device may use a loss function designed by outputs and labels of the neural network.

학습 장치는 미리 정의된 손실 함수를 이용하여 트레이닝 에러를 계산할 수 있다. 손실 함수는 레이블, 출력 및 파라미터를 입력 변수로 미리 정의될 수 있고, 여기서 파라미터는 뉴럴 네트워크(114) 내 가중치들에 의해 설정될 수 있다. 예를 들어, 손실 함수는 MSE(Mean Square Error) 형태, 엔트로피(entropy) 형태 등으로 설계될 수 있는데, 손실 함수가 설계되는 실시예에는 다양한 기법 또는 방식이 채용될 수 있다.The learning device may calculate a training error using a predefined loss function. The loss function may be predefined with labels, outputs and parameters as input variables, where the parameters may be set by weights in the neural network 114. For example, the loss function may be designed in a mean square error (MSE) form, an entropy form, and the like, and various techniques or methods may be employed in an embodiment in which the loss function is designed.

학습 장치는 역전파(Backpropagation) 기법을 이용하여 트레이닝 에러에 영향을 주는 가중치들을 찾아낼 수 있다. 여기서, 가중치들은 뉴럴 네트워크(114) 내 노드들 사이의 관계들이다. 학습 장치는 역전파 기법을 통해 찾아낸 가중치들을 최적화시키기 위해 레이블들 및 출력들을 이용한 SGD 기법을 이용할 수 있다. 예를 들어, 학습 장치는 레이블들, 출력들 및 가중치들에 기초하여 정의된 손실 함수의 가중치들을 SGD 기법을 이용하여 갱신할 수 있다.The learning device may find weights affecting the training error using a backpropagation technique. Here, weights are relationships between nodes in neural network 114 . The learning device may use the SGD technique using labels and outputs to optimize the weights found through the backpropagation technique. For example, the learning device may update weights of a loss function defined based on labels, outputs, and weights using the SGD technique.

일 실시예에 따르면, 학습 장치는 트레이닝 대상체의 음성 신호를 획득하고, 트레이닝 대상체의 음성 신호로부터 트레이닝 음성 출력 신호를 추출할 수 있다. 학습 장치는 트레이닝 음성 출력 신호들에 대해서 각각 미리 레이블링 된 정보(제1 레이블들)를 획득할 수 있는데, 트레이닝 음성 출력 신호들에 미리 정의된 음성 출력 신호에 대응한 사용자의 반응 정도(예를 들어, 반응도 상, 중, 하 등)를 나타내는 제1 레이블들을 획득할 수 있다.According to an embodiment, the learning device may obtain a voice signal of a training object and extract a training voice output signal from the audio signal of the training object. The learning device may acquire pre-labeled information (first labels) for each of the training audio output signals, and the user's reaction degree (e.g., corresponding to a predefined audio output signal) to the training audio output signals. , reactivity high, medium, low, etc.) may be obtained.

일 실시예에 따르면, 학습 장치는 트레이닝 음성 출력 신호들의 외관 특징들, 패턴 특징들 및 어투 특징들에 기초하여 제1 트레이닝 특징 벡터들을 생성할 수 있다. 트레이닝 음성 출력 신호들의 특징을 추출하는 데는 다양한 방식이 채용될 수 있다.According to an embodiment, the learning device may generate first training feature vectors based on appearance features, pattern features, and tone features of the training speech output signals. Various methods may be employed to extract features of the training speech output signals.

일 실시예에 따르면, 학습 장치는 제1 트레이닝 특징 벡터들을 뉴럴 네트워크(114)에 적용하여 트레이닝 출력들을 획득할 수 있다. 학습 장치는 트레이닝 출력들과 제1 레이블들에 기초하여 뉴럴 네트워크(114)를 학습시킬 수 있다. 학습 장치는 트레이닝 출력들에 대응하는 트레이닝 에러들을 계산하고, 그 트레이닝 에러들을 최소화하기 위해 뉴럴 네트워크(114) 내 노드들의 연결 관계를 최적화하여 뉴럴 네트워크(114)를 학습시킬 수 있다. 전자 장치(110)는 학습이 완료된 뉴럴 네트워크(114)를 이용하여 음성 신호로부터 음성 출력 신호를 형성할 수 있다.According to an embodiment, the learning device may obtain training outputs by applying the first training feature vectors to the neural network 114 . The learning device may train the neural network 114 based on the training outputs and the first labels. The learning device may train the neural network 114 by calculating training errors corresponding to training outputs and optimizing a connection relationship of nodes in the neural network 114 to minimize the training errors. The electronic device 110 may form a voice output signal from a voice signal using the neural network 114 that has been learned.

도　4 및 도 5는 본 고안의 실시 예에 따른 인공지능 대화 제공 방법의 절차를 보이는 흐름도이다.　　도　4 및 도 5의 흐름도에서 프로세스 단계들, 방법 단계들, 알고리즘들 등이 순차적인 순서로 설명되었지만, 그러한 프로세스들, 방법들 및 알고리즘들은 임의의 적합한 순서로 작동하도록 구성될 수 있다. 다시 말하면, 본 고안의 다양한 실시예들에서 설명되는 프로세스들, 방법들 및 알고리즘들의 단계들이 본 고안에서 기술된 순서로 수행될 필요는 없다. 또한, 일부 단계들이 비동시적으로 수행되는 것으로서 설명되더라도, 다른 실시예에서는 이러한 일부 단계들이 동시에 수행될 수 있다. 또한, 도면에서의 묘사에 의한 프로세스의 예시는 예시된 프로세스가 그에 대한 다른 변화들 및 수정들을 제외하는 것을 의미하지 않으며, 예시된 프로세스 또는 그의 단계들 중 임의의 것이 본 고안의 다양한 실시예들 중 하나 이상에 필수적임을 의미하지 않으며, 예시된 프로세스가 바람직하다는 것을 의미하지 않는다.4 and 5 are flowcharts showing procedures of an artificial intelligence conversation providing method according to an embodiment of the present invention. Although process steps, method steps, algorithms, etc. are described in a sequential order in the flowcharts of FIGS. 4 and 5, such processes, methods, and algorithms may be configured to operate in any suitable order. In other words, steps of processes, methods, and algorithms described in various embodiments of the present invention need not be performed in the order described herein. Also, although some steps are described as being performed asynchronously, in other embodiments some of these steps may be performed concurrently. Further, illustration of a process by depiction in the drawings does not mean that the illustrated process is exclusive of other changes and modifications thereto, and that any of the illustrated process or steps thereof may be one of various embodiments of the present invention. It is not meant to be essential to one or more, and it does not imply that the illustrated process is preferred.

도　4에 도시한 바와 같이,　단계(S410)에서,　사용자의 음성 신호가 수신된다. 　　예를 들어,　도　1　내지 도　2를 참조하면,　인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　사용자의 음성 신호를 수신할 수 있다.　　일 실시 예에 따르면,　프로세서(110)는,　인공지능 대화 제공을 위한 전자 장치(100)의 마이크(140)를 통하여 사용자의 음성 신호를 수신할 수 있다.As shown in FIG. 4, in step S410, a user's voice signal is received. For example, referring to Figures 1 and 2, the processor 110 of the electronic device 100 for providing artificial intelligence conversations may receive a user's voice signal. According to an embodiment, the processor 110 may receive a user's voice signal through the microphone 140 of the electronic device 100 for providing artificial intelligence conversation.

단계(S420)에서,　음성 신호가 텍스트 신호로 변환된다.　　예를 들어,　도　1　내지 도　2를 참조하면,　인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　단계　S410에서 수신한 사용자의 음성 신호를 텍스트 신호로 변환할 수 있다. 　일 실시 예에 따르면,　프로세서(110)는,　수신된 음성 신호를　A/D(Analog Digital)　변환기(Converter)에 의하여 샘플링하여 디지털 신호로 변환한 후 잡음 신호를 제거하여 텍스트 신호로 변환할 수 있다.In step S420, the voice signal is converted into a text signal. For example, referring to Figure 1 to Figure 2, the processor 110 of the electronic device 100 for providing artificial intelligence conversation may convert the user's voice signal received in step S410 into a text signal. According to an embodiment, the processor 110 samples the received voice signal by an A/D (Analog Digital) converter, converts it into a digital signal, and then removes the noise signal to convert it into a text signal. .

단계(S430)에서,　텍스트 신호의 자연 언어 처리를 수행하여 키워드가 추출된다. 　예를 들어,　도　1　내지 도　3을 참조하면,　인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　단계　S420에서 변환된 텍스트 신호로부터 음성 인식에 유용한 특징을 추출하여 특징 벡터를 생성하고,　생성된 특징 벡터를 미리 학습과정에서 준비된 인식 대상 어휘로 구성된 사전을 참조하여 가장 유사한 단어를 키워드로 추출할 수 있다.In step S430, keywords are extracted by performing natural language processing on the text signal. For example, referring to Figure 1 to Figure 3, the processor 110 of the electronic device 100 for providing artificial intelligence conversation extracts features useful for voice recognition from the text signal converted in step S420 to generate a feature vector. And, the most similar words can be extracted as keywords by referring to a dictionary composed of words to be recognized prepared in advance in the learning process for the generated feature vectors.

단계(S440)에서,　추출된 키워드를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호가 형성된다.　　예를 들어,　도　1　내지 도　3을 참조하면,　인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　성별,　지역,　연령 및 다자발화에 따른 다수의 음성 데이터를 수신하고,　다수의 음성 데이터를 이용하여 데이터베이스의 업데이트를 수행하며,　텍스트 신호를 이용하여 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성할 수 있다.　In step S440, a knowledge base response signal is formed by matching with a database using the extracted keyword. For example, referring to 　Do　1 to 3, the processor 110 of the electronic device 100 for providing artificial intelligence conversation receives a plurality of voice data according to 　gender, 　region, 　age, and multiple speeches, and A database may be updated using voice data of , and a knowledge base response signal may be formed by performing database matching using a text signal.

단계(S450)에서, 지식 베이스 응답 신호를 이용하여 사용자의 나이, 성격 및 친밀도를 고려하여 감성 베이스 응답 신호가 형성된다. 예를 들어,　도　1　내지 도　3을 참조하면, 인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는, 기계학습(machine learning) 특히, 딥러닝(deep learning)과 같은 심층 강화 학습 알고리즘을 이용하여 감성 베이스 응답 신호를 형성할 수 있다.In step S450, an emotion-based response signal is formed in consideration of the user's age, personality, and intimacy using the knowledge-base response signal. For example, referring to Figures 1 to 3, the processor 110 of the electronic device 100 for providing artificial intelligence conversation is machine learning, in particular, deep reinforcement learning such as deep learning. An emotion-based response signal may be formed using an algorithm.

단계(S460)에서, 감성 베이스 응답 신호가 음성 출력 신호로 변환된다. 예를 들어,　도　1　내지 도　3을 참조하면, 인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는, 감성 베이스 응답 신호를 TTS(Text to Speech) 인식 엔진을 이용하여 음성 출력 신호로 변환할 수 있고, 음성 출력 신호는 스피커(150)를 통하여 사용자(UE)에게 출력할 수 있다.In step S460, the emotional base response signal is converted into a voice output signal. For example, referring to Figures 1 to 3, the processor 110 of the electronic device 100 for providing artificial intelligence conversation uses a TTS (Text to Speech) recognition engine to transmit an emotional base response signal to a voice output signal. , and the audio output signal can be output to the user UE through the speaker 150.

도 5에 도시한 바와 같이, 단계(S510)에서,　다수의 음성 데이터가 수신된다. 　　예를 들어,　도　1　내지 도　4를 참조하면,　인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　성별, 지역, 연령 및 다자발화에 따른 다수의 음성 데이터를 수신할 수 있다. 일 실시 예에 따르면, 프로세서(110)는, 인터넷, SNS(Social Network Service) 등 다양한 경로를 통하여 다수의 음성 데이터를 수신할 수 있다.As shown in Fig. 5, in step S510, a plurality of voice data is received. For example, referring to Figures 1 to 4, the processor 110 of the electronic device 100 for providing artificial intelligence conversations may receive a plurality of voice data according to gender, region, age, and multiple speeches. . According to an embodiment, the processor 110 may receive a plurality of voice data through various paths such as the Internet and Social Network Service (SNS).

단계(S520)에서,　데이터베이스의 업데이트가 수행된다. 예를 들어, 도 1 내지 도 4를 참조하면, 인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　단계　S510에서 수신한 다수의 음성 데이터를 이용하여 데이터베이스의 업데이트를 수행할 수 있다. 　일 실시 예에 따르면,　프로세서(110)는,　수신된 다수의 음성 데이터를 성별, 지역, 연령, 다자발화 등에 따라 분류하여 이를 데이터베이스에 지속적으로 업데이트할 수 있다.In step S520, an update of the database is performed. For example, referring to FIGS. 1 to 4 , the processor 110 of the electronic device 100 for providing an artificial intelligence conversation may update a database using a plurality of voice data received in step S510. there is. According to an embodiment, the processor 110 may classify a plurality of received voice data according to gender, region, age, multivocalization, etc., and continuously update them in a database.

단계(S530)에서,　지식 베이스 응답 신호가 형성된다. 　예를 들어,　도　1　내지 도　4를 참조하면,　인공지능 대화 제공을 위한 전자 장치(100)의 프로세서(110)는,　단계 S430에서 추출된 키워드를 이용하여 단계 S520에서 업데이트 된 데이터베이스와 매칭을 수행하여 지식 베이스 응답 신호를 형성할 수 있다.In step S530, a knowledge base response signal is formed. For example, referring to Figure 1 to Figure 4, the processor 110 of the electronic device 100 for providing artificial intelligence conversation performs matching with the updated database in step S520 using the keyword extracted in step S430. Thus, a knowledge base response signal may be formed.

상기 방법은 특정 실시예들을 통하여 설명되었지만, 상기 방법은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상기 실시예들을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 고안이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.Although the method has been described through specific embodiments, the method can also be implemented as computer readable code on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. In addition, the computer-readable recording medium is distributed in computer systems connected through a network, so that computer-readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the above embodiments can be easily inferred by programmers in the art to which the present invention belongs.

이상, 본 고안을 바람직한 실시 예를 사용하여 상세히 설명하였으나, 본 고안의 범위는 특정 실시 예에 한정되는 것은 아니며, 첨부된 특허청구범위에 의하여 해석되어야 할 것이다. 또한, 이 기술분야에서 통상의 지식을 습득한 자라면, 본 고안의 범위에서 벗어나지 않으면서도 많은 수정과 변형이 가능함을 이해하여야 할 것이다.In the above, the present invention has been described in detail using preferred embodiments, but the scope of the present invention is not limited to specific embodiments, and should be interpreted according to the appended claims. In addition, those skilled in the art should understand that many modifications and variations are possible without departing from the scope of the present invention.

100: 전자 장치 110: 프로세서
120: 메모리 130: 송수신기
140: 마이크 150: 스피커
200: 인공지능 대화 제공 시스템 210: 게임 엔진
220: A/D 컨버터 230: 잡음 제거부
240: 특징 추출부 250: 어휘 사전
260: 텍스트 변환부 270: 룰베이스 대화 시스템
280: AI 대화 시스템 112: 입력 레이어
114: 뉴럴 네트워크 116: 출력 레이어100: electronic device 110: processor
120: memory 130: transceiver
140: microphone 150: speaker
200: artificial intelligence dialogue providing system 210: game engine
220: A / D converter 230: noise canceling unit
240: feature extraction unit 250: vocabulary dictionary
260: text conversion unit 270: rule-based dialog system
280: AI dialog system 112: Input layer
114: neural network 116: output layer

Claims

In an electronic device for providing artificial intelligence conversation,
one or more processors; and
one or more memories storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations;
The one or more processors,
Receive a user's voice signal;
Converting the voice signal into a text signal;
extracting keywords by performing natural language processing of the text signal;
Forming a knowledge base response signal by performing database matching using the extracted keywords;
Forming an emotion base response signal in consideration of the user's age, personality, and intimacy using the knowledge base response signal;
Converting the emotional base response signal into a voice output signal;
The one or more processors,
assigning a weight to the audio output signal according to the user's response level using the user's feedback on the audio output signal;
When the user shows a positive response to a specific audio output signal and shows no response or negative response to another audio output signal,
A relatively high weight is assigned to a specific audio output signal to which the user has responded positively;
Updating the database by using the specific audio output signal to which the relatively high weight is assigned so that the specific audio output signal to which the relatively high weight is assigned can be considered in forming the emotion base response signal;
The one or more processors induce appropriate answers through voice conversation history data and perform keyword training,
A plurality of voice data according to gender, region, age, and multilateral speech is received, the database is updated using the plurality of voice data, and the database is matched using the extracted keyword to obtain the knowledge If it is impossible to match the search-based model forming the base response signal with the extracted keyword and the database, it is determined that the knowledge base response signal cannot be formed, and the extracted keyword is determined to be impossible to form the knowledge base based on the extracted keyword using a machine learning algorithm. using a generative model that forms a response signal;
The one or more processors derive a 5G cloud immersive content architecture, additionally reflect native services provided by cloud operators based on the existing architecture,
The one or more processors provide a voice text to speech (TTS) that matches a real human voice, an electronic device for providing artificial intelligence conversations.

According to claim 1,
The one or more processors,
An electronic device for providing artificial intelligence conversation, which performs natural language processing on the received voice signal through sentence separation, spacing correction, spelling correction, abbreviation expansion, and symbol removal procedures for the voice signal.