KR102418465B1

KR102418465B1 - Server, method and computer program for providing voice reading service of story book

Info

Publication number: KR102418465B1
Application number: KR1020190097928A
Authority: KR
Inventors: 이가희; 차재욱; 박정석
Original assignee: 주식회사 케이티
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2022-07-07
Also published as: KR20210019151A

Abstract

동화 낭독 서비스를 제공하는 서버는 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받는 등록부, 상기 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 상기 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받는 선택부, 상기 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석하고, 상기 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅하는 동화 컨텐츠 분석부, 상기 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택하는 의성어 음원 선택부, 상기 선택된 의성어 음원 및 상기 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성하는 동화 낭독 음원 생성부 및 상기 생성된 동화 낭독 음원을 상기 사용자 단말로 제공하는 제공부를 포함하되, 상기 동화 낭독 음원은 상기 적어도 하나의 의성어가 상기 사용자의 음성을 기반으로 하여 표현되도록 생성된다.The server providing the moving picture reading service selects any one moving picture content to be provided as the moving picture reading service from among a registration unit for registering user voice data uttered by the user from the user terminal, and a plurality of moving picture contents previously registered from the user terminal. A receiving selection unit, a moving image content analysis unit that analyzes a plurality of sentences included in the selected moving image content, and tags onomatopoeic information for at least one onomatopoeia included in the analyzed plurality of sentences; An onomatopoeic sound source selection unit that selects any one onomatopoeic sound source among a plurality of onomatopoeic sound sources stored in advance based on one onomatopoeic word, a fairy tale reading sound source generating a fairy tale reading sound source by synthesizing the selected onomatopoeic sound source and the registered user voice data and a providing unit for providing the generated moving picture reading sound source to the user terminal, wherein the moving picture reading sound source is generated such that the at least one onomatopoeic word is expressed based on the user's voice.

Description

Server, method and computer program providing fairy tale reading service {SERVER, METHOD AND COMPUTER PROGRAM FOR PROVIDING VOICE READING SERVICE OF STORY BOOK}

본 발명은 동화 낭독 서비스를 제공하는 서버, 방법 및 컴퓨터 프로그램에 관한 것이다. The present invention relates to a server, a method and a computer program for providing a fairy tale reading service.

지능형 개인 비서는 사용자가 요구하는 작업을 처리하고, 사용자에게 특화된 서비스를 제공하는 소프트웨어 에이전트이다. 지능형 개인 비서는 인공 지능(AI) 엔진과 음성 인식을 기반으로 사용자에게 맞춤 정보를 수집하여 제공하고, 사용자의 음성 명령에 따라 일정 관리, 이메일 전송, 식당 예약 등 여러 기능을 수행하는 점에서 사용자의 편의성을 향상시키는 장점을 갖는다. An intelligent personal assistant is a software agent that processes tasks requested by users and provides specialized services to users. The intelligent personal assistant collects and provides customized information to the user based on artificial intelligence (AI) engine and voice recognition, and performs various functions such as scheduling, sending e-mail, and restaurant reservation according to the user's voice command. It has the advantage of improving convenience.

이러한 지능형 개인 비서는 주로 스마트폰에서 맞춤형 개인 서비스의 형태로 제공되고 있으며, 대표적으로 애플의 시리(siri), 구글의 나우(now), 삼성의 빅스비 등이 이에 포함된다. 이와 관련하여, 선행기술인 한국공개특허 제 2016-0071111호는 전자 장치에서의 개인 비서 서비스 제공 방법을 개시하고 있다.These intelligent personal assistants are mainly provided in the form of customized personal services on smartphones, and representatively include Apple's Siri, Google's Now, and Samsung's Bixby. In this regard, Korean Patent Laid-Open No. 2016-0071111, which is a prior art, discloses a method of providing a personal assistant service in an electronic device.

최근에는 지능형 개인 비서를 통해 사용자 음성 기반의 동화 낭독 서비스를 제공받을 수 있게 되었다. 그러나 사용자 음성 기반으로 동화 낭독 서비스를 제공받는 경우, 다양한 동물의 의성어 등이 잘 표현되지 않는다는 단점을 가지고 있다.Recently, it has become possible to receive a fairy tale reading service based on the user's voice through an intelligent personal assistant. However, when a fairy tale reading service is provided based on the user's voice, it has a disadvantage that the onomatopoeia of various animals is not well expressed.

동화 컨텐츠에 포함된 의성어를 사용자 음성 기반으로 표현하기 위해 동화 컨텐츠에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅하고, 태깅된 의성어 정보에 기초하여 의성어 음원을 선택하고, 선택된 의성어 음원 및 사용자 음성 데이터를 합성하는 동화 낭독 서비스를 제공하는 서버, 방법 및 컴퓨터 프로그램을 제공하고자 한다. In order to express the onomatopoeia included in the fairy tale content based on the user's voice, onomatopoeia information is tagged for at least one onomatopoeia included in the fairy tale content, an onomatopoeic sound source is selected based on the tagged onomatopoeia information, and the selected onomatopoeic sound source and user voice An object of the present invention is to provide a server, a method, and a computer program that provide a fairy tale reading service for synthesizing data.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받는 등록부, 상기 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 상기 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받는 선택부, 상기 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석하고, 상기 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅하는 동화 컨텐츠 분석부, 상기 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택하는 의성어 음원 선택부, 상기 선택된 의성어 음원 및 상기 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성하는 동화 낭독 음원 생성부 및 상기 생성된 동화 낭독 음원을 상기 사용자 단말로 제공하는 제공부를 포함하되, 상기 동화 낭독 음원은 상기 적어도 하나의 의성어가 상기 사용자의 음성을 기반으로 하여 표현되도록 생성된 것인 동화 낭독 서비스 제공 서버를 제공할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention provides a registration unit for registering user voice data uttered by a user from a user terminal, and reading the moving picture among a plurality of moving image contents previously registered from the user terminal. A moving picture that selects any one moving image content to be provided as a service is selected, analyzes a plurality of sentences included in the selected moving image content, and tags onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences A content analysis unit, an onomatopoeic sound source selection unit that selects any one onomatopoeic sound source among a plurality of pre-stored onomatopoeia sound sources based on at least one onomatopoeia tagged with the onomatopoeic information, the selected onomatopoeia sound source and the registered user voice data are synthesized A moving picture reading sound source generating unit for generating a moving picture reading sound source and a providing unit for providing the generated moving picture reading sound source to the user terminal, wherein the moving picture reading sound source includes the at least one onomatopoeia based on the user's voice. It is possible to provide a fairy tale reading service providing server that is generated to be expressed.

본 발명의 다른 실시예는, 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받는 단계, 상기 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 상기 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받는 단계, 상기 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석하는 단계, 상기 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅하는 단계, 상기 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택하는 단계, 상기 선택된 의성어 음원 및 상기 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성하는 단계 및 상기 생성된 동화 낭독 음원을 상기 사용자 단말로 제공하는 단계를 포함하되, 상기 동화 낭독 음원은 상기 적어도 하나의 의성어가 상기 사용자의 음성을 기반으로 하여 표현되도록 생성된 것인 동화 낭독 서비스 제공 방법을 제공할 수 있다. Another embodiment of the present invention includes the steps of registering user voice data uttered by a user from a user terminal, and selecting any one moving image content to be provided as the moving image reading service from among a plurality of moving image contents previously registered from the user terminal. receiving, analyzing a plurality of sentences included in the selected moving image content, tagging onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences, at least one onomatopoeia in which the onomatopoeia information is tagged Selecting any one of the onomatopoeic sound sources among a plurality of pre-stored onomatopoeia sound sources based on It is possible to provide a method for providing a moving picture reading service, including providing the sound source to a user terminal, wherein the at least one onomatopoeia is generated to be expressed based on the user's voice.

본 발명의 또 다른 실시예는, 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받고, 상기 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 상기 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받고, 상기 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석하고, 상기 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅하고, 상기 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택하고, 상기 선택된 의성어 음원 및 상기 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성하고, 상기 생성된 동화 낭독 음원을 상기 사용자 단말로 제공하되, 상기 동화 낭독 음원은 상기 적어도 하나의 의성어가 상기 사용자의 음성을 기반으로 하여 표현되도록 생성되도록 하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. In another embodiment of the present invention, when a computer program is executed by a computing device, user voice data uttered by a user is registered from a user terminal, and the moving picture reading service among a plurality of moving picture contents previously registered from the user terminal receiving a selection of any one moving image content to be provided as , analyzing a plurality of sentences included in the selected moving image content, tagging onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences, and the onomatopoeia information Selects any one onomatopoeic sound source among a plurality of pre-stored onomatopoeia sound sources based on the at least one onomatopoeia tagged, synthesizes the selected onomatopoeia sound source and the registered user voice data to generate a moving picture reading sound source, and the generated A moving picture reading sound source is provided to the user terminal, wherein the moving picture reading sound source is a computer program stored in a medium including a sequence of instructions to generate the at least one onomatopoeia to be expressed based on the user's voice. have.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 종래에는 사용자가 동물의 의성어를 흉내냄으로써, 동물의 의성어를 표현하는데 어색함이 존재한다는 단점을 가지고 있었으나, 사용자 음성 데이터 및 동물의 의성어를 동화 컨텐츠의 의성어 부분에 대해 합성함으로써 실감나는 동화 낭독 서비스를 제공하는 서버, 방법 및 컴퓨터 프로그램을 제공할 수 있다. According to any one of the above-described problem solving means of the present invention, conventionally, the user imitates animal onomatopoeia, thereby having a disadvantage in that there is awkwardness in expressing animal onomatopoeia. It is possible to provide a server, a method, and a computer program that provide a realistic fairy tale reading service by synthesizing the onomatopoeic part of .

사용자가 자신의 음성 및 자신의 반려견의 의성어를 녹음하여 등록함으로써, 동화 컨텐츠를 자신의 음성 및 자신의 반려견의 의성어가 합성된 동화 낭독 음원을 통해 동화 낭독 서비스를 제공받을 수 있도록 하는 동화 낭독 서비스를 제공하는 서버, 방법 및 컴퓨터 프로그램을 제공할 수 있다. A fairy tale reading service that allows users to record and register their own voice and onomatopoeia of their dog, and receive a fairy tale reading service through a fairy tale reading sound source in which their own voice and onomatopoeia of their dog are synthesized. Servers, methods and computer programs for providing may be provided.

도 1은 본 발명의 일 실시예에 따른 동화 낭독 서비스 제공 서버의 구성도이다.
도 2a 및 도 2b는 본 발명의 일 실시예에 따른 사용자 음성 데이터 및 동물 의성어 음원을 등록받는 과정을 설명하기 위한 예시적인 도면이다.
도 3a 및 도 3b는 본 발명의 일 실시예에 따른 동화 컨텐츠 및 사용자 음성을 학습하는 과정을 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 동화 컨텐츠에 포함된 복수의 문장을 분석하여 의성어 정보를 태깅하는 과정을 설명하기 위한 예시적인 도면이다.
도 5는 본 발명의 일 실시예에 따른 선택된 의성어 음원을 도시한 예시적인 도면이다.
도 6a 및 도 6b는 본 발명의 일 실시예에 따른 의성어에 해당하는 객체 종류와 연관된 의성어 음원이 추출되지 않는 경우에 유사 의성어 음원을 추출하는 과정을 설명하기 위한 예시적인 도면이다.
도 7은 본 발명의 일 실시예에 따른 의성어 음원 및 사용자 음성 데이터를 합성하는 과정을 설명하기 위한 예시적인 도면이다.
도 8은 본 발명의 일 실시예에 따른 기존 모델, 동물 실제 동물 소리, 사용자 음성 기반의 의성어 각각에 대한 스펙트럼을 도시한 예시적인 도면이다.
도 9는 본 발명의 일 실시예에 따른 서버에서 동화 낭독 서비스를 제공하는 방법의 순서도이다. 1 is a block diagram of a moving picture reading service providing server according to an embodiment of the present invention.
2A and 2B are exemplary views for explaining a process of registering user voice data and animal onomatopoeia sound sources according to an embodiment of the present invention.
3A and 3B are exemplary views for explaining a process of learning a moving picture content and a user's voice according to an embodiment of the present invention.
4 is an exemplary diagram for explaining a process of tagging onomatopoeic information by analyzing a plurality of sentences included in a moving picture content according to an embodiment of the present invention.
5 is an exemplary diagram illustrating a selected onomatopoeic sound source according to an embodiment of the present invention.
6A and 6B are exemplary views for explaining a process of extracting a similar onomatopoeic sound source when an onomatopoeic sound source associated with an object type corresponding to an onomatopoeic word is not extracted according to an embodiment of the present invention.
7 is an exemplary view for explaining a process of synthesizing an onomatopoeic sound source and user voice data according to an embodiment of the present invention.
8 is an exemplary diagram illustrating a spectrum for each of an existing model, an animal real animal sound, and an onomatopoeia based on a user's voice according to an embodiment of the present invention.
9 is a flowchart of a method for providing a moving picture reading service in a server according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and one or more other features However, it is to be understood that the existence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "part" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.Some of the operations or functions described as being performed by the terminal or device in this specification may be instead performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 동화 낭독 서비스 제공 서버의 구성도이다. 도 1을 참조하면, 동화 낭독 서비스 제공 서버(100)는 등록부(110), 선택부(120), 학습부(130), 동화 컨텐츠 분석부(140), 음원 선택부(150), 동화 낭독 음원 생성부(160) 및 제공부(170)를 포함할 수 있다. 1 is a block diagram of a moving picture reading service providing server according to an embodiment of the present invention. Referring to FIG. 1 , the moving picture reading service providing server 100 includes a registration unit 110 , a selection unit 120 , a learning unit 130 , a moving picture content analysis unit 140 , a sound source selection unit 150 , and a moving picture reading sound source. It may include a generating unit 160 and a providing unit 170 .

등록부(110)는 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받을 수 있다. 또한, 등록부(110)는 사용자 단말로부터 동물 의성어가 녹음된 동물 의성어 음원을 등록받을 수 있다. 사용자 음성 데이터 및 동물 의성어 음원을 등록받는 과정에 대해서는 도 2a 및 도 2b를 통해 상세히 설명하도록 한다. The registration unit 110 may receive user voice data uttered by the user registered by the user terminal. In addition, the registration unit 110 may receive the animal onomatopoeia sound source recorded from the user terminal. The process of registering user voice data and animal onomatopoeia sound sources will be described in detail with reference to FIGS. 2A and 2B .

도 2a 및 도 2b는 본 발명의 일 실시예에 따른 사용자 음성 데이터 및 동물 의성어 음원을 등록받는 과정을 설명하기 위한 예시적인 도면이다. 2A and 2B are exemplary views for explaining a process of registering user voice data and animal onomatopoeia sound sources according to an embodiment of the present invention.

도 2a를 참조하면, 등록부(110)는 목소리 등록 섹션(200)을 통해 사용자 단말로부터 동화를 낭독하고자 하는 사용자 자신의 목소리를 등록받을 수 있다. 예를 들어, 등록부(110)는 사용자 단말로부터 낭독체 발화 및 대화체 발화가 명시된 기설정된 스크립트에 기초하여 사용자에 의해 낭독체로 발화된 제 1 사용자 음성 데이터(201, SPK1) 및 대화체로 발화된 제 2 사용자 음성 데이터(202, SPK2) 중 적어도 하나를 등록(204)받아 데이터베이스에 저장할 수 있다. 다른 예를 들어, 등록부(110)는 추가 버튼(203)을 통해 타사용자의 목소리가 녹음된 타사용자 음성 데이터를 등록(204)받아 데이터베이스에 저장할 수 있다. Referring to FIG. 2A , the registration unit 110 may register the user's own voice who wants to read a fairy tale from the user terminal through the voice registration section 200 . For example, the registration unit 110 may include the first user voice data 201 (SPK1) uttered in a recitative form by the user based on a preset script in which the aloud utterance and the dialogue utterance are specified from the user terminal and the second uttered utterance in a dialogue form. At least one of the user voice data 202 and SPK2 may be registered 204 and stored in the database. For another example, the registration unit 110 may register 204 other user's voice data in which another user's voice is recorded through the add button 203 and store it in the database.

등록부(110)는 의성어 등록 섹션(210)을 통해 사용자 단말로부터 동물의 의성어가 녹음된 동물 의성어 음원을 등록받을 수 있다. 예를 들어, 사용자가 고양이(211)의 의성어를 흉내내고 싶어하는 경우, 등록부(110)는 사용자 단말로부터 고양이(211)의 우는 소리가 녹음된 고양이 의성어 음원을 등록(214)받아 데이터베이스에 저장할 수 있다. 다른 예를 들어, 사용자가 개(212)를 키우고 있는 경우, 등록부(110)는 사용자 단말로부터 자신의 키우는 개(212)가 짖는 소리가 녹음된 개 의성어 음원을 등록(215)받아 데이터베이스에 저장할 수 있다. 이외에도, 등록부(110)는 사용자 단말로부터 추가 버튼(213)을 통해 다양한 동물의 의성어 음원을 등록받아 데이터베이스에 저장할 수 있다. The registration unit 110 may receive an animal onomatopoeia sound source in which onomatopoeia are recorded from the user terminal through the onomatopoeia registration section 210 . For example, if the user wants to imitate the onomatopoeic of the cat 211, the registration unit 110 may register 214 the cat onomatopoeia sound source in which the cry of the cat 211 is recorded from the user terminal and store it in the database. . For another example, if the user is raising a dog 212, the registration unit 110 receives the registration 215 of the dog onomatopoeic sound source in which the barking sound of the dog 212 raised by the user is recorded from the user terminal and can be stored in the database. have. In addition, the registration unit 110 may receive the onomatopoeic sound source of various animals from the user terminal through the add button 213 and store it in the database.

도 2b를 참조하면, 등록부(110)는 목소리 선택 섹션(220)을 통해 사용자 단말로부터 데이터베이스에 저장된 복수의 음성 데이터 중 동화 낭독 서비스에 적용할 사용자 음성 데이터를 선택받을 수 있다. 예를 들어, 등록부(110)는 사용자 단말로부터 나레이션 항목(221)에 대해 데이터베이스에 저장된 복수의 사용자 음성 데이터 중 낭독체로 발화된 제 1 사용자 음성 데이터(201, SPK1)를 선택받고, 대화 항목(222)에 대해 데이터베이스에 저장된 복수의 사용자 음성 데이터 중 대화체로 발화된 제 2 사용자 음성 데이터(202, SPK2)를 선택받을 수 있다. 다른 예를 들어, 등록부(110)는 사용자 단말로부터 나레이션 항목(221)에 대해 데이터베이스에 저장된 복수의 음성 데이터 중 낭독체로 발화된 자신의 음성 데이터를 선택받고, 대화 항목(222)에 대해 데이터베이스에 저장된 복수의 음성 데이터 중 대화체로 발화된 타사용자 음성 데이터를 선택받을 수도 있다. Referring to FIG. 2B , the registration unit 110 may receive a selection of user voice data to be applied to a moving picture reading service from among a plurality of voice data stored in the database from the user terminal through the voice selection section 220 . For example, the registration unit 110 receives the selection of the first user voice data 201 and SPK1 uttered in a read out of a plurality of user voice data stored in the database for the narration item 221 from the user terminal, and the conversation item 222 ) from among a plurality of user voice data stored in the database, the second user voice data 202 (SPK2) uttered in a conversational language may be selected. For another example, the registration unit 110 receives, from the user terminal, its own voice data uttered in a read out of a plurality of voice data stored in the database for the narration item 221 , and is stored in the database for the conversation item 222 . Another user's voice data uttered in a conversational language may be selected from among the plurality of voice data.

등록부(110)는 의성어 선택 섹션(230)을 통해 복수의 동물 의성어 음원 중 사용자 단말에서 등록한 어느 하나의 의성어 음원을 선택받을 수 있다. 예를 들어, 등록부(110)는 사용자 단말로부터 의성어 선택 섹션(230)을 통해 고양이(211)를 선택(231)받은 경우, 사용자에 의해 등록된 고양이(211)의 의성어 음원에 기초하여 사용자 음성 기반의 고양이(211)의 의성어가 제공되도록 할 수 있다. 다른 예를 들어, 등록부(110)는 사용자 단말로부터 의성어 선택 섹션(230)을 통해 개(212)를 선택(232)받지 못한 경우, 기본적으로 제공되는 개(212)의 의성어 음원에 기초하여 사용자 음성 기반의 개(212)의 의성어가 제공되도록 할 수도 있다. The registration unit 110 may receive any one onomatopoeic sound source registered by the user terminal among a plurality of animal onomatopoeia sound sources through the onomatopoeia selection section 230 . For example, when the registration unit 110 receives the selection 231 of the cat 211 through the onomatopoeic selection section 230 from the user terminal, the user voice based on the onomatopoeic sound source of the cat 211 registered by the user Onomatopoeia of the cat 211 of can be provided. As another example, when the registration unit 110 does not receive the selection 232 of the dog 212 through the onomatopoeic selection section 230 from the user terminal, the user voice based on the onomatopoeic sound source of the dog 212 provided by default Onomatopoeia of the base dog 212 may be provided.

다시 도 1로 돌아와서, 선택부(120)는 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받을 수 있다. 예를 들어, 선택부(120)는 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 "장화신은 고양이"를 선택받을 수 있다. Returning to FIG. 1 , the selection unit 120 may receive a selection of any one moving picture content to be provided as a moving picture reading service among a plurality of previously registered moving picture contents from the user terminal. For example, the selection unit 120 may receive a selection of “cat in boots” from among a plurality of previously registered moving picture contents from the user terminal.

학습부(130)는 동화 컨텐츠를 구성하는 각각의 텍스트에 기초하여 음소를 학습하고, 학습된 음소가 등록된 적어도 하나의 사용자 음성과 매칭되도록 조정하는 학습 모델을 학습할 수 있다. 학습 모델을 학습하는 과정에 대해서는 도 3a 및 도 3b를 통해 상세히 설명하도록 한다. The learning unit 130 may learn a phoneme based on each text constituting the moving picture content, and learn a learning model that adjusts the learned phoneme to match at least one registered user voice. The process of learning the learning model will be described in detail with reference to FIGS. 3A and 3B .

도 3a 및 도 3b는 본 발명의 일 실시예에 따른 동화 컨텐츠 및 사용자 음성을 학습하는 과정을 설명하기 위한 예시적인 도면이다.3A and 3B are exemplary views for explaining a process of learning a moving picture content and a user's voice according to an embodiment of the present invention.

도 3a를 참조하면, 학습부(130)는 텍스트 데이터베이스(300)로부터 선택부(120)에서 선택된 동화 컨텐츠를 구성하는 각각의 텍스트를 추출하고, 추출된 각각의 텍스트에 기초하여 인코딩 부분을 통해 음소를 학습하고, 디코딩 부분을 통해 음성 데이터베이스(310)로부터 음소와 매칭되는 음성 파일을 추출하고, 추출된 음성 파일을 어텐션 레이어(attention layer)와 디코드 레이어(decode layer)를 통해 조정할 수 있다. Referring to FIG. 3A , the learning unit 130 extracts respective texts constituting the moving picture content selected by the selection unit 120 from the text database 300 , and based on the extracted respective texts, encodes the phonemes. , extracts a voice file matching a phoneme from the voice database 310 through the decoding part, and adjusts the extracted voice file through an attention layer and a decode layer.

도 3b를 참조하면, 학습부(130)는 사용자 음성 데이터베이스(320)로부터 사용자 음성 데이터를 추출하고, 어텐션 레이어(attention layer) 및 디코드 레이어(decode layer)를 추출된 사용자 음성 데이터로 대체하여 타코트론(tacotron)을 훈련하고, 훈련된 결과를 사용자 음성 데이터베이스(320)에 저장할 수 있다. 타코트론은 인코더, 디코더, 그리핀-림(Griffin-Lim) 알고리즘을 이용한 파형 변환을 통해 훈련을 수행할 수 있다. Referring to FIG. 3B , the learning unit 130 extracts user voice data from the user voice database 320 , and replaces an attention layer and a decode layer with the extracted user voice data to tacotron (tacotron) may be trained, and the trained result may be stored in the user voice database 320 . The tacotron can be trained through encoders, decoders, and waveform transformation using the Griffin-Lim algorithm.

이 과정에서, 동화 낭독에 사용되는 사용자 음성이 동화 컨텐츠의 텍스트와 합성될 수 있도록 훈련된 타코트론이 이용될 수 있다. In this process, a trained tacotron may be used so that the user's voice used for reading a moving picture can be synthesized with the text of the moving picture content.

다시 도 1로 돌아와서, 동화 컨텐츠 분석부(140)는 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석할 수 있다. 예를 들어, 동화 컨텐츠 분석부(140)는 기설정된 식별 기호(예를 들어, 대화-큰 따옴표(" "), 혼잣말-작은 따옴표(' '))에 기초하여 분석된 복수의 문장으로부터 추출된 대화 문장에 대해 대화 정보를 태깅하고, 국어사전에 등록된 의성어 기반으로 분석된 복수의 문장으로부터 추출된 텍스트에 대해 의성어 정보를 태깅하고, 분석된 복수의 문장의 나머지 문장에 대해 나레이션 정보를 태깅할 수 있다. 동화 컨텐츠를 분석하여 각각의 정보를 태깅하는 과정에 대해서는 도 4를 통해 상세히 설명하도록 한다. Returning to FIG. 1 , the moving image content analysis unit 140 may analyze a plurality of sentences included in the selected moving image content. For example, the moving picture content analysis unit 140 extracts from a plurality of sentences analyzed based on a preset identification symbol (eg, dialogue-double quotation marks (" "), self-talk-single quotation marks (' ')). Tagging dialogue information for dialogue sentences, tagging onomatopoeia information for text extracted from a plurality of sentences analyzed based on onomatopoeia registered in the Korean dictionary, and tagging narration information for the remaining sentences of the analyzed plurality of sentences. can The process of analyzing the moving picture content and tagging each information will be described in detail with reference to FIG. 4 .

도 4는 본 발명의 일 실시예에 따른 동화 컨텐츠에 포함된 복수의 문장을 분석하여 의성어 정보를 태깅하는 과정을 설명하기 위한 예시적인 도면이다. 도 4를 참조하면, 동화 컨텐츠 분석부(140)는 동화 컨텐츠에 포함된 복수의 문장을 분석하고, 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅할 수 있다.4 is an exemplary diagram for explaining a process of tagging onomatopoeic information by analyzing a plurality of sentences included in a moving picture content according to an embodiment of the present invention. Referring to FIG. 4 , the moving image content analyzing unit 140 may analyze a plurality of sentences included in the moving image content, and tag onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences.

예를 들어, 동화 컨텐츠에 포함된 문장이 "고양이는 야옹 소리를 내었다"(400)인 경우, 동화 컨텐츠 분석부(140)는 국어사전에 등록된 고양이의 의성어에 해당하는 '야옹'에 의성어 정보(420)를 태깅하고, '고양이는', '소리를', '내었다'와 같이 나머지 문장에 대해서는 나레이션 정보(410)를 태깅할 수 있다. For example, when the sentence included in the fairy tale content is “The cat made a meow sound” ( 400 ), the fairy tale content analysis unit 140 provides onomatopoeia information to 'meow' corresponding to the onomatopoeia of the cat registered in the Korean dictionary. 420 may be tagged, and narration information 410 may be tagged for the remaining sentences such as 'cat', 'sound', and 'made out'.

다시 도 1로 돌아와서, 음원 선택부(150)는 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택할 수 있다. Returning to FIG. 1 again, the sound source selection unit 150 may select any one onomatopoeic sound source among a plurality of pre-stored onomatopoeic sound sources based on at least one onomatopoeia tagged with onomatopoeia information.

이를 위해, 음원 선택부(150)는 의성어가 태깅된 적어도 하나의 의성어에 기초하여 의성어에 해당하는 객체 종류를 판단할 수 있다. 예를 들어, 동화 컨텐츠 분석부(140)에서 분석된 문장에 대해 의성어 정보로 태깅된 의성어가 '야옹'일 경우, 음원 선택부(150)는 의성어에 해당하는 객체 종류를 '고양이'로 판단할 수 있다. To this end, the sound source selection unit 150 may determine the type of object corresponding to the onomatopoeia based on at least one onomatopoeia tagged with the onomatopoeia. For example, if the onomatopoeia tagged as onomatopoeic information for the sentence analyzed by the moving picture content analysis unit 140 is 'meow', the sound source selection unit 150 determines the type of object corresponding to the onomatopoeia as 'cat'. can

음원 선택부(150)는 기저장된 복수의 의성어 음원 중 판단된 객체 종류와 연관된 의성어 음원을 추출하여 선택할 수 있다. 예를 들어, 음원 선택부(150)는 기저장된 복수의 의성어 음원 중 고양이 의성어 음원을 추출할 수 있다. 이 때, 음원 선택부(150)는 서비스 제공자에 의해 제공된 제 1 고양이 의성어 음원과 사용자 단말에 의해 등록된 제 2 고양이 의성어 음원이 추출된 경우, 사용자 단말에 의해 등록된 제 2 고양이 의성어 음원을 우선적으로 선택할 수 있다. 객체 종류와 연관된 의성어 음원을 선택하는 과정에 대해서는 도 5를 참조하여 상세히 설명하도록 한다. The sound source selector 150 may extract and select an onomatopoeic sound source associated with the determined object type from among a plurality of pre-stored onomatopoeia sound sources. For example, the sound source selector 150 may extract a cat onomatopoeic sound source from among a plurality of pre-stored onomatopoeic sound sources. At this time, the sound source selection unit 150 preferentially selects the second cat onomatopoeia sound source registered by the user terminal when the first cat onomatopoeia sound source provided by the service provider and the second cat onomatopoeia sound source registered by the user terminal are extracted. can be selected as A process of selecting an onomatopoeic sound source related to an object type will be described in detail with reference to FIG. 5 .

도 5는 본 발명의 일 실시예에 따른 선택된 의성어 음원을 도시한 예시적인 도면이다. 도 5를 참조하면, 음원 선택부(150)는 동화 낭독 서비스를 제공하기 위해 선택된 의성어 음원을 테이블로 구성할 수 있다. 테이블은 예를 들어, 의성어 텍스트(500), 의성어 객체 종류(501), 음원 제공자(502), 의성어 파일(503)로 구성될 수 있다. 5 is an exemplary diagram illustrating a selected onomatopoeic sound source according to an embodiment of the present invention. Referring to FIG. 5 , the sound source selector 150 may configure a table of onomatopoeic sound sources selected to provide a moving picture reading service. The table may be composed of, for example, onomatopoeic text 500 , onomatopoeic object type 501 , sound source provider 502 , and onomatopoeic file 503 .

예를 들어, 음원 선택부(150)는 사용자 단말에 의해 등록된 동물 의성어 음원이 의성어 음원으로 선택된 경우, 태깅된 의성어 텍스트(500)-'야옹', 의성어 객체 종류(501)- '고양이', 의성어 음원 제공자(502)-사용자, 의성어 파일(503)- 'Cat001.wav'와 같이 테이블을 작성할 수 있다. For example, when the sound source selection unit 150 is selected as the onomatopoeic sound source registered by the user terminal as the animal onomatopoeic sound source, the tagged onomatopoeic text 500 - 'Meow', the onomatopoeic object type 501 - 'cat', Onomatopoeia sound source provider 502 - user, onomatopoeia file 503 - You can create a table like 'Cat001.wav'.

다른 예를 들어, 음원 선택부(150)는 서비스 제공자에 의해 등록된 동물 의성어 음원이 의성어 음원으로 선택된 경우, 태깅된 의성어 텍스트(500)-'으르렁', 의성어 객체 종류(501)-'사자, 의성어 음원 제공자(502)-서비스 제공자, 의성어 파일(503)-'Tiger100.wav'와 같이 테이블을 작성할 수 있다. For another example, when the sound source selection unit 150 selects the animal onomatopoeia sound source registered by the service provider as the onomatopoeic sound source, the tagged onomatopoeic text 500-'Growl', onomatopoeic object type 501-'lion, Onomatopoeia sound source provider 502-service provider, onomatopoeia file 503-'Tiger100.wav' can create a table.

다시 도 1로 돌아와서, 음원 선택부(150)는 판단된 객체 종류와 연관된 의성어 음원이 추출되지 않은 경우, 의성어 정보가 태깅된 텍스트에 대해 음소 유사도 및 음가 유사도 각각에 기초하여 소정의 유사 의성어 음원을 추출할 수 있다. Returning to FIG. 1 again, when the onomatopoeic sound source associated with the determined object type is not extracted, the sound source selector 150 selects a predetermined similar onomatopoeic sound source based on each of the phoneme similarity and phoneme similarity for the text tagged with onomatopoeia information. can be extracted.

이는, 태깅된 의성어가 서비스 제공자에 의해 의성어 음원이 제공되지 않는 의성어에 해당하는 경우, 의성어가 사람, 사물, 동물의 소리를 흉내내는 소리에 해당하므로, 의성어의 자음과 모음을 단순히 비교한 것과, 의성어를 발음했을 때 소리나는 음가를 비교한 것과의 유사도를 산출하여, 가장 유사한 의성어 음원으로 대체할 수 있다. 유사 의성어 음원을 추출하는 과정에 대해서는 도 6a 및 도 6b를 통해 상세히 설명하도록 한다. This is because, when the tagged onomatopoeia corresponds to an onomatopoeia for which onomatopoeia sound sources are not provided by the service provider, the onomatopoeia corresponds to a sound that mimics the sounds of people, things, and animals, so that the consonants and vowels of the onomatopoeia are simply compared; By calculating the degree of similarity with the comparison of the sound value when pronouncing the onomatopoeia, it can be replaced with the most similar onomatopoeic sound source. The process of extracting the similar onomatopoeic sound source will be described in detail with reference to FIGS. 6A and 6B .

도 6a 및 도 6b는 본 발명의 일 실시예에 따른 의성어에 해당하는 객체 종류와 연관된 의성어 음원이 추출되지 않는 경우에 유사 의성어 음원을 추출하는 과정을 설명하기 위한 예시적인 도면이다. 6A and 6B are exemplary views for explaining a process of extracting a similar onomatopoeic sound source when the onomatopoeic sound source associated with the object type corresponding to the onomatopoeic word is not extracted according to an embodiment of the present invention.

음원 선택부(150)는 의성어에 해당하는 단어와의 유사성을 판별하기 위해, 음소 및 음가와 관련된 각각의 유사 의성어를 선택할 수 있다. The sound source selector 150 may select each similar onomatopoeic related to a phoneme and a phoneme in order to determine the similarity with the word corresponding to the onomatopoeia.

예를 들어, 음원 선택부(150)는 음소와 관련하여 의성어의 자음과 모음을 초성, 중성, 종성으로 나누고, 자음과 모음의 위치가 일치하는 경우의 개수와 음절 벡터의 유사도를 산출하여 일치도가 높은 5개의 의성어를 선택할 수 있다. 음절로 구성된 한글의 초성, 중성, 종성의 총 개수가 11,172개로, 너무 많은 가짓수를 계산해야 한다는 단점이 따른다. 따라서, 음원 선택부(150)는 초성에 대해 된소리와 거센소리를 모두 예사소리(ex. ㄲ, ㅋ->ㄱ)로 매핑하여 가짓수를 줄이고, 중성에 대해 {ㅏ, ㅑ}, {ㅓ, ㅕ}, {ㅗ, ㅛ}, {ㅜ, ㅠ}, {ㅐ, ㅔ, ㅖ, ㅒ}, {ㅙ, ㅚ}와 같이 유사한 발음으로 매핑하여 가짓수를 줄이고, 종성에 대해 초성과 동일한 방법을 적용할 수 있다. 이러한 매핑 과정을 거침으로써, 초성 10자, 중성 10자, 종성 10자로 줄어들게 할 수 있다. 또한, 자음이 2개 이상 쓰여진 받침을 대표 발음(예를 들어, ㄵ->ㄴ)으로 매핑하여 가짓수를 줄임으로써, 받침 없는 글자(10*10=100개)와 받침 있는 글자(100*10=1000개)의 음절이 만들어져 총 1100개의 음절의 개수로 줄어들게 할 수 있다. For example, the sound source selector 150 divides the consonants and vowels of onomatopoeic words into a leading, middle, and final consonant in relation to a phoneme, and calculates the number of cases where the positions of the consonants and vowels match and the degree of similarity between the syllable vectors to determine the degree of agreement. You can choose 5 high onomatopoeias. The total number of initial, middle, and final consonants in Hangeul, which is composed of syllables, is 11,172, which has the disadvantage of having to calculate too many numbers. Therefore, the sound source selection unit 150 maps both the hard sounds and the harsh sounds for the initial consonant to the regular sounds (ex. }, {ㅗ, ㅛ}, {TT, ㅠ}, {ㅐ, ㅔ, ㅖ, ㅒ}, {ㅙ, ㅚ} to reduce the number of syllables and apply the same method to the initial syllable. can By going through this mapping process, it can be reduced to 10 initial consonants, 10 consonants, and 10 final consonants. In addition, by reducing the number of letters by mapping the support with two or more consonants to the representative pronunciation (eg, ㄵ->b), letters without support (10*10=100) and letters with support (100*10=) 1000) syllables can be created, which can be reduced to a total of 1100 syllables.

이와 같이, 줄어든 음절의 개수는 데이터가 중복없이 표현되도록 하는 형식인 원-핫 인코딩(one-hot encoding)의 형태로 Hashed Syllable Recurrent Neural Network 부분의 음소 유사도를 산출하기 위한 입력으로 사용될 수 있다. In this way, the reduced number of syllables can be used as an input for calculating the phoneme similarity of the hashed syllable recurrent neural network part in the form of one-hot encoding, which is a format that enables data to be expressed without duplication.

다른 예를 들어, 음원 선택부(150)는 음가와 관련하여 한국 발음 사전 기반의 유사 음소 단위를 이용하여 가장 유사한 발음을 가진 5개의 의성어를 선택할 수 있다. For another example, the sound source selector 150 may select five onomatopoeic words having the most similar pronunciation using a similar phoneme unit based on a Korean pronunciation dictionary in relation to a phonetic value.

도 6a를 참조하면, 음원 선택부(150)는 유사 음소(600)와 유사 음소(600)의 기호(601)를 포함하는 한국어의 유사 음소 단위를 이용하여 음소 기반 음절(phoneme-based syllable)을 구성할 수 있다. 여기서, 사용되는 초성의 가짓수는 23개, 중성의 가짓수는 26개, 종성의 가짓수는 22개로 구성되어 총 10,626개의 입력 단위를 갖는다. 자음은 유성 음화와 같은 음소의 변화된 발음이 사용되므로, 모음의 개수만을 줄여서 이용하고, 중성에 대해 음소에서 이용된 방법과 같이 {ㅏ, ㅑ}, {ㅓ, ㅕ}, {ㅗ, ㅛ}, {ㅜ, ㅠ}, {ㅐ, ㅔ, ㅖ, ㅒ}, {ㅙ, ㅚ}로 매핑함으로써, 중성의 가짓수가 10개로 줄여짐에 따라 음절이 총 5,060개로 줄어들게 할 수 있다. Referring to FIG. 6A , the sound source selector 150 selects a phoneme-based syllable by using a similar phoneme unit of Korean including a similar phoneme 600 and a symbol 601 of the similar phoneme 600. configurable. Here, the number of elements of the initial consonant used is 23, the number of the middle vowel is 26, and the number of the final consonant is 22, so there are a total of 10,626 input units. Consonants use the same phoneme-like pronunciation as voiced consonants, so reduce the number of vowels and use the same method as used in phonemes for neutrals: {a, ㅑ}, {ㅓ, ㅕ}, {ㅗ, ㅛ}, By mapping to {TT, ㅠ}, {ㅐ, ㅔ, ㅖ, ㅒ}, {ㅙ, ㅚ}, as the number of neutrals is reduced to 10, a total of 5,060 syllables can be reduced.

이와 같이, 줄어든 음절의 개수는 데이터가 중복없이 표현되도록 하는 형식인 원-핫 인코딩(one-hot encoding)의 형태로 Hashed Syllable Recurrent Neural Network 부분의 음가 유사도를 산출하기 위한 입력으로 사용될 수 있다.In this way, the reduced number of syllables can be used as an input for calculating the similarity of sound values of the Hashed Syllable Recurrent Neural Network part in the form of one-hot encoding, which is a format that allows data to be expressed without duplication.

도 6b를 참조하면, Hashed syllable bi-recurrent neural network는 의성어(610)에 대해 텍스트 기반으로 매핑된 t(U)(611)와 음성 기반으로 매핑된 p(U)(612)를 이용하여 음소 유사도의 입력 시퀀스(620) 및 음가 유사도의 입력 시퀀스(621)를 입력으로 사용할 수 있다. 이를 위해, 음원 선택부(150)는 다음의 수학식 1을 이용하여 손실 함수(loss function)을 정의할 수 있다. Referring to FIG. 6B , the hashed syllable bi-recurrent neural network uses t(U)611, which is text-based mapped for onomatopoeia 610, and p(U)612, which is mapped based on speech, to obtain phoneme similarity. An input sequence 620 of , and an input sequence 621 of sound similarity may be used as inputs. To this end, the sound source selector 150 may define a loss function using Equation 1 below.

여기서, 'T'는 음절의 수를 의미하며, Hashed syllable bi-recurrent neural network에서 'T'가 최소화되는 방향으로 훈련될 수 있다. 'N'은 서비스 제공자가 가지고 있는 의성어의 개수이고, 'm'은 마진 파라미터(marginal parameter)이고, 'dis'는 코사인 유사도 거리(cosine similarity distance)이고, 'x_i ⁺, y_i ⁺'는 동일한 의성어에서 나온 쌍을 의미하고, 'x_i ⁺, y_i ^-'는 동일하지 않은 의성어에서 나온 쌍을 의미한다. Here, 'T' means the number of syllables, and the hashed syllable bi-recurrent neural network may be trained in a direction in which 'T' is minimized. 'N' is the number of onomatopoeias that the service provider has, 'm' is a marginal parameter, 'dis' is the cosine similarity distance, and 'x _i ⁺ , y _i ⁺ ' is It means pairs from the same onomatopoeia, and 'x _i ⁺ , y _i ^- ' means pairs from non-identical onomatopoeias.

음원 선택부(150)는 유사한 음소를 가진 경우에 대해 거리를 가깝게 하기 위한 w(U_i, U_i ^-)라는 함수를 다음의 수학식 2를 이용하여 정의할 수 있다. The sound source selector 150 may define a function called w(U _i , U _i ^- ) to approximate a distance in the case of having similar phonemes using Equation 2 below.

음원 선택부(150)는 이러한 방식으로 훈련된 네트워크를 이용하여 기등록된 의성어에 대한 음소 유사도의 입력 시퀀스(620) 및 음가 유사도의 입력 시퀀스(621)에 대한 출력 벡터를 등록시켜 놓을 수 있다. The sound source selector 150 may register an output vector for an input sequence 620 of phoneme similarity to pre-registered onomatopoeia and an input sequence 621 of phoneme similarity to the previously registered onomatopoeia by using the network trained in this way.

음원 선택부(150)는 의성어 정보가 태깅된 텍스트에 대해 Hashed syllable bi-recurrent neural network를 통해 도출된 음소 유사도 및 음가 유사도에 기초하여 벡터를 추출할 수 있다. 음원 선택부(150)는 추출된 벡터를 추출된 소정의 유사 의성어 음원에 가중치로 반영하여 최종 의성어 음원을 결정할 수 있다. 음원 선택부(150)는 다음의 수학식 3을 이용하여 최종 의성어 음원을 결정할 수 있다. The sound source selector 150 may extract a vector based on phoneme similarity and phoneme similarity derived through a hashed syllable bi-recurrent neural network with respect to text tagged with onomatopoeic information. The sound source selector 150 may determine the final onomatopoeic sound source by reflecting the extracted vector as a weight to the extracted predetermined similar onomatopoeic sound source. The sound source selector 150 may determine the final onomatopoeic sound source by using Equation 3 below.

수학식 3을 참조하면, 음원 선택부(150)는 5-best의 유사 의성어 음원 중 상위 랭크별로 가중치를 반영하여 의성어 단어와 음가에서 가장 높은 값을 갖는 의성어를 선택함으로써, 추출되지 않은 의성어 음원을 선택된 의성어 음원으로 대체할 수 있다.Referring to Equation 3, the sound source selection unit 150 reflects the weight for each upper rank among the 5-best similar onomatopoeic sound sources and selects the onomatopoeic word having the highest value in the onomatopoeic word and the phonetic value. It can be replaced with the selected onomatopoeic sound source.

음원 선택부(150)는 주파수 변환부(155)를 포함하며, 주파수 변환부(155)는 결정된 최종 의성어 음원에 대해 사용자의 음성과 유사해지도록 주파수 변환을 수행할 수 있다. 이는, 일반적으로 음원은 기존에 사람의 음성으로 훈련된 음소 인식기를 거쳐 음소 발췌를 통해 원하는 음원으로 정제된다. 음소 인식기는 음성으로부터 특징을 추출하고, 훈련에 사용되는 음성을 GMM-HMM과 같은 모델을 이용하여 음소를 모델링하여, 결과값 중 가장 높은 결과값을 음소 인식 결과로 나타낼 수 있다. The sound source selector 150 includes a frequency converter 155 , and the frequency converter 155 may perform frequency conversion on the determined final onomatopoeic sound source to be similar to the user's voice. In general, a sound source is refined into a desired sound source through phoneme extracting through a phoneme recognizer previously trained with human voice. The phoneme recognizer extracts features from a voice, models a phoneme using a model such as GMM-HMM for a voice used for training, and may represent the highest result value among the result values as a phoneme recognition result.

그러나 사람의 음성으로 훈련된 음소 인식기는 의성어 소리를 인식하기 어렵다는 단점을 가지고 있다. 따라서, 본 발명에서는 결정된 최종 의성어 음원이 사용자 음성과 유사해지도록 주파수 변환을 수행함으로써, 의성어 음원을 음성 인식기에 바로 적용할 수 있게 된다는 장점을 갖는다. However, phoneme recognizers trained with human voices have a disadvantage in that it is difficult to recognize onomatopoeic sounds. Accordingly, in the present invention, by performing frequency conversion so that the final determined onomatopoeic sound source is similar to the user's voice, the onomatopoeic sound source can be directly applied to the voice recognizer.

주파수 변환부(155)는 음성과 달리 의성어가 동물 목소리와 같이 다양한 주파수 음역대를 가지고 있으므로, 음소 인식기를 거치기 전 페이즈 보코더(phase vocoder)와 같은 주파수 변환을 통해 사람의 목소리와 유사하게 의성어의 소리를 변환시킴으로써, 음소 인식기의 음소 인식률을 높일 수 있다. 여기서, 일반적인 남성 목소리의 기본 주파수가 100~150Hz이고, 여성 목소리의 기본 주파수가 200~250Hz이므로, 주파수 변환부(155)는 최종 의성어 음원의 음역대가 높으면 여자 목소리로, 최종 의성어 음원의 음역대가 낮으면 남성 목소리로 주파수 변환을 수행할 수 있다. The frequency converter 155 converts the sound of onomatopoeia to a human voice similarly to a human voice through frequency conversion such as a phase vocoder before going through a phoneme recognizer because, unlike voice, onomatopoeic words have a variety of frequency ranges, such as animal voices. By converting it, it is possible to increase the phoneme recognition rate of the phoneme recognizer. Here, since the basic frequency of a general male voice is 100 to 150 Hz, and the basic frequency of a female voice is 200 to 250 Hz, the frequency converter 155 converts the final onomatopoeic sound source to a female voice, and the final onomatopoeic sound source has a low frequency range. frequency conversion can be performed with a male voice.

음원 선택부(150)는 주파수 변환된 최종 의성어 음원을 이용하여 음소 인식기의 결과가 도출되면, 프레임 단위(예를 들어, 0.01초)로 음소 인식기의 결과가 나오게 되며, 고양이 의성어 음원을 넣은 경우, 음소 인식기의 결과는 "ㄴㄴㄴㄴ야야아아아아아아아아아오오옹옹옹옹옹옹옹옹"과 같은 식으로 출력될 수 있다. When the result of the phoneme recognizer is derived using the frequency-converted final onomatopoeic sound source, the sound source selector 150 provides the result of the phoneme recognizer in units of frames (eg, 0.01 seconds). The result of the phoneme recognizer can be output in the same way as "No yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".

음원 선택부(150)는 동화의 경우 긴 의성어 표현보다는 짧은 의성어 표현으로 대체 가능하므로, 구현 동화에서의 의성어 표현은 모든 음소 인식기의 결과를 활용하지 않고, 필요한 부분만을 간략하게 발췌해서 사용할 수도 있다. 여기서, 간략하게 발췌하는 부분으로는 의성어의 음절 수에 기초하여 인식된 고양이 음소 인식 결과를 음소 인식 프레임 개수에 비례하게 줄여서 사용할 수 있다. 예를 들어, "냐옹"의 경우, 음절의 개수가 2개이고, 음소 인식의 결과(총26개, ㄴ-4개, 야-2개, 아-9개, 오-2개, 옹-8개)이므로, 반으로 비례하도록 줄이면, 음소 인식의 결과는 총 13개로 동일한 의성어 표현이 가능하게 된다. Since the sound source selector 150 can be replaced with a short onomatopoeic expression rather than a long onomatopoeic expression in the case of a fairy tale, the onomatopoeic expression in the implemented fairy tale does not utilize the results of all phoneme recognizers, and only the necessary parts may be briefly extracted and used. Here, as a part to be briefly extracted, the cat phoneme recognition result recognized based on the number of syllables of the onomatopoeic word may be reduced in proportion to the number of phoneme recognition frames. For example, in the case of "Nyaong", the number of syllables is 2, and the result of phoneme recognition (total 26, b-4, ya-2, ah-9, o-2, ong-8) ), if it is reduced to be proportional to half, the result of phoneme recognition is 13 in total, making it possible to express the same onomatopoeia.

음원 선택부(150)는 편집된 음소 인식 결과의 프레임으로 음원을 편집하여 최종 의성어 음원으로 사용되도록 할 수 있다. 최종 의성어 음원은 다음의 수학식 4를 이용하여 음원이 편집될 수 있다. The sound source selector 150 may edit the sound source with the frame of the edited phoneme recognition result to be used as the final onomatopoeic sound source. The final onomatopoeic sound source may be edited using Equation 4 below.

수학식 4를 참조하면, 음원 선택부(150)는 모든 프레임에 대해 음소를 갖는 프레임을 탐색하고, 의성어 음절 수에 비례하게 선택할 수 있다. 여기서, 비례도는γ이고, converted 함수는 주파수 변환을 의미하고, O_t는 t 프레임의 특징 벡터이고, λ_i는 음소 모델을 의미할 수 있다. Referring to Equation 4, the sound source selector 150 may search for frames having phonemes for all frames and select them in proportion to the number of onomatopoeic syllables. Here, the proportionality may be γ, the converted function may mean frequency conversion, O _t may be a feature vector of a t frame, and λ _i may mean a phoneme model.

다시 도 1로 돌아와서, 동화 낭독 음원 생성부(160)는 학습 모델에 기초하여 등록된 낭독체로 발화된 제 1 사용자 음성 데이터를 나레이션 정보가 태깅된 적어도 하나의 문장에 기초하여 합성하고, 등록된 대화체로 발화된 제 2 사용자 음성 데이터를 대화 정보가 태깅된 적어도 하나의 문장에 기초하여 합성할 수 있다. 여기서, 음성 합성 방식은 예를 들어, 코퍼스 기반의 TTS 방식 및 딥러닝 기반의 음성 합성 방식이 모두 이용될 수 있다. Returning to FIG. 1 again, the moving picture reading sound source generating unit 160 synthesizes the first user's voice data uttered in the reading body registered based on the learning model based on at least one sentence tagged with narration information, and the registered dialogue body The second user's voice data uttered may be synthesized based on at least one sentence in which conversation information is tagged. Here, as the speech synthesis method, for example, both a corpus-based TTS method and a deep learning-based speech synthesis method may be used.

동화 낭독 음원 생성부(160)는 선택된 의성어 음원 및 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성할 수 있다. 여기서, 동화 낭독 음원은 적어도 하나의 의성어가 사용자의 음성을 기반으로 하여 표현되도록 생성된 것일 수 있다. 동화 낭독 음원을 생성하는 과정에 대해서는 도 7을 통해 상세히 설명하도록 한다. The moving picture reading sound source generator 160 may generate a moving picture reading sound source by synthesizing the selected onomatopoeic sound source and registered user voice data. Here, the sound source for reading a fairy tale may be generated so that at least one onomatopoeic word is expressed based on the user's voice. The process of generating a sound source for reading a fairy tale will be described in detail with reference to FIG. 7 .

도 7은 본 발명의 일 실시예에 따른 의성어 음원 및 사용자 음성 데이터를 합성하는 과정을 설명하기 위한 예시적인 도면이다. 도 7을 참조하면, 동화 낭독 음원 생성부(160)는 학습부(130)에서 학습된 사용자 목소리로 훈련된 타코트론에 대해 음성 텍스트로 도출된 의성어 텍스트(700)를 입력으로 하여, 사용자 목소리로 해당 텍스트를 발성하는 멜 기반의 스펙트럼(701)을 획득한 후, 선택된 의성어 음원(710)을 넣어 의성어에 해당하는 멜 기반의 스펙트럼(711)을 획득할 수 있다. 7 is an exemplary view for explaining a process of synthesizing an onomatopoeic sound source and user voice data according to an embodiment of the present invention. Referring to FIG. 7 , the fairy tale reading sound source generating unit 160 receives the onomatopoeic text 700 derived as a voice text for the tacotron trained with the user voice learned by the learning unit 130 as an input, and uses the voice of the user. After obtaining the Mel-based spectrum 701 that utters the text, the selected onomatopoeic sound source 710 may be inserted to obtain the Mel-based spectrum 711 corresponding to the onomatopoeia.

동화 낭독 음원 생성부(160)는 의성어 소리를 내는 음성을 획득하기 위해 VF 블록(720)에서 CNN의 레이어(layer)를 이용하여 의성어의 소리와 사용자의 음성이 자연스럽게 합성될 수 있도록 VF 블록(720)을 훈련할 수 있다. 이 대, 동화 낭독 음원 생성부(160)는 다음의 수학식 5를 통해 도출된 손실 함수를 더 이용하여 VF 블록(720)을 훈련할 수 있다. The fairy tale reading sound source generator 160 uses a layer of CNN in the VF block 720 to obtain a voice that makes the sound of the onomatopoeia so that the sound of the onomatopoeia and the user's voice can be naturally synthesized in the VF block 720 ) can be trained. In this case, the moving picture reading sound source generator 160 may train the VF block 720 by further using the loss function derived through Equation 5 below.

수학식 5를 참조하면, α는 발성 화자 및 음소에 대한 반영률을 나타내고, β는 의성어에 대한 스타일의 반영률을 의미할 수 있다. Referring to Equation 5, α may indicate a reflection rate of a spoken speaker and phoneme, and β may indicate a reflection rate of a style with respect to onomatopoeia.

이와 같이, 훈련된 VF 블록(720)의 L_total 함수가 문턱값인 TH이하로 떨어지게 되는 경우(730), VF 블록(720)은 훈련된 파라미터를 기반으로 선형 스케일 스펙트로그램(740, liear-scale spectorgram)을 생성할 수 있다. 이후, 동화 낭독 음원 생성부(160)는 그리핀-림 리콘스트럭션(750, griffin-lim reconstruction)을 통해 사용자 음성 데이터 및 의성어 음원을 합성(760)시킬 수 잇다. 이 때, 문턱값 TH는 손실 함수의 포화(saturation)가 시작되는 부분을 자동으로 감지하여 의성어별로 다르게 설정될 수 있다. As such, when the L _total function of the trained VF block 720 falls below the threshold TH ( 730 ), the VF block 720 is a linear scale spectrogram 740 ( liear-scale ) based on the trained parameter. spectorgram) can be generated. Thereafter, the moving picture reading sound source generator 160 may synthesize ( 760 ) the user voice data and the onomatopoeic sound source through a griffin-lim reconstruction ( 750 ). In this case, the threshold value TH may be set differently for each onomatopoeic word by automatically detecting a portion where saturation of the loss function starts.

도 8은 본 발명의 일 실시예에 따른 기존 모델, 동물 실제 동물 소리, 사용자 음성 기반의 의성어 각각에 대한 스펙트럼을 도시한 예시적인 도면이다. 도 8을 참조하면, 의성어가 강아지의 의성어인 경우, 기존의 사용자가 자신의 음성을 이용하여 강아지 의성어를 흉내낸 기존 모델(800)의 스펙트럼, 실제 동물 소리(810)의 스펙트럼, 본 발명을 통해 구현된 사용자 음성 기반으로 표현된 의성어(820)의 스펙트럼을 각각 도시하였다. 8 is an exemplary diagram illustrating a spectrum for each of an existing model, an animal real animal sound, and an onomatopoeia based on a user's voice according to an embodiment of the present invention. Referring to FIG. 8 , if the onomatopoeia is a dog onomatopoeia, the spectrum of the existing model 800 in which the existing user imitates the dog onomatopoeia using his/her own voice, the spectrum of the real animal sound 810, and the present invention The spectrum of the onomatopoeia 820 expressed based on the implemented user voice is shown, respectively.

기존 모델(800)의 스펙트럼은 단조로운 음성으로 동물의 목소리에 대한 운율과 음색이 없어 동물의 의성어를 효과적으로 표현하지 못한다는 단점을 가지고 있었으나, 본 발명에서 제안하는 사용자 음성 기반의 의성어(820)의 스펙트럼은 사용자 음성과 동물의 목소리가 합성됨으로써, 자연스러운 동물의 소리를 표현할 수 있다는 장점을 갖는다. The spectrum of the existing model 800 has a disadvantage that it cannot effectively express animal onomatopoeia because it has no rhyme and tone for the animal voice as a monotonous voice, but the spectrum of the user voice-based onomatopoeia 820 proposed in the present invention has the advantage of being able to express natural animal sounds by synthesizing the user's voice and the animal's voice.

다음의 표 1은 도 8의 기존 모델(800), 실제 동물 소리(810), 본 발명을 통해 제안된 사용자 음성 기반의 의성어(820)를 정량적으로 평가한 표이다. Table 1 below is a table for quantitatively evaluating the existing model 800 of FIG. 8 , the real animal sound 810 , and the onomatopoeia 820 based on the user's voice proposed through the present invention.

방식method 기존 모델original model 본 발명 모델 Invention model Correlation distancecorrelation distance 0.750.75 0.0580.058

각각의 음성 데이터를 정량적으로 평가하기 위해 음성 데이터 간의 파형 유사도를 각각 비교하였다. 파형 유사도는 음성의 프레임 단위로 분석을 진행하였으며, 프레임별 512개의 주파수 분석을 통해 파형에 대한 유사도를 측정하였다. 유사도 측정은 대표적인 방식인 유사도 거리를 기초로 하였다. 여기서, 유사도 거리는 거리가 0에 가까울수록 실제 동물 소리와 유사하다고 판별될 수 있다. In order to quantitatively evaluate each voice data, the waveform similarity between the voice data was compared respectively. Waveform similarity was analyzed in units of frames of speech, and the similarity of waveforms was measured by analyzing 512 frequencies for each frame. The similarity measurement was based on the similarity distance, which is a representative method. Here, the similarity distance may be determined to be similar to an actual animal sound as the distance is closer to 0.

표 1을 참조하면, 각각의 의성어를 정량적으로 평가한 경우, 코사인 유사도 거리(cosine correlation distance)를 통해 본 발명에서 제안하는 사용자 음성 기반의 의성어의 유사도가 기존 모델의 유사도에 비해 대폭 향상된 것을 확인할 수 있다. Referring to Table 1, when each onomatopoeic is quantitatively evaluated, it can be seen that the similarity of the user voice-based onomatopoeia proposed in the present invention is significantly improved compared to the similarity of the existing model through the cosine correlation distance. have.

다시 도 1로 돌아와서, 제공부(170)는 생성된 동화 낭독 음원을 사용자 단말로 제공할 수 있다. Returning to FIG. 1 again, the providing unit 170 may provide the generated sound source for reading a moving picture to the user terminal.

이러한 동화 낭독 서비스 제공 서버(100)는 동화 낭독 서비스를 제공하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램에 의해 실행될 수 있다. 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받고, 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받고, 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석하고, 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅하고, 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택하고, 선택된 의성어 음원 및 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성하고, 생성된 동화 낭독 음원을 사용자 단말로 제공하되, 동화 낭독 음원은 적어도 하나의 의성어가 사용자의 음성을 기반으로 하여 표현되도록 생성되도록 하는 명령어들의 시퀀스를 포함할 수 있다. The moving picture reading service providing server 100 may be executed by a computer program stored in a medium including a sequence of instructions for providing a moving picture reading service. When the computer program is executed by the computing device, user voice data uttered by the user is registered from the user terminal, and any one moving picture content to be provided as a moving picture reading service among a plurality of previously registered moving picture contents is selected from the user terminal. , analyzes a plurality of sentences included in the selected moving image content, tags onomatopoeia information for at least one onomatopoeic included in the analyzed plurality of sentences, and a plurality of pre-stored onomatopoeia based on the at least one onomatopoeia tagged with onomatopoeia information. Select any one of the onomatopoeic sound sources among the onomatopoeic sound sources, synthesize the selected onomatopoeic sound source and the registered user voice data to generate a fairy tale reading sound source, and provide the generated fairy tale reading sound source to the user terminal, wherein the fairy tale reading sound source is at least one The onomatopoeia may include a sequence of commands to be generated to be expressed based on the user's voice.

도 9는 본 발명의 일 실시예에 따른 동화 낭독 서비스 제공 서버에서 동화 낭독 서비스를 제공하는 방법의 순서도이다. 도 9에 도시된 동화 낭독 서비스 제공 서버(100)에서 동화 낭독 서비스를 제공하는 방법은 도 1 내지 도 8에 도시된 실시예에 따라 동화 낭독 서비스 제공 서버(100)에 의해 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 8에 도시된 실시예에 따른 동화 낭독 서비스 제공 서버(100)에도 적용된다. 9 is a flowchart of a method for providing a moving picture reading service in the moving picture reading service providing server according to an embodiment of the present invention. The method of providing a moving picture reading service in the moving picture reading service providing server 100 shown in FIG. 9 is a step of time-series processing by the moving picture reading service providing server 100 according to the embodiment shown in FIGS. 1 to 8 include those Therefore, even if omitted below, it is also applied to the moving picture reading service providing server 100 according to the embodiment shown in FIGS. 1 to 8 .

단계 S910에서 동화 낭독 서비스 제공 서버(100)는 사용자 단말로부터 사용자에 의해 발화된 사용자 음성 데이터를 등록받을 수 있다. In step S910, the moving picture reading service providing server 100 may receive user voice data uttered by the user from the user terminal registered.

단계 S920에서 동화 낭독 서비스 제공 서버(100)는 사용자 단말로부터 기등록된 복수의 동화 컨텐츠 중 동화 낭독 서비스로서 제공될 어느 하나의 동화 컨텐츠를 선택받을 수 있다. In step S920 , the moving picture reading service providing server 100 may receive a selection of any moving picture content to be provided as a moving picture reading service from among a plurality of previously registered moving picture contents from the user terminal.

단계 S930에서 동화 낭독 서비스 제공 서버(100)는 선택된 동화 컨텐츠에 포함된 복수의 문장을 분석할 수 있다. In step S930, the moving picture reading service providing server 100 may analyze a plurality of sentences included in the selected moving picture content.

단계 S940에서 동화 낭독 서비스 제공 서버(100)는 분석된 복수의 문장에 포함된 적어도 하나의 의성어에 대해 의성어 정보를 태깅할 수 있다. In operation S940, the moving picture reading service providing server 100 may tag onomatopoeia information for at least one onomatopoeic included in the plurality of analyzed sentences.

단계 S950에서 동화 낭독 서비스 제공 서버(100)는 의성어 정보가 태깅된 적어도 하나의 의성어에 기초하여 기저장된 복수의 의성어 음원 중 어느 하나의 의성어 음원을 선택할 수 있다. In step S950, the moving picture reading service providing server 100 may select any one onomatopoeic sound source among a plurality of pre-stored onomatopoeia sound sources based on at least one onomatopoeia tagged with onomatopoeia information.

단계 S960에서 동화 낭독 서비스 제공 서버(100)는 선택된 의성어 음원 및 등록된 사용자 음성 데이터를 합성하여 동화 낭독 음원을 생성할 수 있다. In step S960, the moving picture reading service providing server 100 may generate a moving picture reading sound source by synthesizing the selected onomatopoeic sound source and the registered user voice data.

단계 S970에서 동화 낭독 서비스 제공 서버(100)는 생성된 동화 낭독 음원을 사용자 단말로 제공할 수 있다. In step S970, the moving picture reading service providing server 100 may provide the generated moving picture reading sound source to the user terminal.

상술한 설명에서, 단계 S910 내지 S950은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다.In the above description, steps S910 to S950 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as needed, and the order between the steps may be switched.

도 1 내지 도 9를 통해 동화 낭독 서비스 제공 서버에서 동화 낭독 서비스를 제공하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 9를 통해 동화 낭독 서비스 제공 서버에서 동화 낭독 서비스를 제공하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method of providing a moving picture reading service in the moving picture reading service providing server through FIGS. 1 to 9 can also be implemented in the form of a recording medium including a computer program stored in a medium executed by a computer or instructions executable by a computer. have. In addition, the method of providing a moving picture reading service in the moving picture reading service providing server through FIGS. 1 to 9 may be implemented in the form of a computer program stored in a medium executed by a computer.

컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The description of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100: 동화 낭독 서비스 제공 서버
110: 등록부
120: 선택부
130: 학습부
140: 동화 컨텐츠 분석부
150: 음원 선택부
155: 주파수 변환부
160: 동화 낭독 음원 생성부
170: 제공부100: fairy tale reading service providing server
110: register
120: selection unit
130: study unit
140: fairy tale content analysis unit
150: sound source selection unit
155: frequency converter
160: fairy tale reading sound source generator
170: provider

Claims

In the server providing a fairy tale reading service,
a registration unit for registering user voice data uttered by the user from the user terminal;
a selection unit for receiving, from the user terminal, selecting any one moving picture content to be provided as the moving picture reading service from among a plurality of previously registered moving picture contents;
a moving image content analysis unit that analyzes a plurality of sentences included in the selected moving image content and tags onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences;
an onomatopoeic sound source selection unit that selects any one onomatopoeic sound source among a plurality of pre-stored onomatopoeia sound sources based on the at least one onomatopoeia tagged with the onomatopoeia information;
a moving picture reading sound source generating unit for generating a moving picture reading sound source by synthesizing the selected onomatopoeic sound source and the registered user voice data; and
A providing unit that provides the generated sound source for reading a fairy tale to the user terminal
including,
The fairy tale reading sound source is generated so that the at least one onomatopoeic word is expressed based on the user's voice, the fairy tale reading service providing server.

The method of claim 1,
Wherein the registration unit receives at least one of first user voice data uttered in a recited form by a user and second user voice data uttered in a conversational form by a user based on a preset script from the user terminal.

The method of claim 1,
The registration unit will further register the animal onomatopoeia sound source recorded with animal onomatopoeia from the user terminal, a fairy tale reading service providing server.

3. The method of claim 2,
The moving image content analysis unit tags the dialogue information for dialogue sentences extracted from the analyzed plurality of sentences based on a preset identification symbol, and adds to the text extracted from the analyzed sentences based on onomatopoeia registered in the Korean dictionary. For tagging the onomatopoeia information, and tagging the narration information for the remaining sentences of the analyzed plurality of sentences, a fairy tale reading service providing server.

5. The method of claim 4,
The moving picture reading further comprising a learning unit for learning a phoneme based on each text constituting the moving picture content and learning a learning model for adjusting the learned phoneme to match the registered at least one user voice Serving Server.

6. The method of claim 5,
The moving picture reading sound source generator synthesizes the registered first user voice data based on the learning model based on at least one sentence tagged with the narration information, and combines the registered second user voice data with the conversation information. A server for providing a fairy tale reading service, which is synthesized based on at least one tagged sentence.

The method of claim 1,
The sound source selection unit determines the type of object corresponding to the onomatopoeia based on at least one onomatopoeia tagged with the onomatopoeia, and extracts the onomatopoeia sound source associated with the determined object type among the plurality of pre-stored onomatopoeia sound sources, A server that provides fairy tale reading service.

8. The method of claim 7,
When the onomatopoeic sound source associated with the determined object type is not extracted, the sound source selection unit extracts a predetermined similar onomatopoeic sound source based on each of the phoneme similarity and the phonetic similarity for the text tagged with the onomatopoeia information. Serving Server.

9. The method of claim 8,
The sound source selection unit extracts a vector based on the phoneme similarity and phoneme similarity for the text tagged with the onomatopoeic information, and reflects the extracted vector as a weight to the extracted predetermined onomatopoeic sound source to determine the final onomatopoeic sound source A server that provides a fairy tale reading service.

10. The method of claim 9,
The sound source selection unit will include a frequency conversion unit for performing frequency conversion to be similar to the user's voice with respect to the determined final onomatopoeic sound source, moving picture reading service providing server.

A method for providing a fairy tale reading service in a server,
receiving user voice data uttered by the user from the user terminal;
receiving, from the user terminal, a selection of any one moving picture content to be provided as the moving picture reading service from among a plurality of previously registered moving picture contents;
analyzing a plurality of sentences included in the selected moving image content;
tagging onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences;
selecting any one onomatopoeic sound source among a plurality of pre-stored onomatopoeia sound sources based on the at least one onomatopoeia tagged with the onomatopoeia information;
generating a moving picture reading sound source by synthesizing the selected onomatopoeia sound source and the registered user voice data; and
Comprising the step of providing the generated fairy tale reading sound source to the user terminal,
The moving picture reading sound source is a method for providing a moving picture reading service, wherein the at least one onomatopoeic word is generated to be expressed based on the user's voice.

12. The method of claim 11,
The step of receiving the user voice data registration,
The method for providing a children's story reading service, comprising the step of registering at least one of first user voice data uttered by a user in a read form and second user voice data uttered in a conversational form by a user based on a preset script from the user terminal; .

13. The method of claim 12,
The step of tagging the onomatopoeic information is,
tagging dialogue information for dialogue sentences extracted from the plurality of analyzed sentences based on a preset identification symbol;
tagging the onomatopoeic information with respect to the text extracted from the analyzed plurality of sentences based on the onomatopoeia registered in the Korean dictionary; and
The method for providing a moving picture reading service comprising the step of tagging narration information for the remaining sentences of the analyzed plurality of sentences.

14. The method of claim 13,
learning phonemes based on each text constituting the moving picture content; and
The method further comprising the step of learning a learning model for adjusting the learned phoneme to match the registered at least one user voice.

15. The method of claim 14,
The step of generating the sound source for reading the fairy tale is,
synthesizing the registered first user voice data based on the learning model based on at least one sentence tagged with the narration information; and
and synthesizing the registered second user voice data based on at least one sentence in which the conversation information is tagged.

12. The method of claim 11,
The step of selecting the onomatopoeic sound source is,
determining an object type corresponding to the onomatopoeia based on at least one onomatopoeia tagged with the onomatopoeia; and
The method for providing a moving picture reading service comprising extracting an onomatopoeic sound source associated with the determined object type from among the plurality of pre-stored onomatopoeia sound sources.

17. The method of claim 16,
The step of selecting the onomatopoeic sound source is,
If the onomatopoeic sound source associated with the determined object type is not extracted, extracting a predetermined similar onomatopoeic sound source based on each of the phoneme similarity and phoneme similarity with respect to the text tagged with the onomatopoeia information, which comprises the steps of: HOW TO PROVIDE SERVICES.

18. The method of claim 17,
The step of selecting the onomatopoeic sound source is,
extracting a vector based on the phoneme similarity and phoneme similarity for the text tagged with the onomatopoeic information; and
and determining a final onomatopoeic sound source by reflecting the extracted vector as a weight to the extracted predetermined similar onomatopoeic sound source.

19. The method of claim 18,
The step of selecting the onomatopoeic sound source is,
The method for providing a moving picture reading service comprising the step of performing frequency conversion on the determined final onomatopoeic sound source to be similar to the user's voice.

A computer program stored in a computer readable recording medium including a sequence of instructions for providing a story reading service,
When the computer program is executed by a computing device,
Registering the user voice data uttered by the user from the user terminal,
receiving selected one of the moving picture contents to be provided as the moving picture reading service from among the plurality of previously registered moving picture contents from the user terminal;
analyzing a plurality of sentences included in the selected moving image content, and tagging onomatopoeia information for at least one onomatopoeia included in the analyzed plurality of sentences,
Selecting any one onomatopoeic sound source among a plurality of pre-stored onomatopoeia sound sources based on the at least one onomatopoeia to which the onomatopoeic information is tagged,
Synthesizing the selected onomatopoeic sound source and the registered user voice data to create a sound source for reading a moving picture,
Provide the generated fairy tale reading sound source to the user terminal,
A computer program stored in a computer-readable recording medium including a sequence of instructions for generating the at least one onomatopoeic word to be expressed based on the user's voice, the sound source for reading the moving picture.