KR20220051136A

KR20220051136A - Method, apparatus, apparatus and computer recording medium for generating broadcast voice

Info

Publication number: KR20220051136A
Application number: KR1020217042726A
Authority: KR
Inventors: 스치앙 딩; 지저우 황; 디 우
Original assignee: 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디.
Priority date: 2020-10-15
Filing date: 2021-06-02
Publication date: 2022-04-26
Also published as: US20220406291A1; CN112269864A; JP2023502815A; CN112269864B; EP4012576A4; EP4012576A1; WO2022077927A1

Abstract

본 출원은 음성 기술 및 지식 그래프 기술의 분야에 관한 방송 음성을 생성하는 방법, 장치, 기기 및 컴퓨터 기록 매체를 개시한다. 구체적인 구현 방안은 음성 패키지로부터 장면에 매칭하는 화술을 획득하고, 상기 장면에 대해 미리 배치된 방송 템플릿을 획득하고, 상기 화술을 사용하여 상기 방송 템플릿을 충전하여, 방송 음성을 생성한다. 본 출원은 방송 음성이 당해 음성 패키지의 엔티티 대상의 인격적 특징을 잘 구현할 수 있고, 방송 효과를 더 향상시킬 수 있다.The present application discloses a method, an apparatus, an apparatus, and a computer recording medium for generating a broadcast voice in the fields of voice technology and knowledge graph technology. A specific implementation method is to obtain a speech matching a scene from a voice package, obtain a broadcast template pre-arranged for the scene, and use the speech to fill the broadcast template to generate a broadcast voice. According to the present application, the broadcast voice can realize the personality characteristics of the entity target of the voice package well, and the broadcast effect can be further improved.

Description

Method, apparatus, apparatus and computer recording medium for generating broadcast voice

본 출원은 컴퓨터 애플리케이션 기술의 분야에 관한 것으로, 특히, 음성 기술 및 지식 그래프 기술(knowledge graph technology)에서 방송 음성을 생성하는 방법, 장치, 기기 및 컴퓨터 기록 매체에 관한 것이다.The present application relates to the field of computer application technology, and more particularly, to a method, an apparatus, an apparatus, and a computer recording medium for generating a broadcast voice in a voice technology and a knowledge graph technology.

<관련 출원의 상호 참조><Cross-reference to related applications>

본 출원은 출원일이 2020년 10월 15일이고, 출원번호가 2020111059358이며, 발명 명칭이 "방송 음성을 생성하는 방법, 장치, 기기 및 컴퓨터 기록 매체(method and device for generating broadcast voice, electronic equipment, and computer storage medium)"인 중국 특허 출원의 우선권을 주장한다.This application has an application date of October 15, 2020, an application number of 2020111059358, and the title of the invention is "method and device for generating broadcast voice, electronic equipment, and computer storage medium)", claiming the priority of Chinese patent applications.

사용자가 스마트 단말의 기능에 대한 요구가 높아짐에 따라, 점점 많은 애플리케이션 프로그램에는 음성 방송의 기능이 통합된다. 사용자는 음성 방송이 자신이 좋아하는 태스크의 소리를 사용할 수 있도록, 여러 가지 음성 패키지(various voice packages)를 다운로드하여 설치할 수 있다. As the user's demand for the function of the smart terminal increases, the function of voice broadcasting is integrated into more and more application programs. The user can download and install various voice packages so that the voice broadcast can use the sound of his or her favorite task.

현재, 음성 방송은 사운드 측면에서 사용자의 요구를 상당한 정도로 만족시키고 있지만, 음성 방송의 내용이 각 장면에서 모두 고정되고 있기 때문에, 효과가 만족스럽지 못하다. 예를 들면, 네비게이션이 시작된 경우, 사용자가 어떤 음성 패키지를 사용하든, 모두 "출발 시작"을 방송한다.Currently, voice broadcasting satisfies users' demands in terms of sound to a considerable degree, but the effect is not satisfactory because the contents of voice broadcasting are all fixed in each scene. For example, when navigation is started, no matter which voice package the user uses, all broadcast "start of departure".

이를 고려하여, 본 출원은 음성 방송의 효과를 향상시키는 것을 용이하게 하기 위한 방송 음성을 생성하는 방법, 장치, 기기 및 컴퓨터 기록 매체를 제공할 수 있다. In consideration of this, the present application may provide a method, an apparatus, an apparatus, and a computer recording medium for generating a broadcast voice for facilitating improving the effect of voice broadcasting.

제1 측면에 있어서, 본 출원은 방송 음성을 생성하는 방법을 제공하고, In a first aspect, the present application provides a method for generating a broadcast voice,

음성 패키지로부터 장면에 매칭하는 화술을 획득하고, 상기 장면에 대해 미리 구성된 방송 템플릿을 획득하는 단계; 및obtaining a narration matching a scene from a voice package, and obtaining a broadcast template pre-configured for the scene; and

상기 화술을 사용하여 상기 방송 템플릿을 충전하여, 방송 음성을 생성하는 단계;를 포함한다. and generating a broadcast voice by charging the broadcast template using the speech.

제2 측면에 있어서, 본 출원은 방송 음성을 생성하는 장치를 제공하고, In a second aspect, the present application provides an apparatus for generating a broadcast voice,

음성 패키지로부터 장면에 매칭하는 화술을 획득하기 위한 화술 획득 모듈; a narration acquisition module for acquiring a narration matching the scene from the voice package;

상기 장면에 대해 미리 구성된 방송 템플릿을 획득하기 위한 템플릿 획득 모듈; 및a template obtaining module for obtaining a broadcast template preconfigured for the scene; and

상기 화술을 사용하여 상기 방송 템플릿을 충전하여, 방송 음성을 생성하기 위한 음성 생성 모듈;을 포함한다. and a voice generating module configured to generate a broadcasting voice by charging the broadcasting template using the narration.

제3 측면에 있어서, 본 출원은 전자 기기를 제공하고, In a third aspect, the present application provides an electronic device,

적어도 하나의 프로세서; 및 at least one processor; and

상기 적어도 하나의 프로세서에 통신 연결되는 메모리;를 포함하고,a memory communicatively coupled to the at least one processor;

상기 메모리에는 상기 적어도 하나의 프로세서에 의해 수행 가능한 명령이 저장되어 있고, 상기 명령이 상기 적어도 하나의 프로세서에 의해 수행되어, 상기 적어도 하나의 프로세서에 의해 상기 방법이 수행되도록 한다.An instruction executable by the at least one processor is stored in the memory, and the instruction is executed by the at least one processor, so that the method is performed by the at least one processor.

제4 측면에 있어서, 본 출원은 컴퓨터 명령이 저장되어 있는 비일시적 컴퓨터 판독 가능 기록 매체를 제공하고, 상기 컴퓨터 명령은 상기 컴퓨터가 상기 방법을 수행하도록 한다. In a fourth aspect, the present application provides a non-transitory computer-readable recording medium having a computer instruction stored thereon, the computer instruction causing the computer to perform the method.

상술한 기술 방안으로부터 알 수 있는 것은, 본 출원은 음성 패키지 내의 장면에 매칭하는 화술을 사용하여, 방송 템플릿을 충전한 후에 방송 음성을 획득하고, 방송 음성이 당해 음성 패키지의 엔티티 대상의 인격적 특징을 잘 구현할 수 있고, 방송 효과를 더 향상시키고, 사용자가 정말로 당해 음성 패키지의 엔티티 대상이 말하는 느낌을 가지도록 한다. It can be seen from the above technical solution that the present application uses a speech matching a scene in a voice package to obtain a broadcasting voice after filling a broadcasting template, and the broadcasting voice reflects the personality characteristics of the entity object of the voice package. It can be implemented well, further improve the broadcasting effect, and make the user really have the feeling that the entity object of the voice package is speaking.

상기 선택 가능한 방식이 가지는 다른 효과에 대해서는 아래 구체적인 실시예를 결부하여 추가로 설명하고자 한다.Other effects of the selectable method will be further described in conjunction with the following specific examples.

첨부 도면은 본 해결수단을 더 잘 이해하기 위한 것으로, 본 출원에 대해 한정하는 것으로 구성되지 않는다.
도 1은 종래 기술의 방송 음성을 생성하는 원리의 개략도이다.
도 2는 본 출원의 실시예에 적용될 수 있는 예시적인 시스템 아키텍처를 도시한다.
도 3은 본 출원의 실시예에서 제공되는 주요 방법의 흐름도이다.
도 4는 본 출원의 실시예에서 제공되는 방송 음성을 생성하는 원리의 개략도이다.
도 5는 본 출원의 실시예에서 제공되는 스타일 화술을 발굴하는 방법의 흐름도이다.
도 6은 본 출원의 실시예에서 제공되는 지식 화술을 발굴하는 방법의 흐름도이다.
도 7은 본 출원의 실시예에서 제공되는 일부의 지식 그래프의 예시적인 다이어그램이다.
도 8은 본 출원의 실시예에서 제공되는 방송 음성을 생성하는 장치의 구조도이다.
도 9는 본 출원의 실시예를 구현하는 전자 기기의 블록도이다.The accompanying drawings are for better understanding of the present solution, and are not intended to limit the present application.
1 is a schematic diagram of a principle of generating a broadcast voice in the prior art.
2 shows an exemplary system architecture that may be applied to an embodiment of the present application.
3 is a flowchart of a main method provided in an embodiment of the present application.
4 is a schematic diagram of a principle of generating a broadcast voice provided in an embodiment of the present application.
5 is a flowchart of a method for discovering a style speech provided in an embodiment of the present application.
6 is a flowchart of a method for discovering a knowledge discourse provided in an embodiment of the present application.
7 is an exemplary diagram of some knowledge graphs provided in embodiments of the present application.
8 is a structural diagram of an apparatus for generating broadcast voice provided in an embodiment of the present application.
9 is a block diagram of an electronic device implementing an embodiment of the present application.

이하, 첨부된 도면을 결부하여 본 출원의 예시적 실시예를 설명하되, 여기에는 이해를 돕기 위한 본 출원의 실시예의 다양한 세부 사항이 포함되며, 이는 단지 예시적인 것으로 간주되어야 한다. 따라서, 본 기술분야의 통에서의 기술자는 본 출원의 범위와 사상을 벗어나지 않으면서, 여기서 설명되는 실시예에 대한 다양한 변경과 수정이 이루어질 수 있음을 이해해야 한다. 마찬가지로, 명확성 및 간결성을 위해, 아래의 설명에서 공지된 기능과 구조에 대한 설명을 생략한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present application will be described in conjunction with the accompanying drawings, which include various details of the embodiments of the present application for easy understanding, which should be regarded as exemplary only. Accordingly, those skilled in the art should understand that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Likewise, for clarity and brevity, descriptions of well-known functions and structures are omitted from the description below.

종래 기술에서, 방송 음성을 생성하는 원리는 도 1과 같을 수 있다. 방송 텍스트를 생성하는 단계는, 하기의 2가지 상황을 포함할 수 있지만, 이에 한정되지 않는다. In the prior art, a principle of generating a broadcast voice may be the same as in FIG. 1 . The step of generating the broadcast text may include, but is not limited to, the following two situations.

하나는 대화에 기반한 방송 텍스트의 생성이다. 즉 사용자 음성 명령을 수신한 후, 사용자 음성 명령에 응답하여 생성된 응답 텍스트를 방송 텍스트로 한다. 예를 들면, 사용자의 음성 명령 "카페를 조회해. (Query coffee shop)"를 수신하고, 생성된 응답 텍스트는 "당신에게 가장 가까운 카페를 찾아 드리겠습니다. 중관촌 남거리 베이징 국제 빌딩 C 층에 위치하고 2.1 킬로미터 떨어져 있습니다."이다. 이런 상황에서, 주로, 대화의 이해에 기반하여 장면(scene) 및 사용자 의도(user's intention)의 분석을 수행한 후에 응답 텍스트를 생성한다. One is the generation of broadcast texts based on dialogue. That is, after receiving the user's voice command, the response text generated in response to the user's voice command is used as the broadcast text. For example, receiving the user's voice command "Query coffee shop", the generated response text is "We will find the nearest cafe to you. Located on the C floor of Beijing International Building, Zhongguancun South Street, 2.1 Kilometers away." In this situation, mainly, the response text is generated after performing an analysis of the scene and user's intention based on the understanding of the dialogue.

다른 하나는 주동적으로 방송 텍스트를 생성한다. 즉 특정 기능의 음성 방송 중에, 주동적으로 음성 방송을 한다. 예를 들면, 네비게이션의 프로세스 중에, "출발 시작", "전방 좌회전" 등의 방송 텍스트를 주동적으로 방송한다. 이런 상황에서, 주로 현재의 실제의 상황에 기반하여 장면 분석을 한 후에 방송 텍스트를 생성한다. The other actively generates the broadcast text. That is, during the audio broadcast of a specific function, the audio broadcast is actively performed. For example, during the process of navigation, broadcast texts such as "start of departure" and "turn left forward" are actively broadcast. In this situation, the broadcast text is generated mainly after scene analysis based on the current actual situation.

방송 텍스트를 생성한 후에, 음성 패키지 내의 음색 정보를 사용하여 음성 합성을 하여 방송하려고 하는 음성을 획득한다. 상기 종래 기술에 의해 생성된 방송 음성은 같은 장면에서, 다른 음성 패키지에 의해 방송된 음성 내용과 같으며, 음색의 차이만 있다. 예를 들면, 사용자는 아들의 음성 패키지를 사용하거나, 스타의 음성 패키지를 사용하여 "카페를 조회해." 하는 장면에서, 모두 "가장 가까운 카페를 찾아 드리겠습니다. ***에 위치합니다." 를 방송한다. After generating the broadcast text, the voice to be broadcast is obtained by performing voice synthesis using the tone information in the voice package. The broadcast audio generated by the prior art is the same as the audio content broadcast by other audio packages in the same scene, only with a difference in tone. For example, the user uses the son's voice package, or the star's voice package to "look up the cafe." In the scene where they all say, "We'll find the nearest cafe. It's located at ***." to broadcast

도 2는 본 출원의 실시예에 적용될 수 있는 방송 음성을 생성하는 방법, 또는 방송 음성을 생성하는 장치의 예시적인 시스템 아키텍처를 도시한다. 2 shows an exemplary system architecture of a method for generating a broadcast voice or an apparatus for generating a broadcast voice that can be applied to an embodiment of the present application.

도 2에 도시된 바와 같이, 당해 시스템 아키텍처는 단말 디바이스(101, 102), 네트워크(103) 및 서버(104)를 포함할 수 있다. 네트워크(103)는 단말 디바이스(101, 102)와 서버(104) 사이의 통신 링크를 위한 매체를 제공한다. 네트워크(103)는 유선, 무선통신 링크, 또는 광섬유 케이블 등의 다양한 연결 유형을 포함할 수 있다. As shown in FIG. 2 , the system architecture may include terminal devices 101 and 102 , a network 103 , and a server 104 . The network 103 provides a medium for a communication link between the terminal devices 101 , 102 and the server 104 . Network 103 may include various types of connections, such as wired, wireless communication links, or fiber optic cables.

사용자는 단말 디바이스(101, 102)를 사용하여 네트워크(103)를 통해 서버(104)와 인터랙션할 수 있다. 단말 디바이스(101, 102)에는, 음성 인터랙션 애플리케이션, 지도류 애플리케이션, 웹 브라우저 애플리케이션, 통신류 애플리케이션 등의 다양한 애플리케이션이 인스톨될 수 있다. A user may interact with the server 104 via the network 103 using the terminal devices 101 , 102 . Various applications, such as a voice interaction application, a map application, a web browser application, and a communication type application, may be installed in the terminal devices 101 and 102 .

단말 디바이스(101, 102)는 음성 방송을 서포트하는 여러 가지 전자 기기일 수 있다. 스마트폰, 태블릿, 노트북, 스마트 웨어러블 기기 등이 포함되지만, 이에 한정되지 않는다. 본 출원에 의해 제공되는 방송 음성을 생성하는 장치는 상술한 서버(104)에 설치하여 수행할 수 있고, 단말 디바이스(101, 102)에 설치하여 수행할 수도 있다. 복수의 소프트웨어 또는 소프트웨어 모듈 (예를 들면, 분산 서비스를 제공한다)로 구현할 수 있고, 단일 소프트웨어 또는 소프트웨어 모듈로 구현할 수도 있고, 여기서 구체적으로 한정하지 않는다. The terminal devices 101 and 102 may be various electronic devices that support voice broadcasting. Smartphones, tablets, laptops, smart wearable devices, and the like are included, but are not limited thereto. The apparatus for generating broadcast voice provided by the present application may be installed and performed in the aforementioned server 104 or may be installed and performed in the terminal devices 101 and 102 . It may be implemented in a plurality of software or software modules (eg, to provide a distributed service), or may be implemented in a single software or software module, but is not specifically limited here.

서버(104)는 단일 서버일 수 있고, 복수의 서버에 의해 구성되는 서버 그룹일 수 있다. 도 2의 단말 디바이스, 네트워크 및 서버의 수는 단순한 예시적인 것일 뿐이고. 구현 요구에 따라, 임의의 수의 단말 디바이스, 네트워크 및 서버일 수 있다. The server 104 may be a single server, or may be a server group constituted by a plurality of servers. The number of terminal devices, networks and servers in FIG. 2 is merely exemplary. According to implementation requirements, it may be any number of terminal devices, networks and servers.

도 3은 본 출원의 실시예에서 제공되는 주요 방법의 흐름도이다. 도 3에 도시된 바와 같이, 당해 방법은 하기와 같은 단계를 포함할 수 있다. 3 is a flowchart of a main method provided in an embodiment of the present application. 3 , the method may include the following steps.

301에서, 음성 패키지로부터 장면에 매칭하는 화술을 획득한다. At 301 , a narration matching the scene is obtained from the voice package.

본 출원의 실시예에서, 음성 패키지에는 음색 정보가 포함되는 외에도, 여러 가지 화술 정보가 더 포함된다. "화술"은 말하는 방식으로 이해할 수 있고, 같은 의미를 표현할 때에 다른 표현 방식을 사용할 수 있고, 즉 다른 화술을 사용할 수 있다. 본 출원의 실시예에서, 동일한 장면에 대해, 다른 음성 패키지는 다른 화술을 사용할 수 있다. 그 중, 화술은 호칭 화술, 스타일 화술 및 지식 화술 등 중의 적어도 하나를 포함한다. 호칭 화술은 사용자에게 대한 호칭의 표현 방식이다. 스타일 화술은 특정 스타일을 사용하는 표현 방식이다. 지식 화술은 특정한 지식내용에 기반한 표현 방식이다. In the embodiment of the present application, the voice package further includes various kinds of speech information in addition to including tone information. A "discourse" can be understood in a way of speaking, and can use different modes of expression when expressing the same meaning, i.e., different speech styles. In embodiments of the present application, for the same scene, different voice packages may use different narrations. Among them, the narration includes at least one of a name narration, a style narration, and a knowledge narration. Calling speech is a way of expressing titles to users. Stylistic speech is a way of expression that uses a particular style. Knowledge discourse is an expression method based on specific knowledge content.

호칭 화술을 예로 들면, 사용자가 아들의 음성 패키지를 사용할 경우, 호칭 화술은 "아버지"를 사용할 수 있다. 사용자가 아내의 음성 패키지를 사용할 경우, 호칭 화술은 "남편"을 사용할 수 있다. 물론, 하나의 음성 패키지에 호칭 화술 정보가 존재하지 않을 수도 있다. 예를 들면, 스타의 음성 패키지에 대해, 호칭 화술을 사용하지 않고, 모두 "너(

)", "당신(

)" 등의 기초적인 화술로 총칭할 수 있다. As an example, when the user uses the son's voice package, the title speech may use "Father". If the user uses the wife's voice package, the title speech can use "husband". Of course, there may be no title speech information in one voice package. For example, for the star's voice package, without using a title speech, all "you (

)", "you(

)", etc., can be collectively referred to as basic dialogue techniques.

하나의 스타일 화술을 예로 들면, 동일한 장면 "속도 위반"에 대해, 사용자가 가족의 사람 음성 패키지를 사용할 경우, 구현되는 것은, 마음이 따뜻해지는 스타일이며, 스타일 화술 "속도 위반을 하였어요, 운전 중에 안전을 주의해 주세요.

"를 사용할 수 있다. 사용자가 코미디언의 음성 패키지를 사용할 경우, 구현되는 것은, 재미있는 스타일이며, 스타일 화술 "우리들은 보통 운전자입니다. F1을 운전하는 척하지 말아 주십시오. 감속하고 천천히 가 주세요.

"를 사용할 수 있다.Taking one style speech as an example, for the same scene "Speeding", when the user uses the person's voice package in the family, what is implemented is a heart-warming style, the style speech "Speeding, safe while driving" Please be careful.

" can be used. When users use the comedian's voice package, what's implemented is, it's a fun style, and a stylistic "We're usually drivers. Don't pretend to drive the F1. Please slow down and go slowly.

" can be used.

하나의 지식 화술을 예로 들면, 장면 "카페"에 대해, 사용자가 스타 A의 음성 패키지를 사용할 경우, 지식 화술 "xxx 커피를 한 잔 주세요."를 사용할 수 있고, 그 중, "xxx"는 스타 A가 광고하는 커피 브랜드일 수 있다. 사용자가 스타 B의 음성 패키지를 사용할 경우, 지식 화술에서의 "xxx"는 스타 B가 광고하는 커피 브랜드일 수 있다. Taking one knowledge discourse as an example, for the scene "cafe", if the user uses the voice package of star A, the knowledge discourse "Please give me a cup of coffee xxx" can be used, where "xxx" is a star It may be a coffee brand advertised by A. If the user uses Star B's voice package, "xxx" in the knowledge narrative may be the coffee brand Star B advertises.

음성 패키지 중의 다양한 화술의 생성 방식은 후속의 실시예에서 상세하게 설명한다. Methods of generating the various narrations in the voice package will be described in detail in the following examples.

302에서, 상기 장면에 대해 미리 구성된 방송 템플릿을 획득한다. At 302, a preconfigured broadcast template for the scene is obtained.

본 출원의 실시예에서, 미리 각 장면에 방송 템플릿을 구성할 수 있다. 방송 템플릿은 적어도 하나의 화술 조합을 포함할 수 있다. In an embodiment of the present application, a broadcast template may be configured in each scene in advance. The broadcast template may include at least one narration combination.

303에서, 획득된 화술을 사용하여 당해 방송 템플릿을 충전하여, 방송 음성을 생성한다. In 303, the broadcast template is filled using the obtained speech to generate a broadcast voice.

방송 템플릿은 장면에 대응하고, 음성 패키지에는 장면에 매칭하는 인격화된 화술이 존재하고, 당해 화술을 사용하여 방송 템플릿을 충전한 후에 획득된 방송 텍스트는 당해 음성 패키지의 엔티티 대상 (예를 들면 아들, 아내, 어떤 유명인 등)에 대응하는 인격적 특징을 잘 구현할 수 있고, 방송 효과를 더 향상시키고, 사용자가 정말로 당해 음성 패키지의 엔티티 대상이 말하는 느낌을 가지도록 한다. The broadcast template corresponds to the scene, the voice package has a personified narration matching the scene, and the broadcast text obtained after using the narration to fill the broadcast template is an entity object of the voice package (eg, son, wife, a certain celebrity, etc.) can be well implemented, and the broadcasting effect can be further improved, and the user can really feel that the entity object of the voice package is speaking.

방송 텍스트를 획득한 후, 또한, 음성 패키지 내의 음색 정보를 사용하여 음성 합성을 하여 최종적으로 방송 음성을 생성할 수 있고, 당해 부분은 종래 기술과 같으며, 상세하게 설명하지 않는다. After acquiring the broadcast text, further, by using the tone information in the audio package, voice synthesis may be performed to finally generate a broadcast voice, and this part is the same as in the prior art, and will not be described in detail.

도 4에 도시된 바와 같이, 본 출원에서, 방송 텍스트를 생성하는 프로세스에서 음성 패키지 내의 화술 정보를 사용한다. 하기는 실시예를 결합하여 음성 패키지 중의 화술 정보의 생성 방식을 상세하게 설명한다. As shown in FIG. 4 , in the present application, the narration information in the voice package is used in the process of generating the broadcast text. The following describes in detail a method of generating speech information in a voice package by combining embodiments.

음성 패키지 내의 호칭 화술에 대해, 사용자가 설치하여 획득할 수 있다. 바람직한 실시 방식으로서, 음성 패키지에 대한 설치 인터페이스에서 사용자에게 호칭 화술의 입력 박스 또는 옵션 등의 컴포넌트를 제공하여 사용자가 호칭 화술을 입력 또는 선택하도록 할 수 있다. 예를 들면, 아들 음성 패키지를 사용하는 사용자에 대해, 그 당해 음성 패키지에 대한 설치 인터페이스를 제공할 수 있고, 당해 설치 인터페이스에는, 사용자가 선택하도록, "아버지", "어머니", "할아버지", "남편", "아내", "할머니", "외할머니", "외할아버지", "보배(

) 등의 일반적인 호칭의 옵션을 포함한다. 사용자가 스스로 입력하도록 입력 박스를 제공할 수도 있다. With respect to the name utterance in the voice package, the user can install and obtain it. As a preferred embodiment, a component such as an input box or an option for a nickname speech may be provided to the user in the installation interface for the voice package so that the user can input or select the nickname speech. For example, for a user who uses the son's voice package, an installation interface for the voice package may be provided, and the installation interface includes "father", "mother", "grandfather", "husband", "wife", "grandmother", "maternal grandmother", "maternal grandfather",

), including general designation options. An input box may be provided for the user to input by himself/herself.

음성 패키지 내의 스타일 화술에 대해, 연구 개발 담당자, 서비스 제공자 등에 의해 미리 설정되는 것과 같은 미리 설정된 내용을 획득할 수 있다. 그러나, 바람직한 실시 방식으로서, 스타일 화술은 검색 엔진을 통해 미리 발굴하여 획득할 수 있다. 예를 들면, 도 5에 도시된 단계를 사용할 수 있다. With respect to the style speech in the voice package, it is possible to obtain preset contents such as those preset by a research and development person, a service provider, and the like. However, as a preferred implementation method, the style narration can be obtained by excavating in advance through a search engine. For example, the steps shown in FIG. 5 may be used.

501에서, 미리 설정된 스타일 키워드와 장면의 키워드를 사용하여 스플라이싱하여 검색 키워드를 획득한다. In 501, a search keyword is obtained by splicing using a preset style keyword and a keyword of the scene.

스타일 키워드도 사용자가 설치하여 획득할 수 있다. 예를 들면, 음성 패키지에 대한 설치 인터페이스에서 사용자에게 스타일 키워드의 입력 박스 또는 옵션의 컴포넌트를 제공하여 사용자가 입력 또는 선택하도록 할 수 있다. 예를 들면, 사용자가 선택하도록, 음성 패키지에 대한 설치 인터페이스에 "친밀", "코믹", "유세통", "틱톡 스타일" 등의 스타일 키워드의 옵션을 제공할 수 있다. Style keywords can also be obtained by installing them. For example, in an installation interface for a voice package, an input box of a style keyword or an optional component may be provided to the user to allow the user to input or select. For example, an option of style keywords such as "intimate", "comic", "popular", "TikTok style" and the like may be provided in the installation interface for the voice package for the user to select.

502에서, 검색 키워드에 대응하는 검색 결과 텍스트로부터 스타일 화술 후보 항목을 선택한다. At 502 , a style speech candidate is selected from the search result text corresponding to the search keyword.

현재의 장면이 카페를 묻는 것이며, 장면 키워드가 "카페", "커피"이며, 사용자가 현재 사용하는 음성 패키지의 스타일 키워드가 "마음이 따뜻해짐"이라고 가정하면, "카페 마음이 따뜻해진다.", "커피 마음이 따뜻해진다." 키워드를 구축할 수 있고, 각각 검색 후, 검색 결과의 제목, 개요 등의 검색 결과 텍스트를 획득할 수 있다. 검색 키워드의 관련성에 기반하여 정렬한 후, 상위 N 개에 배열된 검색 결과 텍스트를 스타일 화술 후보 항목으로 획득한다. 그 중 N는 미리 설정된 양의 정수이다. Assuming that the current scene asks for a cafe, the scene keywords are "cafe", "coffee", and the style keyword of the voice package that the user is currently using is "warming heart", "cafe heart warms". , "Coffee warms the heart." It is possible to construct keywords, and after each search, it is possible to obtain search result texts such as titles and outlines of search results. After sorting based on the relevance of the search keywords, the search result texts arranged in the top N are obtained as style speech candidates. Among them, N is a preset positive integer.

503에서, 스타일 화술 후보 항목을 수정한 후, 스타일 화술을 획득한다. At 503 , after modifying the style speech candidate item, the style speech is acquired.

본 실시예에서, 상기 스타일 화술 후보 항목을 수정하는 것은, 연구 개발 담당자가 스타일 화술 후보 항목에 대해 조정, 조합, 선택 등의 처리를 한 후에, 최종적인 스타일 화술을 획득할 수 있다. 스타일 화술에는, 호칭 슬롯을 추가할 수도 있다. 인공적으로 수정하는 방식 이외에, 다른 수정 방식을 사용할 수도 있다. In the present embodiment, in the modification of the style speech candidate item, the final style speech may be obtained after the research and development person adjusts, combines, and selects the style speech candidate item. You can also add a title slot to a style narrative. In addition to artificial insemination, other fertilization methods may be used.

예를 들면, 스타일 화술 후보 항목 "커피는 나를 활력 있도록 할 수 있고, 잠을 잘 자고 싶다면, 먼저 끊으세요.", "커피 한 모금을 마시면 좀 쓰지만 그 후의 달콤함은 씁쓸함을 잊게 해준다.", "생활은 커피 한 잔과 같아서 씁쓸하면서도 달콤하고 달콤하면서도 즐겁다."로부터, 인공적으로 수정한 후, 스타일 화술 "커피를 마시면 머리를 맑게 할 수 있지만 수면에 영향을 줄 수 있으니, [호칭]은 휴식에 주의해야 합니다."를 획득할 수 있다. For example, nominees for style speech: "Coffee can energize me, and if you want to sleep well, quit first.", "A sip of coffee is a bit bitter, but the sweetness afterward makes you forget the bitterness.", " Life is like a cup of coffee, bitter, sweet, sweet and enjoyable", after artificial insemination and style speech "Coffee can clear your head, but it can affect your sleep, so [title] is for relaxation. You have to be careful."

음성 패키지 내의 지식 화술에 대해, 미리 설정된 내용을 획득할 수 있고, 예를 들면, 연구 개발 담당자, 서비스 제공자 등에 의해 미리 설정될 수 있다. 그러나, 바람직한 실시 방식으로서, 지식 화술은 지식 그래프에 기반하여 미리 발굴하여 획득할 수 있다. 예를 들면 도 6에 도시된 단계를 사용할 수 있다. For the knowledge discourse in the voice package, preset contents may be acquired, and may be preset by, for example, a research and development person in charge, a service provider, and the like. However, as a preferred implementation method, knowledge discourse can be obtained by excavating in advance based on the knowledge graph. For example, the steps shown in FIG. 6 can be used.

601에서, 음성 패키지에 관련되는 지식 그래프를 획득한다. At 601 , a knowledge graph related to the speech package is obtained.

통상 음성 패키지는 특정한 엔티티 대상에 대응하고, 당해 엔티티 대상의 음색으로 구현되며, 예를 들면, 사용자가 친족의 음성 패키지를 사용하면, 당해 음성 패키지에 대응하는 엔티티는 당해 친족이다. 그 다음에, 예를 들면, 사용자가 스타 A의 음성 패키지를 사용하면, 당해 음성 패키지에 대응하는 엔티티는 스타 A이다. 각 엔티티는 모두 그것에 대응하는 지식 그래프가 존재하므로, 본 단계에서, 음성 패키지에 대응하는 엔티티의 지식 그래프를 획득할 수 있다. In general, a voice package corresponds to a specific entity object, and is implemented with a tone color of the entity object. For example, when a user uses a voice package of a relative, the entity corresponding to the voice package is the relative. Then, for example, if the user uses the voice package of star A, the entity corresponding to the voice package is star A. Since each entity has a knowledge graph corresponding to it, in this step, the knowledge graph of the entity corresponding to the voice package may be obtained.

602에서, 지식 그래프로부터 장면에 매칭하는 지식 노드를 획득한다. At 602 , a knowledge node matching the scene is obtained from the knowledge graph.

지식 기반 그래프에서, 각 지식 노드는, 구체적인 내용 및 기타의 지식 노드의 관련 관계를 모두 포함한다. 도 7에 도시된 지식 그래프의 일부 내용을 예로 한다. "스타 A"의 음성 패키지를 예로 들면, 이에 대응하는 엔티티는 "스타 A"이며, 지식 그래프에서, 예를 들면, "내부고발자(吹哨人)", "루이싱커피", "중앙 연극 학원", "항저우시" 등의 지식 노드를 포함할 수 있고, 그 중, "내부고발자(吹哨人)"와 "스타 A"의 관련 관계는 "히트 영화"이며, "루이싱커피"와 "스타 A"의 관련 관계는 "광고 모델"이며, "중앙 연극 학원"과 "스타 A"의 관련 관계는 "졸업 대학"이며, "항저우시"와 "스타 A"의 관련 관계는 "출신지"이다. 장면에 매칭하는 지식 노드를 획득할 경우, 장면 키워드 및 지식 노드의 내용, 관련 관계를 매칭할 수 있다. In the knowledge-based graph, each knowledge node includes both specific content and other related relationships between the knowledge nodes. Some contents of the knowledge graph shown in FIG. 7 are taken as an example. Taking the audio package of "Star A" as an example, the corresponding entity is "Star A", and in the knowledge graph, for example, "Whistleblower", "Lucing Coffee", "Central Theater Academy" It may include knowledge nodes such as ", "Hangzhou City", etc., among which the related relationship between "whistleblower" and "Star A" is "hit movie", "Luixing Coffee" and " The relationship between "Star A" is "Advertising Model", the relationship between "Central Theater Academy" and "Star A" is "Graduation University", and the relationship between "Hangzhou City" and "Star A" is "Hometown" . When a knowledge node matching a scene is obtained, the scene keyword, contents of the knowledge node, and related relationships may be matched.

603에서, 획득된 지식 노드 및 장면에 대응하는 화술 템플릿을 사용하여, 장면에 대응하는 지식 화술을 생성한다. At 603 , a knowledge narration corresponding to the scene is generated by using the obtained knowledge node and the narration template corresponding to the scene.

장면마다, 지식 화술의 화술 템플릿을 미리 설정할 수 있다. 예를 들면, 장면 "영화관의 검색"에 대해, 화술 템플릿 "영화관에 와서 내가 새로 공연한 영화[영화 이름]를 보세요."를 설치할 수 있고, 단계 602에서, 지식 노드 "내부고발자(吹哨人)"를 결정한 후, 이를 화술 템플릿 내의 슬롯 [영화 이름]에 채우고, 지식 화술 "영화관에 와서 내가 새로 공연한 영화《내부고발자(吹哨人)》를 보세요."를 생성한다. For each scene, a narration template of knowledge narration can be preset. For example, for the scene "Search in the cinema", the narrative template "Come to the cinema and see the new movie [film name] that I have performed" may be installed, and in step 602 , the knowledge node "Whistleman (吹哨人) )", fill it in the slot [movie name] in the narrative template, and create the knowledge narrative "Come to the cinema and watch the movie I just performed, Whistleblower."

음성 패키지에 있어서, 호칭 화술, 스타일 화술 및 지식 화술 중의 일부 또는 전부를 구비할 수 있다. 바람직한 실시 방식으로서, 상기 단계 301에서 "음성 패키지로부터 장면에 매칭하는 화술을 획득한다."일 경우, 먼저, 장면의 키워드를 결정하고, 그 다음에, 음성 패키지로부터 장면의 키워드에 매칭하는 화술을 획득할 수 있다. 그 중, 매칭할 때에 텍스트 유사도의 방식에 기반할 수 있고, 예를 들면, 화술과 장면의 키워드 사이의 텍스트 유사도가 미리 설정된 유사도의 역치 이상일 경우, 매칭한다고 간주할 수 있다. 이러한 방식은 장면에 비교적 가깝고, 보다 포괄적인 화술을 찾을 수 있다. In the voice package, some or all of a title speech, a style speech, and a knowledge speech may be included. As a preferred implementation method, in the step 301, when “obtain a narration matching the scene from the voice package.” First, a keyword of the scene is determined, and then a narration matching the keyword of the scene is obtained from the voice package. can be obtained Among them, matching may be based on a method of text similarity. For example, if the text similarity between the speech and the scene keyword is greater than or equal to a preset similarity threshold, matching may be considered. In this way, it is relatively close to the scene, and a more comprehensive narrative can be found.

상술한 바람직한 실시 방식 외에도, 다른 방식을 사용할 수도 있다. 예를 들면, 화술과 각 장면의 매칭 관계 등을 미리 설정할 수 있다. In addition to the preferred implementation methods described above, other methods may be used. For example, the matching relationship between the dialogue and each scene may be set in advance.

하기는 실시예를 결합하여 상기 단계 302의 "상기 장면에 대해 미리 구성된 방송 템플릿을 획득한다." 및 단계 303의 "획득된 화술을 사용하여 당해 방송 템플릿을 충전하여, 방송 음성을 생성한다."의 구현 방식을 설명한다. The following is a combination of embodiments, and in step 302, "acquire a preconfigured broadcast template for the scene." and the implementation method of step 303 "filling the broadcast template using the acquired speech to generate a broadcast voice."

각 장면에 대해 적어도 하나의 방송 템플릿 및 각 방송 템플릿의 속성 정보를 미리 구성할 수 있다. 그 중, 방송 템플릿은 적어도 하나의 화술 조합을 포함할 수 있고, 상술한 호칭 화술, 스타일 화술 및 지식 화술 이외에, 기초 화술을 더 포함할 수도 있고, 기초 화술은 서버 측에 저장할 수 있다. 속성 정보는 우선 순위와 화술 사이의 제약 규칙 등 중의 적어도 하나를 포함할 수 있다.At least one broadcast template and attribute information of each broadcast template may be preconfigured for each scene. Among them, the broadcast template may include at least one narration combination, and may further include basic narration in addition to the above-described nominal narration, style narration, and knowledge narration, and the basic narration may be stored on the server side. The attribute information may include at least one of a constraint rule between priority and narration.

하나의 예로 들면, "카페를 조회해."라고 하는 테마에 대해, 6개의 방송 템플릿을 설치하고, 그 우선 순위와 제약 규칙은 표 1에 표시된다. As an example, for the theme "Search the cafe.", six broadcast templates are installed, and the priority and constraint rules are shown in Table 1.

사용자가 아들의 음성 패키지를 사용한다고 가정하고, "카페를 조회해."의 장면에서, 음성 패키지 내의 당해 장면에 매칭하는 화술을 획득하고, 다음과 같다. Assume that the user uses the son's voice package, and in the scene of "Search the cafe.", a narration matching the scene in the voice package is obtained, as follows.

호칭 화술: 아버지, Nickname Speech: Father,

스타일 화술: 커피를 마시면 머리를 맑게 할 수 있지만 수면에 영향을 줄 수 있으니, [호칭]은 휴식에 주의해야 합니다. Style speech: Drinking coffee can clear your head, but it can affect your sleep, so [Title] should be careful with rest.

표 1에서 나타내는 방송 템플릿에서, 우선 순위가 높은 것으로부터 낮은 것으로 선정한다. 당해 장면에 매칭하는 지식 화술이 존재하지 않기 때문에, 앞의 2개의 템플릿을 사용할 수 없다. 3번째의 템플릿의 제약 규칙에서, 제약 스타일 화술에 호칭이 있어서는 안되므로, 사용할 수 없기 때문에, 4번째의 템플릿" [기초 화술] [스타일 화술]" 을 사용할 수 있다. In the broadcast templates shown in Table 1, the ones with the highest priority are selected from the ones with the lowest priority. Since there is no knowledge narrative that matches the scene, the first two templates cannot be used. In the constraint rule of the third template, since the constraint style speech must not have a title and cannot be used, the fourth template "[basic speech] [style speech]" can be used.

서버 측에서 당해 장면의 기초 화술 "가장 가까운 카페를 찾아 드리겠습니다. ***에 위치합니다."를 획득하고, 음성 패키지로부터 당해 장면의 스타일 화술 "커피를 마시면 머리를 맑게 할 수 있지만 수면에 영향을 줄 수 있으니, [호칭]은 휴식에 주의해야 합니다."를 획득하여 4번째의 템플릿을 충전하여, 최종적으로 방송 텍스트 "가장 가까운 카페를 찾아 드리겠습니다. ***에 위치합니다, 커피를 마시면 머리를 맑게 할 수 있지만, 수면에 영향을 줄 수 있으니, 아버지는 휴식에 주의해야 합니다."를 획득한다. On the server side, you will get the basic narration of the scene "We will find the nearest cafe. It is located at ***." from the audio package, and the style narration of the scene from the audio package "Coffee can clear your head, but it will not affect your sleep." You can give it, so [Title] needs to be careful about rest.", recharge the 4th template, and finally broadcast text "I'll find the nearest cafe. It is located at ***, if you drink coffee, You can clear it, but it can affect your sleep, so fathers should be careful with rest."

방송 텍스트를 획득한 후, 음성 패키지 내의 음색 정보에 기반하여 음성 합성을 하여, 방송 음성을 획득할 수 있다. 이러한 방송 음성의 생성 방식은 당해 사용자가 듣는 음성이 자신의 아들이 말한 것처럼, 대단히 마음이 따뜻해지고, 대단히 강한 인격화적 효과가 있다. After acquiring the broadcast text, the broadcast voice may be acquired by performing voice synthesis based on the tone information in the voice package. This method of generating a broadcast voice makes the user feel very warm and has a very strong personalizing effect, as if the user's voice was spoken by his own son.

이상은 본 출원에 의해 제공되는 방법에 대한 상세한 설명이며, 하기는 본 출원에 의해 제공되는 장치에 대해 상세하게 설명한다. The above is a detailed description of the method provided by the present application, and the following describes in detail the apparatus provided by the present application.

도 8은 본 출원의 실시예에서 제공되는 방송 음성을 생성하는 장치의 구조도이다. 당해 장치는, 로컬 단말의 애플리케이션에 위치할 수 있을 것인가, 또는 로컬 단말의 애플리케이션 내의 플러그인 또는 소프트웨어 개발 킷(Software Development Kit, SDK) 등의 기능 유닛에 위치할 수 있거나, 또는, 서버 측에 위치할 수도 있다. 도 8에 도시된 바와 같이, 당해 장치는 화술 획득 모듈(00), 템플릿 획득 모듈(10) 및 음성 생성 모듈(20)을 포함할 수 있고, 제1 발굴(mining) 모듈(30) 및 제2 발굴 모듈(40)을 더 포함할 수도 있다. 그 중, 각 구성 유닛의 주요 기능은 다음과 같다. 8 is a structural diagram of an apparatus for generating broadcast voice provided in an embodiment of the present application. The device may be located in the application of the local terminal, or it may be located in a functional unit such as a plug-in or a software development kit (SDK) in the application of the local terminal, or located on the server side may be As shown in FIG. 8 , the apparatus may include a speech obtaining module 00 , a template obtaining module 10 and a voice generating module 20 , and a first mining module 30 and a second It may further include an excavation module 40 . Among them, the main functions of each component unit are as follows.

화술 획득 모듈(00)은 음성 패키지로부터 장면에 매칭하는 화술을 획득하는데 사용된다. The narration acquisition module 00 is used to acquire a narration matching the scene from the voice package.

바람직한 실시 방식으로서, 화술 획득 모듈(00)은 장면의 키워드를 결정하고, 음성 패키지로부터 장면의 키워드에 매칭하는 화술을 획득할 수 있다. As a preferred embodiment, the narration acquisition module 00 may determine a keyword of the scene, and acquire a narration matching the keyword of the scene from the voice package.

그 중, 화술은 호칭 화술, 스타일 화술 및 지식 화술 중의 적어도 하나를 포함한다. Among them, the narration includes at least one of a name narration, a style narration, and a knowledge narration.

템플릿 획득 모듈(10)은 장면에 대해 미리 구성된 방송 템플릿을 획득하는데 사용된다. The template obtaining module 10 is used to obtain a preconfigured broadcast template for a scene.

바람직한 실시 방식으로서, 템플릿 획득 모듈(10)은 장면에 대해 미리 구성된 적어도 하나의 방송 템플릿 및 각 방송 템플릿의 속성 정보를 결정하고, 방송 템플릿은 적어도 하나의 화술 조합을 포함하고, 각 방송 템플릿의 속성 정보와 음성 패키지에 따라, 적어도 하나의 방송 템플릿으로부터 하나의 상기 장면에 구성된 방송 템플릿을 선택하는데 사용된다. As a preferred embodiment, the template obtaining module 10 determines at least one broadcast template preconfigured for a scene and property information of each broadcast template, the broadcast template includes at least one speech combination, and the properties of each broadcast template used to select a broadcast template configured in one said scene from at least one broadcast template according to the information and audio package.

음성 생성 모듈(20)은 화술을 사용하여 방송 템플릿을 충전하여, 방송 음성을 생성하는데 사용된다. The voice generating module 20 is used to generate a broadcasting voice by filling a broadcasting template using a narration.

구체적으로, 음성 생성 모듈(20)은 텍스트 생성 서브 모듈(21) 및 음성 합성 서브 모듈(22)을 포함할 수 있다. Specifically, the speech generation module 20 may include a text generation submodule 21 and a speech synthesis submodule 22 .

텍스트 생성 서브 모듈(21)은 화술을 사용하여 방송 템플릿을 충전하여, 방송 텍스트를 생성하는데 사용된다. The text generation sub-module 21 is used to generate a broadcast text by filling a broadcast template using a narration.

음성 합성 서브 모듈(22)은 음성 패키지 내의 음색 정보를 사용하여, 방송 텍스트에 대해 음성 합성을 수행하여 방송 음성을 획득하는데 사용된다. The speech synthesis sub-module 22 is used to obtain a broadcast speech by performing speech synthesis on the broadcast text by using the tone information in the speech package.

음성 패키지 내의 호칭 화술에 대해, 사용자가 설치하여 획득할 수 있다. 바람직한 실시 방식으로서, 음성 패키지에 대한 설치 인터페이스에서 사용자에게 호칭 화술의 입력 박스 또는 옵션 등의 컴포넌트를 제공하여 사용자가 호칭 화술을 입력 또는 선택하도록 할 수 있다. With respect to the name utterance in the voice package, the user can install and obtain it. As a preferred embodiment, a component such as an input box or an option for a nominal speech may be provided to the user in the installation interface for the voice package so that the user may input or select the nominal speech.

음성 패키지 내의 스타일 화술에 대해, 미리 설정된 내용을 획득할 수 있고, 예를 들면 연구 개발 담당자, 서비스 제공자 등에 의해 미리 설정될 수 있다. 그러나, 바람직한 실시 방식으로서, 스타일 화술은 검색 엔진을 통해 미리 제1 발굴 모듈(30)로 발굴하여 획득할 수 있다. For the style speech in the voice package, preset content may be acquired, and may be preset, for example, by a research and development person in charge, a service provider, and the like. However, as a preferred implementation method, the style narration may be obtained by excavating in advance with the first excavation module 30 through a search engine.

제1 발굴 모듈(30)은 하기의 방식을 사용하여 미리 발굴 음성 패키지 내의 스타일 화술을 획득하고, 상기 방식은, The first excavation module 30 acquires the style speech in the excavation voice package in advance by using the following method, the method comprising:

미리 설정된 스타일 키워드와 장면의 키워드를 사용하여 스플라이싱하여 검색 키워드를 획득하고, Splicing using preset style keywords and keywords in the scene to obtain search keywords,

검색 키워드에 대응하는 검색 결과 텍스트로부터 스타일 화술 후보 항목을 선택하고, selecting a style speech candidate from the search result text corresponding to the search keyword;

스타일 화술 후보 항목을 수정한 결과를 획득함으로써 스타일 화술을 획득한다. 그 중의 실시 방식으로서, 스타일 화술 후보 항목에 대해 인공적으로 수정할 수 있다. Style speech is acquired by obtaining the result of modifying the style speech candidate items. Among them, as an implementation method, the candidate items for style speech may be artificially modified.

제2 발굴 모듈(40)은 하기의 방식을 사용하여 음성 패키지 내의 지식 화술을 미리 발굴하여 획득하고, 상기 방식은, The second discovery module 40 discovers and acquires the knowledge narration in the voice package in advance using the following method, wherein the method is:

음성 패키지에 관련되는 지식 그래프를 획득하고, Acquire a knowledge graph related to the voice package,

지식 그래프로부터 장면에 매칭하는 지식 노드를 획득하고, obtain a knowledge node matching the scene from the knowledge graph,

획득된 지식 노드 및 장면에 대응하는 화술 템플릿을 사용하여, 장면에 대응하는 지식 화술을 생성한다.Using the acquired knowledge node and the speech template corresponding to the scene, a knowledge speech corresponding to the scene is generated.

본 출원의 실시예에 따르면, 본 출원은 전자 기기 및 판독 가능 기록 매체를 더 제공한다. According to an embodiment of the present application, the present application further provides an electronic device and a readable recording medium.

도 9에 도시된 바와 같이, 본 출원의 실시예에 따른 방송 음성을 생성하는 방법을 구현하는 전자 기기의 블록도이다. 전자 기기는 랩톱 컴퓨터, 데스크톱 컴퓨터, 운영 플랫폼, 개인 정보 단말기, 서버, 블레이드 서버, 대형 컴퓨터, 및 다른 적합한 컴퓨터와 같은 다양한 형태의 디지털 컴퓨터를 의미한다. 전자 기기는 개인 디지털 처리, 셀룰러폰, 스마트폰, 웨어러블 기기 및 다른 유사한 컴퓨팅 장치와 같은 다양한 형태의 이동 장치를 의미할 수도 있다. 본문에서 나타낸 부재, 이들의 연결과 관계, 및 이들의 기능은 단지 예시적인 것으로, 본문에서 설명 및/또는 요구된 본 발명의 구현을 한정하지 않는다.As shown in FIG. 9 , it is a block diagram of an electronic device implementing a method for generating a broadcast voice according to an embodiment of the present application. Electronic device means various types of digital computers such as laptop computers, desktop computers, operating platforms, personal digital assistants, servers, blade servers, large computers, and other suitable computers. Electronic devices may refer to various types of mobile devices such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The elements shown in the text, their connections and relationships, and their functions are illustrative only and do not limit the implementation of the invention described and/or claimed in this text.

도 9에 도시된 바와 같이, 상기 전자 기기는, 하나 또는 복수의 프로세서(901), 메모리(902), 및 고속 인터페이스 및 저속 인터페이스를 포함하는 각 부재를 연결하기 위한 인터페이스를 포함한다. 각 부재는 상이한 버스를 이용하여 서로 연결되고, 공통 메인보드에 장착될 수 있거나 필요에 따라 다른 방식으로 장착될 수 있다. 프로세서는, 메모리에 저장되거나 메모리에서 외부 입력/출력 장치(예를 들어, 인터페이스에 커플링된 표시 기기)에 GUI의 그래픽 정보를 표시하는 명령을 포함하는 전자 기기 내에서 실행되는 명령을 처리할 수 있다. 다른 실시형태에서, 필요에 따라 다수의 프로세서 및/또는 다수의 버스를 다수의 메모리와 함께 사용할 수 있다. 마찬가지로, 다수의 전자 기기를 연결할 수 있고, 각 기기는 일부 필요한 동작(예를 들어, 서버 어레이, 한 그룹의 블레이드 서버, 또는 다중프로세서 시스템)을 제공한다. 도 9에서는 하나의 프로세서(901)를 예로 한다.As shown in FIG. 9 , the electronic device includes one or a plurality of processors 901 , a memory 902 , and an interface for connecting each member including a high-speed interface and a low-speed interface. Each member may be connected to each other using a different bus, and may be mounted on a common main board or may be mounted in other ways as needed. The processor may process instructions stored in the memory or executed within the electronic device including instructions for displaying graphical information of the GUI on an external input/output device (eg, a display device coupled to the interface) in the memory. there is. In other embodiments, multiple processors and/or multiple buses may be used with multiple memories as needed. Likewise, multiple electronic devices may be connected, each providing some necessary operation (eg, a server array, a group of blade servers, or a multiprocessor system). In FIG. 9 , one processor 901 is exemplified.

메모리(902)는 본 출원에서 제공된 비일시적 컴퓨터 판독 가능 저장 매체이다. 여기서, 상기 메모리에는 적어도 하나의 프로세서에 의해 실행 가능한 명령이 저장되어, 상기 적어도 하나의 프로세서가 본 출원에서 제공된 방송 음성을 생성하는 방법을 수행하도록 한다. 본 출원의 비일시적 컴퓨터 판독 가능 저장 매체는 컴퓨터 명령을 저장하며, 상기 컴퓨터 명령은 컴퓨터가 본 출원에서 제공된 방송 음성을 생성하는 방법을 수행하도록 한다.Memory 902 is a non-transitory computer-readable storage medium provided herein. Here, instructions executable by at least one processor are stored in the memory, so that the at least one processor performs the method of generating a broadcast voice provided in the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions, the computer instructions causing the computer to perform the method of generating a broadcast voice provided in the present application.

메모리(902)는 비일시적 컴퓨터 판독 가능 저장 매체로서, 본 출원의 실시예에서의 방송 음성을 생성하는 방법에 대응되는 프로그램 명령/모듈과 같은 비일시적 소프트웨어 프로그램, 비일시적 컴퓨터 실행 가능 프로그램 및 모듈을 저장하는데 사용될 수 있다. 프로세서(901)는 메모리(902)에 저장되어 있는 비일시적 소프트웨어 프로그램, 명령 및 모듈을 실행함으로써, 서버의 다양한 기능 애플리케이션 및 데이터 처리를 수행하며, 즉 상기 방법의 실시예에서의 방송 음성을 생성하는 방법을 구현한다.The memory 902 is a non-transitory computer-readable storage medium, which stores non-transitory software programs, non-transitory computer-executable programs and modules, such as program instructions/modules, corresponding to the method for generating broadcast voices in the embodiments of the present application. can be used to store The processor 901 executes non-transitory software programs, commands and modules stored in the memory 902, thereby performing various functional applications and data processing of the server, that is, generating broadcast voice in the embodiment of the method. implement the method

메모리(902)는 프로그램 저장 영역 및 데이터 저장 영역을 포함할 수 있는 바, 여기서 프로그램 저장 영역은 운영 체제, 적어도 하나의 기능에 필요한 애플리케이션 프로그램을 저장할 수 있고; 데이터 저장 영역은 방송 음성을 생성하는 방법에 따른 전자 기기의 사용에 따라 구축된 데이터 등을 저장할 수 있다. 이밖에, 메모리(902)는 고속 랜덤 액세스 메모리를 포함할 수 있고, 적어도 하나의 자기 디스크 저장 소자, 플래시 소자, 또는 다른 비일시적 솔리드 스테이트 저장 소자와 같은 비일시적 메모리를 더 포함할 수 있다. 일부 실시예에서, 메모리(902)는 프로세서(901)에 대해 원격으로 설치되는 메모리를 선택적으로 포함할 수 있고, 이러한 원격 메모리는 네트워크를 통해 방송 음성을 생성하는 방법을 구현하는 전자 기기에 연결될 수 있다. 상기 네트워크의 구현예는 인터넷, 기업 인트라넷, 근거리 통신망, 이동 통신망, 및 이들의 조합을 포함하지만 이에 한정되지 않는다.The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program necessary for at least one function; The data storage area may store data constructed according to the use of an electronic device according to a method of generating a broadcast voice. In addition, the memory 902 may include a high-speed random access memory, and may further include a non-transitory memory such as at least one magnetic disk storage device, a flash device, or other non-transitory solid state storage device. In some embodiments, memory 902 may optionally include memory installed remotely with respect to processor 901 , which remote memory may be coupled to an electronic device implementing a method of generating broadcast voice over a network. there is. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile networks, and combinations thereof.

당해 전자 기기는, 입력 장치(903) 및 출력 장치(904)를 더 포함할 수 있다. 프로세서(901), 메모리(902), 입력 장치(903) 및 출력 장치(904)는 버스 또는 다른 방식을 통해 연결될 수 있고, 도 9에서는 버스를 통한 연결을 예로 한다.The electronic device may further include an input device 903 and an output device 904 . The processor 901 , the memory 902 , the input device 903 , and the output device 904 may be connected via a bus or other method, and in FIG. 9 , the connection via a bus is exemplified.

입력 장치(903)는 입력된 숫자 또는 캐릭터 정보를 수신할 수 있고, 방송 음성을 생성하는 방법을 구현하는 전자 기기의 사용자 설정 및 기능 제어와 관련된 키 신호 입력을 생성할 수 있으며, 예를 들어 터치 스크린, 키패드, 마우스, 트랙 패드, 터치 패드, 포인팅 스틱, 하나 또는 다수의 마우스 버튼, 트랙볼, 조이스틱 등 입력 장치일 수 있다. 출력 장치(904)는 디스플레이 기기, 보조 조명 장치(예를 들어, LED) 및 촉각 피드백 장치(예를 들어, 진동 모터) 등을 포함할 수 있다. 상기 디스플레이 기기는 LCD(액정 디스플레이 장치), LED(발광 다이오드) 디스플레이 장치 및 플라즈마 디스플레이 장치를 포함할 수 있으나 이에 한정되지 않는다. 일부 실시형태에서, 디스플레이 기기는 터치 스크린일 수 있다.The input device 903 may receive input number or character information, and may generate a key signal input related to user setting and function control of an electronic device implementing a method of generating a broadcast voice, for example, touch It may be an input device such as a screen, keypad, mouse, trackpad, touchpad, pointing stick, one or more mouse buttons, trackball, joystick, or the like. The output device 904 may include a display device, an auxiliary lighting device (eg, an LED), and a tactile feedback device (eg, a vibration motor). The display device may include, but is not limited to, a liquid crystal display device (LCD), a light emitting diode (LED) display device, and a plasma display device. In some embodiments, the display device may be a touch screen.

여기서 설명된 시스템 및 기술의 다양한 실시형태는 디지털 전자 회로 시스템, 집적 회로 시스템, 전용 ASIC(전용 집적 회로), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합에서 구현될 수 있다. 이러한 다양한 실시형태는 하나 또는 다수의 컴퓨터 프로그램에서의 구현을 포함할 수 있고, 상기 하나 또는 다수의 컴퓨터 프로그램은 적어도 하나의 프로그램 가능 프로세서를 포함하는 프로그램 가능 시스템에서 실행 및/또는 해석될 수 있으며, 상기 프로그램 가능 프로세서는 전용 또는 범용 프로그램 가능 프로세서일 수 있고, 저장 시스템, 적어도 하나의 입력 장치, 및 적어도 하나의 출력 장치로부터 데이터 및 명령을 수신할 수 있으며, 데이터 및 명령을 상기 저장 시스템, 상기 적어도 하나의 입력 장치, 및 상기 적어도 하나의 출력 장치에 전송할 수 있다.Various embodiments of the systems and techniques described herein may be implemented in digital electronic circuit systems, integrated circuit systems, dedicated ASICs (dedicated integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs, wherein the one or more computer programs may be executed and/or interpreted in a programmable system comprising at least one programmable processor, The programmable processor may be a dedicated or general purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one It can transmit to one input device and the at least one output device.

이러한 컴퓨팅 프로그램(프로그램, 소프트웨어, 소프트웨어 애플리케이션, 또는 코드라고도 함)은 프로그램 가능 프로세서의 기계 명령을 포함하고, 하이레벨 프로세스 및/또는 객체에 대한 프로그래밍 언어, 및/또는 어셈블리/기계 언어를 이용하여 이러한 컴퓨팅 프로그램을 실행할 수 있다. 본문에서 사용된 바와 같이, 용어 "기계 판독 가능한 매체" 및 "컴퓨터 판독 가능한 매체"는 기계 명령 및/또는 데이터를 프로그램 가능 프로세서에 제공하기 위한 임의의 컴퓨터 프로그램 제품, 기기, 및/또는 장치(예를 들어, 자기 디스크, 광 디스크, 메모리, 프로그램 가능 로직 장치(PLD))를 의미하고, 기계 판독 가능한 신호인 기계 명령을 수신하는 기계 판독 가능한 매체를 포함한다. 용어 "기계 판독 가능한 신호"는 기계 명령 및/또는 데이터를 프로그램 가능 프로세서에 제공하기 위한 임의의 신호를 의미한다.Such computing programs (also referred to as programs, software, software applications, or code) contain machine instructions of a programmable processor, and include programming languages for high-level processes and/or objects, and/or assembly/machine languages such as Computing programs can be executed. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or apparatus (eg, for example, a magnetic disk, an optical disk, a memory, a programmable logic device (PLD)), and includes a machine-readable medium for receiving machine instructions, which are machine-readable signals. The term “machine-readable signal” means any signal for providing machine instructions and/or data to a programmable processor.

사용자와의 인터랙션을 제공하기 위하여, 컴퓨터에서 여기서 설명된 시스템 및 기술을 실시할 수 있고, 상기 컴퓨터는 사용자에게 정보를 표시하기 위한 표시 장치(예를 들어, CRT(음극선관) 또는 LCD(액정 표시 장치) 모니터); 및 키보드 및 지향 장치(예를 들어, 마우스 또는 트랙 볼)를 구비하며, 사용자는 상기 키보드 및 상기 지향 장치를 통해 컴퓨터에 입력을 제공한다. 다른 타입의 장치는 또한 사용자와의 인터랙션을 제공할 수 있는데, 예를 들어, 사용자에게 제공된 피드백은 임의의 형태의 감지 피드백(예를 들어, 시각 피드백, 청각 피드백, 또는 촉각 피드백)일 수 있고; 임의의 형태(소리 입력, 음성 입력, 또는 촉각 입력)로 사용자로부터의 입력을 수신할 수 있다.To provide for interaction with a user, a computer may implement the systems and techniques described herein, the computer comprising a display device (eg, a cathode ray tube (CRT) or liquid crystal display (LCD) for displaying information to the user) device) monitor); and a keyboard and a pointing device (eg, a mouse or track ball), wherein a user provides input to the computer through the keyboard and the pointing device. Other types of devices may also provide for interaction with a user, for example, the feedback provided to the user may be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); An input from the user may be received in any form (sound input, voice input, or tactile input).

여기서 설명된 시스템 및 기술은 백엔드 부재를 포함하는 컴퓨팅 시스템(예를 들어, 데이터 서버로 사용됨), 또는 미들웨어 부재를 포함하는 컴퓨팅 시스템(예를 들어, 애플리케이션 서버), 또는 프론트 엔드 부재를 포함하는 컴퓨팅 시스템(예를 들어, 그래픽 사용자 인터페이스 또는 네트워크 브라우저를 구비하는 사용자 컴퓨터인 바, 사용자는 상기 그래픽 사용자 인터페이스 또는 상기 네트워크 브라우저를 통해 여기서 설명된 시스템 및 기술의 실시형태와 인터랙션할 수 있음), 또는 이러한 백엔드 부재, 미들웨어 부재, 또는 프론트 엔드 부재의 임의의 조합을 포함하는 컴퓨팅 시스템에서 구현될 수 있다. 임의의 형태 또는 매체의 디지털 데이터 통신(예를 들어, 통신 네트워크)을 통해 시스템의 부재를 서로 연결시킬 수 있다. 통신 네트워크의 예는, 근거리 통신망(LAN), 광역망(WAN), 인터넷을 포함한다.The systems and techniques described herein include a computing system that includes a back-end member (eg, used as a data server), or a computing system that includes a middleware member (eg, an application server), or a computing system that includes a front-end member. a system (eg, a user computer having a graphical user interface or network browser through which a user may interact with embodiments of the systems and techniques described herein), or such It may be implemented in a computing system that includes any combination of back-end members, middleware members, or front-end members. Any form or medium of digital data communication (eg, a communication network) may connect the members of the system to one another. Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

컴퓨터 시스템은 클라이언트 및 서버를 포함할 수 있다. 클라이언트 및 서버는 일반적으로 서로 멀리 떨어져 있고 일반적으로 통신 네트워크를 통해 서로 인터랙션한다. 대응되는 컴퓨터에서 실행되고 또한 서로 클라이언트-서버 관계를 가지는 컴퓨터 프로그램을 통해 클라이언트 및 서버의 관계를 생성한다.A computer system may include a client and a server. A client and server are typically remote from each other and interact with each other, usually via a communication network. Creating a relationship between a client and a server through a computer program running on a corresponding computer and having a client-server relationship with each other.

위에서 설명된 다양한 형태의 프로세스를 사용하여 단계를 재배열, 추가 또는 삭제할 수 있음을 이해해야 한다. 예를 들어, 본 출원에 기재된 각 단계는 동시에, 순차적으로, 또는 상이한 순서로 수행될 수 있으며, 본 출원에 개시된 기술적 해결수단이 이루고자 하는 결과를 구현할 수 있는 한, 본문은 여기서 한정되지 않는다.It should be understood that steps may be rearranged, added, or deleted using the various forms of the process described above. For example, each step described in the present application may be performed simultaneously, sequentially, or in a different order, and as long as the technical solution disclosed in the present application can realize the desired result, the text is not limited herein.

상기 구체적인 실시형태는 본 출원의 보호 범위를 한정하지 않는다. 본 기술분야의 통상의 기술자는, 설계 요구 및 다른 요소에 따라 다양한 수정, 조합, 서브 조합 및 대체를 진행할 수 있음을 이해해야 한다. 본 출원의 정신 및 원칙 내에서 이루어진 임의의 수정, 등가 교체 및 개선 등은 모두 본 출원의 보호 범위 내에 포함되어야 한다.The above specific embodiments do not limit the protection scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims

In a method for generating broadcast voice,
obtaining a narration matching a scene from a voice package, and obtaining a broadcast template pre-configured for the scene; and
generating a broadcast voice by charging the broadcast template using the speech;
containing,
How to create a broadcast voice.

According to claim 1,
wherein the narration comprises at least one of a title narration, a style narration and a knowledge narration,
How to create a broadcast voice.

According to claim 1,
The step of obtaining a narration matching the scene from the voice package comprises:
determining a keyword of the scene; and
Including; obtaining a speech matching the keyword of the scene from the voice package;
How to create a broadcast voice.

According to claim 1,
Obtaining a broadcast template preconfigured for the scene comprises:
determining at least one broadcast template preconfigured for the scene and attribute information of each broadcast template, the broadcast template including at least one speech combination; and
Selecting a broadcast template configured for one scene from the at least one broadcast template according to attribute information of each broadcast template and the audio package;
How to create a broadcast voice.

According to claim 1,
The step of generating a broadcast voice by charging the broadcast template using the narration comprises:
filling the broadcast template using the narration to generate broadcast text; and
obtaining the broadcast voice by performing voice synthesis on the broadcast text using the tone information in the voice package;
How to create a broadcast voice.

3. The method of claim 2,
The stylized speech within the audio package is:
Splicing using preset style keywords and keywords in the scene to obtain search keywords,
selecting a style speech candidate item from the search result text corresponding to the search keyword;
obtaining a style speech by obtaining a result of modifying the style speech candidate item;
obtained by excavating in advance using the method,
How to create a broadcast voice.

3. The method of claim 2,
The knowledge discourse in the voice package is:
acquiring a knowledge graph related to the voice package,
obtaining a knowledge node matching the scene from the knowledge graph;
generating a knowledge narration corresponding to the scene by using the acquired knowledge node and the narration template corresponding to the scene;
obtained by excavating in advance using the method,
How to create a broadcast voice.

In a device for generating broadcast voice,
a narration acquisition module for acquiring a narration matching the scene from the voice package;
a template obtaining module for obtaining a broadcast template preconfigured for the scene; and
a voice generating module configured to generate a broadcasting voice by charging the broadcasting template using the narration;
containing,
Broadcast voice generator.

9. The method of claim 8,
wherein the narration comprises at least one of a title narration, a style narration and a knowledge narration,
Broadcast voice generator.

9. The method of claim 8,
The dialogue acquisition module is
used to determine a keyword of the scene, and to obtain a narration matching the keyword of the scene from the voice package,
Broadcast voice generator.

9. The method of claim 8,
The template acquisition module,
Determine at least one broadcast template preconfigured for the scene and attribute information of each broadcast template, wherein the broadcast template includes at least one speech combination, and according to the attribute information of each broadcast template and the voice package, the at least one used to select a broadcast template composed of one said scene from one broadcast template,
Broadcast voice generator.

12. The method of claim 11,
The voice generating module,
a text generation sub-module configured to generate broadcast text by filling the broadcast template using the speech;
A voice synthesis sub-module configured to obtain the broadcast voice by performing voice synthesis on the broadcast text by using the tone information in the voice package;
Broadcast voice generator.

10. The method of claim 9,
Further comprising a first excavation module,
The first excavation module includes:
Splicing using preset style keywords and keywords in the scene to obtain search keywords,
selecting a style speech candidate item from the search result text corresponding to the search keyword;
obtaining a style speech by obtaining a result of modifying the style speech candidate item;
to obtain the style speech in the voice package by pre-excavating using the method,
Broadcast voice generator.

10. The method of claim 9,
Further comprising a second excavation module,
The second excavation module includes:
acquiring a knowledge graph related to the voice package,
obtaining a knowledge node matching the scene from the knowledge graph;
generating a knowledge narration corresponding to the scene by using the acquired knowledge node and the narration template corresponding to the scene;
to acquire the knowledge narration in the voice package by excavating in advance using a method,
Broadcast voice generator.

In an electronic device,
at least one processor; and
a memory communicatively coupled to the at least one processor;
An instruction executable by the at least one processor is stored in the memory, the instruction is executed by the at least one processor, and the method according to any one of claims 1 to 7 by the at least one processor to do this,
Electronics.

In a non-transitory computer-readable recording medium storing computer instructions,
wherein the computer instructions cause the computer to perform the method of any one of claims 1-7.
A non-transitory computer-readable recording medium having computer instructions stored thereon.