WO2022164207A1

WO2022164207A1 - Method and system for generating synthesized speech of new speaker

Info

Publication number: WO2022164207A1
Application number: PCT/KR2022/001414
Authority: WO
Inventors: 김태수; 이영근; 황영태
Original assignee: 네오사피엔스 주식회사
Priority date: 2021-01-26
Filing date: 2022-01-26
Publication date: 2022-08-04

Abstract

The present invention relates to a method, performed by at least one processor, for generating synthesized speech of a new speaker. The method may comprise the steps of: receiving target text; acquiring speaker features of a reference speaker; acquiring information about changes in utterance features; determining speaker features of a new speaker by using the acquired speaker features of the reference speaker and the acquired information about changes in utterance features; and generating output speech for the target text by inputting the target text and the determined speaker features of the new speaker to an artificial neural network text-speech synthesis model, wherein the output speech reflects the determined speaker features of the new speaker. Here, the artificial neural network text-speech synthesis model can be trained on the basis of a plurality of training text items and speaker features of a plurality of training speakers to output speech for the plurality of training text items, wherein the output speech reflects the speaker features of the plurality of training speakers.

Description

Method and system for generating synthesized speech of a new speaker

The present disclosure relates to a method and system for generating a synthesized voice of a new speaker, and more particularly, to determine the speaker characteristic of a new speaker using the speaker characteristic and vocal characteristic change information of a reference speaker, and artificial neural network text-to-speech synthesis A method and system for generating a synthesized voice in which the speaker characteristics of a new speaker are reflected by using a model.

With the development of audio content and video content production technology, any content creator can easily produce audio content or video content. In addition, with the development of virtual voice generation technology and virtual image production technology, a neural network voice model is trained through audio samples recorded by voice actors, and voice synthesis technology having the same voice characteristics as voice actors recording audio samples is being developed.

However, in the conventional audio sample-based speech synthesis technology, it is technically difficult to create a new voice that did not exist before, and it is technically difficult to create a voice that has non-existent voice features, such as a neutral voice combining male and female voices, and a child's voice with accurate pronunciation. The voice has a problem that is difficult to implement. Moreover, the newly generated voice was of low quality enough to be recognized as a mechanical voice, making it difficult to use commercially.

The present disclosure provides a method for generating a new speaker's synthesized voice, a computer program stored in a computer-readable recording medium, and an apparatus (system) to solve the above problems.

The present disclosure may be implemented in various ways including a method, a system, an apparatus, or a computer program stored in a computer-readable storage medium, and a computer-readable recording medium.

According to an embodiment of the present disclosure, a method for generating a synthesized voice of a new speaker, performed by at least one processor, includes the steps of receiving a target text, acquiring speaker characteristics of a reference speaker, and changing speech characteristics. obtaining a speaker characteristic of a reference speaker and determining the speaker characteristic of a new speaker using the acquired speaker characteristic of the reference speaker and the acquired speech characteristic change information, and the target text and the speaker characteristic of the determined new speaker to the artificial neural network text-to-speech synthesis model and generating an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected, wherein the artificial neural network text-to-speech synthesis model is based on the plurality of training text items and the speaker characteristics of the plurality of learned speakers. Thus, it is learned to output voices for a plurality of learning text items in which the speaker characteristics of the plurality of learning speakers are reflected.

In an embodiment, the determining of the speaker characteristic of the new speaker may include generating the speaker characteristic change by inputting the speaker characteristic of the reference speaker and the acquired vocal characteristic change information into an artificial neural network speaker characteristic change generation model, and outputting the speaker characteristics of a new speaker by synthesizing the speaker characteristic and the generated speaker characteristic change, wherein the artificial neural network speaker characteristic change generation model is based on the speaker characteristics of the plurality of learned speakers and the speaker characteristics of the plurality of learned speakers. It is learned using a plurality of included vocal features.

In an embodiment, the speech characteristic change information includes information about a change in the target speech characteristic.

In an embodiment, the acquiring the speaker characteristics of the reference speaker includes acquiring a plurality of speaker characteristics corresponding to the plurality of reference speakers, and the acquiring the vocalization characteristic change information includes: obtaining a corresponding set of weights, wherein the determining of the speaker characteristic of the new speaker includes applying a weight included in the obtained weight set to each of the plurality of speaker characteristics, thereby determining the speaker characteristic of the new speaker. including the steps of

In an embodiment, the method further includes obtaining speaker characteristics of the plurality of speakers, wherein the speaker characteristics of the plurality of speakers include a plurality of speaker vectors, wherein the obtaining of the vocalization characteristic change information includes: Normalizing each of the speaker vectors; determining a plurality of principal components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers; selecting at least one principal component from among the determined plurality of principal components; and determining the speech characteristic change information by using the selected main component, wherein the determining of the speaker characteristic of the new speaker includes determining a weight of the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the determined speech characteristic change information. and determining the speaker characteristics of the new speaker using

In one embodiment, the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is assigned, and the obtaining of the speech characteristic change information includes: obtaining the speaker vectors of a plurality of speakers having different target speech characteristics, and determining the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers and determining the speaker characteristic of the new speaker, wherein the determining of the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the weight of the determined speech characteristic change information. do.

In one embodiment, the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is allocated, and the obtaining the speech characteristic change information includes: obtaining a speaker vector of speakers included in each of a plurality of speaker groups having different target speech characteristics; including a speaker group; calculating an average of speaker vectors of speakers included in the first speaker group; calculating an average of speaker vectors of speakers included in the second speaker group; and the steps corresponding to the first speaker group and determining the speech characteristic change information based on a difference between the average of the speaker vectors and the average of the speaker vectors corresponding to the second speaker group, wherein the determining of the speaker characteristic of the new speaker includes the speaker characteristic of the reference speaker. , determining a speaker characteristic of a new speaker by using the determined speech characteristic change information and a weight of the determined speech characteristic change information.

In one embodiment, the method further comprises obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising the plurality of speaker vectors, wherein the speaker characteristics of the reference speaker include the plurality of vocalization characteristics of the reference speaker. The obtaining of the speech characteristic change information includes: inputting the speaker characteristics of the plurality of speakers into the artificial neural network speech characteristic prediction model, and outputting the speech characteristics of each of the plurality of speakers; the speaker characteristics of the plurality of speakers selecting a speaker characteristic of a speaker in which a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker; acquiring a corresponding weight, wherein the determining of the speaker characteristic of the new speaker includes: the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker determining the characteristics.

In an embodiment, the speaker feature of the new speaker includes the speaker feature vector, and calculating a hash value corresponding to the speaker feature vector using a hash function; determining whether there is content associated with a hash value similar to the hash value, and if there is no content associated with the hash value similar to the calculated hash value, determining that the output voice associated with the speaker characteristic of the new speaker is the new output voice include

In an embodiment, the speaker feature of the reference speaker includes a speaker vector, and the obtaining of the speech feature change information includes extracting a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature. a step of - the normal vector refers to a normal vector of a hyperplane for classifying a target speech feature - and obtaining information indicating a degree of adjusting the target speech feature, wherein the determining of the speaker feature of a new speaker includes: and determining the speaker characteristic of the new speaker based on the degree of adjusting the speaker vector, the extracted normal vector, and the target vocalization characteristic of the speaker.

A computer program stored in a computer-readable recording medium is provided for executing the above-described method for generating a synthesized voice of a new speaker according to an embodiment of the present disclosure in a computer.

According to an embodiment of the present disclosure, the speech synthesizer is trained using learning data including the synthesized voice of the new speaker generated according to the above-described method for generating the synthesized voice of the new speaker.

According to an embodiment of the present disclosure, an apparatus for providing a synthesized voice is connected to a memory and a memory configured to store a synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker, and stored in the memory. An apparatus for providing synthesized speech comprising at least one processor configured to execute at least one computer readable program included therein, wherein the at least one program is configured to output at least a portion of the synthesized voice of the new speaker stored in the memory is provided

According to an embodiment of the present disclosure, a method of providing a synthesized voice of a new speaker, performed by at least one processor, includes a synthesized voice of a new speaker generated according to the above-described method of generating a synthesized voice of a new speaker. storing and providing at least a portion of the stored synthesized voice of the new speaker.

According to some embodiments of the present disclosure, it is possible to generate a natural voice for the target text in which the speaker characteristics of the new speaker are reflected.

According to some embodiments of the present disclosure, a synthesized voice having a new voice may be generated by modifying a speaker feature vector through quantitative adjustment of vocalization features.

According to some embodiments of the present disclosure, a new speaker's voice may be generated by mixing the voices of several speakers (eg, two or more speakers or three or more speakers).

According to some embodiments of the present disclosure, the output voice may be generated by finely adjusting one or more vocalization characteristics from the user terminal. For example, the one or more vocal characteristics may include gender control, vocal tone control, vocal strength, male age control, female age control, pitch, tempo, and the like.

The effect of the present disclosure is not limited to the above-mentioned effects, and other effects not mentioned are those of ordinary skill in the art to which the present disclosure belongs from the description of the claims (hereinafter referred to as 'person of ordinary skill') can be clearly understood by

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein like reference numerals denote like elements, but are not limited thereto.

1 is a diagram illustrating an example in which a synthesized voice generating system according to an embodiment of the present disclosure generates an output voice by receiving a target text and speaker characteristics of a new speaker.

FIG. 2 is a schematic diagram illustrating a configuration in which a plurality of user terminals and a synthesized voice generating system are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure.

3 is a block diagram illustrating an internal configuration of a user terminal and a synthesized voice generating system according to an embodiment of the present disclosure.

4 is a block diagram illustrating an internal configuration of a processor of a user terminal according to an embodiment of the present disclosure.

5 is a flowchart illustrating a method of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.

6 is a diagram illustrating an example of generating an output voice in which the speaker characteristics of a new speaker are reflected using the artificial neural network text-to-speech synthesis model according to an embodiment of the present disclosure.

7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected using an artificial neural network text-to-speech synthesis model according to another embodiment of the present disclosure.

8 is an exemplary diagram illustrating a user interface for generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.

9 is a structural diagram illustrating an artificial neural network model according to an embodiment of the present disclosure.

Hereinafter, specific contents for carrying out the present disclosure will be described in detail with reference to the accompanying drawings. However, in the following description, if there is a risk of unnecessarily obscuring the gist of the present disclosure, detailed descriptions of well-known functions or configurations will be omitted.

In the accompanying drawings, identical or corresponding components are assigned the same reference numerals. In addition, in the description of the embodiments below, overlapping description of the same or corresponding components may be omitted. However, even if description regarding components is omitted, it is not intended that such components are not included in any embodiment.

Advantages and features of the disclosed embodiments, and methods of achieving them, will become apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the present disclosure to be complete, and the present disclosure provides those skilled in the art with the scope of the invention. It is provided for complete information only.

Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. The terms used in the present specification have been selected as currently widely used general terms as possible while considering the functions in the present disclosure, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

Expressions in the singular herein include plural expressions unless the context clearly dictates the singular. Also, the plural expression includes the singular expression unless the context clearly dictates the plural. In the entire specification, when a part includes a certain component, this means that other components may be further included, rather than excluding other components, unless otherwise stated.

In addition, the term 'unit' or 'module' used in the specification means a software or hardware component, and 'module' performs certain roles. However, 'unit' or 'module' is not meant to be limited to software or hardware. A 'unit' or 'module' may be configured to reside on an addressable storage medium or may be configured to reproduce one or more processors. Thus, as an example, 'part' or 'module' refers to components such as software components, object-oriented software components, class components, and task components, and processes, functions, and properties. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays or at least one of variables. Components and 'units' or 'modules' are the functions provided therein that are combined into a smaller number of components and 'units' or 'modules' or additional components and 'units' or 'modules' can be further separated.

According to an embodiment of the present disclosure, a 'unit' or a 'module' may be implemented with a processor and a memory. 'Processor' should be construed broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like. In some contexts, a 'processor' may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like. 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such configurations. You may. Also, 'memory' should be construed broadly to include any electronic component capable of storing electronic information. 'Memory' means random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read-only memory (EPROM); may refer to various types of processor-readable media, such as electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. A memory is said to be in electronic communication with the processor if the processor is capable of reading information from and/or writing information to the memory. A memory integrated in the processor is in electronic communication with the processor.

In the present disclosure, a 'text item' may refer to a part or all of text, and the text may refer to a text item. Similarly, each of 'data item' and 'information item' may refer to at least a portion of data and at least a portion of information, and data and information may refer to a data item and information item. In the present disclosure, 'each of a plurality of A' or 'each of a plurality of A' may refer to each of all components included in the plurality of As or may refer to each of some components included in the plurality of As. . For example, each of the features of the plurality of speakers may refer to each of all speaker features included in each of the features of the plurality of speakers or to each of some speaker features included in the features of the plurality of speakers. .

Hereinafter, with reference to the accompanying drawings, the embodiments will be described in detail so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. And in order to clearly describe the present disclosure in the drawings, parts irrelevant to the description will be omitted.

1 is a diagram illustrating an example in which a synthesized voice generating system 100 according to an embodiment of the present disclosure generates an output voice 130 by receiving a target text 110 and a speaker characteristic 120 of a new speaker. The synthesized voice generating system 100 may receive the target text 110 and the speaker characteristic 120 of the new speaker, and generate the output voice 130 in which the speaker characteristic 120 of the new speaker is reflected. Here, the target text 110 may include one or more paragraphs, sentences, clauses, phrases, words, words, phonemes, and the like.

According to an embodiment, the speaker characteristic 120 of the new speaker may be determined or generated using the speaker characteristic of the reference speaker and information on the change of the vocalization characteristic. Here, the speaker characteristic of the reference speaker may include the speaker characteristic of the speaker to be newly created, that is, the speaker characteristic of the speaker that is a reference in generating the speaker characteristic of the new speaker. For example, the speaker characteristic of the reference speaker may include a speaker characteristic similar to the speaker characteristic of the speaker to be newly created. As another example, the speaker characteristics of the reference speaker may include speaker characteristics of a plurality of reference speakers.

According to an embodiment, the speaker characteristic of the reference speaker may include a speaker vector of the reference speaker. For example, the speaker vector of the reference speaker may be extracted based on the speaker id (eg, speaker one-hot vector, etc.) and the vocalization feature (eg, vector) using the neural network speaker feature extraction model. Here, the artificial neural network speaker feature extraction model may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. As another example, a speaker vector of a reference speaker may be extracted based on speech and vocalization features (eg, vectors) recorded by a speaker using a human neural network speaker feature extraction model. Here, the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocalization features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. Here, the speaker vector of the reference speaker may include one or more speech characteristics (eg, tone, speech strength, speech speed, gender, age, etc.) of the reference speaker's voice. Also, the speaker id and/or the voice recorded by the speaker may be selected as the voice on which the speaker characteristics of the new speaker are based. In addition, the vocalization characteristic may include a basic vocalization characteristic that will be reflected in the speaker characteristic of the new speaker. That is, the speaker id, voice and/or vocal characteristics recorded by the speaker are generated as the speaker characteristics of the reference speaker, and the speaker characteristics of the reference speaker thus generated are synthesized with the speech characteristic change information to obtain the speaker characteristics of the new speaker. can be used to create

The vocalization characteristic change information may include any information about the vocalization characteristic desired to be applied to the speaker characteristic of the new speaker. According to an embodiment, the speech characteristic change information may include information about a difference between the speaker characteristic of the new speaker and the speaker characteristic of the reference speaker. For example, the new speaker characteristic may be generated by synthesizing the speaker characteristic and the speaker characteristic change of the reference speaker. Here, the speaker characteristic change may be generated by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the artificial neural network speaker characteristic change generation model. Here, the artificial neural network speaker characteristic change generation model may be trained using speaker characteristics of a plurality of learned speakers and a plurality of speech characteristics included in the plurality of speaker characteristics.

For example, the vocalization characteristic change information may include information indicating a difference between the target vocalization characteristic included in the speaker characteristic of the new speaker and the target vocalization characteristic included in the speaker characteristic of the reference speaker. That is, the speech characteristic change information may include information about a change in the target speech characteristic. As another example, the speech feature change information may include a normal vector of a hyperplane that classifies the target speech feature from the speaker feature and information indicating the degree of adjusting the target speech feature. As another example, the speech characteristic change information may include a weight to be applied to each of the speaker characteristics of the plurality of reference speakers. As another example, the speech feature change information may include a target speech feature generated based on an axis between target speech features included in the learned speaker and a weight of the target speech feature. As another example, the speech feature change information may include a target generated feature generated based on a difference between speaker features of speakers having different target speech features and a weight for the target speech feature. As another example, the speech characteristic change information may include a speaker characteristic of a speaker having a difference from a target speaker characteristic included in the speaker characteristic of the reference speaker and a weight of the corresponding speaker characteristic.

The synthesized voice generation system 100 is a synthesized voice for the target text 110 in which the speaker characteristics 120 of the new speaker are reflected, and generates an output voice 130 in which the target text is uttered according to the speaker characteristics of the newly created speaker. can do. To this end, the synthesized speech generation system 100 learns to output voices for the plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. and artificial neural network text-to-speech synthesis model. Alternatively, the artificial neural network text-to-speech synthesis model may be configured to output voice data for a plurality of training text items when the target text 110 and the speaker characteristic 120 of a new speaker are input, in this case, the output The obtained voice data may be post-processed into human audible voice using a post-processing processor, a vocoder, or the like.

FIG. 2 illustrates a configuration in which a plurality of user terminals 210_1 , 210_2 , and 210_3 and a synthesized voice generating system 230 are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure. is an overview. The plurality of user terminals 210_1 , 210_2 , and 210_3 may communicate with the synthesized voice generation system 230 through the network 220 . The network 220 may be configured to enable communication between the plurality of user terminals 210_1 , 210_2 , and 210_3 and the synthesized voice generating system 230 . Network 220 according to the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired network 220 such as RS-serial communication, mobile communication network, WLAN (Wireless) LAN), Wi-Fi, Bluetooth, and a wireless network 220 such as ZigBee, or a combination thereof. The communication method is not limited, and the user terminals 210_1, 210_2, 210_3) may also include short-range wireless communication. For example, the network 220 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like. In addition, the network 220 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. not limited

In FIG. 2, the mobile phone or smart phone 210_1, the tablet computer 210_2, and the laptop or desktop computer 210_3 are illustrated as examples of a user terminal that executes or operates a user interface that provides a synthetic voice generation service, but is not limited thereto. The user terminals 210_1, 210_2, and 210_3 are capable of wired and/or wireless communication and have a web browser, a mobile browser application, or a synthetic voice generating application installed so that a user interface providing a synthetic voice generating service can be executed. It may be a computing device. For example, the user terminal 210 may include a smartphone, a mobile phone, a navigation terminal, a desktop computer, a laptop computer, a digital broadcasting terminal, a personal digital assistant (PDA), a portable multimedia player (PMP), a tablet computer, and a game console (game). console), a wearable device, an Internet of things (IoT) device, a virtual reality (VR) device, an augmented reality (AR) device, and the like. In addition, in FIG. 2 , three user terminals 210_1 , 210_2 , and 210_3 are illustrated as communicating with the synthesized speech generation system 230 through the network 220 , but the present invention is not limited thereto, and a different number of user terminals is connected to the network It may be configured to communicate with a synthetic speech generation system 230 via 220 .

In an embodiment, the user terminals 210_1, 210_2, and 210_3 provide the target text, information about the speaker characteristics of the reference speaker, and/or information indicating or selecting speech characteristics change information to the synthesized speech generation system 230. can Additionally or alternatively, the user terminals 210_1 , 210_2 , and 210_3 may receive the speaker characteristic and/or the candidate vocalization characteristic change information of the candidate reference speaker from the synthesized speech generation system 230 . In response, the user terminals 210_1, 210_2, and 210_3 may select, in response to the user input, speaker characteristics and/or speech characteristics change information of the reference speaker from the candidate reference speaker speaker characteristics and/or candidate vocal characteristics change information. have. Also, the user terminals 210_1 , 210_2 , and 210_3 may receive the output voice generated from the synthesized voice generating system 230 .

In FIG. 2, each of the user terminals 210_1, 210_2, and 210_3 and the synthesized voice generating system 230 are illustrated as separately configured elements, but the present invention is not limited thereto. 210_3) may be configured to be included in each. On the other hand, the synthesized speech generation system 230 includes an input/output interface to determine the target text, the speaker characteristics of the reference speaker, and the speech characteristics change information without communication with the user terminals 210_1, 210_2, and 210_3 for the target text. , it may be configured to output a synthesized voice in which the speaker characteristics of the new speaker are reflected.

3 is a block diagram illustrating internal configurations of the user terminal 210 and the synthesized voice generation system 230 according to an embodiment of the present disclosure. The user terminal 210 may refer to any computing device capable of wired/wireless communication, for example, the mobile phone or smart phone 210_1, the tablet computer 210_2, the laptop or desktop computer 210_3 of FIG. 2 . and the like. As shown, the user terminal 210 may include a memory 312 , a processor 314 , a communication module 316 , and an input/output interface 318 . Similarly, the synthesized speech generation system 230 may include a memory 332 , a processor 334 , a communication module 336 , and an input/output interface 338 . As shown in FIG. 3 , the user terminal 210 and the synthesized voice generation system 230 are configured to communicate information and/or data via the network 220 using the

respective communication modules

316 and 336 , respectively. can be configured. In addition, the input/output device 320 may be configured to input information and/or data to the user terminal 210 through the input/output interface 318 or to output information and/or data generated from the user terminal 210 .

The

memories

312 and 332 may include any non-transitory computer-readable recording medium. According to one embodiment, the

memories

312 and 332 are non-volatile mass storage devices such as random access memory (RAM), read only memory (ROM), disk drives, solid state drives (SSDs), flash memory, and the like. (permanent mass storage device) may be included. As another example, a non-volatile mass storage device such as a ROM, SSD, flash memory, disk drive, etc. may be included in the user terminal 210 and/or the synthetic voice generation system 230 as a separate persistent storage device separate from the memory. have. Also, in the

memories

312 and 332 , an operating system and at least one program code (eg, a code for determining a speaker characteristic of a new speaker, a code for generating an output voice reflecting the speaker characteristic of a new speaker, etc.) are stored. can be

These software components may be loaded from a computer-readable recording medium separate from the

memories

312 and 332 . The separate computer-readable recording medium may include a recording medium directly connectable to the user terminal 210 and the synthesized voice generation system 230, for example, a floppy drive, a disk, a tape, a DVD/CD. - It may include a computer-readable recording medium such as a ROM drive and a memory card. As another example, the software components may be loaded into the

memories

312 and 332 through a communication module rather than a computer-readable recording medium. For example, the at least one program is a computer program (eg, artificial neural network text-to-speech synthesis model program) installed by files provided through the network 220 by developers or a file distribution system that distributes installation files of applications. ) may be loaded into the

memories

312 and 332 based on the.

The

processors

314 and 334 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the

processor

314 , 334 by the

memory

312 , 332 or the

communication module

316 , 336 . For example, the

processors

314 and 334 may be configured to execute received instructions according to program code stored in a recording device, such as the

memories

312 and 332 .

The

communication modules

316 and 336 may provide a configuration or function for the user terminal 210 and the synthesized voice generation system 230 to communicate with each other via the network 220 , and the user terminal 210 and/or synthesis The voice generating system 230 may provide a configuration or function for communicating with another user terminal or another system (eg, a separate cloud system, a separate frame image generating system, etc.). For example, a request generated by the processor 314 of the user terminal 210 according to a program code stored in a recording device such as the memory 312 (eg, a synthetic voice generation request, a new speaker's speaker characteristic generation request, etc.) may be transmitted to the synthesized voice generation system 230 through the network 220 under the control of the communication module 316 . Conversely, a control signal or command provided under the control of the processor 334 of the synthesized speech generation system 230 is transmitted to the communication module 316 of the user terminal 210 via the communication module 336 and the network 220 . through the user terminal 210 may be received.

The input/output interface 318 may be a means for interfacing with the input/output device 320 . As an example, the input device may include a device such as a keyboard, a microphone, a mouse, a camera including an image sensor, and the output device may include a device such as a display, a speaker, a haptic feedback device, and the like. As another example, the input/output interface 318 may be a means for an interface with a device in which a configuration or function for performing input and output, such as a touch screen, is integrated into one. For example, when the processor 314 of the user terminal 210 processes a command of a computer program loaded in the memory 312, information provided by the synthesized speech generation system 230 or other user terminal 210 and/ Alternatively, a service screen or user interface configured using data may be displayed on the display through the input/output interface 318 .

In FIG. 3 , the input/output device 320 is illustrated not to be included in the user terminal 210 , but the present invention is not limited thereto, and may be configured as a single device with the user terminal 210 . In addition, the input/output interface 338 of the synthesized voice generation system 230 interfaces with a device (not shown) for input or output that is connected to the synthesized voice generation system 230 or may include the synthesized voice generation system 230 . may be a means for In FIG. 3, the input/

output interfaces

318 and 338 are illustrated as elements configured separately from the

processors

314 and 334, but the present invention is not limited thereto, and the input/

output interfaces

318 and 338 may be configured to be included in the

processors

314 and 334. have.

The user terminal 210 and the synthesized voice generation system 230 may include more components than those of FIG. 3 . However, there is no need to clearly show most of the prior art components. According to an embodiment, the user terminal 210 may be implemented to include at least a portion of the above-described input/output device 320 . In addition, the user terminal 210 may further include other components such as a transceiver, a global positioning system (GPS) module, a camera, various sensors, and a database. For example, when the user terminal 210 is a smart phone, it may include components generally included in the smart phone, for example, an acceleration sensor, a gyro sensor, a camera module, various physical buttons, and touch. Various components such as a button using a panel, an input/output port, and a vibrator for vibration may be implemented to be further included in the user terminal 210 .

According to an embodiment, the processor 314 of the user terminal 210 may be configured to operate a synthetic voice output application or the like. In this case, a code associated with a corresponding application and/or program may be loaded into the memory 312 of the user terminal 210 . While the application and/or program is being operated, the processor 314 of the user terminal 210 receives information and/or data provided from the input/output device 320 through the input/output interface 318 or through the communication module 316 . Information and/or data may be received from the synthesized speech generation system 230 , and the received information and/or data may be processed and stored in the memory 312 . In addition, such information and/or data may be provided to the synthesized voice generation system 230 through the communication module 316 .

While a program for a synthetic voice output application, etc. is being operated, the processor 314 may receive text input or selected through an input device 320 such as a touch screen or a keyboard connected to the input/output interface 318, and receive The synthesized text may be stored in the memory 312 or provided to the synthesized speech generation system 230 through the communication module 316 and the network 220 . For example, the processor 314 may receive an input for the target text (eg, one or more paragraphs, sentences, phrases, words, phonemes, etc.) through the input device 320 . Additionally, the processor 314 may receive, through the input device 320 , any information indicating or selecting information about a reference speaker and/or information on change of speech characteristics.

According to an embodiment, the processor 314 may receive an input for the target text through the input device 320 through the input/output interface 318 . According to another embodiment, the processor 314 may receive, through the input device 320 and the input/output interface 318 , an input for uploading a file in a document format including the target text through the user interface. Here, in response to the input, the processor 314 may receive a file in a document format corresponding to the input from the memory 312 . In response to this input, the processor 314 may receive the target text included in the file. The received target text may be provided to the synthesized speech generating system 230 through the communication module 316 . Alternatively, the processor 314 may be configured to provide the uploaded file to the synthesized speech generation system 230 via the communication module 316 and to receive the target text contained within the file from the synthesized speech generation system 230 . have.

The processor 314 outputs the processed information and/or data through an output device such as a display output capable device (eg, a touch screen, a display, etc.) of the user terminal 210 and an audio output capable device (eg, a speaker). can be configured. For example, the processor 314 may display information representing or selecting target text and/or speech characteristic change information received from at least one of the input device 320 , the memory 312 , or the synthesized speech generation system 230 to the user. It can be output through the screen of the terminal 210 . Additionally or alternatively, the processor 314 may output the speaker characteristics of the new speaker determined or generated by the information processing system 230 through the screen of the user terminal 210 . Also, the processor 314 may output the synthesized voice through a voice output capable device such as a speaker. Additionally, the processor 314 may output the audio through a device capable of outputting audio, such as a speaker.

The processor 334 of the synthesized speech generation system 230 may be configured to manage, process and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems, including the user terminal 210 . can The information and/or data processed by the processor 334 may be provided to the user terminal 210 through the communication module 336 . In one embodiment, the processor 334 receives from the user terminal 210, the memory 332 and/or the external storage device information indicating or selecting the target text, information about the reference speaker, and speech characteristic change information, It is possible to obtain or determine the speaker characteristics and vocal characteristics change information of the reference speaker included in the memory 332 and/or the external storage device.

Then, the processor 334 may determine the speaker characteristic of the new speaker using the speaker characteristic and the vocalization characteristic change information of the reference speaker. Also, the processor 334 may generate an output voice for the target text in which the determined new speaker characteristic is reflected. For example, the processor 334 may input the target text and the speaker characteristics of the new speaker into the artificial neural network text-to-speech synthesis model to generate output speech from the artificial neural network text-to-speech synthesis model. The output voice generated in this way may be provided to the user terminal 210 through the network 220 and output through a speaker associated with the user terminal 210 .

4 is a block diagram illustrating an internal configuration of a processor 334 of a user terminal according to an embodiment of the present disclosure. As shown, the processor 334 may include a speaker characteristic determination module 410 , a synthesized speech output module 420 , a speech characteristic change information determination module 430 , and an output speech verification module 440 . Each of the modules operated on the processor 334 may be configured to communicate with each other. In FIG. 4 , the internal configuration of the processor 334 is described separately for each function, but this does not necessarily mean that the processor 334 is physically separated. In addition, the internal configuration of the processor 334 shown in FIG. 4 is only an example, and only essential configurations are not shown. Accordingly, in some embodiments, the processor 334 may be implemented differently, such as by additionally including other components other than the illustrated internal configuration, or by omitting some of the illustrated internal components.

The speaker characteristic determination module 410 may acquire speaker characteristics of a reference speaker. According to an embodiment, as described in FIG. 1 , the features of the reference speaker may be extracted through the learned artificial neural network speaker feature extraction model. For example, the speaker feature determination module 410 inputs the speaker id (eg, speaker one-hot vector, etc.) and vocalization characteristics (eg, vector) into the trained artificial neural network speaker feature extraction model to determine the speaker of the reference speaker. Features (eg vectors) can be extracted. As another example, the speaker feature determination module 410 inputs the speech and vocalization features (eg, vectors) recorded by the speaker into the trained artificial neural network speaker feature extraction model, and extracts the speaker features (eg, vectors) of the reference speaker. can do.

The speaker characteristic determination module 410 may obtain speaker characteristics and vocalization characteristic change information of the reference speaker, and determine the speaker characteristic of a new speaker by using the acquired speaker characteristic of the reference speaker and the acquired vocalization characteristic change information. Here, as the speaker characteristic of the reference speaker, at least one of the speaker characteristics of a plurality of speakers stored in the storage medium may be selected. In addition, the speech characteristic change information includes information indicating a change in speaker characteristics of a reference speaker, information indicating a change in speaker characteristics of at least some of the plurality of speakers stored in the storage medium, and/or included in the speaker characteristics of at least some of the plurality of speakers It may be information indicating a change in vocal characteristics. Here, the speaker features of the plurality of speakers may include features inferred from the learned artificial neural network speaker feature extraction model. In addition, each of the speaker characteristic and the vocalization characteristic may be expressed in a vector form.

The synthesized speech output module 420 may receive the target text from the user terminal and receive the speaker characteristics of the new speaker from the speaker characteristic determination module 410 . The synthesized voice output module 420 may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected. In an embodiment, the synthesized speech output module 420 inputs the target text and the speaker characteristics of the new speaker to the trained artificial neural network text-to-speech synthesis model, and outputs the speech (ie, synthesized speech) from the artificial neural network text-to-speech synthesis model. ) can be created. This artificial neural network text-to-speech synthesis model is to be stored in a storage medium (eg, the memory 332 of the information processing system 230 , other storage media accessible by the processor 334 of the information processing system 230 , etc.). can Here, the artificial neural network text-to-speech synthesis model includes a model trained to output a voice for the target text in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. can do. Then, the synthesized voice output module 420 may provide the generated synthesized voice to the user terminal. Accordingly, the generated synthesized voice may be output through any speaker built into the user terminal 210 or connected via wire or wirelessly.

The speech characteristic change information determination module 430 may obtain speech characteristic change information from the memory 332 . According to an embodiment, the speech characteristic change information may be determined through information determined through a user input through a user terminal (eg, the user terminal 210 of FIG. 2 ). Here, the speech characteristic change information may include information on a speech characteristic to be changed in order to generate a new speaker, that is, a new speaker. Additionally or alternatively, the vocalization characteristic change information may include information (eg, reflection ratio information) related to the speaker characteristic of the reference speaker.

Hereinafter, the speech characteristic change information is determined by the speaker characteristic determination module 410 and the speech characteristic change information determination module 430 , and the characteristic of a new speaker is determined using the determined speech characteristic change information and the speaker characteristic of the reference speaker. Specific examples are described.

In an embodiment, the speaker characteristic determination module 410 generates a speaker characteristic change by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the learned artificial neural network speaker characteristic change generation model, and the speaker characteristic and generation of the reference speaker By synthesizing the changed speaker characteristics, it is possible to output the speaker characteristics of a new speaker. When the artificial neural network is learning the speaker characteristic change generation model, individual speech characteristic information may be obtained for each speaker without using the speech characteristic information included in the speaker characteristic of the speaker as an input. For example, information on the vocalization characteristic of a given speaker may be obtained through tagging by a person. As another example, the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker. The obtained speaker's speech characteristic information may be stored in a storage medium. That is, it is possible to adjust the speaker characteristics of the reference speaker according to the change in the vocalization characteristics by using the artificial neural network speaker characteristic change generation model. This artificial neural network speaker feature change generation model can be learned using Equation 1 below.

here,

may refer to the speaker characteristics of the reference speaker,

may refer to the speaker characteristics of the reference speaker. These speaker features may be features to be extracted through the learned artificial neural network speaker feature extraction model. Likewise,

may refer to the vocal characteristics of the reference speaker,

may refer to the vocal characteristics of the reference speaker. These vocalization features may be features extracted through a learned artificial neural network vocalization feature extraction model. That is, the vocalization characteristic change information determination module 430 receives the information from the storage medium.

,

and

can be obtained and used to learn the artificial neural network speaker feature change generation model. In addition,

and

Based on the difference value of , that is, loss, an artificial neural network speaker feature change generation model can be trained.

Then, the vocalization characteristic change information determination module 430 inputs the difference between the reference speaker's vocalization characteristic and the reference speaker's vocalization characteristic and the reference speaker's speaker characteristic to the learned artificial neural network speaker characteristic change generation model during inference to input the vocalization characteristic change information

can be decided The speaker characteristic determination module 410 is configured to provide the determined speech characteristic change information.

and speaker characteristics of the reference speaker

Based on this, it is possible to determine the speaker characteristics of the new speaker. This new speaker characteristic can be expressed as Equation 2 below.

here,

may mean the speaker characteristics of the new speaker.

In an embodiment, the vocalization characteristic change information determining module 430 may extract a normal vector for the target vocalization characteristic by using a vocalization feature classification model corresponding to the target vocalization characteristic. To this end, a speech feature classification model corresponding to each of the plurality of speech features may be generated. The vocal feature classification model is a hyperplane-based model, and may be implemented using, for example, a support vector machine (SVM), a linear classifier, or the like, but is not limited thereto. Also, the target vocalization characteristic may refer to a vocalization characteristic selected from among a plurality of vocalization features, which will be changed and reflected in the speaker characteristic of a new speaker. Also, the speaker's characteristic may be expressed as a speaker vector.

When the speech feature classification model is trained, speech feature information included in the speaker feature of the speaker is not used as an input, and each speech feature information may be obtained for each speaker. For example, voice characteristic information of a given speaker may be obtained through tagging by a person. As another example, the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker.

Such a speech feature classification model can be learned through Equation 3 below.

here,

Is

means the i-th vocalization characteristic,

denotes a normal vector of a hyperplane that classifies the i-th vocalization feature, and b denotes a bias.

Then, the speaker feature determination module 410 is configured to generate a speaker feature vector of the reference speaker most similar to the new speaker through the trained artificial neural network speaker feature extraction model to generate a synthesized speech of the new speaker.

can be obtained. Also, the vocalization characteristic change information determination module 430 may obtain, as the vocalization characteristic change information, information indicating a normal vector of the target vocalization characteristic and the degree of adjusting the vocalization characteristic from the learned vocalization characteristic classification model. The speaker feature vector of the reference speaker thus obtained

, the speaker characteristic of the new speaker according to Equation 4 below using the normal vector of the target speech feature and the degree of adjusting the speech feature.

can be created.

here,

is the normal vector of the target vocalization feature,

may refer to the degree of controlling the vocal characteristics.

In an embodiment, the speaker characteristic determining module 410 may acquire a plurality of speaker characteristics corresponding to a plurality of reference speakers. Also, the speech characteristic change information determination module 430 may obtain a weight set corresponding to a plurality of speaker characteristics and provide the obtained weight set to the speaker characteristic determination module 410 . The speaker characteristic determination module 410 may determine the speaker characteristic of a new speaker as shown in Equation 5 below by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. That is, the voices of several speakers may be mixed to generate a new speaker's voice.

here,

is the speaker vector of the i speaker,

may mean a weight for speaker i.

By applying the sigma constraint, feature vectors of multiple speakers can be mixed into the feature vectors of new speakers.

According to an embodiment, the speaker characteristic determining module 410 may generate a new speaker characteristic vector through a pre-calculated method of adjusting the vocalization characteristic axis. For example, a speaker feature includes one or more vocal features. The vocalization characteristic change information determination module 430 may find the vocalization characteristic axis and adjust the vocalization characteristic axis. The adjusted vocalization characteristic axis may be provided to the speaker characteristic determination module 410 and used to determine the speaker characteristic of a new speaker. That is, the speaker characteristic determination module 410 calculates the speaker characteristic r of the reference speaker, the vocalization characteristic axis, as shown in Equation 6 below.

and weight of speech characteristic change information

can be used to determine the speaker characteristics of the new speaker.

here,

is the j-th vocal feature axis,

may mean a weight for the j-th utterance feature. In addition, C may mean a quantitatively expressed vocalization feature, and c may mean an axis on the inside of a speaker feature. For example, if C = {1, 30, -1, 1, 1}, then the axis of vocal feature C is female (

= 1), age 30 (

= 30), the tone is low (

= -1), speaks fast (

= 1), vocal strength is strong (

= 1) can be expressed. In addition,

is an individual vocalization characteristic.

It may mean one axis on the vocal feature space to distinguish

Is

may have the same dimension as the speaker's expression.

As vocal characteristic change information

In order to obtain , the speech characteristic change information determining module 430 may normalize each of the speaker vectors of the plurality of speakers. In this case, the speaker vectors of the plurality of speakers may be included in the speaker characteristics of the plurality of speakers. For example, the speech feature change information determining module 430 may determine that the speaker vector R = {

,i = 0, ...,

-1} can be normalized. For example, the speech characteristic change information determination module 430 may perform Z-normalization in which the mean is subtracted from all data and the variance is divided, or normalization in which the mean is subtracted from all data.

Here, N(-) denotes a normalization function, and D(-) denotes an inverse normalization function.

Then, the speech characteristic change information determination module 430 may determine the plurality of main components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers. Here, the dimensionality reduction analysis may be performed through a conventionally known dimension reduction technique, such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), or Stochastic Neighbor Embedding (t-SNE). For example, the speech characteristic change information determination module 430 may determine a plurality of main components P in Equation 8 below by performing PCA on N(R).

here,

may refer to the kth major component,

may refer to the number of dimensions of r of the speaker expression.

Then, in Equation 9 below

A voice generated by using can be listened to and evaluated by a person to be assigned a vocal feature label. The vocalization characteristic change information determination module 430 may select at least one main component from among the plurality of determined main components. For example, key components associated with the vocal characteristics desired to be altered in the speaker characteristics of the new speaker may be selected.

here,

means the characteristics of the new speaker,

means the kth major component,

may mean the selected main component.

That is, the j-th vocalization characteristic

the main ingredient selected

and a normalization inverse function D. The j-th utterance feature and a weight corresponding thereto are provided to the speaker feature determination module 410, so that the speaker feature of a new speaker can be generated through Equation 6 above.

Additionally or alternatively, the vocalization characteristic change information determination module 430 is used in Equation (6).

Instead, obtained through Equation (10)

By using , interference between the vocal feature axes can be removed.

here,

Is

may refer to an axis of vocalization characteristics in which some vocalization characteristics are changed. In addition,

is an individual vocalization characteristic.

It may mean one axis on the vocal feature space for classifying .

The speech characteristic change information determination module 430 may obtain speaker vectors of a plurality of speakers having different target speech characteristics. In this case, the speaker vectors of the plurality of learning speakers may be included in the speaker characteristics of the plurality of learning speakers. Further, each of the plurality of speakers is assigned a label for one or more vocal features. In one embodiment,

A vocal feature label may be assigned to each of a plurality of speakers as shown in FIG. Here, the speech characteristics may include tone, speech strength, speech speed, gender, and age. Tone, vocal strength, and vocal speed

It can be expressed as, where

may be an element of l. Also, the gender of men and women

It can be expressed as , and the age is

can be expressed as for example,

The silver tone is low, the vocal strength is medium, and the vocalization rate is high, which may indicate the vocal characteristics of a 50-year-old male.

Then, the speech characteristic change information determining module 430 is configured to perform speaker vectors of a plurality of speakers having different target speech characteristics, as shown in Equation 11 above.

and

Vocal characteristics based on the difference between

can be decided Here, the vocal features

may be included in the speech characteristic change information. This speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.

In another embodiment, the speech characteristic change information determining module 430 may determine the speech characteristic change information based on a difference between the averages of the speaker vectors of a plurality of speaker groups. As described in connection with Equation (11) above, the speaker features of the plurality of speakers include speaker vectors of the plurality of speakers, and each of the speaker features of the plurality of speakers is assigned a label for one or more vocalization features.

The speech characteristic change information determination module 430 may obtain speaker vectors of speakers included in each of a plurality of speaker groups having different target speech characteristics. Here, the group of the plurality of learning speakers may include a first speaker group and a second speaker group.

Then, the speech characteristic change information determination module 430 calculates an average of the speaker vectors of the speakers included in the first speaker group, and calculates the average of the speaker vectors of the speakers included in the second speaker group, by calculating the average of the speaker vectors included in the second speaker group, Equation (12) A speech characteristic based on the difference between the average of the speaker vectors corresponding to the first speaker group and the average of the speaker vectors corresponding to the second speaker group as

can be decided Determined vocal characteristics

may be included in the speech characteristic change information.

Then, this speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.

According to an embodiment, the speech characteristic change information determining module 430 may include, as in Equation 13 below, speaker characteristics of a plurality of speakers.

A neural network vocal feature prediction model

By typing in, each vocalization characteristic of a plurality of speakers

can be printed out. Here, the speech characteristic change information determining module 430 is a speaker characteristic of a plurality of speakers.

Among them, the output vocal characteristics

selected from

class

A speaker feature that has a difference value in the j-voicing features included in

can be selected or determined. These speaker characteristics

may be provided to the speaker characteristic determination module 410 .

Also, the speaker characteristic determining module 410 may obtain a weight corresponding to the speaker characteristic of the selected speaker. Then, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. For example, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker using Equation 14 below.

here,

is the speaker characteristic of the new speaker,

is the speaker characteristics of the reference speaker,

is the speaker characteristics of the selected speaker,

may refer to a weight corresponding to the speaker characteristic of the selected speaker.

The output voice verification module 440 may determine whether the output voice associated with the speaker characteristic of the new speaker is a new output voice that is not previously stored. According to an embodiment, the output voice verification module 440 may calculate a hash value corresponding to a speaker feature (eg, a speaker feature vector) of a new speaker by using a hash function. In another embodiment, the output voice verification module 440 does not calculate a hash value using the speaker voice of the new speaker, but extracts the speaker feature of the speaker from the new output voice, and uses the extracted speaker feature of the new speaker to hash A value can be calculated.

Then, the output voice verification module 440 may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. When there is no content associated with a hash value similar to the calculated hash value, the output voice verification module 440 may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice. When it is determined as the new output voice, the synthesized voice reflecting the speaker characteristics of the new speaker may be set to be used.

5 is a flowchart illustrating a method 500 of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure. In an embodiment, the method 500 for generating an output voice reflecting the speaker characteristics of the new speaker includes a processor (eg, the processor 314 of the user terminal 210 and/or the processor 334 of the synthesized voice generating system 230 ). )) can be performed by As shown, the method 500 may be initiated by the processor receiving the target text (S510).

Then, the processor may acquire a speaker characteristic of the reference speaker corresponding to the reference speaker ( S520 ). In one embodiment, the speaker characteristic of the reference speaker may include a speaker vector. Additionally or alternatively, the speaker characteristics of the reference speaker may include vocalization characteristics of the reference speaker. According to another embodiment, the speaker characteristics of the reference speaker may include a plurality of speaker characteristics corresponding to the plurality of reference speakers. Here, the plurality of speaker features may include a plurality of speaker vectors.

Then, the processor may acquire vocal feature change information ( S530 ). To this end, the processor may acquire speaker characteristics of the plurality of speakers. Here, the speaker characteristics of the plurality of speakers may include a plurality of speaker vectors.

According to an embodiment, the processor may determine the plurality of principal components by performing normalization on each of the speaker vectors of the plurality of speakers and performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers. At least one major analysis from among the plurality of key components thus determined may be selected. Then, the processor may determine the speech characteristic change information using the selected main component.

According to another embodiment, the processor may obtain speaker vectors of a plurality of speakers having different target vocalization characteristics, and determine the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers. According to another embodiment, the processor may obtain a speaker vector of speakers included in each of a plurality of speaker groups having different target vocalization characteristics. Here, the plurality of speaker groups may include a first speaker group and a second speaker group. Then, the processor may calculate an average of speaker vectors of speakers included in the first speaker group, and calculate an average of speaker vectors of speakers included in the second speaker group. The processor may determine the speech characteristic change information based on a difference between an average of speaker vectors corresponding to the first speaker group and an average of speaker vectors corresponding to the second speaker group.

In another embodiment, the processor may input the speaker characteristics of the plurality of speakers to the artificial neural network speech characteristic prediction model, and output the speech characteristics of each of the plurality of speakers. Then, the processor is configured to: a speaker of the speaker, wherein, among the speaker features of the plurality of speakers, a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker. A characteristic may be selected, and a weight corresponding to the speaker characteristic of the selected speaker may be obtained. Here, the speaker characteristic of the selected speaker and the weight corresponding to the speaker characteristic of the selected speaker may be obtained as speech characteristic change information.

According to another embodiment, the processor may extract a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature. Here, the normal vector may refer to a normal vector of a hyperplane that classifies the target speech feature, and information indicating the degree of adjusting the target speech feature may be obtained. The extracted normal vector and information indicating the degree to which the target speech feature is adjusted may be obtained as speech feature change information.

Then, the processor may determine the speaker characteristics of the new speaker by using the acquired speaker characteristics of the reference speaker and the acquired speech characteristic change information ( S540 ).

According to an embodiment, the processor generates a change in the speaker characteristic by inputting the speaker characteristic of the reference speaker and the acquired speech characteristic change information into the artificial neural network speaker characteristic change generation model, and the speaker characteristic of the reference speaker and the generated speaker characteristic change By synthesizing , it is possible to output the speaker characteristics of the new speaker. Here, the artificial neural network speaker characteristic change generation model may be learned by using the speaker characteristics of the plurality of learned speakers and the plurality of speech characteristics included in the speaker characteristics of the plurality of learned speakers.

In another embodiment, the processor may determine the speaker characteristic of the new speaker by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. In another embodiment, the processor may determine the characteristics of the new speaker by using the weights of the speaker characteristics of the reference speaker, the speech characteristics change information, and the speech characteristics change information. According to another embodiment, the processor may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. According to another embodiment, the processor may determine the speaker characteristic of the new speaker based on the degree to which the reference speaker's speaker vector, the extracted normal vector, and the target vocalization characteristic are adjusted.

Then, the processor may input the target text and the determined speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected (S550). Here, the artificial neural network text-to-speech synthesis model learns to output voices for a plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. model may be included.

According to an embodiment, the processor may calculate a hash value corresponding to the speaker feature vector using a hash function. Here, the speaker feature vector may be included in the speaker feature of the new speaker. Then, the processor may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. If there is no content associated with the hash value similar to the calculated hash value, the processor may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice.

According to an embodiment, a speech synthesizer learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker may be provided. Here, the voice synthesizer may be any voice synthesizer that can be learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker. For example, the speech synthesizer may include any text-to-speech synthesis (TTS) model trained using this training data. Here, the TTS model may be implemented as a machine learning model or an artificial neural network model known in the art.

Since the speech synthesizer has learned the synthesized voice of the new speaker as training data, when the target text is input, the target text may be output as the synthesized voice of the new speaker. According to an embodiment, such a voice synthesizer may be included or implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 .

According to an embodiment, a memory and a memory configured to store a synthesized voice of a new speaker generated according to the method for generating a synthesized voice of a new speaker as described above and connected to the memory, execute at least one computer-readable program included in the memory An apparatus for providing a synthesized voice may be provided, including at least one processor configured to: the at least one program including an instruction for outputting at least a part of the synthesized voice of the new speaker stored in the memory. For example, the device for providing the synthesized voice may refer to any device that stores the synthesized voice of a new speaker that has been generated in advance and provides at least a part of the stored synthesized voice.

According to an embodiment, the apparatus for providing such a synthesized voice may be implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 . Specifically, the apparatus for providing the synthesized voice is not limited thereto, but may be implemented as a video system, an ARS system, a game system, a sound pen, or the like. For example, when a device for providing such a synthesized voice is provided to the information processing system 230 , at least a part of the outputted synthesized voice of the new speaker is transmitted to the user terminal device connected to the information processing system 230 by wire/wireless. can be provided. Specifically, the information processing system 230 may provide at least a part of the output synthesized voice of the new speaker in a streaming manner.

According to an embodiment, there is provided a method for providing a synthesized voice of a new speaker, comprising the steps of: storing the synthesized voice of the new speaker generated according to the above-described method; and providing at least a part of the stored synthesized voice of the new speaker. can be This method may be executed by the processor of the user terminal 210 and/or the processor of the information processing system 230 of FIG. 2 . This method may be provided for a service providing a synthesized voice of a new speaker. For example, such a service may be implemented as a video system, an ARS system, a game system, a sound pen, etc., but is not limited thereto.

6 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected according to an embodiment of the present disclosure; In an embodiment, the artificial neural network text-to-speech synthesis model may include an encoder 610 , an attention 620 , and a decoder 630 .

The encoder 610 may receive the target text 640 as an input. The encoder 610 may be configured to generate pronunciation information for the input target text 640 (eg, phoneme information for the target text, a vector for each of a plurality of phonemes included in the target text, etc.). In an embodiment, the encoder 610 may generate the target text 640 by converting it into character embeddings. For example, in encoder 610, the generated character embeddings may be passed to a pre-net including a fully-connected layer. Also, the encoder 610 may provide the output from the pre-net to the CBHG module to output encoder hidden states. For example, the CBHG module may include a 1D convolution bank, max pooling, a highway network, and a bidirectional gated recurrent unit (GRU). The pronunciation information generated by the encoder 610 may be provided to the attention 620 , and the attention 620 may connect or combine the provided pronunciation information with voice data corresponding to the pronunciation information. For example, attention 620 may be configured to determine from which portion of the input text to generate speech.

The pronunciation information connected in this way and voice data corresponding to the pronunciation information may be provided to the decoder 630 . The decoder 630 may be configured to generate the voice data 660 corresponding to the target text 640 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.

According to one embodiment, the decoder 630 provides the speaker characteristics (

) 658 , to generate an output voice for the target text reflecting the speaker characteristics of the new speaker. Here, the speaker characteristics of the new speaker (

) 658 may be generated through the vocalization characteristic change module 656 . For example, the vocalization characteristic change module 656 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .

According to an embodiment, the artificial neural network speaker feature extraction model 650 is a reference speaker's The speaker feature (r) can be obtained. Here, the vocalization feature C 654 and the speaker feature r of the speaker may be expressed in a vector form. In addition, the artificial neural network speaker feature extraction model 650 may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. Using the generated reference speaker characteristic (r) and the input information (d) 655 associated with the speech characteristic change information, the vocalization characteristic change information is determined through the vocalization characteristic change module 656, and further, the new speaker's speaker Characteristic(

) (658) can be determined. The input information (d) 655 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.

In one embodiment, the decoder 630 includes a freenet composed of a fully connected layer, an attention recurrent neural network (RNN) including a gated recurrent unit (GRU), and a decoder RNN (decoder RNN) including a residual GRU (residual GRU). RNN) may be included. The voice data 660 output from the decoder 630 may be expressed as a mel-scale spectrogram. In this case, the output of the decoder 630 may be provided to a post-processing processor (not shown). The CBHG of the post-processing processor may be configured to convert the mel-scale spectrogram of the decoder 630 into a linear-scale spectrogram. For example, the output signal of the CBHG of the post-processing processor may include a magnitude spectrogram. The phase of the output signal of the CBHG of the post-processing processor may be restored through a Griffin-Lim algorithm and subjected to inverse short-time Fourier transform. The post-processing processor may output a voice signal in a time domain. As another example, the post-processing processor may be implemented using a GAN-based vocoder.

In order to generate or learn such an artificial neural network text-to-speech synthesis model, the processor uses a database including a training text item, a speaker characteristic of a plurality of learned speakers, and a training voice data item corresponding to the training text item in which the speaker characteristic is reflected. can The processor may learn the artificial neural network text-to-speech synthesis model to output a synthesized voice reflecting the speaker characteristics of the learning speaker based on the training text item, the speaker characteristics of the training speaker, and the training voice data item corresponding to the training text item. .

The processor may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way. In one embodiment, the processor adds the target text 640 and the new speaker speaker characteristics (

) 658 , a synthesized voice may be generated based on the output voice data 660 . The synthesized speech generated in this way has the speaker characteristics (

) 658 may be reflected, and may include a voice uttering the target text 640 .

6 illustrates the attention 620 and the decoder 630 as separate components, but is not limited thereto. For example, the decoder 630 may include the attention 620 . In addition, in FIG. 6, the speaker characteristics of the new speaker (

) 658 is input to the decoder 630 , but is not limited thereto. For example, the speaker characteristics of the new speaker (

) 658 may be input to the encoder 610 and/or the attention 620 .

7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to another embodiment of the present disclosure. The encoder 710 , the attention 720 , and the decoder 730 illustrated in FIG. 7 may perform functions similar to those of the encoder 610 , the attention 620 and the decoder 630 illustrated in FIG. 6 , respectively. Accordingly, the description overlapping with FIG. 6 will be omitted.

In an embodiment, the encoder 710 may receive the target text 740 as input. The encoder 710 is configured to generate pronunciation information for the input target text 740 (eg, a plurality of phoneme information included in the target text, a vector for each of a plurality of phonemes included in the target text, etc.). can The pronunciation information generated by the encoder 710 may be provided to the attention 720 , and the attention 720 may connect the pronunciation information and voice data corresponding to the pronunciation information. The pronunciation information connected as described above and voice data corresponding to the pronunciation information may be provided to the decoder 730 . The decoder 730 may be configured to generate the voice data 760 corresponding to the target text 740 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.

In one embodiment, the decoder 730 provides the new speaker's speaker characteristics (

) 758 , and generate an output voice for the target text reflecting the speaker characteristics of the new speaker. Here, the speaker characteristics of the new speaker (

) 758 may be generated through the vocal feature change module 756 . For example, the vocal feature change module 756 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .

According to an embodiment, the artificial neural network speaker feature extraction model 750 may output speaker identification information (i) 753 based on the voice 752 and the speech feature set (C) 754 recorded by the speaker. Also, it is possible to obtain the speaker characteristic (r) of the reference speaker. Here, the speech feature set may include one or more speech features c. In addition, the speech feature set (C) 754 and the speaker feature (r) of the speaker may be expressed in a vector form. In addition, the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. The vocalization characteristic change module 756 determines the vocalization characteristic change information using the generated reference speaker characteristic (r) and the input information (d) 755 associated with the vocalization characteristic change information, and furthermore, the speaker characteristic of the new speaker. (

) can be determined. The input information (d) 755 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.

In order to generate or learn such an artificial neural network text-to-speech synthesis model, the processor uses a database including a pair of training speech data items corresponding to the training text item, in which the speaker characteristics of the plurality of training text items and the speaker characteristics are reflected. can The processor may learn the artificial neural network text-to-speech synthesis model to output the synthesized voice 760 in which the speaker characteristics of the new speaker are reflected, based on the speaker characteristics of the training speaker and the training voice data item corresponding to the training text item.

The processor may generate the output voice 760 in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way. In one embodiment, the processor adds the target text 740 and the new speaker speaker characteristics (

) 758 , a synthesized voice may be generated based on the output voice data 760 . The synthesized speech generated in this way has the speaker characteristics (

) 758 may include a voice uttering the target text 740 .

Although the attention 720 and the decoder 730 are illustrated as separate components in FIG. 7 , the present invention is not limited thereto. For example, the decoder 730 may include an attention 720 . In addition, in FIG. 7, the speaker characteristics of the new speaker (

) is input to the decoder 730, but is not limited thereto. For example, the speaker characteristics of the new speaker (

) may be input to the encoder 710 and/or the attention 720 .

6 and 7, a target text is expressed as one input data item (eg, a vector) and one output data item (eg, a melscale spectrogram) is output through an artificial neural network text-to-speech synthesis model. Although it is illustrated by way of example, the present invention is not limited thereto, and may be configured to output any number of output data items by inputting an arbitrary number of input data items to the artificial neural network text-to-speech synthesis model.

8 is an exemplary diagram illustrating a user interface 800 for generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure. The user terminal (eg, the user terminal 210 ) may output a synthesized voice reflecting the speaker characteristics of the new speaker through the user interface 800 . The user interface 800 may include a text area 810 , a speech characteristic adjustment area 820 , a speaker characteristic adjustment area 830 , and an output voice display area 840 . Hereinafter, the processor may be the processor 314 of the user terminal 210 and/or the processor 334 of the information processing system 230 .

The processor may receive the target text through a user input using an input interface (eg, a keyboard, a mouse, a microphone, etc.), and display the received target text through the text area 810 . Alternatively, the processor may receive a document file including text, extract text in the document file, and display the extracted text in the text area 810 . The text displayed in the text area 810 in this way may be a target to be uttered through a synthesized voice.

One or more reference speakers may be selected in response to a user input for selecting one or more reference speakers from among the reference speakers displayed in the speaker characteristic adjustment area 830 . Then, the processor may receive a weight (eg, a reflection ratio) for the speaker characteristics of the selected one or more reference speakers as speech characteristic change information. For example, the processor may receive a weight for each of the speaker characteristics of one or more reference speakers in Equation 5 described with reference to FIG. 4 through an input in the speaker characteristic adjustment region 830 . As shown, in the speaker feature control area 830, six standard speakers, 'Eun-Byul Ko', 'Soo-Min Kim', 'Woo-Rim Lee', 'Do-Young Song', 'Seong-Soo Shin', and 'Jin-Kyung Shin' may be given. That is, the user selects one or more reference speakers from among the six reference speakers, and adjusts a reflection ratio adjustment means (eg, bar) corresponding to each of the selected one or more reference speakers, so that the speaker characteristics of the selected reference speaker are changed to a new speaker. A ratio that is reflected in the speaker characteristics of may be determined. Alternatively, one or more of the six reference speakers may be randomly selected.

The reflection ratios for each speaker may be received so that the sum of reflection ratios corresponding to the selected one or more reference speakers becomes 100. Alternatively, even if the reflection ratio corresponding to the one or more reference speakers selected in this way is greater than or less than 100, each reflection ratio may be automatically adjusted so that the sum of the ratios becomes 100. In FIG. 6 , six reference speakers are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and 5 or less reference speakers and 7 or more reference speakers are displayed in the speaker characteristic adjustment area 830 to create new speaker characteristics. It can be used to generate speaker characteristics.

The processor may receive a weight (eg, a reflection ratio) for each of the plurality of speech features as speech feature change information through the speech feature adjustment region 820 . According to an embodiment, the processor may receive a weight for each of the plurality of speech features in Equation 6 described with reference to FIG. 4 through an input in the speech feature adjustment region 820 . Here, r in Equation 6 may be a reference speaker generated according to the selection and reflection ratio of one or more reference speakers in the speaker characteristic adjustment region 830 . For example, r is a result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 .

can be

In another embodiment, the speech feature received through the input in the speech feature adjustment region 820 and the weight for the speech feature are obtained in Equation (13).

It can be used for vocal features to find Here, in Equation 13

is the result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 ,

can be

In the present disclosure, gender, vocal tone, vocal strength, male age, female age, pitch, and tempo may be given as quantitatively adjustable vocal characteristics in the vocalization characteristic adjustment area 820 . A ratio adjusting means (eg, a bar) corresponding to each of the plurality of speech features is adjusted according to a user input, thereby determining a ratio at which each of the plurality of speech features is reflected in the speaker features of a new speaker. For example, if a bar corresponding to one or more vocalization characteristics is adjusted to 0, the corresponding vocalization characteristic is not reflected in the speaker characteristic of the new speaker at all. In FIG. 6 , seven vocal characteristics are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and the vocal characteristics of six or less people and additional vocal characteristics are displayed in the vocalization characteristic control area 820 to display the speaker characteristics of the new speaker. can be used to create

Then, the processor receives the speaker characteristics of one or more reference speakers selected in the speaker characteristic adjustment area 830 , and weights input from the speaker characteristic adjustment area 830 and/or the speech characteristics adjustment area 820 . A speaker characteristic of a new speaker may be generated by using the speech characteristic adjustment information including the weight. One of the methods described with reference to FIG. 4 may be used as a specific method for generating the speaker characteristic of a new speaker. Then, the processor may input the target text and the generated speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected. As an example, the input in the text area 810 , the speech characteristic adjustment area 820 , and the speaker characteristic adjustment area 830 is completed, and the 'Create' button located below the speech characteristic adjustment area 820 is selected or clicked Then, an output voice for the target text in which the speaker characteristics of the new speaker are reflected may be generated. The output voice thus generated may be output through a speaker connected to the user terminal. The reproduction time and/or position of the output voice may be displayed through the output voice display area 840 .

9 is a structural diagram illustrating an artificial neural network model 900 according to an embodiment of the present disclosure. According to an embodiment, the artificial neural network model 900 is a statistical learning algorithm implemented based on the structure of a biological neural network or a structure for executing the algorithm in machine learning technology and cognitive science. According to an embodiment, the artificial neural network model 900 is an artificial neuron that forms a network by combining synapses, as in a biological neural network, by repeatedly adjusting the weights of synapses, so that By learning to reduce the error between the output and the inferred output, it is possible to represent a machine learning model with problem-solving ability. For example, the artificial neural network model 900 may include arbitrary probabilistic models, neural network models, etc. used in artificial intelligence learning methods such as machine learning and deep learning. In the present disclosure, the artificial neural network model 900 includes the aforementioned artificial neural network text-to-speech synthesis model, the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. may include

The artificial neural network model 900 may be implemented as a multilayer perceptron (MLP) composed of multiple layers of nodes and connections between them. The artificial neural network model 900 according to the present embodiment may be implemented using one of various artificial neural network structures including MLP. As shown in FIG. 9 , the artificial neural network model 900 includes an input layer 920 that receives an input signal or data 910 from the outside, and an output layer that outputs an output signal or data 950 corresponding to the input data ( 940), located between the input layer 920 and the output layer 940, receiving a signal from the input layer 920, extracting characteristics, and transferring the characteristics to the output layer 940. It may be composed of n hidden layers 930_1 to 930_n. . Here, the output layer 940 may receive a signal from the hidden layers 930_1 to 930_n and output the signal to the outside. The learning method of the artificial neural network model 900 includes a supervised learning method that learns to be optimized to solve a problem by input of a teacher signal (correct answer), and an unsupervised learning method that does not require a teacher signal. ) is a way.

According to an embodiment, when the artificial neural network model 900 is an artificial neural network text-to-speech synthesis model, the processor inputs text information and speaker characteristics of a new speaker into the artificial neural network model 900, and the artificial neural network model 900 This new speaker characteristic can be learned end-to-end to output voice data for the reflected text. That is, when the artificial neural network model 900 inputs information about text and information about a new speaker, the intermediate process is learned by itself by the processor, and a synthesized voice can be output. The processor may generate the synthesized speech by converting the text information and the speaker characteristics of the new speaker into embeddings (eg, embedding vectors) through the encoding layer of the neural network model 900 . Here, the input variable of the artificial neural network model 900 may be a vector 910 composed of vector data elements representing text information and new speaker information. Here, the text information may be represented by arbitrary embeddings representing text, for example, it may be represented by character embeddings, phoneme embeddings, and the like. In addition, the speaker characteristics of the new speaker may be represented by any type of embedding representing the speaker's utterance. When the artificial neural network model 900 is trained end-to-end, the artificial neural network model 900 may be trained to reflect the dependency between text information and new speaker information. Under this configuration, the output variable may be composed of a result vector 950 representing the synthesized voice for the target text in which the speaker characteristics of the new speaker are reflected.

In this way, the input layer 920 and the output layer 940 of the artificial neural network model 900 are matched with a plurality of input variables and a plurality of output variables corresponding to each other, and the input layer 920 and the hidden layers 930_1 ... 930_n , where n is a natural number equal to or greater than 2) and by adjusting the synapse values between the nodes included in the output layer 940, the artificial neural network model 900 can be trained to infer the correct output corresponding to a specific input. have. In inferring the correct output, correct answer data of the analysis result may be used, and such correct answer data may be obtained as a result of an annotator's annotation work. Through this learning process, the characteristics hidden in the input variable of the artificial neural network model 900 can be identified, and the error between the output variable calculated based on the input variable and the target output is reduced. A synapse value (or weight) between the two may be adjusted.

In order to solve the dependency phenomenon between input information, when the artificial neural network model 900 is trained, mutual information between text information and new speaker information (eg, text information embedding and new speaker information embedding) A loss function that minimizes may be used. According to an embodiment, when the neural network model 900 is an artificial neural network text-to-speech synthesis model, the neural network model 900 is configured to predict a loss between embedding text information and embedding new speaker information (for example, , a fully-connected layer, etc.). Under this configuration, the artificial neural network model 900 may be trained to predict and minimize mutual information between text information and speaker information. The artificial neural network model 900 learned in this way may be configured to independently adjust each of the input text information and the new speaker information.

Then, when the neural network model 900 is an artificial neural network text-to-speech synthesis model, the processor inputs target text information and new speaker information to the learned artificial neural network model 900, and the new speaker's speaker characteristics are reflected. A synthesized voice corresponding to the text may be output. Such voice data may be configured such that mutual information between the target text information and the new speaker information is minimized.

The learning process of the artificial neural network model 900 uses the training data of each model to generate the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. can be applied. In addition, the artificial neural network models trained in this way may generate an inference value as output data by using data corresponding to the learning input data as input.

The above-described method may be provided as a computer program stored in a computer-readable recording medium for execution by a computer. The medium may continuously store a computer executable program, or may be a temporary storage for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like. In addition, examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute various other software, or servers.

The method, operation, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those of ordinary skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.

In a hardware implementation, the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.

Accordingly, the various illustrative logic blocks, modules, and circuits described in connection with this disclosure are suitable for use in general purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the present disclosure. It may be implemented or performed in any combination of those designed to perform the functions described in A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.

In firmware and/or software implementations, the techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM ( on computer-readable media such as programmable read-only memory), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

Although the above-described embodiments have been described utilizing aspects of the presently disclosed subject matter in one or more standalone computer systems, the present disclosure is not so limited and may be implemented in connection with any computing environment, such as a network or distributed computing environment. . Still further, aspects of the subject matter in this disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across the plurality of devices. Such devices may include PCs, network servers, and portable devices.

Although the present disclosure has been described in connection with some embodiments herein, various modifications and changes can be made without departing from the scope of the present disclosure that can be understood by those skilled in the art to which the present disclosure pertains. Further, such modifications and variations are intended to fall within the scope of the claims appended hereto.

Claims

A method for generating a synthesized voice of a new speaker, performed by at least one processor, the method comprising:

receiving the target text;

obtaining a speaker characteristic of a reference speaker;

obtaining vocal feature change information;

determining a speaker characteristic of a new speaker using the acquired speaker characteristic of the reference speaker and the acquired speech characteristic change information; and

inputting the target text and the determined speaker characteristics of the new speaker into an artificial neural network text-to-speech synthesis model, and generating an output voice for the target text, reflecting the speaker characteristics of the determined new speaker;

The artificial neural network text-to-speech synthesis model is trained to output voices for a plurality of training text items in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of training text items and the speaker characteristics of the plurality of learning speakers. felled,

How to generate a synthesized voice for a new speaker.
According to claim 1,

Determining the speaker characteristics of the new speaker comprises:

generating a change in speaker characteristic by inputting the speaker characteristic of the reference speaker and the acquired speech characteristic change information into an artificial neural network speaker characteristic change generation model; and

outputting the speaker characteristic of the new speaker by synthesizing the speaker characteristic of the reference speaker and the generated speaker characteristic change;

The artificial neural network speaker characteristic change generation model is learned using speaker characteristics of a plurality of learned speakers and a plurality of speech characteristics included in the speaker characteristics of the plurality of learned speakers,

How to generate a synthesized voice for a new speaker.
3. The method of claim 2,

The method of generating a synthesized voice of a new speaker, wherein the speech characteristic change information includes information on a change in a target speech characteristic.
According to claim 1,

The acquiring of the speaker characteristics of the reference speaker includes acquiring a plurality of speaker characteristics corresponding to the plurality of reference speakers,

The obtaining of the speech characteristic change information includes obtaining a set of weights corresponding to the plurality of speaker characteristics,

The determining of the speaker characteristic of the new speaker includes determining the speaker characteristic of the new speaker by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics,

How to generate a synthesized voice for a new speaker.
According to claim 1,

further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;

The step of obtaining the speech characteristic change information comprises:

normalizing each of the speaker vectors of the plurality of speakers;

determining a plurality of principal components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers;

selecting at least one main component from the determined plurality of main components; and

determining the vocalization characteristic change information by using the selected main component,

Determining the speaker characteristics of the new speaker comprises:

determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and a weight of the determined speech characteristic change information;

How to generate a synthesized voice for a new speaker.
According to claim 1,

further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;

each of the plurality of speakers is assigned a label for one or more vocal features,

The step of obtaining the speech characteristic change information comprises:

acquiring speaker vectors of a plurality of speakers having different target vocalization characteristics; and

determining the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers,

Determining the speaker characteristics of the new speaker comprises:

determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and a weight of the determined speech characteristic change information;

How to generate a synthesized voice for a new speaker.
According to claim 1,

further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;

each of the plurality of speakers is assigned a label for one or more vocal features,

The step of obtaining the speech characteristic change information comprises:

obtaining a speaker vector of speakers included in each of a plurality of speaker groups having different target vocalization characteristics, the plurality of speaker groups including a first speaker group and a second speaker group;

calculating an average of speaker vectors of speakers included in the first speaker group;

calculating an average of speaker vectors of speakers included in the second speaker group; and

determining the speech characteristic change information based on a difference between the average of the speaker vectors corresponding to the first speaker group and the average of the speaker vectors corresponding to the second speaker group;

Determining the speaker characteristics of the new speaker comprises:

determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and a weight of the determined speech characteristic change information;

How to generate a synthesized voice for a new speaker.
According to claim 1,

further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;

The speaker characteristics of the reference speaker include a plurality of vocalization characteristics of the reference speaker,

The step of obtaining the speech characteristic change information comprises:

inputting the speaker characteristics of the plurality of speakers into an artificial neural network speech characteristic prediction model, and outputting the speech characteristics of each of the plurality of speakers;

Selecting a speaker characteristic of a speaker in which a difference exists between a target vocalization characteristic among the output vocalization characteristics of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker from among the speaker characteristics of the plurality of speakers to do; and

obtaining a weight corresponding to a speaker characteristic of the selected speaker;

Determining the speaker characteristics of the new speaker comprises:

determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker;

How to generate a synthesized voice for a new speaker.
According to claim 1,

The speaker feature of the new speaker includes a speaker feature vector,

calculating a hash value corresponding to the speaker feature vector using a hash function;

determining whether there is content associated with a hash value similar to the calculated hash value from among the plurality of speaker contents stored in the storage medium; and

If there is no content associated with a hash value similar to the calculated hash value, determining that the output voice associated with the speaker characteristic of the new speaker is the new output voice,

How to generate a synthesized voice for a new speaker.
According to claim 1,

The speaker characteristic of the reference speaker includes a speaker vector,

The step of obtaining the speech characteristic change information comprises:

extracting a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature, wherein the normal vector refers to a normal vector of a hyperplane that classifies the target speech feature; and

Comprising the step of obtaining information indicating the degree of adjusting the target vocalization characteristics,

Determining the speaker characteristics of the new speaker comprises:

determining the speaker characteristic of the new speaker based on the degree of adjusting the speaker vector of the reference speaker, the extracted normal vector, and the target vocalization characteristic,

How to generate a synthesized voice for a new speaker.
A computer program stored in a computer-readable recording medium for executing the method according to claim 1 in a computer.
A speech synthesizer comprising:

A speech synthesizer trained using training data including a synthesized voice of a new speaker generated according to the method according to claim 1 .
A device for providing synthetic speech, comprising:

a memory configured to store the synthesized voice of the new speaker generated according to the method according to claim 1 ; and

at least one processor coupled to the memory and configured to execute at least one computer readable program contained in the memory

including,

the at least one program,

Comprising a command for outputting at least a part of the synthesized voice of the new speaker stored in the memory,

A device that provides synthetic speech.
A method for providing a synthesized voice of a new speaker, performed by at least one processor, the method comprising:

storing the synthesized voice of a new speaker generated according to the method according to claim 1 ;

providing at least a portion of the stored synthesized voice of the new speaker;

How to provide a synthesized voice for a new speaker.