WO2022164207A1 - Method and system for generating synthesized speech of new speaker - Google Patents

Method and system for generating synthesized speech of new speaker Download PDF

Info

Publication number
WO2022164207A1
WO2022164207A1 PCT/KR2022/001414 KR2022001414W WO2022164207A1 WO 2022164207 A1 WO2022164207 A1 WO 2022164207A1 KR 2022001414 W KR2022001414 W KR 2022001414W WO 2022164207 A1 WO2022164207 A1 WO 2022164207A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
characteristic
speech
new
speakers
Prior art date
Application number
PCT/KR2022/001414
Other languages
French (fr)
Korean (ko)
Inventor
김태수
이영근
황영태
Original Assignee
네오사피엔스 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 네오사피엔스 주식회사 filed Critical 네오사피엔스 주식회사
Priority claimed from KR1020220011853A external-priority patent/KR102604932B1/en
Publication of WO2022164207A1 publication Critical patent/WO2022164207A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present disclosure relates to a method and system for generating a synthesized voice of a new speaker, and more particularly, to determine the speaker characteristic of a new speaker using the speaker characteristic and vocal characteristic change information of a reference speaker, and artificial neural network text-to-speech synthesis
  • any content creator can easily produce audio content or video content.
  • virtual voice generation technology and virtual image production technology a neural network voice model is trained through audio samples recorded by voice actors, and voice synthesis technology having the same voice characteristics as voice actors recording audio samples is being developed.
  • the present disclosure provides a method for generating a new speaker's synthesized voice, a computer program stored in a computer-readable recording medium, and an apparatus (system) to solve the above problems.
  • the present disclosure may be implemented in various ways including a method, a system, an apparatus, or a computer program stored in a computer-readable storage medium, and a computer-readable recording medium.
  • a method for generating a synthesized voice of a new speaker includes the steps of receiving a target text, acquiring speaker characteristics of a reference speaker, and changing speech characteristics. obtaining a speaker characteristic of a reference speaker and determining the speaker characteristic of a new speaker using the acquired speaker characteristic of the reference speaker and the acquired speech characteristic change information, and the target text and the speaker characteristic of the determined new speaker to the artificial neural network text-to-speech synthesis model and generating an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected, wherein the artificial neural network text-to-speech synthesis model is based on the plurality of training text items and the speaker characteristics of the plurality of learned speakers.
  • the artificial neural network text-to-speech synthesis model is based on the plurality of training text items and the speaker characteristics of the plurality of learned speakers.
  • the determining of the speaker characteristic of the new speaker may include generating the speaker characteristic change by inputting the speaker characteristic of the reference speaker and the acquired vocal characteristic change information into an artificial neural network speaker characteristic change generation model, and outputting the speaker characteristics of a new speaker by synthesizing the speaker characteristic and the generated speaker characteristic change, wherein the artificial neural network speaker characteristic change generation model is based on the speaker characteristics of the plurality of learned speakers and the speaker characteristics of the plurality of learned speakers. It is learned using a plurality of included vocal features.
  • the speech characteristic change information includes information about a change in the target speech characteristic.
  • the acquiring the speaker characteristics of the reference speaker includes acquiring a plurality of speaker characteristics corresponding to the plurality of reference speakers
  • the acquiring the vocalization characteristic change information includes: obtaining a corresponding set of weights, wherein the determining of the speaker characteristic of the new speaker includes applying a weight included in the obtained weight set to each of the plurality of speaker characteristics, thereby determining the speaker characteristic of the new speaker. including the steps of
  • the method further includes obtaining speaker characteristics of the plurality of speakers, wherein the speaker characteristics of the plurality of speakers include a plurality of speaker vectors, wherein the obtaining of the vocalization characteristic change information includes: Normalizing each of the speaker vectors; determining a plurality of principal components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers; selecting at least one principal component from among the determined plurality of principal components; and determining the speech characteristic change information by using the selected main component, wherein the determining of the speaker characteristic of the new speaker includes determining a weight of the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the determined speech characteristic change information. and determining the speaker characteristics of the new speaker using
  • the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is assigned, and the obtaining of the speech characteristic change information includes: obtaining the speaker vectors of a plurality of speakers having different target speech characteristics, and determining the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers and determining the speaker characteristic of the new speaker, wherein the determining of the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the weight of the determined speech characteristic change information. do.
  • the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is allocated, and the obtaining the speech characteristic change information includes: obtaining a speaker vector of speakers included in each of a plurality of speaker groups having different target speech characteristics; including a speaker group; calculating an average of speaker vectors of speakers included in the first speaker group; calculating an average of speaker vectors of speakers included in the second speaker group; and the steps corresponding to the first speaker group and determining the speech characteristic change information based on a difference between the average of the speaker vectors and the average of the speaker vectors corresponding to the second speaker group, wherein the determining of the speaker characteristic of the new speaker includes the speaker characteristic of the reference speaker. , determining a speaker characteristic of a new speaker by using the determined speech characteristic change information and a weight of the determined speech characteristic change information.
  • the method further comprises obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising the plurality of speaker vectors, wherein the speaker characteristics of the reference speaker include the plurality of vocalization characteristics of the reference speaker.
  • the obtaining of the speech characteristic change information includes: inputting the speaker characteristics of the plurality of speakers into the artificial neural network speech characteristic prediction model, and outputting the speech characteristics of each of the plurality of speakers; the speaker characteristics of the plurality of speakers selecting a speaker characteristic of a speaker in which a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker; acquiring a corresponding weight, wherein the determining of the speaker characteristic of the new speaker includes: the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker determining the characteristics.
  • the speaker feature of the new speaker includes the speaker feature vector, and calculating a hash value corresponding to the speaker feature vector using a hash function; determining whether there is content associated with a hash value similar to the hash value, and if there is no content associated with the hash value similar to the calculated hash value, determining that the output voice associated with the speaker characteristic of the new speaker is the new output voice include
  • the speaker feature of the reference speaker includes a speaker vector
  • the obtaining of the speech feature change information includes extracting a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature.
  • a step of - the normal vector refers to a normal vector of a hyperplane for classifying a target speech feature - and obtaining information indicating a degree of adjusting the target speech feature
  • the determining of the speaker feature of a new speaker includes: and determining the speaker characteristic of the new speaker based on the degree of adjusting the speaker vector, the extracted normal vector, and the target vocalization characteristic of the speaker.
  • a computer program stored in a computer-readable recording medium is provided for executing the above-described method for generating a synthesized voice of a new speaker according to an embodiment of the present disclosure in a computer.
  • the speech synthesizer is trained using learning data including the synthesized voice of the new speaker generated according to the above-described method for generating the synthesized voice of the new speaker.
  • an apparatus for providing a synthesized voice is connected to a memory and a memory configured to store a synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker, and stored in the memory.
  • An apparatus for providing synthesized speech comprising at least one processor configured to execute at least one computer readable program included therein, wherein the at least one program is configured to output at least a portion of the synthesized voice of the new speaker stored in the memory is provided
  • a method of providing a synthesized voice of a new speaker includes a synthesized voice of a new speaker generated according to the above-described method of generating a synthesized voice of a new speaker. storing and providing at least a portion of the stored synthesized voice of the new speaker.
  • a synthesized voice having a new voice may be generated by modifying a speaker feature vector through quantitative adjustment of vocalization features.
  • a new speaker's voice may be generated by mixing the voices of several speakers (eg, two or more speakers or three or more speakers).
  • the output voice may be generated by finely adjusting one or more vocalization characteristics from the user terminal.
  • the one or more vocal characteristics may include gender control, vocal tone control, vocal strength, male age control, female age control, pitch, tempo, and the like.
  • FIG. 1 is a diagram illustrating an example in which a synthesized voice generating system according to an embodiment of the present disclosure generates an output voice by receiving a target text and speaker characteristics of a new speaker.
  • FIG. 2 is a schematic diagram illustrating a configuration in which a plurality of user terminals and a synthesized voice generating system are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure.
  • FIG. 3 is a block diagram illustrating an internal configuration of a user terminal and a synthesized voice generating system according to an embodiment of the present disclosure.
  • FIG. 4 is a block diagram illustrating an internal configuration of a processor of a user terminal according to an embodiment of the present disclosure.
  • FIG. 5 is a flowchart illustrating a method of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.
  • FIG. 6 is a diagram illustrating an example of generating an output voice in which the speaker characteristics of a new speaker are reflected using the artificial neural network text-to-speech synthesis model according to an embodiment of the present disclosure.
  • FIG. 7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected using an artificial neural network text-to-speech synthesis model according to another embodiment of the present disclosure.
  • FIG. 8 is an exemplary diagram illustrating a user interface for generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.
  • FIG. 9 is a structural diagram illustrating an artificial neural network model according to an embodiment of the present disclosure.
  • 'unit' or 'module' used in the specification means a software or hardware component, and 'module' performs certain roles.
  • 'unit' or 'module' is not meant to be limited to software or hardware.
  • a 'unit' or 'module' may be configured to reside on an addressable storage medium or may be configured to reproduce one or more processors.
  • 'part' or 'module' refers to components such as software components, object-oriented software components, class components, and task components, and processes, functions, and properties. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays or at least one of variables.
  • Components and 'units' or 'modules' are the functions provided therein that are combined into a smaller number of components and 'units' or 'modules' or additional components and 'units' or 'modules' can be further separated.
  • a 'unit' or a 'module' may be implemented with a processor and a memory.
  • 'Processor' should be construed broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like.
  • a 'processor' may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like.
  • ASIC application specific semiconductor
  • PLD programmable logic device
  • FPGA field programmable gate array
  • 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such configurations. You may. Also, 'memory' should be construed broadly to include any electronic component capable of storing electronic information.
  • RAM random access memory
  • ROM read-only memory
  • NVRAM non-volatile random access memory
  • PROM programmable read-only memory
  • EPROM erase-programmable read-only memory
  • a memory is said to be in electronic communication with the processor if the processor is capable of reading information from and/or writing information to the memory.
  • a memory integrated in the processor is in electronic communication with the processor.
  • a 'text item' may refer to a part or all of text, and the text may refer to a text item.
  • each of 'data item' and 'information item' may refer to at least a portion of data and at least a portion of information, and data and information may refer to a data item and information item.
  • 'each of a plurality of A' or 'each of a plurality of A' may refer to each of all components included in the plurality of As or may refer to each of some components included in the plurality of As.
  • each of the features of the plurality of speakers may refer to each of all speaker features included in each of the features of the plurality of speakers or to each of some speaker features included in the features of the plurality of speakers. .
  • FIG. 1 is a diagram illustrating an example in which a synthesized voice generating system 100 according to an embodiment of the present disclosure generates an output voice 130 by receiving a target text 110 and a speaker characteristic 120 of a new speaker.
  • the synthesized voice generating system 100 may receive the target text 110 and the speaker characteristic 120 of the new speaker, and generate the output voice 130 in which the speaker characteristic 120 of the new speaker is reflected.
  • the target text 110 may include one or more paragraphs, sentences, clauses, phrases, words, words, phonemes, and the like.
  • the speaker characteristic 120 of the new speaker may be determined or generated using the speaker characteristic of the reference speaker and information on the change of the vocalization characteristic.
  • the speaker characteristic of the reference speaker may include the speaker characteristic of the speaker to be newly created, that is, the speaker characteristic of the speaker that is a reference in generating the speaker characteristic of the new speaker.
  • the speaker characteristic of the reference speaker may include a speaker characteristic similar to the speaker characteristic of the speaker to be newly created.
  • the speaker characteristics of the reference speaker may include speaker characteristics of a plurality of reference speakers.
  • the speaker characteristic of the reference speaker may include a speaker vector of the reference speaker.
  • the speaker vector of the reference speaker may be extracted based on the speaker id (eg, speaker one-hot vector, etc.) and the vocalization feature (eg, vector) using the neural network speaker feature extraction model.
  • the artificial neural network speaker feature extraction model may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • a speaker vector of a reference speaker may be extracted based on speech and vocalization features (eg, vectors) recorded by a speaker using a human neural network speaker feature extraction model.
  • the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocalization features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • the speaker vector of the reference speaker may include one or more speech characteristics (eg, tone, speech strength, speech speed, gender, age, etc.) of the reference speaker's voice.
  • the speaker id and/or the voice recorded by the speaker may be selected as the voice on which the speaker characteristics of the new speaker are based.
  • the vocalization characteristic may include a basic vocalization characteristic that will be reflected in the speaker characteristic of the new speaker.
  • the speaker id, voice and/or vocal characteristics recorded by the speaker are generated as the speaker characteristics of the reference speaker, and the speaker characteristics of the reference speaker thus generated are synthesized with the speech characteristic change information to obtain the speaker characteristics of the new speaker.
  • the vocalization characteristic change information may include any information about the vocalization characteristic desired to be applied to the speaker characteristic of the new speaker.
  • the speech characteristic change information may include information about a difference between the speaker characteristic of the new speaker and the speaker characteristic of the reference speaker.
  • the new speaker characteristic may be generated by synthesizing the speaker characteristic and the speaker characteristic change of the reference speaker.
  • the speaker characteristic change may be generated by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the artificial neural network speaker characteristic change generation model.
  • the artificial neural network speaker characteristic change generation model may be trained using speaker characteristics of a plurality of learned speakers and a plurality of speech characteristics included in the plurality of speaker characteristics.
  • the vocalization characteristic change information may include information indicating a difference between the target vocalization characteristic included in the speaker characteristic of the new speaker and the target vocalization characteristic included in the speaker characteristic of the reference speaker. That is, the speech characteristic change information may include information about a change in the target speech characteristic.
  • the speech feature change information may include a normal vector of a hyperplane that classifies the target speech feature from the speaker feature and information indicating the degree of adjusting the target speech feature.
  • the speech characteristic change information may include a weight to be applied to each of the speaker characteristics of the plurality of reference speakers.
  • the speech feature change information may include a target speech feature generated based on an axis between target speech features included in the learned speaker and a weight of the target speech feature.
  • the speech feature change information may include a target generated feature generated based on a difference between speaker features of speakers having different target speech features and a weight for the target speech feature.
  • the speech characteristic change information may include a speaker characteristic of a speaker having a difference from a target speaker characteristic included in the speaker characteristic of the reference speaker and a weight of the corresponding speaker characteristic.
  • the synthesized voice generation system 100 is a synthesized voice for the target text 110 in which the speaker characteristics 120 of the new speaker are reflected, and generates an output voice 130 in which the target text is uttered according to the speaker characteristics of the newly created speaker. can do.
  • the synthesized speech generation system 100 learns to output voices for the plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. and artificial neural network text-to-speech synthesis model.
  • the artificial neural network text-to-speech synthesis model may be configured to output voice data for a plurality of training text items when the target text 110 and the speaker characteristic 120 of a new speaker are input, in this case, the output
  • the obtained voice data may be post-processed into human audible voice using a post-processing processor, a vocoder, or the like.
  • FIG. 2 illustrates a configuration in which a plurality of user terminals 210_1 , 210_2 , and 210_3 and a synthesized voice generating system 230 are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure.
  • the plurality of user terminals 210_1 , 210_2 , and 210_3 may communicate with the synthesized voice generation system 230 through the network 220 .
  • the network 220 may be configured to enable communication between the plurality of user terminals 210_1 , 210_2 , and 210_3 and the synthesized voice generating system 230 .
  • Network 220 according to the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired network 220 such as RS-serial communication, mobile communication network, WLAN (Wireless) LAN), Wi-Fi, Bluetooth, and a wireless network 220 such as ZigBee, or a combination thereof.
  • the communication method is not limited, and the user terminals 210_1, 210_2, 210_3) may also include short-range wireless communication.
  • the network 220 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like.
  • the network 220 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. not limited
  • the mobile phone or smart phone 210_1, the tablet computer 210_2, and the laptop or desktop computer 210_3 are illustrated as examples of a user terminal that executes or operates a user interface that provides a synthetic voice generation service, but is not limited thereto.
  • the user terminals 210_1, 210_2, and 210_3 are capable of wired and/or wireless communication and have a web browser, a mobile browser application, or a synthetic voice generating application installed so that a user interface providing a synthetic voice generating service can be executed. It may be a computing device.
  • the user terminal 210 may include a smartphone, a mobile phone, a navigation terminal, a desktop computer, a laptop computer, a digital broadcasting terminal, a personal digital assistant (PDA), a portable multimedia player (PMP), a tablet computer, and a game console (game). console), a wearable device, an Internet of things (IoT) device, a virtual reality (VR) device, an augmented reality (AR) device, and the like.
  • IoT Internet of things
  • VR virtual reality
  • AR augmented reality
  • three user terminals 210_1 , 210_2 , and 210_3 are illustrated as communicating with the synthesized speech generation system 230 through the network 220 , but the present invention is not limited thereto, and a different number of user terminals is connected to the network It may be configured to communicate with a synthetic speech generation system 230 via 220 .
  • the user terminals 210_1, 210_2, and 210_3 provide the target text, information about the speaker characteristics of the reference speaker, and/or information indicating or selecting speech characteristics change information to the synthesized speech generation system 230.
  • the user terminals 210_1 , 210_2 , and 210_3 may receive the speaker characteristic and/or the candidate vocalization characteristic change information of the candidate reference speaker from the synthesized speech generation system 230 .
  • the user terminals 210_1, 210_2, and 210_3 may select, in response to the user input, speaker characteristics and/or speech characteristics change information of the reference speaker from the candidate reference speaker speaker characteristics and/or candidate vocal characteristics change information.
  • the user terminals 210_1 , 210_2 , and 210_3 may receive the output voice generated from the synthesized voice generating system 230 .
  • each of the user terminals 210_1, 210_2, and 210_3 and the synthesized voice generating system 230 are illustrated as separately configured elements, but the present invention is not limited thereto. 210_3) may be configured to be included in each.
  • the synthesized speech generation system 230 includes an input/output interface to determine the target text, the speaker characteristics of the reference speaker, and the speech characteristics change information without communication with the user terminals 210_1, 210_2, and 210_3 for the target text. , it may be configured to output a synthesized voice in which the speaker characteristics of the new speaker are reflected.
  • the user terminal 210 may refer to any computing device capable of wired/wireless communication, for example, the mobile phone or smart phone 210_1, the tablet computer 210_2, the laptop or desktop computer 210_3 of FIG. 2 . and the like.
  • the user terminal 210 may include a memory 312 , a processor 314 , a communication module 316 , and an input/output interface 318 .
  • the synthesized speech generation system 230 may include a memory 332 , a processor 334 , a communication module 336 , and an input/output interface 338 . As shown in FIG.
  • the user terminal 210 and the synthesized voice generation system 230 are configured to communicate information and/or data via the network 220 using the respective communication modules 316 and 336 , respectively. can be configured.
  • the input/output device 320 may be configured to input information and/or data to the user terminal 210 through the input/output interface 318 or to output information and/or data generated from the user terminal 210 .
  • the memories 312 and 332 may include any non-transitory computer-readable recording medium.
  • the memories 312 and 332 are non-volatile mass storage devices such as random access memory (RAM), read only memory (ROM), disk drives, solid state drives (SSDs), flash memory, and the like. (permanent mass storage device) may be included.
  • a non-volatile mass storage device such as a ROM, SSD, flash memory, disk drive, etc. may be included in the user terminal 210 and/or the synthetic voice generation system 230 as a separate persistent storage device separate from the memory. have.
  • an operating system and at least one program code (eg, a code for determining a speaker characteristic of a new speaker, a code for generating an output voice reflecting the speaker characteristic of a new speaker, etc.) are stored.
  • a program code eg, a code for determining a speaker characteristic of a new speaker, a code for generating an output voice reflecting the speaker characteristic of a new speaker, etc.
  • the separate computer-readable recording medium may include a recording medium directly connectable to the user terminal 210 and the synthesized voice generation system 230, for example, a floppy drive, a disk, a tape, a DVD/CD.
  • a computer-readable recording medium such as a ROM drive and a memory card.
  • the software components may be loaded into the memories 312 and 332 through a communication module rather than a computer-readable recording medium.
  • the at least one program is a computer program (eg, artificial neural network text-to-speech synthesis model program) installed by files provided through the network 220 by developers or a file distribution system that distributes installation files of applications. ) may be loaded into the memories 312 and 332 based on the.
  • a computer program eg, artificial neural network text-to-speech synthesis model program
  • the processors 314 and 334 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the processor 314 , 334 by the memory 312 , 332 or the communication module 316 , 336 . For example, the processors 314 and 334 may be configured to execute received instructions according to program code stored in a recording device, such as the memories 312 and 332 .
  • the communication modules 316 and 336 may provide a configuration or function for the user terminal 210 and the synthesized voice generation system 230 to communicate with each other via the network 220 , and the user terminal 210 and/or synthesis
  • the voice generating system 230 may provide a configuration or function for communicating with another user terminal or another system (eg, a separate cloud system, a separate frame image generating system, etc.).
  • a request generated by the processor 314 of the user terminal 210 according to a program code stored in a recording device such as the memory 312 eg, a synthetic voice generation request, a new speaker's speaker characteristic generation request, etc.
  • a control signal or command provided under the control of the processor 334 of the synthesized speech generation system 230 is transmitted to the communication module 316 of the user terminal 210 via the communication module 336 and the network 220 . through the user terminal 210 may be received.
  • the input/output interface 318 may be a means for interfacing with the input/output device 320 .
  • the input device may include a device such as a keyboard, a microphone, a mouse, a camera including an image sensor
  • the output device may include a device such as a display, a speaker, a haptic feedback device, and the like.
  • the input/output interface 318 may be a means for an interface with a device in which a configuration or function for performing input and output, such as a touch screen, is integrated into one.
  • a service screen or user interface configured using data may be displayed on the display through the input/output interface 318 .
  • the input/output device 320 is illustrated not to be included in the user terminal 210 , but the present invention is not limited thereto, and may be configured as a single device with the user terminal 210 .
  • the input/output interface 338 of the synthesized voice generation system 230 interfaces with a device (not shown) for input or output that is connected to the synthesized voice generation system 230 or may include the synthesized voice generation system 230 . may be a means for In FIG.
  • the input/output interfaces 318 and 338 are illustrated as elements configured separately from the processors 314 and 334, but the present invention is not limited thereto, and the input/output interfaces 318 and 338 may be configured to be included in the processors 314 and 334. have.
  • the user terminal 210 and the synthesized voice generation system 230 may include more components than those of FIG. 3 . However, there is no need to clearly show most of the prior art components. According to an embodiment, the user terminal 210 may be implemented to include at least a portion of the above-described input/output device 320 . In addition, the user terminal 210 may further include other components such as a transceiver, a global positioning system (GPS) module, a camera, various sensors, and a database. For example, when the user terminal 210 is a smart phone, it may include components generally included in the smart phone, for example, an acceleration sensor, a gyro sensor, a camera module, various physical buttons, and touch. Various components such as a button using a panel, an input/output port, and a vibrator for vibration may be implemented to be further included in the user terminal 210 .
  • GPS global positioning system
  • the processor 314 of the user terminal 210 may be configured to operate a synthetic voice output application or the like.
  • a code associated with a corresponding application and/or program may be loaded into the memory 312 of the user terminal 210 .
  • the processor 314 of the user terminal 210 receives information and/or data provided from the input/output device 320 through the input/output interface 318 or through the communication module 316 .
  • Information and/or data may be received from the synthesized speech generation system 230 , and the received information and/or data may be processed and stored in the memory 312 .
  • such information and/or data may be provided to the synthesized voice generation system 230 through the communication module 316 .
  • the processor 314 may receive text input or selected through an input device 320 such as a touch screen or a keyboard connected to the input/output interface 318, and receive The synthesized text may be stored in the memory 312 or provided to the synthesized speech generation system 230 through the communication module 316 and the network 220 .
  • the processor 314 may receive an input for the target text (eg, one or more paragraphs, sentences, phrases, words, phonemes, etc.) through the input device 320 .
  • the processor 314 may receive, through the input device 320 , any information indicating or selecting information about a reference speaker and/or information on change of speech characteristics.
  • the processor 314 may receive an input for the target text through the input device 320 through the input/output interface 318 .
  • the processor 314 may receive, through the input device 320 and the input/output interface 318 , an input for uploading a file in a document format including the target text through the user interface.
  • the processor 314 may receive a file in a document format corresponding to the input from the memory 312 .
  • the processor 314 may receive the target text included in the file.
  • the received target text may be provided to the synthesized speech generating system 230 through the communication module 316 .
  • the processor 314 may be configured to provide the uploaded file to the synthesized speech generation system 230 via the communication module 316 and to receive the target text contained within the file from the synthesized speech generation system 230 . have.
  • the processor 314 outputs the processed information and/or data through an output device such as a display output capable device (eg, a touch screen, a display, etc.) of the user terminal 210 and an audio output capable device (eg, a speaker).
  • an output device such as a display output capable device (eg, a touch screen, a display, etc.) of the user terminal 210 and an audio output capable device (eg, a speaker).
  • the processor 314 may display information representing or selecting target text and/or speech characteristic change information received from at least one of the input device 320 , the memory 312 , or the synthesized speech generation system 230 to the user. It can be output through the screen of the terminal 210 . Additionally or alternatively, the processor 314 may output the speaker characteristics of the new speaker determined or generated by the information processing system 230 through the screen of the user terminal 210 . Also, the processor 314 may output the synthesized voice through a voice output capable device such as a speaker. Additionally, the processor 314 may output the audio through
  • the processor 334 of the synthesized speech generation system 230 may be configured to manage, process and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems, including the user terminal 210 .
  • the information and/or data processed by the processor 334 may be provided to the user terminal 210 through the communication module 336 .
  • the processor 334 receives from the user terminal 210, the memory 332 and/or the external storage device information indicating or selecting the target text, information about the reference speaker, and speech characteristic change information, It is possible to obtain or determine the speaker characteristics and vocal characteristics change information of the reference speaker included in the memory 332 and/or the external storage device.
  • the processor 334 may determine the speaker characteristic of the new speaker using the speaker characteristic and the vocalization characteristic change information of the reference speaker. Also, the processor 334 may generate an output voice for the target text in which the determined new speaker characteristic is reflected. For example, the processor 334 may input the target text and the speaker characteristics of the new speaker into the artificial neural network text-to-speech synthesis model to generate output speech from the artificial neural network text-to-speech synthesis model. The output voice generated in this way may be provided to the user terminal 210 through the network 220 and output through a speaker associated with the user terminal 210 .
  • the processor 334 may include a speaker characteristic determination module 410 , a synthesized speech output module 420 , a speech characteristic change information determination module 430 , and an output speech verification module 440 .
  • Each of the modules operated on the processor 334 may be configured to communicate with each other.
  • the internal configuration of the processor 334 is described separately for each function, but this does not necessarily mean that the processor 334 is physically separated.
  • the internal configuration of the processor 334 shown in FIG. 4 is only an example, and only essential configurations are not shown. Accordingly, in some embodiments, the processor 334 may be implemented differently, such as by additionally including other components other than the illustrated internal configuration, or by omitting some of the illustrated internal components.
  • the speaker characteristic determination module 410 may acquire speaker characteristics of a reference speaker.
  • the features of the reference speaker may be extracted through the learned artificial neural network speaker feature extraction model.
  • the speaker feature determination module 410 inputs the speaker id (eg, speaker one-hot vector, etc.) and vocalization characteristics (eg, vector) into the trained artificial neural network speaker feature extraction model to determine the speaker of the reference speaker.
  • Features eg vectors
  • the speaker feature determination module 410 inputs the speech and vocalization features (eg, vectors) recorded by the speaker into the trained artificial neural network speaker feature extraction model, and extracts the speaker features (eg, vectors) of the reference speaker. can do.
  • the speaker characteristic determination module 410 may obtain speaker characteristics and vocalization characteristic change information of the reference speaker, and determine the speaker characteristic of a new speaker by using the acquired speaker characteristic of the reference speaker and the acquired vocalization characteristic change information.
  • the speaker characteristic of the reference speaker at least one of the speaker characteristics of a plurality of speakers stored in the storage medium may be selected.
  • the speech characteristic change information includes information indicating a change in speaker characteristics of a reference speaker, information indicating a change in speaker characteristics of at least some of the plurality of speakers stored in the storage medium, and/or included in the speaker characteristics of at least some of the plurality of speakers It may be information indicating a change in vocal characteristics.
  • the speaker features of the plurality of speakers may include features inferred from the learned artificial neural network speaker feature extraction model.
  • each of the speaker characteristic and the vocalization characteristic may be expressed in a vector form.
  • the synthesized speech output module 420 may receive the target text from the user terminal and receive the speaker characteristics of the new speaker from the speaker characteristic determination module 410 .
  • the synthesized voice output module 420 may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected.
  • the synthesized speech output module 420 inputs the target text and the speaker characteristics of the new speaker to the trained artificial neural network text-to-speech synthesis model, and outputs the speech (ie, synthesized speech) from the artificial neural network text-to-speech synthesis model. ) can be created.
  • This artificial neural network text-to-speech synthesis model is to be stored in a storage medium (eg, the memory 332 of the information processing system 230 , other storage media accessible by the processor 334 of the information processing system 230 , etc.).
  • the artificial neural network text-to-speech synthesis model includes a model trained to output a voice for the target text in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. can do.
  • the synthesized voice output module 420 may provide the generated synthesized voice to the user terminal. Accordingly, the generated synthesized voice may be output through any speaker built into the user terminal 210 or connected via wire or wirelessly.
  • the speech characteristic change information determination module 430 may obtain speech characteristic change information from the memory 332 .
  • the speech characteristic change information may be determined through information determined through a user input through a user terminal (eg, the user terminal 210 of FIG. 2 ).
  • the speech characteristic change information may include information on a speech characteristic to be changed in order to generate a new speaker, that is, a new speaker.
  • the vocalization characteristic change information may include information (eg, reflection ratio information) related to the speaker characteristic of the reference speaker.
  • the speech characteristic change information is determined by the speaker characteristic determination module 410 and the speech characteristic change information determination module 430 , and the characteristic of a new speaker is determined using the determined speech characteristic change information and the speaker characteristic of the reference speaker. Specific examples are described.
  • the speaker characteristic determination module 410 generates a speaker characteristic change by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the learned artificial neural network speaker characteristic change generation model, and the speaker characteristic and generation of the reference speaker By synthesizing the changed speaker characteristics, it is possible to output the speaker characteristics of a new speaker.
  • the artificial neural network is learning the speaker characteristic change generation model
  • individual speech characteristic information may be obtained for each speaker without using the speech characteristic information included in the speaker characteristic of the speaker as an input.
  • information on the vocalization characteristic of a given speaker may be obtained through tagging by a person.
  • the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker.
  • the obtained speaker's speech characteristic information may be stored in a storage medium. That is, it is possible to adjust the speaker characteristics of the reference speaker according to the change in the vocalization characteristics by using the artificial neural network speaker characteristic change generation model.
  • This artificial neural network speaker feature change generation model can be learned using Equation 1 below.
  • the vocalization characteristic change information determination module 430 receives the information from the storage medium. , , and can be obtained and used to learn the artificial neural network speaker feature change generation model. In addition, and Based on the difference value of , that is, loss, an artificial neural network speaker feature change generation model can be trained.
  • the vocalization characteristic change information determination module 430 inputs the difference between the reference speaker's vocalization characteristic and the reference speaker's vocalization characteristic and the reference speaker's speaker characteristic to the learned artificial neural network speaker characteristic change generation model during inference to input the vocalization characteristic change information can be decided
  • the speaker characteristic determination module 410 is configured to provide the determined speech characteristic change information. and speaker characteristics of the reference speaker Based on this, it is possible to determine the speaker characteristics of the new speaker. This new speaker characteristic can be expressed as Equation 2 below.
  • the vocalization characteristic change information determining module 430 may extract a normal vector for the target vocalization characteristic by using a vocalization feature classification model corresponding to the target vocalization characteristic.
  • a speech feature classification model corresponding to each of the plurality of speech features may be generated.
  • the vocal feature classification model is a hyperplane-based model, and may be implemented using, for example, a support vector machine (SVM), a linear classifier, or the like, but is not limited thereto.
  • the target vocalization characteristic may refer to a vocalization characteristic selected from among a plurality of vocalization features, which will be changed and reflected in the speaker characteristic of a new speaker.
  • the speaker's characteristic may be expressed as a speaker vector.
  • each speech feature information included in the speaker feature of the speaker is not used as an input, and each speech feature information may be obtained for each speaker.
  • voice characteristic information of a given speaker may be obtained through tagging by a person.
  • the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker.
  • Is means the i-th vocalization characteristic, denotes a normal vector of a hyperplane that classifies the i-th vocalization feature, and b denotes a bias.
  • the speaker feature determination module 410 is configured to generate a speaker feature vector of the reference speaker most similar to the new speaker through the trained artificial neural network speaker feature extraction model to generate a synthesized speech of the new speaker. can be obtained.
  • the vocalization characteristic change information determination module 430 may obtain, as the vocalization characteristic change information, information indicating a normal vector of the target vocalization characteristic and the degree of adjusting the vocalization characteristic from the learned vocalization characteristic classification model. The speaker feature vector of the reference speaker thus obtained , the speaker characteristic of the new speaker according to Equation 4 below using the normal vector of the target speech feature and the degree of adjusting the speech feature. can be created.
  • the normal vector of the target vocalization feature may refer to the degree of controlling the vocal characteristics.
  • the speaker characteristic determining module 410 may acquire a plurality of speaker characteristics corresponding to a plurality of reference speakers. Also, the speech characteristic change information determination module 430 may obtain a weight set corresponding to a plurality of speaker characteristics and provide the obtained weight set to the speaker characteristic determination module 410 . The speaker characteristic determination module 410 may determine the speaker characteristic of a new speaker as shown in Equation 5 below by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. That is, the voices of several speakers may be mixed to generate a new speaker's voice.
  • the speaker vector of the i speaker may mean a weight for speaker i.
  • feature vectors of multiple speakers can be mixed into the feature vectors of new speakers.
  • the speaker characteristic determining module 410 may generate a new speaker characteristic vector through a pre-calculated method of adjusting the vocalization characteristic axis.
  • a speaker feature includes one or more vocal features.
  • the vocalization characteristic change information determination module 430 may find the vocalization characteristic axis and adjust the vocalization characteristic axis.
  • the adjusted vocalization characteristic axis may be provided to the speaker characteristic determination module 410 and used to determine the speaker characteristic of a new speaker. That is, the speaker characteristic determination module 410 calculates the speaker characteristic r of the reference speaker, the vocalization characteristic axis, as shown in Equation 6 below. and weight of speech characteristic change information can be used to determine the speaker characteristics of the new speaker.
  • the j-th vocal feature axis may mean a weight for the j-th utterance feature.
  • is an individual vocalization characteristic It may mean one axis on the vocal feature space to distinguish Is may have the same dimension as the speaker's expression.
  • the speech characteristic change information determining module 430 may normalize each of the speaker vectors of the plurality of speakers.
  • the speaker vectors of the plurality of speakers may be included in the speaker characteristics of the plurality of speakers.
  • the speech characteristic change information determination module 430 may perform Z-normalization in which the mean is subtracted from all data and the variance is divided, or normalization in which the mean is subtracted from all data.
  • N(-) denotes a normalization function
  • D(-) denotes an inverse normalization function
  • the speech characteristic change information determination module 430 may determine the plurality of main components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers.
  • the dimensionality reduction analysis may be performed through a conventionally known dimension reduction technique, such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), or Stochastic Neighbor Embedding (t-SNE).
  • PCA Principal Component Analysis
  • SVD Singular Value Decomposition
  • t-SNE Stochastic Neighbor Embedding
  • the speech characteristic change information determination module 430 may determine a plurality of main components P in Equation 8 below by performing PCA on N(R).
  • kth major component may refer to the number of dimensions of r of the speaker expression.
  • the vocalization characteristic change information determination module 430 may select at least one main component from among the plurality of determined main components. For example, key components associated with the vocal characteristics desired to be altered in the speaker characteristics of the new speaker may be selected.
  • the j-th vocalization characteristic the main ingredient selected and a normalization inverse function D.
  • the j-th utterance feature and a weight corresponding thereto are provided to the speaker feature determination module 410, so that the speaker feature of a new speaker can be generated through Equation 6 above.
  • the vocalization characteristic change information determination module 430 is used in Equation (6). Instead, obtained through Equation (10) By using , interference between the vocal feature axes can be removed.
  • Is may refer to an axis of vocalization characteristics in which some vocalization characteristics are changed.
  • the speech characteristic change information determination module 430 may obtain speaker vectors of a plurality of speakers having different target speech characteristics.
  • the speaker vectors of the plurality of learning speakers may be included in the speaker characteristics of the plurality of learning speakers.
  • each of the plurality of speakers is assigned a label for one or more vocal features.
  • a vocal feature label may be assigned to each of a plurality of speakers as shown in FIG.
  • the speech characteristics may include tone, speech strength, speech speed, gender, and age. Tone, vocal strength, and vocal speed It can be expressed as, where may be an element of l.
  • the gender of men and women It can be expressed as , and the age is can be expressed as for example, The silver tone is low, the vocal strength is medium, and the vocalization rate is high, which may indicate the vocal characteristics of a 50-year-old male.
  • the speech characteristic change information determining module 430 is configured to perform speaker vectors of a plurality of speakers having different target speech characteristics, as shown in Equation 11 above. and Vocal characteristics based on the difference between can be decided
  • the vocal features may be included in the speech characteristic change information.
  • This speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.
  • the speech characteristic change information determining module 430 may determine the speech characteristic change information based on a difference between the averages of the speaker vectors of a plurality of speaker groups.
  • the speaker features of the plurality of speakers include speaker vectors of the plurality of speakers, and each of the speaker features of the plurality of speakers is assigned a label for one or more vocalization features.
  • the speech characteristic change information determination module 430 may obtain speaker vectors of speakers included in each of a plurality of speaker groups having different target speech characteristics.
  • the group of the plurality of learning speakers may include a first speaker group and a second speaker group.
  • the speech characteristic change information determination module 430 calculates an average of the speaker vectors of the speakers included in the first speaker group, and calculates the average of the speaker vectors of the speakers included in the second speaker group, by calculating the average of the speaker vectors included in the second speaker group, Equation (12)
  • a speech characteristic based on the difference between the average of the speaker vectors corresponding to the first speaker group and the average of the speaker vectors corresponding to the second speaker group as can be decided Determined vocal characteristics may be included in the speech characteristic change information.
  • this speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.
  • the speech characteristic change information determining module 430 may include, as in Equation 13 below, speaker characteristics of a plurality of speakers.
  • a neural network vocal feature prediction model By typing in, each vocalization characteristic of a plurality of speakers can be printed out.
  • the speech characteristic change information determining module 430 is a speaker characteristic of a plurality of speakers.
  • the output vocal characteristics selected from class A speaker feature that has a difference value in the j-voicing features included in can be selected or determined. These speaker characteristics may be provided to the speaker characteristic determination module 410 .
  • the speaker characteristic determining module 410 may obtain a weight corresponding to the speaker characteristic of the selected speaker. Then, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. For example, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker using Equation 14 below.
  • the speaker characteristic of the new speaker is the speaker characteristics of the reference speaker
  • the speaker characteristics of the selected speaker may refer to a weight corresponding to the speaker characteristic of the selected speaker.
  • the output voice verification module 440 may determine whether the output voice associated with the speaker characteristic of the new speaker is a new output voice that is not previously stored. According to an embodiment, the output voice verification module 440 may calculate a hash value corresponding to a speaker feature (eg, a speaker feature vector) of a new speaker by using a hash function. In another embodiment, the output voice verification module 440 does not calculate a hash value using the speaker voice of the new speaker, but extracts the speaker feature of the speaker from the new output voice, and uses the extracted speaker feature of the new speaker to hash A value can be calculated.
  • a speaker feature eg, a speaker feature vector
  • the output voice verification module 440 may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. When there is no content associated with a hash value similar to the calculated hash value, the output voice verification module 440 may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice. When it is determined as the new output voice, the synthesized voice reflecting the speaker characteristics of the new speaker may be set to be used.
  • the method 500 for generating an output voice reflecting the speaker characteristics of the new speaker includes a processor (eg, the processor 314 of the user terminal 210 and/or the processor 334 of the synthesized voice generating system 230 ). )) can be performed by As shown, the method 500 may be initiated by the processor receiving the target text (S510).
  • a processor eg, the processor 314 of the user terminal 210 and/or the processor 334 of the synthesized voice generating system 230 ).
  • the processor may acquire a speaker characteristic of the reference speaker corresponding to the reference speaker ( S520 ).
  • the speaker characteristic of the reference speaker may include a speaker vector. Additionally or alternatively, the speaker characteristics of the reference speaker may include vocalization characteristics of the reference speaker.
  • the speaker characteristics of the reference speaker may include a plurality of speaker characteristics corresponding to the plurality of reference speakers.
  • the plurality of speaker features may include a plurality of speaker vectors.
  • the processor may acquire vocal feature change information ( S530 ).
  • the processor may acquire speaker characteristics of the plurality of speakers.
  • the speaker characteristics of the plurality of speakers may include a plurality of speaker vectors.
  • the processor may determine the plurality of principal components by performing normalization on each of the speaker vectors of the plurality of speakers and performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers. At least one major analysis from among the plurality of key components thus determined may be selected. Then, the processor may determine the speech characteristic change information using the selected main component.
  • the processor may obtain speaker vectors of a plurality of speakers having different target vocalization characteristics, and determine the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers.
  • the processor may obtain a speaker vector of speakers included in each of a plurality of speaker groups having different target vocalization characteristics.
  • the plurality of speaker groups may include a first speaker group and a second speaker group. Then, the processor may calculate an average of speaker vectors of speakers included in the first speaker group, and calculate an average of speaker vectors of speakers included in the second speaker group.
  • the processor may determine the speech characteristic change information based on a difference between an average of speaker vectors corresponding to the first speaker group and an average of speaker vectors corresponding to the second speaker group.
  • the processor may input the speaker characteristics of the plurality of speakers to the artificial neural network speech characteristic prediction model, and output the speech characteristics of each of the plurality of speakers. Then, the processor is configured to: a speaker of the speaker, wherein, among the speaker features of the plurality of speakers, a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker. A characteristic may be selected, and a weight corresponding to the speaker characteristic of the selected speaker may be obtained.
  • the speaker characteristic of the selected speaker and the weight corresponding to the speaker characteristic of the selected speaker may be obtained as speech characteristic change information.
  • the processor may extract a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature.
  • the normal vector may refer to a normal vector of a hyperplane that classifies the target speech feature, and information indicating the degree of adjusting the target speech feature may be obtained.
  • the extracted normal vector and information indicating the degree to which the target speech feature is adjusted may be obtained as speech feature change information.
  • the processor may determine the speaker characteristics of the new speaker by using the acquired speaker characteristics of the reference speaker and the acquired speech characteristic change information ( S540 ).
  • the processor generates a change in the speaker characteristic by inputting the speaker characteristic of the reference speaker and the acquired speech characteristic change information into the artificial neural network speaker characteristic change generation model, and the speaker characteristic of the reference speaker and the generated speaker characteristic change By synthesizing , it is possible to output the speaker characteristics of the new speaker.
  • the artificial neural network speaker characteristic change generation model may be learned by using the speaker characteristics of the plurality of learned speakers and the plurality of speech characteristics included in the speaker characteristics of the plurality of learned speakers.
  • the processor may determine the speaker characteristic of the new speaker by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. In another embodiment, the processor may determine the characteristics of the new speaker by using the weights of the speaker characteristics of the reference speaker, the speech characteristics change information, and the speech characteristics change information. According to another embodiment, the processor may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. According to another embodiment, the processor may determine the speaker characteristic of the new speaker based on the degree to which the reference speaker's speaker vector, the extracted normal vector, and the target vocalization characteristic are adjusted.
  • the processor may input the target text and the determined speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected (S550).
  • the artificial neural network text-to-speech synthesis model learns to output voices for a plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. model may be included.
  • the processor may calculate a hash value corresponding to the speaker feature vector using a hash function.
  • the speaker feature vector may be included in the speaker feature of the new speaker. Then, the processor may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. If there is no content associated with the hash value similar to the calculated hash value, the processor may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice.
  • a speech synthesizer learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker may be provided.
  • the voice synthesizer may be any voice synthesizer that can be learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker.
  • the speech synthesizer may include any text-to-speech synthesis (TTS) model trained using this training data.
  • TTS text-to-speech synthesis
  • the TTS model may be implemented as a machine learning model or an artificial neural network model known in the art.
  • the speech synthesizer since the speech synthesizer has learned the synthesized voice of the new speaker as training data, when the target text is input, the target text may be output as the synthesized voice of the new speaker. According to an embodiment, such a voice synthesizer may be included or implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 .
  • a memory and a memory configured to store a synthesized voice of a new speaker generated according to the method for generating a synthesized voice of a new speaker as described above and connected to the memory, execute at least one computer-readable program included in the memory
  • An apparatus for providing a synthesized voice may be provided, including at least one processor configured to: the at least one program including an instruction for outputting at least a part of the synthesized voice of the new speaker stored in the memory.
  • the device for providing the synthesized voice may refer to any device that stores the synthesized voice of a new speaker that has been generated in advance and provides at least a part of the stored synthesized voice.
  • the apparatus for providing such a synthesized voice may be implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 .
  • the apparatus for providing the synthesized voice is not limited thereto, but may be implemented as a video system, an ARS system, a game system, a sound pen, or the like.
  • a device for providing such a synthesized voice is provided to the information processing system 230
  • at least a part of the outputted synthesized voice of the new speaker is transmitted to the user terminal device connected to the information processing system 230 by wire/wireless.
  • the information processing system 230 may provide at least a part of the output synthesized voice of the new speaker in a streaming manner.
  • a method for providing a synthesized voice of a new speaker comprising the steps of: storing the synthesized voice of the new speaker generated according to the above-described method; and providing at least a part of the stored synthesized voice of the new speaker.
  • This method may be executed by the processor of the user terminal 210 and/or the processor of the information processing system 230 of FIG. 2 .
  • This method may be provided for a service providing a synthesized voice of a new speaker.
  • a service may be implemented as a video system, an ARS system, a game system, a sound pen, etc., but is not limited thereto.
  • the artificial neural network text-to-speech synthesis model may include an encoder 610 , an attention 620 , and a decoder 630 .
  • the encoder 610 may receive the target text 640 as an input.
  • the encoder 610 may be configured to generate pronunciation information for the input target text 640 (eg, phoneme information for the target text, a vector for each of a plurality of phonemes included in the target text, etc.).
  • the encoder 610 may generate the target text 640 by converting it into character embeddings.
  • the generated character embeddings may be passed to a pre-net including a fully-connected layer.
  • the encoder 610 may provide the output from the pre-net to the CBHG module to output encoder hidden states.
  • the CBHG module may include a 1D convolution bank, max pooling, a highway network, and a bidirectional gated recurrent unit (GRU).
  • the pronunciation information generated by the encoder 610 may be provided to the attention 620 , and the attention 620 may connect or combine the provided pronunciation information with voice data corresponding to the pronunciation information.
  • attention 620 may be configured to determine from which portion of the input text to generate speech.
  • the pronunciation information connected in this way and voice data corresponding to the pronunciation information may be provided to the decoder 630 .
  • the decoder 630 may be configured to generate the voice data 660 corresponding to the target text 640 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.
  • the decoder 630 provides the speaker characteristics ( ) 658 , to generate an output voice for the target text reflecting the speaker characteristics of the new speaker.
  • the speaker characteristics of the new speaker ( ) 658 may be generated through the vocalization characteristic change module 656 .
  • the vocalization characteristic change module 656 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .
  • the artificial neural network speaker feature extraction model 650 is a reference speaker's
  • the speaker feature (r) can be obtained.
  • the vocalization feature C 654 and the speaker feature r of the speaker may be expressed in a vector form.
  • the artificial neural network speaker feature extraction model 650 may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • the vocalization characteristic change information is determined through the vocalization characteristic change module 656, and further, the new speaker's speaker Characteristic( ) (658) can be determined.
  • the input information (d) 655 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.
  • the decoder 630 includes a freenet composed of a fully connected layer, an attention recurrent neural network (RNN) including a gated recurrent unit (GRU), and a decoder RNN (decoder RNN) including a residual GRU (residual GRU). RNN) may be included.
  • the voice data 660 output from the decoder 630 may be expressed as a mel-scale spectrogram.
  • the output of the decoder 630 may be provided to a post-processing processor (not shown).
  • the CBHG of the post-processing processor may be configured to convert the mel-scale spectrogram of the decoder 630 into a linear-scale spectrogram.
  • the output signal of the CBHG of the post-processing processor may include a magnitude spectrogram.
  • the phase of the output signal of the CBHG of the post-processing processor may be restored through a Griffin-Lim algorithm and subjected to inverse short-time Fourier transform.
  • the post-processing processor may output a voice signal in a time domain.
  • the post-processing processor may be implemented using a GAN-based vocoder.
  • the processor uses a database including a training text item, a speaker characteristic of a plurality of learned speakers, and a training voice data item corresponding to the training text item in which the speaker characteristic is reflected.
  • the processor may learn the artificial neural network text-to-speech synthesis model to output a synthesized voice reflecting the speaker characteristics of the learning speaker based on the training text item, the speaker characteristics of the training speaker, and the training voice data item corresponding to the training text item.
  • the processor may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way.
  • the processor adds the target text 640 and the new speaker speaker characteristics ( ) 658 , a synthesized voice may be generated based on the output voice data 660 .
  • the synthesized speech generated in this way has the speaker characteristics ( ) 658 may be reflected, and may include a voice uttering the target text 640 .
  • the decoder 630 may include the attention 620 .
  • the speaker characteristics of the new speaker ( ) 658 is input to the decoder 630 , but is not limited thereto.
  • the speaker characteristics of the new speaker ( ) 658 may be input to the encoder 610 and/or the attention 620 .
  • FIG. 7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to another embodiment of the present disclosure.
  • the encoder 710 , the attention 720 , and the decoder 730 illustrated in FIG. 7 may perform functions similar to those of the encoder 610 , the attention 620 and the decoder 630 illustrated in FIG. 6 , respectively. Accordingly, the description overlapping with FIG. 6 will be omitted.
  • the encoder 710 may receive the target text 740 as input.
  • the encoder 710 is configured to generate pronunciation information for the input target text 740 (eg, a plurality of phoneme information included in the target text, a vector for each of a plurality of phonemes included in the target text, etc.).
  • the pronunciation information generated by the encoder 710 may be provided to the attention 720 , and the attention 720 may connect the pronunciation information and voice data corresponding to the pronunciation information.
  • the pronunciation information connected as described above and voice data corresponding to the pronunciation information may be provided to the decoder 730 .
  • the decoder 730 may be configured to generate the voice data 760 corresponding to the target text 740 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.
  • the decoder 730 provides the new speaker's speaker characteristics ( ) 758 , and generate an output voice for the target text reflecting the speaker characteristics of the new speaker.
  • the speaker characteristics of the new speaker ( ) 758 may be generated through the vocal feature change module 756 .
  • the vocal feature change module 756 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .
  • the artificial neural network speaker feature extraction model 750 may output speaker identification information (i) 753 based on the voice 752 and the speech feature set (C) 754 recorded by the speaker. Also, it is possible to obtain the speaker characteristic (r) of the reference speaker.
  • the speech feature set may include one or more speech features c.
  • the speech feature set (C) 754 and the speaker feature (r) of the speaker may be expressed in a vector form.
  • the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • the vocalization characteristic change module 756 determines the vocalization characteristic change information using the generated reference speaker characteristic (r) and the input information (d) 755 associated with the vocalization characteristic change information, and furthermore, the speaker characteristic of the new speaker. ( ) can be determined.
  • the input information (d) 755 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.
  • the processor uses a database including a pair of training speech data items corresponding to the training text item, in which the speaker characteristics of the plurality of training text items and the speaker characteristics are reflected.
  • the processor may learn the artificial neural network text-to-speech synthesis model to output the synthesized voice 760 in which the speaker characteristics of the new speaker are reflected, based on the speaker characteristics of the training speaker and the training voice data item corresponding to the training text item.
  • the processor may generate the output voice 760 in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way.
  • the processor adds the target text 740 and the new speaker speaker characteristics ( ) 758 , a synthesized voice may be generated based on the output voice data 760 .
  • the synthesized speech generated in this way has the speaker characteristics ( ) 758 may include a voice uttering the target text 740 .
  • the attention 720 and the decoder 730 are illustrated as separate components in FIG. 7 , the present invention is not limited thereto.
  • the decoder 730 may include an attention 720 .
  • the speaker characteristics of the new speaker ( ) is input to the decoder 730, but is not limited thereto.
  • the speaker characteristics of the new speaker ( ) may be input to the encoder 710 and/or the attention 720 .
  • a target text is expressed as one input data item (eg, a vector) and one output data item (eg, a melscale spectrogram) is output through an artificial neural network text-to-speech synthesis model.
  • one input data item eg, a vector
  • one output data item eg, a melscale spectrogram
  • the present invention is not limited thereto, and may be configured to output any number of output data items by inputting an arbitrary number of input data items to the artificial neural network text-to-speech synthesis model.
  • the user terminal (eg, the user terminal 210 ) may output a synthesized voice reflecting the speaker characteristics of the new speaker through the user interface 800 .
  • the user interface 800 may include a text area 810 , a speech characteristic adjustment area 820 , a speaker characteristic adjustment area 830 , and an output voice display area 840 .
  • the processor may be the processor 314 of the user terminal 210 and/or the processor 334 of the information processing system 230 .
  • the processor may receive the target text through a user input using an input interface (eg, a keyboard, a mouse, a microphone, etc.), and display the received target text through the text area 810 .
  • the processor may receive a document file including text, extract text in the document file, and display the extracted text in the text area 810 .
  • the text displayed in the text area 810 in this way may be a target to be uttered through a synthesized voice.
  • One or more reference speakers may be selected in response to a user input for selecting one or more reference speakers from among the reference speakers displayed in the speaker characteristic adjustment area 830 . Then, the processor may receive a weight (eg, a reflection ratio) for the speaker characteristics of the selected one or more reference speakers as speech characteristic change information. For example, the processor may receive a weight for each of the speaker characteristics of one or more reference speakers in Equation 5 described with reference to FIG. 4 through an input in the speaker characteristic adjustment region 830 .
  • a weight eg, a reflection ratio
  • the speaker feature control area 830 six standard speakers, 'Eun-Byul Ko', 'Soo-Min Kim', 'Woo-Rim Lee', 'Do-Young Song', 'Seong-Soo Shin', and 'Jin-Kyung Shin' may be given. That is, the user selects one or more reference speakers from among the six reference speakers, and adjusts a reflection ratio adjustment means (eg, bar) corresponding to each of the selected one or more reference speakers, so that the speaker characteristics of the selected reference speaker are changed to a new speaker. A ratio that is reflected in the speaker characteristics of may be determined. Alternatively, one or more of the six reference speakers may be randomly selected.
  • a reflection ratio adjustment means eg, bar
  • the reflection ratios for each speaker may be received so that the sum of reflection ratios corresponding to the selected one or more reference speakers becomes 100. Alternatively, even if the reflection ratio corresponding to the one or more reference speakers selected in this way is greater than or less than 100, each reflection ratio may be automatically adjusted so that the sum of the ratios becomes 100.
  • FIG. 6 six reference speakers are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and 5 or less reference speakers and 7 or more reference speakers are displayed in the speaker characteristic adjustment area 830 to create new speaker characteristics. It can be used to generate speaker characteristics.
  • the processor may receive a weight (eg, a reflection ratio) for each of the plurality of speech features as speech feature change information through the speech feature adjustment region 820 .
  • the processor may receive a weight for each of the plurality of speech features in Equation 6 described with reference to FIG. 4 through an input in the speech feature adjustment region 820 .
  • r in Equation 6 may be a reference speaker generated according to the selection and reflection ratio of one or more reference speakers in the speaker characteristic adjustment region 830 .
  • r is a result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 .
  • Equation 13 is the result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 , can be
  • gender, vocal tone, vocal strength, male age, female age, pitch, and tempo may be given as quantitatively adjustable vocal characteristics in the vocalization characteristic adjustment area 820 .
  • a ratio adjusting means eg, a bar
  • a ratio adjusting means eg, a bar
  • the corresponding vocalization characteristic is not reflected in the speaker characteristic of the new speaker at all.
  • seven vocal characteristics are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and the vocal characteristics of six or less people and additional vocal characteristics are displayed in the vocalization characteristic control area 820 to display the speaker characteristics of the new speaker. can be used to create
  • the processor receives the speaker characteristics of one or more reference speakers selected in the speaker characteristic adjustment area 830 , and weights input from the speaker characteristic adjustment area 830 and/or the speech characteristics adjustment area 820 .
  • a speaker characteristic of a new speaker may be generated by using the speech characteristic adjustment information including the weight.
  • One of the methods described with reference to FIG. 4 may be used as a specific method for generating the speaker characteristic of a new speaker.
  • the processor may input the target text and the generated speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected.
  • the input in the text area 810 , the speech characteristic adjustment area 820 , and the speaker characteristic adjustment area 830 is completed, and the 'Create' button located below the speech characteristic adjustment area 820 is selected or clicked Then, an output voice for the target text in which the speaker characteristics of the new speaker are reflected may be generated.
  • the output voice thus generated may be output through a speaker connected to the user terminal.
  • the reproduction time and/or position of the output voice may be displayed through the output voice display area 840 .
  • the artificial neural network model 900 is a statistical learning algorithm implemented based on the structure of a biological neural network or a structure for executing the algorithm in machine learning technology and cognitive science.
  • the artificial neural network model 900 is an artificial neuron that forms a network by combining synapses, as in a biological neural network, by repeatedly adjusting the weights of synapses, so that By learning to reduce the error between the output and the inferred output, it is possible to represent a machine learning model with problem-solving ability.
  • the artificial neural network model 900 may include arbitrary probabilistic models, neural network models, etc.
  • the artificial neural network model 900 includes the aforementioned artificial neural network text-to-speech synthesis model, the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. may include
  • the artificial neural network model 900 may be implemented as a multilayer perceptron (MLP) composed of multiple layers of nodes and connections between them.
  • the artificial neural network model 900 according to the present embodiment may be implemented using one of various artificial neural network structures including MLP.
  • the artificial neural network model 900 includes an input layer 920 that receives an input signal or data 910 from the outside, and an output layer that outputs an output signal or data 950 corresponding to the input data ( 940), located between the input layer 920 and the output layer 940, receiving a signal from the input layer 920, extracting characteristics, and transferring the characteristics to the output layer 940. It may be composed of n hidden layers 930_1 to 930_n. .
  • the output layer 940 may receive a signal from the hidden layers 930_1 to 930_n and output the signal to the outside.
  • the learning method of the artificial neural network model 900 includes a supervised learning method that learns to be optimized to solve a problem by input of a teacher signal (correct answer), and an unsupervised learning method that does not require a teacher signal. ) is a way.
  • the processor inputs text information and speaker characteristics of a new speaker into the artificial neural network model 900, and the artificial neural network model 900 This new speaker characteristic can be learned end-to-end to output voice data for the reflected text. That is, when the artificial neural network model 900 inputs information about text and information about a new speaker, the intermediate process is learned by itself by the processor, and a synthesized voice can be output.
  • the processor may generate the synthesized speech by converting the text information and the speaker characteristics of the new speaker into embeddings (eg, embedding vectors) through the encoding layer of the neural network model 900 .
  • the input variable of the artificial neural network model 900 may be a vector 910 composed of vector data elements representing text information and new speaker information.
  • the text information may be represented by arbitrary embeddings representing text, for example, it may be represented by character embeddings, phoneme embeddings, and the like.
  • the speaker characteristics of the new speaker may be represented by any type of embedding representing the speaker's utterance.
  • the output variable may be composed of a result vector 950 representing the synthesized voice for the target text in which the speaker characteristics of the new speaker are reflected.
  • the input layer 920 and the output layer 940 of the artificial neural network model 900 are matched with a plurality of input variables and a plurality of output variables corresponding to each other, and the input layer 920 and the hidden layers 930_1 ... 930_n , where n is a natural number equal to or greater than 2) and by adjusting the synapse values between the nodes included in the output layer 940, the artificial neural network model 900 can be trained to infer the correct output corresponding to a specific input. have.
  • correct answer data of the analysis result may be used, and such correct answer data may be obtained as a result of an annotator's annotation work.
  • the characteristics hidden in the input variable of the artificial neural network model 900 can be identified, and the error between the output variable calculated based on the input variable and the target output is reduced.
  • a synapse value (or weight) between the two may be adjusted.
  • the neural network model 900 is trained, mutual information between text information and new speaker information (eg, text information embedding and new speaker information embedding) A loss function that minimizes may be used.
  • the neural network model 900 is an artificial neural network text-to-speech synthesis model
  • the neural network model 900 is configured to predict a loss between embedding text information and embedding new speaker information (for example, , a fully-connected layer, etc.).
  • the artificial neural network model 900 may be trained to predict and minimize mutual information between text information and speaker information.
  • the artificial neural network model 900 learned in this way may be configured to independently adjust each of the input text information and the new speaker information.
  • the processor inputs target text information and new speaker information to the learned artificial neural network model 900, and the new speaker's speaker characteristics are reflected.
  • a synthesized voice corresponding to the text may be output.
  • voice data may be configured such that mutual information between the target text information and the new speaker information is minimized.
  • the learning process of the artificial neural network model 900 uses the training data of each model to generate the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. can be applied.
  • the artificial neural network models trained in this way may generate an inference value as output data by using data corresponding to the learning input data as input.
  • the above-described method may be provided as a computer program stored in a computer-readable recording medium for execution by a computer.
  • the medium may continuously store a computer executable program, or may be a temporary storage for execution or download.
  • the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like.
  • examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute various other software, or servers.
  • the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.
  • a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.
  • the techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM (on computer-readable media such as programmable read-only memory), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present invention relates to a method, performed by at least one processor, for generating synthesized speech of a new speaker. The method may comprise the steps of: receiving target text; acquiring speaker features of a reference speaker; acquiring information about changes in utterance features; determining speaker features of a new speaker by using the acquired speaker features of the reference speaker and the acquired information about changes in utterance features; and generating output speech for the target text by inputting the target text and the determined speaker features of the new speaker to an artificial neural network text-speech synthesis model, wherein the output speech reflects the determined speaker features of the new speaker. Here, the artificial neural network text-speech synthesis model can be trained on the basis of a plurality of training text items and speaker features of a plurality of training speakers to output speech for the plurality of training text items, wherein the output speech reflects the speaker features of the plurality of training speakers.

Description

새로운 화자의 합성 음성을 생성하는 방법 및 시스템Method and system for generating synthesized speech of a new speaker
본 개시는 새로운 화자의 합성 음성을 생성하는 방법 및 시스템에 관한 것으로서, 더 구체적으로, 기준 화자의 화자 특징 및 발성 특징 변화 정보를 이용하여 새로운 화자의 화자 특징을 결정하고, 인공신경망 텍스트-음성 합성 모델을 이용하여 새로운 화자의 화자 특징이 반영된 합성 음성을 생성하는 방법 및 시스템에 관한 것이다.The present disclosure relates to a method and system for generating a synthesized voice of a new speaker, and more particularly, to determine the speaker characteristic of a new speaker using the speaker characteristic and vocal characteristic change information of a reference speaker, and artificial neural network text-to-speech synthesis A method and system for generating a synthesized voice in which the speaker characteristics of a new speaker are reflected by using a model.
오디오 콘텐츠 및 비디오 콘텐츠 제작 기술의 발전에 따라, 콘텐츠 제작자는 누구나 오디오 콘텐츠 또는 비디오 콘텐츠를 쉽게 제작할 수 있게 되었다. 또한, 가상 음성 생성 기술 및 가상 영상 제작 기술의 발전으로, 성우가 녹음한 오디오 샘플을 통해 신경망 음성 모델을 학습시켜, 오디오 샘플을 녹음한 성우와 동일한 음성 특징을 갖는 음성 합성기술이 개발되고 있다.With the development of audio content and video content production technology, any content creator can easily produce audio content or video content. In addition, with the development of virtual voice generation technology and virtual image production technology, a neural network voice model is trained through audio samples recorded by voice actors, and voice synthesis technology having the same voice characteristics as voice actors recording audio samples is being developed.
그러나, 종래의 오디오 샘플 기반 음성 합성 기술은 기존에 존재하지 않던 목소리를 새롭게 생성하는 것은 기술적으로 어려우며, 남성과 여성의 목소리를 결합한 중성적인 목소리, 발음이 정확한 어린이 목소리 등 존재하지 않는 음성 특징을 갖는 목소리는 구현하기 어려운 문제가 있다. 더욱이, 새롭게 생성된 음성은, 기계적인 음성으로 인식될 정도로 퀄리티가 낮아서, 상업적으로 사용되기 어려웠다.However, in the conventional audio sample-based speech synthesis technology, it is technically difficult to create a new voice that did not exist before, and it is technically difficult to create a voice that has non-existent voice features, such as a neutral voice combining male and female voices, and a child's voice with accurate pronunciation. The voice has a problem that is difficult to implement. Moreover, the newly generated voice was of low quality enough to be recognized as a mechanical voice, making it difficult to use commercially.
본 개시는 상기와 같은 문제를 해결하기 위한 새로운 화자의 합성 음성을 생성하는 방법, 컴퓨터 판독가능한 기록 매체에 저장된 컴퓨터 프로그램 및 장치(시스템)를 제공한다.The present disclosure provides a method for generating a new speaker's synthesized voice, a computer program stored in a computer-readable recording medium, and an apparatus (system) to solve the above problems.
본 개시는 방법, 시스템, 장치 또는 컴퓨터 판독가능 저장 매체에 저장된 컴퓨터 프로그램, 컴퓨터 판독가능한 기록 매체를 포함한 다양한 방식으로 구현될 수 있다.The present disclosure may be implemented in various ways including a method, a system, an apparatus, or a computer program stored in a computer-readable storage medium, and a computer-readable recording medium.
본 개시의 일 실시예에 따르면, 적어도 하나의 프로세서에 의해 수행되는, 새로운 화자의 합성 음성을 생성하는 방법은, 대상 텍스트를 수신하는 단계, 기준 화자의 화자 특징을 획득하는 단계, 발성 특징 변화 정보를 획득하는 단계, 획득된 기준 화자의 화자 특징 및 획득된 발성 특징 변화 정보를 이용하여 새로운 화자의 화자 특징을 결정하는 단계 및 대상 텍스트 및 결정된 새로운 화자의 화자 특징을 인공신경망 텍스트-음성 합성 모델에 입력하여, 결정된 새로운 화자의 화자 특징이 반영된, 대상 텍스트에 대한 출력 음성을 생성하는 단계를 포함하고, 인공신경망 텍스트-음성 합성 모델은, 복수의 학습 텍스트 아이템 및 복수의 학습 화자의 화자 특징을 기초로, 복수의 학습 화자의 화자 특징이 반영된, 복수의 학습 텍스트 아이템에 대한 음성을 출력하도록 학습된다.According to an embodiment of the present disclosure, a method for generating a synthesized voice of a new speaker, performed by at least one processor, includes the steps of receiving a target text, acquiring speaker characteristics of a reference speaker, and changing speech characteristics. obtaining a speaker characteristic of a reference speaker and determining the speaker characteristic of a new speaker using the acquired speaker characteristic of the reference speaker and the acquired speech characteristic change information, and the target text and the speaker characteristic of the determined new speaker to the artificial neural network text-to-speech synthesis model and generating an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected, wherein the artificial neural network text-to-speech synthesis model is based on the plurality of training text items and the speaker characteristics of the plurality of learned speakers. Thus, it is learned to output voices for a plurality of learning text items in which the speaker characteristics of the plurality of learning speakers are reflected.
일 실시예에서, 새로운 화자의 화자 특징을 결정하는 단계는, 기준 화자의 화자 특징 및 획득된 발성 특징 변화 정보를 인공신경망 화자 특징 변화 생성 모델에 입력하여 화자 특징 변화를 생성하는 단계 및 기준 화자의 화자 특징 및 생성된 화자 특징 변화를 합성함으로써, 새로운 화자의 화자 특징을 출력하는 단계를 포함하고, 인공신경망 화자 특징 변화 생성 모델은, 복수의 학습 화자의 화자 특징 및 복수의 학습 화자의 화자 특징에 포함된 복수의 발성 특징을 이용하여 학습된다.In an embodiment, the determining of the speaker characteristic of the new speaker may include generating the speaker characteristic change by inputting the speaker characteristic of the reference speaker and the acquired vocal characteristic change information into an artificial neural network speaker characteristic change generation model, and outputting the speaker characteristics of a new speaker by synthesizing the speaker characteristic and the generated speaker characteristic change, wherein the artificial neural network speaker characteristic change generation model is based on the speaker characteristics of the plurality of learned speakers and the speaker characteristics of the plurality of learned speakers. It is learned using a plurality of included vocal features.
일 실시예에서, 발성 특징 변화 정보는, 타겟 발성 특징의 변화에 대한 정보를 포함한다.In an embodiment, the speech characteristic change information includes information about a change in the target speech characteristic.
일 실시예에서, 기준 화자의 화자 특징을 획득하는 단계는, 복수의 기준 화자에 대응하는 복수의 화자 특징을 획득하는 단계를 포함하고, 발성 특징 변화 정보를 획득하는 단계는, 복수의 화자 특징에 대응하는 가중치 세트를 획득하는 단계를 포함하고, 새로운 화자의 화자 특징을 결정하는 단계는, 복수의 화자의 특징의 각각에 획득된 가중치 세트에 포함된 가중치를 적용함으로써, 새로운 화자의 화자 특징을 결정하는 단계를 포함한다.In an embodiment, the acquiring the speaker characteristics of the reference speaker includes acquiring a plurality of speaker characteristics corresponding to the plurality of reference speakers, and the acquiring the vocalization characteristic change information includes: obtaining a corresponding set of weights, wherein the determining of the speaker characteristic of the new speaker includes applying a weight included in the obtained weight set to each of the plurality of speaker characteristics, thereby determining the speaker characteristic of the new speaker. including the steps of
일 실시예에서, 복수의 화자의 화자 특징을 획득하는 단계 - 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고, 발성 특징 변화 정보를 획득하는 단계는, 복수의 화자의 화자 벡터의 각각을 정규화시키는 단계, 정규화된 복수의 화자의 화자 벡터에 대한 차원 축소 분석을 수행함으로써, 복수의 주요 성분을 결정하는 단계, 결정된 복수의 주요 성분 중 적어도 하나의 주요 성분을 선택하는 단계 및 선택된 주요 성분을 이용하여 발성 특징 변화 정보를 결정하는 단계를 포함하고, 새로운 화자의 화자 특징을 결정하는 단계는, 기준 화자의 화자 특징, 결정된 발성 특징 변화 정보 및 결정된 발성 특징 변화 정보의 가중치를 이용하여 새로운 화자의 화자 특징을 결정하는 단계를 포함한다.In an embodiment, the method further includes obtaining speaker characteristics of the plurality of speakers, wherein the speaker characteristics of the plurality of speakers include a plurality of speaker vectors, wherein the obtaining of the vocalization characteristic change information includes: Normalizing each of the speaker vectors; determining a plurality of principal components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers; selecting at least one principal component from among the determined plurality of principal components; and determining the speech characteristic change information by using the selected main component, wherein the determining of the speaker characteristic of the new speaker includes determining a weight of the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the determined speech characteristic change information. and determining the speaker characteristics of the new speaker using
일 실시예에서, 복수의 화자의 화자 특징을 획득하는 단계 - 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고, 복수의 화자의 각각은, 하나 이상의 발성 특징에 대한 레이블이 할당되고, 발성 특징 변화 정보를 획득하는 단계는, 타겟 발성 특징이 상이한 복수의 화자의 화자 벡터를 획득하는 단계 및 획득된 복수의 화자의 화자 벡터 사이의 차이를 기초로 발성 특징 변화 정보를 결정하는 단계를 포함하고, 새로운 화자의 화자 특징을 결정하는 단계는, 기준 화자의 화자 특징, 결정된 발성 특징 변화 정보 및 결정된 발성 특징 변화 정보의 가중치를 이용하여 새로운 화자의 화자 특징을 결정하는 단계를 포함한다.In one embodiment, the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is assigned, and the obtaining of the speech characteristic change information includes: obtaining the speaker vectors of a plurality of speakers having different target speech characteristics, and determining the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers and determining the speaker characteristic of the new speaker, wherein the determining of the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the weight of the determined speech characteristic change information. do.
일 실시예에서, 복수의 화자의 화자 특징을 획득하는 단계 - 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고, 복수의 화자의 각각은, 하나 이상의 발성 특징에 대한 레이블이 할당되고, 발성 특징 변화 정보를 획득하는 단계는, 타겟 발성 특징이 상이한 복수의 화자 그룹의 각각에 포함된 화자들의 화자 벡터를 획득하는 단계 - 복수의 화자의 그룹은 제1 화자 그룹 및 제2 화자 그룹을 포함함 -, 제1 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하는 단계, 제2 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하는 단계 및 제1 화자 그룹에 대응하는 화자 벡터의 평균 및 제2 화자 그룹에 대응하는 화자 벡터의 평균 사이의 차이를 기초로 발성 특징 변화 정보를 결정하는 단계를 포함하고, 새로운 화자의 화자 특징을 결정하는 단계는, 기준 화자의 화자 특징, 결정된 발성 특징 변화 정보 및 결정된 발성 특징 변화 정보의 가중치를 이용하여 새로운 화자의 화자 특징을 결정하는 단계를 포함한다.In one embodiment, the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is allocated, and the obtaining the speech characteristic change information includes: obtaining a speaker vector of speakers included in each of a plurality of speaker groups having different target speech characteristics; including a speaker group; calculating an average of speaker vectors of speakers included in the first speaker group; calculating an average of speaker vectors of speakers included in the second speaker group; and the steps corresponding to the first speaker group and determining the speech characteristic change information based on a difference between the average of the speaker vectors and the average of the speaker vectors corresponding to the second speaker group, wherein the determining of the speaker characteristic of the new speaker includes the speaker characteristic of the reference speaker. , determining a speaker characteristic of a new speaker by using the determined speech characteristic change information and a weight of the determined speech characteristic change information.
일 실시예에서, 복수의 화자의 화자 특징을 획득하는 단계 - 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고, 기준 화자의 화자 특징은, 기준 화자의 복수의 발성 특징을 포함하고, 발성 특징 변화 정보를 획득하는 단계는, 복수의 화자의 화자 특징을 인공신경망 발성 특징 예측 모델에 입력하여, 복수의 화자의 각각의 발성 특징을 출력하는 단계, 복수의 화자의 화자 특징 중에서, 출력된 복수의 화자의 각각의 발성 특징 중 타겟 발성 특징과 기준 화자의 복수의 발성 특징 중 타겟 발성 특징 사이의 차이가 존재하는, 화자의 화자 특징을 선택하는 단계 및 선택된 화자의 화자 특징에 대응하는 가중치를 획득하는 단계를 포함하고, 새로운 화자의 화자 특징을 결정하는 단계는, 기준 화자의 화자 특징, 선택된 화자의 화자 특징 및 선택된 화자의 화자 특징에 대응하는 가중치를 이용하여 새로운 화자의 화자 특징을 결정하는 단계를 포함한다.In one embodiment, the method further comprises obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising the plurality of speaker vectors, wherein the speaker characteristics of the reference speaker include the plurality of vocalization characteristics of the reference speaker. The obtaining of the speech characteristic change information includes: inputting the speaker characteristics of the plurality of speakers into the artificial neural network speech characteristic prediction model, and outputting the speech characteristics of each of the plurality of speakers; the speaker characteristics of the plurality of speakers selecting a speaker characteristic of a speaker in which a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker; acquiring a corresponding weight, wherein the determining of the speaker characteristic of the new speaker includes: the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker determining the characteristics.
일 실시예에서, 새로운 화자의 화자 특징은, 화자 특징 벡터를 포함하고, 해쉬 함수를 이용하여 화자 특징 벡터에 대응하는 해쉬값을 산출하는 단계, 저장 매체에 저장된 복수의 화자의 콘텐츠 중에서, 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 있는지 여부를 판정하는 단계 및 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 없는 경우, 새로운 화자의 화자 특징과 연관된 출력 음성이 새로운 출력 음성임을 결정하는 단계를 포함한다.In an embodiment, the speaker feature of the new speaker includes the speaker feature vector, and calculating a hash value corresponding to the speaker feature vector using a hash function; determining whether there is content associated with a hash value similar to the hash value, and if there is no content associated with the hash value similar to the calculated hash value, determining that the output voice associated with the speaker characteristic of the new speaker is the new output voice include
일 실시예에서, 기준 화자의 화자 특징은, 화자 벡터를 포함하고, 발성 특징 변화 정보를 획득하는 단계는, 타겟 발성 특징에 대응하는 발성 특징 분류 모델을 이용하여 타겟 발성 특징에 대한 법선 벡터를 추출하는 단계 - 법선 벡터는 타겟 발성 특징을 분류하는 hyperplane의 법선 벡터를 지칭함 - 및 타겟 발성 특징을 조절하는 정도를 나타내는 정보를 획득하는 단계를 포함하고, 새로운 화자의 화자 특징을 결정하는 단계는, 기준 화자의 화자 벡터, 추출된 법선 벡터 및 타겟 발성 특징을 조절하는 정도를 기초로 새로운 화자의 화자 특징을 결정하는 단계를 포함한다.In an embodiment, the speaker feature of the reference speaker includes a speaker vector, and the obtaining of the speech feature change information includes extracting a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature. a step of - the normal vector refers to a normal vector of a hyperplane for classifying a target speech feature - and obtaining information indicating a degree of adjusting the target speech feature, wherein the determining of the speaker feature of a new speaker includes: and determining the speaker characteristic of the new speaker based on the degree of adjusting the speaker vector, the extracted normal vector, and the target vocalization characteristic of the speaker.
본 개시의 일 실시예에 따른 상술된 새로운 화자의 합성 음성을 생성하는 방법을 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 제공된다.A computer program stored in a computer-readable recording medium is provided for executing the above-described method for generating a synthesized voice of a new speaker according to an embodiment of the present disclosure in a computer.
본 개시의 일 실시예에 따르면, 음성 합성기는 상술된 새로운 화자의 합성 음성을 생성하는 방법에 따라 생성된 새로운 화자의 합성 음성을 포함한 학습 데이터를 이용하여 학습된다.According to an embodiment of the present disclosure, the speech synthesizer is trained using learning data including the synthesized voice of the new speaker generated according to the above-described method for generating the synthesized voice of the new speaker.
본 개시의 일 실시예에 따르면, 합성 음성을 제공하는 장치는, 상술된 새로운 화자의 합성 음성을 생성하는 방법에 따라 생성된 새로운 화자의 합성 음성을 저장하도록 구성된 메모리 및 메모리와 연결되고, 메모리에 포함된 컴퓨터 판독 가능한 적어도 하나의 프로그램을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로그램은, 메모리에 저장된 새로운 화자의 합성 음성 중 적어도 일부를 출력하도록 구성된, 합성 음성을 제공하는 장치가 제공된다.According to an embodiment of the present disclosure, an apparatus for providing a synthesized voice is connected to a memory and a memory configured to store a synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker, and stored in the memory. An apparatus for providing synthesized speech comprising at least one processor configured to execute at least one computer readable program included therein, wherein the at least one program is configured to output at least a portion of the synthesized voice of the new speaker stored in the memory is provided
본 개시의 일 실시예에 따르면, 적어도 하나의 프로세서에 의해 수행되는, 새로운 화자의 합성 음성을 제공하는 방법은, 상술된 새로운 화자의 합성 음성을 생성하는 방법에 따라 생성된 새로운 화자의 합성 음성을 저장하는 단계 및 저장된 새로운 화자의 합성 음성 중 적어도 일부를 제공하는 단계를 포함한다.According to an embodiment of the present disclosure, a method of providing a synthesized voice of a new speaker, performed by at least one processor, includes a synthesized voice of a new speaker generated according to the above-described method of generating a synthesized voice of a new speaker. storing and providing at least a portion of the stored synthesized voice of the new speaker.
본 개시의 일부 실시예에 따르면, 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대한 자연스러운 음성을 생성할 수 있다.According to some embodiments of the present disclosure, it is possible to generate a natural voice for the target text in which the speaker characteristics of the new speaker are reflected.
본 개시의 일부 실시예에 따르면, 발성 특징의 정량적 조절을 통해 화자 특징 벡터를 수정함으로써, 새로운 목소리를 가지는 합성 음성이 생성될 수 있다.According to some embodiments of the present disclosure, a synthesized voice having a new voice may be generated by modifying a speaker feature vector through quantitative adjustment of vocalization features.
본 개시의 일부 실시예에 따르면, 여러 화자(예를 들어, 2 이상의 화자 또는 3 이상의 화자)의 목소리를 섞어서 새로운 화자의 목소리가 생성될 수 있다.According to some embodiments of the present disclosure, a new speaker's voice may be generated by mixing the voices of several speakers (eg, two or more speakers or three or more speakers).
본 개시의 일부 실시예에 따르면, 사용자 단말로부터 하나 이상의 발성 특징을 세세하게 조정함으로써 출력 음성이 생성될 수 있다. 예를 들어, 하나 이상의 발성 특징은 성별 조절, 발성 톤 조절, 발성 강도, 남자 연령 조절, 여자 연령 조절, 피치, 템포 등을 포함할 수 있다.According to some embodiments of the present disclosure, the output voice may be generated by finely adjusting one or more vocalization characteristics from the user terminal. For example, the one or more vocal characteristics may include gender control, vocal tone control, vocal strength, male age control, female age control, pitch, tempo, and the like.
본 개시의 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급되지 않은 다른 효과들은 청구범위의 기재로부터 본 개시에 속하는 기술 분야에서 통상의 지식을 가진 자(이하, '통상의 기술자'라 함)에게 명확하게 이해될 수 있을 것이다.The effect of the present disclosure is not limited to the above-mentioned effects, and other effects not mentioned are those of ordinary skill in the art to which the present disclosure belongs from the description of the claims (hereinafter referred to as 'person of ordinary skill') can be clearly understood by
본 개시의 실시예들은, 이하 설명하는 첨부 도면들을 참조하여 설명될 것이며, 여기서 유사한 참조 번호는 유사한 요소들을 나타내지만, 이에 한정되지는 않는다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein like reference numerals denote like elements, but are not limited thereto.
도 1은 본 개시의 일 실시예에 따른 합성 음성 생성 시스템이 대상 텍스트 및 새로운 화자의 화자 특징을 입력 받아 출력 음성을 생성하는 예시를 나타내는 도면이다.1 is a diagram illustrating an example in which a synthesized voice generating system according to an embodiment of the present disclosure generates an output voice by receiving a target text and speaker characteristics of a new speaker.
도 2는 본 개시의 일 실시예에 따른 텍스트에 대한 합성 음성 생성 서비스를 제공하기 위하여, 복수의 사용자 단말과 합성 음성 생성 시스템이 통신 가능하도록 연결된 구성을 나타내는 개요도이다.FIG. 2 is a schematic diagram illustrating a configuration in which a plurality of user terminals and a synthesized voice generating system are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure.
도 3은 본 개시의 일 실시예에 따른 사용자 단말 및 합성 음성 생성 시스템의 내부 구성을 나타내는 블록도이다.3 is a block diagram illustrating an internal configuration of a user terminal and a synthesized voice generating system according to an embodiment of the present disclosure.
도 4는 본 개시의 일 실시예에 따른 사용자 단말의 프로세서의 내부 구성을 나타내는 블록도이다.4 is a block diagram illustrating an internal configuration of a processor of a user terminal according to an embodiment of the present disclosure.
도 5는 본 개시의 일 실시예에 따른 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 방법을 나타내는 흐름도이다.5 is a flowchart illustrating a method of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.
도 6은 본 개시의 일 실시예에 따른 인공신경망 텍스트-음성 합성 모델을 이용하여 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 예시를 나타내는 도면이다.6 is a diagram illustrating an example of generating an output voice in which the speaker characteristics of a new speaker are reflected using the artificial neural network text-to-speech synthesis model according to an embodiment of the present disclosure.
도 7은 본 개시의 다른 실시예에 따른 인공신경망 텍스트-음성 합성 모델을 이용하여 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 예시를 나타내는 도면이다.7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected using an artificial neural network text-to-speech synthesis model according to another embodiment of the present disclosure.
도 8은 본 개시의 일 실시예에 따른 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 사용자 인터페이스를 보여주는 예시도이다.8 is an exemplary diagram illustrating a user interface for generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.
도 9은 본 개시의 일 실시예에 따른 인공신경망 모델을 나타내는 구조도이다.9 is a structural diagram illustrating an artificial neural network model according to an embodiment of the present disclosure.
이하, 본 개시의 실시를 위한 구체적인 내용을 첨부된 도면을 참조하여 상세히 설명한다. 다만, 이하의 설명에서는 본 개시의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, specific contents for carrying out the present disclosure will be described in detail with reference to the accompanying drawings. However, in the following description, if there is a risk of unnecessarily obscuring the gist of the present disclosure, detailed descriptions of well-known functions or configurations will be omitted.
첨부된 도면에서, 동일하거나 대응하는 구성요소에는 동일한 참조부호가 부여되어 있다. 또한, 이하의 실시예들의 설명에 있어서, 동일하거나 대응되는 구성요소를 중복하여 기술하는 것이 생략될 수 있다. 그러나, 구성요소에 관한 기술이 생략되어도, 그러한 구성요소가 어떤 실시예에 포함되지 않는 것으로 의도되지는 않는다.In the accompanying drawings, identical or corresponding components are assigned the same reference numerals. In addition, in the description of the embodiments below, overlapping description of the same or corresponding components may be omitted. However, even if description regarding components is omitted, it is not intended that such components are not included in any embodiment.
개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 통상의 기술자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and methods of achieving them, will become apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the present disclosure to be complete, and the present disclosure provides those skilled in the art with the scope of the invention. It is provided for complete information only.
본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. 본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. The terms used in the present specification have been selected as currently widely used general terms as possible while considering the functions in the present disclosure, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.
본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한, 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다. 명세서 전체에서 어떤 부분이 어떤 구성요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.Expressions in the singular herein include plural expressions unless the context clearly dictates the singular. Also, the plural expression includes the singular expression unless the context clearly dictates the plural. In the entire specification, when a part includes a certain component, this means that other components may be further included, rather than excluding other components, unless otherwise stated.
또한, 명세서에서 사용되는 '부' 또는 '모듈'이라는 용어는 소프트웨어 또는 하드웨어 구성요소를 의미하며, '모듈'은 어떤 역할들을 수행한다. 그렇지만, '부' 또는 '모듈'은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '부' 또는 '모듈'은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서, '부' 또는 '모듈'은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 또는 변수들 중 적어도 하나를 포함할 수 있다. 구성요소들과 '부' 또는 '모듈'들은 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '부' 또는 '모듈'들로 결합되거나 추가적인 구성요소들과 '부' 또는 '모듈'들로 더 분리될 수 있다.In addition, the term 'unit' or 'module' used in the specification means a software or hardware component, and 'module' performs certain roles. However, 'unit' or 'module' is not meant to be limited to software or hardware. A 'unit' or 'module' may be configured to reside on an addressable storage medium or may be configured to reproduce one or more processors. Thus, as an example, 'part' or 'module' refers to components such as software components, object-oriented software components, class components, and task components, and processes, functions, and properties. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays or at least one of variables. Components and 'units' or 'modules' are the functions provided therein that are combined into a smaller number of components and 'units' or 'modules' or additional components and 'units' or 'modules' can be further separated.
본 개시의 일 실시예에 따르면, '부' 또는 '모듈'은 프로세서 및 메모리로 구현될 수 있다. '프로세서'는 범용 프로세서, 중앙 처리 장치(CPU), 마이크로프로세서, 디지털 신호 프로세서(DSP), 제어기, 마이크로제어기, 상태 머신 등을 포함하도록 넓게 해석되어야 한다. 몇몇 환경에서, '프로세서'는 주문형 반도체(ASIC), 프로그램가능 로직 디바이스(PLD), 필드 프로그램가능 게이트 어레이(FPGA) 등을 지칭할 수도 있다. '프로세서'는, 예를 들어, DSP와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 결합한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성들의 조합과 같은 처리 디바이스들의 조합을 지칭할 수도 있다. 또한, '메모리'는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. '메모리'는 임의 액세스 메모리(RAM), 판독-전용 메모리(ROM), 비-휘발성 임의 액세스 메모리(NVRAM), 프로그램가능 판독-전용 메모리(PROM), 소거-프로그램가능 판독 전용 메모리(EPROM), 전기적으로 소거가능 PROM(EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서가 메모리로부터 정보를 판독하고/하거나 메모리에 정보를 기록할 수 있다면 메모리는 프로세서와 전자 통신 상태에 있다고 불린다. 프로세서에 집적된 메모리는 프로세서와 전자 통신 상태에 있다.According to an embodiment of the present disclosure, a 'unit' or a 'module' may be implemented with a processor and a memory. 'Processor' should be construed broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like. In some contexts, a 'processor' may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like. 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such configurations. You may. Also, 'memory' should be construed broadly to include any electronic component capable of storing electronic information. 'Memory' means random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read-only memory (EPROM); may refer to various types of processor-readable media, such as electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. A memory is said to be in electronic communication with the processor if the processor is capable of reading information from and/or writing information to the memory. A memory integrated in the processor is in electronic communication with the processor.
본 개시에서, '텍스트 아이템'은 텍스트의 일부 또는 전부를 지칭할 수 있으며, 텍스트는 텍스트 아이템을 지칭할 수 있다. 이와 유사하게, '데이터 아이템' 및 '정보 아이템'의 각각은 데이터의 적어도 일부 및 정보의 적어도 일부를 지칭할 수 있으며, 데이터 및 정보는 데이터 아이템 및 정보 아이템을 지칭할 수 있다. 본 개시에서, '복수의 A의 각각' 또는 '복수의 A 각각'은 복수의 A에 포함된 모든 구성 요소의 각각을 지칭하거나, 복수의 A에 포함된 일부 구성 요소의 각각을 지칭할 수 있다. 예를 들어, 복수의 화자의 특징의 각각은, 복수의 화자의 특징의 각각에 포함된 모든 화자 특징의 각각을 지칭하거나 복수의 화자의 특징에 포함된 일부 화자의 특징의 각각을 지칭할 수 있다.In the present disclosure, a 'text item' may refer to a part or all of text, and the text may refer to a text item. Similarly, each of 'data item' and 'information item' may refer to at least a portion of data and at least a portion of information, and data and information may refer to a data item and information item. In the present disclosure, 'each of a plurality of A' or 'each of a plurality of A' may refer to each of all components included in the plurality of As or may refer to each of some components included in the plurality of As. . For example, each of the features of the plurality of speakers may refer to each of all speaker features included in each of the features of the plurality of speakers or to each of some speaker features included in the features of the plurality of speakers. .
아래에서는 첨부한 도면을 참고하여 실시예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다.Hereinafter, with reference to the accompanying drawings, the embodiments will be described in detail so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. And in order to clearly describe the present disclosure in the drawings, parts irrelevant to the description will be omitted.
도 1은 본 개시의 일 실시예에 따른 합성 음성 생성 시스템(100)이 대상 텍스트(110) 및 새로운 화자의 화자 특징(120)을 입력 받아 출력 음성(130)을 생성하는 예시를 나타내는 도면이다. 합성 음성 생성 시스템(100)은 대상 텍스트(110) 및 새로운 화자의 화자 특징(120)을 입력 받아, 새로운 화자의 화자 특징(120)이 반영된 출력 음성(130)을 생성할 수 있다. 여기서, 대상 텍스트(110)는 하나 이상의 문단, 문장, 절, 구, 단어, 어절, 음소 등을 포함할 수 있다. 1 is a diagram illustrating an example in which a synthesized voice generating system 100 according to an embodiment of the present disclosure generates an output voice 130 by receiving a target text 110 and a speaker characteristic 120 of a new speaker. The synthesized voice generating system 100 may receive the target text 110 and the speaker characteristic 120 of the new speaker, and generate the output voice 130 in which the speaker characteristic 120 of the new speaker is reflected. Here, the target text 110 may include one or more paragraphs, sentences, clauses, phrases, words, words, phonemes, and the like.
일 실시예에 따르면, 새로운 화자의 화자 특징(120)은 기준 화자의 화자 특징 및 발성 특징 변화 정보를 이용하여 결정되거나 생성될 수 있다. 여기서, 기준 화자의 화자 특징은 새롭게 생성하고 싶은 화자, 즉, 새로운 화자의 화자 특징을 생성하는데 있어서 기준이 되는 화자의 화자 특징을 포함할 수 있다. 예를 들어, 기준 화자의 화자 특징은 새롭게 생성하고 싶은 화자의 화자 특징과 유사한 화자 특징을 포함할 수 있다. 다른 예로서, 기준 화자의 화자 특징은 복수의 기준 화자의 화자 특징을 포함할 수 있다.According to an embodiment, the speaker characteristic 120 of the new speaker may be determined or generated using the speaker characteristic of the reference speaker and information on the change of the vocalization characteristic. Here, the speaker characteristic of the reference speaker may include the speaker characteristic of the speaker to be newly created, that is, the speaker characteristic of the speaker that is a reference in generating the speaker characteristic of the new speaker. For example, the speaker characteristic of the reference speaker may include a speaker characteristic similar to the speaker characteristic of the speaker to be newly created. As another example, the speaker characteristics of the reference speaker may include speaker characteristics of a plurality of reference speakers.
일 실시예에 따르면, 기준 화자의 화자 특징은 기준 화자의 화자 벡터를 포함할 수 있다. 예를 들어, 인경신경망 화자 특징 추출 모델을 이용하여 화자 id(예를 들어, 화자 one-hot vector 등) 및 발성 특징(예: 벡터)을 기초로, 기준 화자의 화자 벡터가 추출될 수 있다. 여기서, 인공신경망 화자 특징 추출 모델은 복수의 학습 화자 id 및 복수의 학습 발성 특징(예: 벡터)을 입력받아 참조 기준 화자의 화자 벡터(ground truth)를 추출하도록 학습될 수 있다. 또 다른 예로서, 인경신경망 화자 특징 추출 모델을 이용하여 화자가 녹음한 음성 및 발성 특징(예: 벡터)을 기초로, 기준 화자의 화자 벡터가 추출될 수 있다. 여기서, 인공신경망 화자 특징 추출 모델은 복수의 학습 화자가 녹음한 음성 및 복수의 학습 발성 특징(예: 벡터)을 입력받아 참조 기준 화자의 화자 벡터(ground truth)를 추출하도록 학습될 수 있다. 여기서, 기준 화자의 화자 벡터는 기준 화자의 음성이 가지고 있는 하나 이상의 발성 특징(예를 들어, 톤, 발성강도, 발성속도, 성별, 나이 등)을 포함할 수 있다. 또한, 화자 id 및/또는 화자가 녹음한 음성은, 새로운 화자의 화자 특징의 기초이 되는 음성으로서 선택될 수 있다. 이에 더하여, 발성 특징은, 새로운 화자의 화자 특징에 반영이 될 기초가 되는 발성 특징을 포함할 수 있다. 즉, 이러한 화자 id, 화자가 녹음한 음성 및/또는 발성 특징은, 기준 화자의 화자 특징으로 생성되고, 이렇게 생성된 기준 화자의 화자 특징은 발성 특징 변화 정보와 서로 합성되어 새로운 화자의 화자 특징을 생성하는데 사용될 수 있다.According to an embodiment, the speaker characteristic of the reference speaker may include a speaker vector of the reference speaker. For example, the speaker vector of the reference speaker may be extracted based on the speaker id (eg, speaker one-hot vector, etc.) and the vocalization feature (eg, vector) using the neural network speaker feature extraction model. Here, the artificial neural network speaker feature extraction model may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. As another example, a speaker vector of a reference speaker may be extracted based on speech and vocalization features (eg, vectors) recorded by a speaker using a human neural network speaker feature extraction model. Here, the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocalization features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. Here, the speaker vector of the reference speaker may include one or more speech characteristics (eg, tone, speech strength, speech speed, gender, age, etc.) of the reference speaker's voice. Also, the speaker id and/or the voice recorded by the speaker may be selected as the voice on which the speaker characteristics of the new speaker are based. In addition, the vocalization characteristic may include a basic vocalization characteristic that will be reflected in the speaker characteristic of the new speaker. That is, the speaker id, voice and/or vocal characteristics recorded by the speaker are generated as the speaker characteristics of the reference speaker, and the speaker characteristics of the reference speaker thus generated are synthesized with the speech characteristic change information to obtain the speaker characteristics of the new speaker. can be used to create
발성 특징 변화 정보는 새로운 화자의 화자 특징에 적용하길 원하는 발성 특징에 대한 임의의 정보를 포함할 수 있다. 일 실시예에 따르면, 발성 특징 변화 정보는 새로운 화자의 화자 특징과 기준 화자의 화자 특징 사이의 차이에 관한 정보를 포함할 수 있다. 예를 들어, 새로운 화자의 특징은, 기준 화자의 화자 특징 및 화자 특징 변화를 합성함으로써 생성될 수 있다. 여기서, 화자 특징 변화는 기준 화자의 화자 특징 및 발성 특징 변화 정보를 인공신경망 화자 특징 변화 생성 모델에 입력함으로써, 생성될 수 있다. 여기서, 인공신경망 화자 특징 변화 생성 모델은, 복수의 학습 화자의 화자 특징 및 복수의 화자 특징에 포함된 복수의 발성 특징을 이용하여 학습될 수 있다. The vocalization characteristic change information may include any information about the vocalization characteristic desired to be applied to the speaker characteristic of the new speaker. According to an embodiment, the speech characteristic change information may include information about a difference between the speaker characteristic of the new speaker and the speaker characteristic of the reference speaker. For example, the new speaker characteristic may be generated by synthesizing the speaker characteristic and the speaker characteristic change of the reference speaker. Here, the speaker characteristic change may be generated by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the artificial neural network speaker characteristic change generation model. Here, the artificial neural network speaker characteristic change generation model may be trained using speaker characteristics of a plurality of learned speakers and a plurality of speech characteristics included in the plurality of speaker characteristics.
예를 들어, 발성 특징 변화 정보는 새로운 화자의 화자 특징에 포함된 타겟 발성 특징과 기준 화자의 화자 특징에 포함된 타겟 발성 특징 사이의 차이를 나타내는 정보를 포함할 수 있다. 즉, 발성 특징 변화 정보는 타겟 발성 특징의 변화에 대한 정보를 포함할 수 있다. 다른 예로서, 발성 특징 변화 정보는, 화자 특징으로부터 타겟 발성 특징을 분류하는 hyperplane의 법선 벡터(normal vector) 및 타겟 발성 특징을 조절하는 정도를 나타내는 정보를 포함할 수 있다. 또 다른 예로서, 발성 특징 변화 정보는 복수의 기준 화자의 화자 특징의 각각에 적용될 가중치를 포함할 수 있다. 또 다른 예로서, 발성 특징 변화 정보는, 학습 화자에 포함된 타겟 발성 특징 사이의 축을 기초로 생성된 타겟 발성 특징 및 해당 타겟 발성 특징의 가중치를 포함할 수 있다. 또 다른 예로서, 발성 특징 변화 정보는, 타겟 발성 특징이 상이한 화자들의 화자 특징 사이의 차이를 기초로 생성된 타겟 발생 특징 및 해당 타겟 발성 특징에 대한 가중치를 포함할 수 있다. 또 다른 예로서, 발성 특징 변화 정보는, 기준 화자의 화자 특징에 포함된 타겟 화자 특징과의 차이를 가진 화자의 화자 특징과 해당 화자 특징의 가중치를 포함할 수 있다.For example, the vocalization characteristic change information may include information indicating a difference between the target vocalization characteristic included in the speaker characteristic of the new speaker and the target vocalization characteristic included in the speaker characteristic of the reference speaker. That is, the speech characteristic change information may include information about a change in the target speech characteristic. As another example, the speech feature change information may include a normal vector of a hyperplane that classifies the target speech feature from the speaker feature and information indicating the degree of adjusting the target speech feature. As another example, the speech characteristic change information may include a weight to be applied to each of the speaker characteristics of the plurality of reference speakers. As another example, the speech feature change information may include a target speech feature generated based on an axis between target speech features included in the learned speaker and a weight of the target speech feature. As another example, the speech feature change information may include a target generated feature generated based on a difference between speaker features of speakers having different target speech features and a weight for the target speech feature. As another example, the speech characteristic change information may include a speaker characteristic of a speaker having a difference from a target speaker characteristic included in the speaker characteristic of the reference speaker and a weight of the corresponding speaker characteristic.
합성 음성 생성 시스템(100)은 새로운 화자의 화자 특징(120)이 반영된 대상 텍스트(110)에 대한 합성 음성으로서, 새롭게 생성된 화자의 화자 특징에 따라 대상 텍스트가 발화되는 출력 음성(130)을 생성할 수 있다. 이를 위해, 합성 음성 생성 시스템(100)은 복수의 학습 텍스트 아이템 및 복수의 학습 화자의 화자 특징을 기초로, 복수의 학습 화자의 화자 특징이 반영된, 복수의 학습 텍스트 아이템에 대한 음성을 출력하도록 학습된 인공신경망 텍스트-음성 합성 모델을 포함할 수 있다. 이와 달리, 인공신경망 텍스트-음성 합성 모델은, 대상 텍스트(110) 및 새로운 화자의 화자 특징(120)을 입력받으면 복수의 학습 텍스트 아이템에 대한 음성 데이터를 출력하도록 구성될 수 있으며, 이 경우, 출력된 음성 데이터는 후처리 프로세서, 보코더(vocoder) 등을 이용하여 사람이 들을 수 있는 음성으로 후처리될 수 있다.The synthesized voice generation system 100 is a synthesized voice for the target text 110 in which the speaker characteristics 120 of the new speaker are reflected, and generates an output voice 130 in which the target text is uttered according to the speaker characteristics of the newly created speaker. can do. To this end, the synthesized speech generation system 100 learns to output voices for the plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. and artificial neural network text-to-speech synthesis model. Alternatively, the artificial neural network text-to-speech synthesis model may be configured to output voice data for a plurality of training text items when the target text 110 and the speaker characteristic 120 of a new speaker are input, in this case, the output The obtained voice data may be post-processed into human audible voice using a post-processing processor, a vocoder, or the like.
도 2는 본 개시의 일 실시예에 따른 텍스트에 대한 합성 음성 생성 서비스를 제공하기 위하여, 복수의 사용자 단말(210_1, 210_2, 210_3)과 합성 음성 생성 시스템(230)이 통신 가능하도록 연결된 구성을 나타내는 개요도이다. 복수의 사용자 단말(210_1, 210_2, 210_3)은 네트워크(220)를 통해 합성 음성 생성 시스템(230)과 통신할 수 있다. 네트워크(220)는, 복수의 사용자 단말(210_1, 210_2, 210_3)과 합성 음성 생성 시스템(230) 사이의 통신이 가능하도록 구성될 수 있다. 네트워크(220)는 설치 환경에 따라, 예를 들어, 이더넷(Ethernet), 유선 홈 네트워크(Power Line Communication), 전화선 통신 장치 및 RS-serial 통신 등의 유선 네트워크(220), 이동통신망, WLAN(Wireless LAN), Wi-Fi, Bluetooth 및 ZigBee 등과 같은 무선 네트워크(220) 또는 그 조합으로 구성될 수 있다. 통신 방식은 제한되지 않으며, 네트워크(220)가 포함할 수 있는 통신망(예를 들어, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망, 위성망 등)을 활용하는 통신 방식뿐만 아니라 사용자 단말(210_1, 210_2, 210_3) 사이의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(220)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(220)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.FIG. 2 illustrates a configuration in which a plurality of user terminals 210_1 , 210_2 , and 210_3 and a synthesized voice generating system 230 are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure. is an overview. The plurality of user terminals 210_1 , 210_2 , and 210_3 may communicate with the synthesized voice generation system 230 through the network 220 . The network 220 may be configured to enable communication between the plurality of user terminals 210_1 , 210_2 , and 210_3 and the synthesized voice generating system 230 . Network 220 according to the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired network 220 such as RS-serial communication, mobile communication network, WLAN (Wireless) LAN), Wi-Fi, Bluetooth, and a wireless network 220 such as ZigBee, or a combination thereof. The communication method is not limited, and the user terminals 210_1, 210_2, 210_3) may also include short-range wireless communication. For example, the network 220 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like. In addition, the network 220 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. not limited
도 2에서 휴대폰 또는 스마트폰(210_1), 태블릿 컴퓨터(210_2) 및 랩탑 또는 데스크탑 컴퓨터(210_3)가 합성 음성 생성 서비스를 제공하는 사용자 인터페이스를 실행하거나 동작하는 사용자 단말의 예로서 도시되었으나, 이에 한정되지 않으며, 사용자 단말(210_1, 210_2, 210_3)은 유선 및/또는 무선 통신이 가능하고 웹 브라우저, 모바일 브라우저 애플리케이션 또는 합성 음성 생성 애플리케이션이 설치되어 합성 음성 생성 서비스를 제공하는 사용자 인터페이스가 실행될 수 있는 임의의 컴퓨팅 장치일 수 있다. 예를 들어, 사용자 단말(210)은, 스마트폰, 휴대폰, 내비게이션 단말기, 데스크탑 컴퓨터, 랩탑 컴퓨터, 디지털방송용 단말기, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 컴퓨터, 게임 콘솔(game console), 웨어러블 디바이스(wearable device), IoT(internet of things) 디바이스, VR(virtual reality) 디바이스, AR(augmented reality) 디바이스 등을 포함할 수 있다. 또한, 도 2에는 3개의 사용자 단말(210_1, 210_2, 210_3)이 네트워크(220)를 통해 합성 음성 생성 시스템(230)과 통신하는 것으로 도시되어 있으나, 이에 한정되지 않으며, 상이한 수의 사용자 단말이 네트워크(220)를 통해 합성 음성 생성 시스템(230)과 통신하도록 구성될 수도 있다.In FIG. 2, the mobile phone or smart phone 210_1, the tablet computer 210_2, and the laptop or desktop computer 210_3 are illustrated as examples of a user terminal that executes or operates a user interface that provides a synthetic voice generation service, but is not limited thereto. The user terminals 210_1, 210_2, and 210_3 are capable of wired and/or wireless communication and have a web browser, a mobile browser application, or a synthetic voice generating application installed so that a user interface providing a synthetic voice generating service can be executed. It may be a computing device. For example, the user terminal 210 may include a smartphone, a mobile phone, a navigation terminal, a desktop computer, a laptop computer, a digital broadcasting terminal, a personal digital assistant (PDA), a portable multimedia player (PMP), a tablet computer, and a game console (game). console), a wearable device, an Internet of things (IoT) device, a virtual reality (VR) device, an augmented reality (AR) device, and the like. In addition, in FIG. 2 , three user terminals 210_1 , 210_2 , and 210_3 are illustrated as communicating with the synthesized speech generation system 230 through the network 220 , but the present invention is not limited thereto, and a different number of user terminals is connected to the network It may be configured to communicate with a synthetic speech generation system 230 via 220 .
일 실시예에서, 사용자 단말(210_1, 210_2, 210_3)은 대상 텍스트, 기준화자의 화자 특징에 대한 정보 및/또는 발성 특징 변화 정보를 나타내거나 선택하는 정보를 합성 음성 생성 시스템(230)으로 제공할 수 있다. 추가적으로 또는 대안적으로, 사용자 단말(210_1, 210_2, 210_3)은 후보 기준 화자의 화자 특징 및/또는 후보 발성 특징 변화 정보를 합성 음성 생성 시스템(230)으로부터 수신할 수 있다. 이에 응답하여, 사용자 단말(210_1, 210_2, 210_3)은 사용자 입력에 응답하여, 후보 기준 화자의 화자 특징 및/또는 후보 발성 특징 변화 정보 중에서 기준 화자의 화자 특징 및/또는 발성 특징 변화 정보를 선택할 수 있다. 또한, 사용자 단말(210_1, 210_2, 210_3)은 합성 음성 생성 시스템(230)으로부터 생성된 출력 음성을 수신할 수 있다.In an embodiment, the user terminals 210_1, 210_2, and 210_3 provide the target text, information about the speaker characteristics of the reference speaker, and/or information indicating or selecting speech characteristics change information to the synthesized speech generation system 230. can Additionally or alternatively, the user terminals 210_1 , 210_2 , and 210_3 may receive the speaker characteristic and/or the candidate vocalization characteristic change information of the candidate reference speaker from the synthesized speech generation system 230 . In response, the user terminals 210_1, 210_2, and 210_3 may select, in response to the user input, speaker characteristics and/or speech characteristics change information of the reference speaker from the candidate reference speaker speaker characteristics and/or candidate vocal characteristics change information. have. Also, the user terminals 210_1 , 210_2 , and 210_3 may receive the output voice generated from the synthesized voice generating system 230 .
도 2에서는 사용자 단말(210_1, 210_2, 210_3)의 각각과 합성 음성 생성 시스템(230)은 별도로 구성된 요소로서 도시되었으나, 이에 한정되지 않으며, 합성 음성 생성 시스템(230)이 사용자 단말(210_1, 210_2, 210_3)의 각각에 포함되도록 구성될 수 있다. 이와 달리, 합성 음성 생성 시스템(230)이 입출력 인터페이스를 포함하여, 사용자 단말(210_1, 210_2, 210_3)와의 통신 없이, 대상 텍스트, 기준 화자의 화자 특징 및 발성 특징 변화 정보를 결정하여 대상 텍스트에 대한, 새로운 화자의 화자 특징이 반영된 합성 음성을 출력하도록 구성될 수 있다. In FIG. 2, each of the user terminals 210_1, 210_2, and 210_3 and the synthesized voice generating system 230 are illustrated as separately configured elements, but the present invention is not limited thereto. 210_3) may be configured to be included in each. On the other hand, the synthesized speech generation system 230 includes an input/output interface to determine the target text, the speaker characteristics of the reference speaker, and the speech characteristics change information without communication with the user terminals 210_1, 210_2, and 210_3 for the target text. , it may be configured to output a synthesized voice in which the speaker characteristics of the new speaker are reflected.
도 3은 본 개시의 일 실시예에 따른 사용자 단말(210) 및 합성 음성 생성 시스템(230)의 내부 구성을 나타내는 블록도이다. 사용자 단말(210)은 유/무선 통신이 가능한 임의의 컴퓨팅 장치를 지칭할 수 있으며, 예를 들어, 도 2의 휴대폰 또는 스마트폰(210_1), 태블릿 컴퓨터(210_2), 랩탑 또는 데스크탑 컴퓨터(210_3) 등을 포함할 수 있다. 도시된 바와 같이, 사용자 단말(210)은 메모리(312), 프로세서(314), 통신 모듈(316) 및 입출력 인터페이스(318)를 포함할 수 있다. 이와 유사하게, 합성 음성 생성 시스템(230)은 메모리(332), 프로세서(334), 통신 모듈(336) 및 입출력 인터페이스(338)를 포함할 수 있다. 도 3에서 도시된 바와 같이, 사용자 단말(210) 및 합성 음성 생성 시스템(230)은 각각의 통신 모듈(316, 336)을 이용하여 네트워크(220)를 통해 정보 및/또는 데이터를 통신할 수 있도록 구성될 수 있다. 또한, 입출력 장치(320)는 입출력 인터페이스(318)를 통해 사용자 단말(210)에 정보 및/또는 데이터를 입력하거나 사용자 단말(210)로부터 생성된 정보 및/또는 데이터를 출력하도록 구성될 수 있다.3 is a block diagram illustrating internal configurations of the user terminal 210 and the synthesized voice generation system 230 according to an embodiment of the present disclosure. The user terminal 210 may refer to any computing device capable of wired/wireless communication, for example, the mobile phone or smart phone 210_1, the tablet computer 210_2, the laptop or desktop computer 210_3 of FIG. 2 . and the like. As shown, the user terminal 210 may include a memory 312 , a processor 314 , a communication module 316 , and an input/output interface 318 . Similarly, the synthesized speech generation system 230 may include a memory 332 , a processor 334 , a communication module 336 , and an input/output interface 338 . As shown in FIG. 3 , the user terminal 210 and the synthesized voice generation system 230 are configured to communicate information and/or data via the network 220 using the respective communication modules 316 and 336 , respectively. can be configured. In addition, the input/output device 320 may be configured to input information and/or data to the user terminal 210 through the input/output interface 318 or to output information and/or data generated from the user terminal 210 .
메모리(312, 332)는 비-일시적인 임의의 컴퓨터 판독 가능한 기록매체를 포함할 수 있다. 일 실시예에 따르면, 메모리(312, 332)는 RAM(random access memory), ROM(read only memory), 디스크 드라이브, SSD(solid state drive), 플래시 메모리(flash memory) 등과 같은 비소멸성 대용량 저장 장치(permanent mass storage device)를 포함할 수 있다. 다른 예로서, ROM, SSD, 플래시 메모리, 디스크 드라이브 등과 같은 비소멸성 대용량 저장 장치는 메모리와는 구분되는 별도의 영구 저장 장치로서 사용자 단말(210) 및/또는 합성 음성 생성 시스템(230)에 포함될 수 있다. 또한, 메모리(312, 332)에는 운영체제와 적어도 하나의 프로그램 코드(예를 들어, 새로운 화자의 화자 특징을 결정하기 위한 코드, 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하기 위한 코드 등)가 저장될 수 있다.The memories 312 and 332 may include any non-transitory computer-readable recording medium. According to one embodiment, the memories 312 and 332 are non-volatile mass storage devices such as random access memory (RAM), read only memory (ROM), disk drives, solid state drives (SSDs), flash memory, and the like. (permanent mass storage device) may be included. As another example, a non-volatile mass storage device such as a ROM, SSD, flash memory, disk drive, etc. may be included in the user terminal 210 and/or the synthetic voice generation system 230 as a separate persistent storage device separate from the memory. have. Also, in the memories 312 and 332 , an operating system and at least one program code (eg, a code for determining a speaker characteristic of a new speaker, a code for generating an output voice reflecting the speaker characteristic of a new speaker, etc.) are stored. can be
이러한 소프트웨어 구성요소들은 메모리(312, 332)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독가능한 기록매체는 이러한 사용자 단말(210) 및 합성 음성 생성 시스템(230)에 직접 연결가능한 기록 매체를 포함할 수 있는데, 예를 들어, 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 예로서, 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 모듈을 통해 메모리(312, 332)에 로딩될 수도 있다. 예를 들어, 적어도 하나의 프로그램은 개발자들 또는 애플리케이션의 설치 파일을 배포하는 파일 배포 시스템이 네트워크(220)를 통해 제공하는 파일들에 의해 설치되는 컴퓨터 프로그램(예: 인공신경망 텍스트-음성 합성 모델 프로그램)에 기반하여 메모리(312, 332)에 로딩될 수 있다.These software components may be loaded from a computer-readable recording medium separate from the memories 312 and 332 . The separate computer-readable recording medium may include a recording medium directly connectable to the user terminal 210 and the synthesized voice generation system 230, for example, a floppy drive, a disk, a tape, a DVD/CD. - It may include a computer-readable recording medium such as a ROM drive and a memory card. As another example, the software components may be loaded into the memories 312 and 332 through a communication module rather than a computer-readable recording medium. For example, the at least one program is a computer program (eg, artificial neural network text-to-speech synthesis model program) installed by files provided through the network 220 by developers or a file distribution system that distributes installation files of applications. ) may be loaded into the memories 312 and 332 based on the.
프로세서(314, 334)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(312, 332) 또는 통신 모듈(316, 336)에 의해 프로세서(314, 334)로 제공될 수 있다. 예를 들어, 프로세서(314, 334)는 메모리(312, 332)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processors 314 and 334 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the processor 314 , 334 by the memory 312 , 332 or the communication module 316 , 336 . For example, the processors 314 and 334 may be configured to execute received instructions according to program code stored in a recording device, such as the memories 312 and 332 .
통신 모듈(316, 336)은 네트워크(220)를 통해 사용자 단말(210)과 합성 음성 생성 시스템(230)이 서로 통신하기 위한 구성 또는 기능을 제공할 수 있으며, 사용자 단말(210) 및/또는 합성 음성 생성 시스템(230)이 다른 사용자 단말 또는 다른 시스템(일례로 별도의 클라우드 시스템, 별도의 프레임 이미지 생성 시스템 등)과 통신하기 위한 구성 또는 기능을 제공할 수 있다. 일례로, 사용자 단말(210)의 프로세서(314)가 메모리(312) 등과 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청(예를 들어, 합성 음성 생성 요청, 새로운 화자의 화자 특징 생성 요청 등)은 통신 모듈(316)의 제어에 따라 네트워크(220)를 통해 합성 음성 생성 시스템(230)으로 전달될 수 있다. 역으로, 합성 음성 생성 시스템(230)의 프로세서(334)의 제어에 따라 제공되는 제어 신호나 명령이 통신 모듈(336)과 네트워크(220)를 거쳐 사용자 단말(210)의 통신 모듈(316)을 통해 사용자 단말(210)에 수신될 수 있다.The communication modules 316 and 336 may provide a configuration or function for the user terminal 210 and the synthesized voice generation system 230 to communicate with each other via the network 220 , and the user terminal 210 and/or synthesis The voice generating system 230 may provide a configuration or function for communicating with another user terminal or another system (eg, a separate cloud system, a separate frame image generating system, etc.). For example, a request generated by the processor 314 of the user terminal 210 according to a program code stored in a recording device such as the memory 312 (eg, a synthetic voice generation request, a new speaker's speaker characteristic generation request, etc.) may be transmitted to the synthesized voice generation system 230 through the network 220 under the control of the communication module 316 . Conversely, a control signal or command provided under the control of the processor 334 of the synthesized speech generation system 230 is transmitted to the communication module 316 of the user terminal 210 via the communication module 336 and the network 220 . through the user terminal 210 may be received.
입출력 인터페이스(318)는 입출력 장치(320)와의 인터페이스를 위한 수단일 수 있다. 일 예로서, 입력 장치는 키보드, 마이크로폰, 마우스, 이미지 센서를 포함한 카메라 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커, 햅틱 피드백 디바이스(haptic feedback device) 등과 같은 장치를 포함할 수 있다. 다른 예로, 입출력 인터페이스(318)는 터치스크린 등과 같이 입력과 출력을 수행하기 위한 구성 또는 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 사용자 단말(210)의 프로세서(314)가 메모리(312)에 로딩된 컴퓨터 프로그램의 명령을 처리함에 있어서 합성 음성 생성 시스템(230)이나 다른 사용자 단말(210)이 제공하는 정보 및/또는 데이터를 이용하여 구성되는 서비스 화면이나 사용자 인터페이스가 입출력 인터페이스(318)를 통해 디스플레이에 표시될 수 있다.The input/output interface 318 may be a means for interfacing with the input/output device 320 . As an example, the input device may include a device such as a keyboard, a microphone, a mouse, a camera including an image sensor, and the output device may include a device such as a display, a speaker, a haptic feedback device, and the like. As another example, the input/output interface 318 may be a means for an interface with a device in which a configuration or function for performing input and output, such as a touch screen, is integrated into one. For example, when the processor 314 of the user terminal 210 processes a command of a computer program loaded in the memory 312, information provided by the synthesized speech generation system 230 or other user terminal 210 and/ Alternatively, a service screen or user interface configured using data may be displayed on the display through the input/output interface 318 .
도 3에서는 입출력 장치(320)가 사용자 단말(210)에 포함되지 않도록 도시되어 있으나, 이에 한정되지 않으며, 사용자 단말(210)과 하나의 장치로 구성될 수 있다. 또한, 합성 음성 생성 시스템(230)의 입출력 인터페이스(338)는 합성 음성 생성 시스템(230)과 연결되거나 합성 음성 생성 시스템(230)이 포함할 수 있는 입력 또는 출력을 위한 장치(미도시)와의 인터페이스를 위한 수단일 수 있다. 도 3에서는 입출력 인터페이스(318, 338)가 프로세서(314, 334)와 별도로 구성된 요소로서 도시되었으나, 이에 한정되지 않으며, 입출력 인터페이스(318, 338)가 프로세서(314, 334)에 포함되도록 구성될 수 있다.In FIG. 3 , the input/output device 320 is illustrated not to be included in the user terminal 210 , but the present invention is not limited thereto, and may be configured as a single device with the user terminal 210 . In addition, the input/output interface 338 of the synthesized voice generation system 230 interfaces with a device (not shown) for input or output that is connected to the synthesized voice generation system 230 or may include the synthesized voice generation system 230 . may be a means for In FIG. 3, the input/ output interfaces 318 and 338 are illustrated as elements configured separately from the processors 314 and 334, but the present invention is not limited thereto, and the input/ output interfaces 318 and 338 may be configured to be included in the processors 314 and 334. have.
사용자 단말(210) 및 합성 음성 생성 시스템(230)은 도 3의 구성요소들보다 더 많은 구성요소들을 포함할 수 있다. 그러나, 대부분의 종래 기술적 구성요소들을 명확하게 도시할 필요성은 없다. 일 실시예에 따르면, 사용자 단말(210)은 상술된 입출력 장치(320) 중 적어도 일부를 포함하도록 구현될 수 있다. 또한, 사용자 단말(210)은 트랜시버(transceiver), GPS(Global Positioning system) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수 있다. 예를 들어, 사용자 단말(210)이 스마트폰인 경우, 일반적으로 스마트폰이 포함하고 있는 구성요소를 포함할 수 있으며, 예를 들어, 가속도 센서, 자이로 센서, 카메라 모듈, 각종 물리적인 버튼, 터치패널을 이용한 버튼, 입출력 포트, 진동을 위한 진동기 등의 다양한 구성요소들이 사용자 단말(210)에 더 포함되도록 구현될 수 있다.The user terminal 210 and the synthesized voice generation system 230 may include more components than those of FIG. 3 . However, there is no need to clearly show most of the prior art components. According to an embodiment, the user terminal 210 may be implemented to include at least a portion of the above-described input/output device 320 . In addition, the user terminal 210 may further include other components such as a transceiver, a global positioning system (GPS) module, a camera, various sensors, and a database. For example, when the user terminal 210 is a smart phone, it may include components generally included in the smart phone, for example, an acceleration sensor, a gyro sensor, a camera module, various physical buttons, and touch. Various components such as a button using a panel, an input/output port, and a vibrator for vibration may be implemented to be further included in the user terminal 210 .
일 실시예에 따르면, 사용자 단말(210)의 프로세서(314)는 합성 음성 출력 어플리케이션 등이 동작하도록 구성될 수 있다. 이 때, 해당 어플리케이션 및/또는 프로그램과 연관된 코드가 사용자 단말(210)의 메모리(312)에 로딩될 수 있다. 어플리케이션 및/또는 프로그램이 동작되는 동안에, 사용자 단말(210)의 프로세서(314)는 입출력 장치(320)로부터 제공된 정보 및/또는 데이터를 입출력 인터페이스(318)를 통해 수신하거나 통신 모듈(316)을 통해 합성 음성 생성 시스템(230)으로부터 정보 및/또는 데이터를 수신할 수 있으며, 수신된 정보 및/또는 데이터를 처리하여 메모리(312)에 저장할 수 있다. 또한, 이러한 정보 및/또는 데이터는 통신 모듈(316)을 통해 합성 음성 생성 시스템(230)에 제공할 수 있다.According to an embodiment, the processor 314 of the user terminal 210 may be configured to operate a synthetic voice output application or the like. In this case, a code associated with a corresponding application and/or program may be loaded into the memory 312 of the user terminal 210 . While the application and/or program is being operated, the processor 314 of the user terminal 210 receives information and/or data provided from the input/output device 320 through the input/output interface 318 or through the communication module 316 . Information and/or data may be received from the synthesized speech generation system 230 , and the received information and/or data may be processed and stored in the memory 312 . In addition, such information and/or data may be provided to the synthesized voice generation system 230 through the communication module 316 .
합성 음성 출력 어플리케이션 등을 위한 프로그램이 동작되는 동안에, 프로세서(314)는 입출력 인터페이스(318)와 연결된 터치 스크린, 키보드 등의 입력 장치(320)를 통해 입력되거나 선택된 텍스트 등을 수신할 수 있으며, 수신된 텍스트를 메모리(312)에 저장하거나 통신 모듈(316) 및 네트워크(220)를 통해 합성 음성 생성 시스템(230)에 제공할 수 있다. 예를 들어, 프로세서(314)는 대상 텍스트(예를 들어, 하나 이상의 문단, 문장, 문구, 단어, 음소 등)에 대한 입력을 입력 장치(320)를 통해 수신할 수 있다. 추가적으로, 프로세서(314)는 기준 화자에 대한 정보 및/또는 발성 특징 변화 정보를 나타내거나 선택하는 임의의 정보를 입력 장치(320)를 통해 수신할 수 있다.While a program for a synthetic voice output application, etc. is being operated, the processor 314 may receive text input or selected through an input device 320 such as a touch screen or a keyboard connected to the input/output interface 318, and receive The synthesized text may be stored in the memory 312 or provided to the synthesized speech generation system 230 through the communication module 316 and the network 220 . For example, the processor 314 may receive an input for the target text (eg, one or more paragraphs, sentences, phrases, words, phonemes, etc.) through the input device 320 . Additionally, the processor 314 may receive, through the input device 320 , any information indicating or selecting information about a reference speaker and/or information on change of speech characteristics.
일 실시예에 따르면, 프로세서(314)는 입력 장치(320)를 통한, 대상 텍스트에 대한 입력을 입출력 인터페이스(318)를 통해 수신할 수 있다. 다른 실시예에 따르면, 프로세서(314)는 대상 텍스트를 포함하고 있는 문서 형식의 파일을 사용자 인터페이스를 통해 업로드하는 입력을 입력 장치(320) 및 입출력 인터페이스(318)를 통해 수신할 수 있다. 여기서, 프로세서(314)는 이러한 입력에 응답하여, 메모리(312)로부터 입력에 대응하는 문서 형식의 파일을 수신할 수 있다. 프로세서(314)는 이러한 입력에 응답하여, 파일에 포함된 대상 텍스트를 수신할 수 있다. 이렇게 수신된 대상 텍스트를 통신 모듈(316)을 통해 합성 음성 생성 시스템(230)에 제공할 수 있다. 이와 달리, 프로세서(314)는 업로드된 파일을 통신 모듈(316)을 통해 합성 음성 생성 시스템(230)에 제공하고, 합성 음성 생성 시스템(230)으로부터 파일 내에 포함된 대상 텍스트를 수신하도록 구성될 수 있다.According to an embodiment, the processor 314 may receive an input for the target text through the input device 320 through the input/output interface 318 . According to another embodiment, the processor 314 may receive, through the input device 320 and the input/output interface 318 , an input for uploading a file in a document format including the target text through the user interface. Here, in response to the input, the processor 314 may receive a file in a document format corresponding to the input from the memory 312 . In response to this input, the processor 314 may receive the target text included in the file. The received target text may be provided to the synthesized speech generating system 230 through the communication module 316 . Alternatively, the processor 314 may be configured to provide the uploaded file to the synthesized speech generation system 230 via the communication module 316 and to receive the target text contained within the file from the synthesized speech generation system 230 . have.
프로세서(314)는 사용자 단말(210)의 디스플레이 출력 가능 장치(예: 터치 스크린, 디스플레이 등), 음성 출력 가능 장치(예: 스피커) 등의 출력 장치를 통해 처리된 정보 및/또는 데이터를 출력하도록 구성될 수 있다. 예를 들면, 프로세서(314)는 입력 장치(320), 메모리(312) 또는 합성 음성 생성 시스템(230) 중 적어도 하나로부터 수신된 대상 텍스트 및/또는 발성 특징 변화 정보를 나타내거나 선택하는 정보를 사용자 단말(210)의 화면을 통해 출력할 수 있다. 추가적으로 또는 대안적으로, 프로세서(314)는 정보 처리 시스템(230)에 의해 결정되거나 생성된 새로운 화자의 화자 특징을 사용자 단말(210)의 화면을 통해 출력할 수 있다. 또한, 프로세서(314)는 합성 음성을 스피커 등 음성 출력 가능 장치를 통해 출력할 수 있다. 추가적으로, 프로세서(314)는 스피커 등 음성 출력 가능 장치를 통해 출력할 수 있다.The processor 314 outputs the processed information and/or data through an output device such as a display output capable device (eg, a touch screen, a display, etc.) of the user terminal 210 and an audio output capable device (eg, a speaker). can be configured. For example, the processor 314 may display information representing or selecting target text and/or speech characteristic change information received from at least one of the input device 320 , the memory 312 , or the synthesized speech generation system 230 to the user. It can be output through the screen of the terminal 210 . Additionally or alternatively, the processor 314 may output the speaker characteristics of the new speaker determined or generated by the information processing system 230 through the screen of the user terminal 210 . Also, the processor 314 may output the synthesized voice through a voice output capable device such as a speaker. Additionally, the processor 314 may output the audio through a device capable of outputting audio, such as a speaker.
합성 음성 생성 시스템(230)의 프로세서(334)는 사용자 단말(210)을 포함한 복수의 사용자 단말 및/또는 복수의 외부 시스템으로부터 수신된 정보 및/또는 데이터를 관리, 처리 및/또는 저장하도록 구성될 수 있다. 프로세서(334)에 의해 처리된 정보 및/또는 데이터는 통신 모듈(336)을 통해 사용자 단말(210)에 제공할 수 있다. 일 실시예에서, 프로세서(334)는 사용자 단말(210), 메모리(332) 및/또는 외부 저장 장치로부터 대상 텍스트, 기준 화자에 대한 정보, 발성 특징 변화 정보를 나타내거나 선택하는 정보를 수신하고, 메모리(332) 및/또는 외부 저장 장치에 포함된 기준 화자의 화자 특징 및 발성 특징 변화 정보를 획득하거나 결정할 수 있다.The processor 334 of the synthesized speech generation system 230 may be configured to manage, process and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems, including the user terminal 210 . can The information and/or data processed by the processor 334 may be provided to the user terminal 210 through the communication module 336 . In one embodiment, the processor 334 receives from the user terminal 210, the memory 332 and/or the external storage device information indicating or selecting the target text, information about the reference speaker, and speech characteristic change information, It is possible to obtain or determine the speaker characteristics and vocal characteristics change information of the reference speaker included in the memory 332 and/or the external storage device.
그리고 나서, 프로세서(334)는, 기준 화자의 화자 특징 및 발성 특징 변화 정보를 이용하여, 새로운 화자의 화자 특징을 결정할 수 있다. 또한, 프로세서(334)는 결정된 새로운 화자 특징이 반영된 대상 텍스트에 대한 출력 음성을 생성할 수 있다. 예를 들어, 프로세서(334)는 대상 텍스트, 새로운 화자의 화자 특징을 인공신경망 텍스트-음성 합성 모델에 입력하여, 인공신경망 텍스트-음성 합성 모델로부터 출력 음성을 생성할 수 있다. 이렇게 생성된 출력 음성은 네트워크(220)를 통해 사용자 단말(210)에 제공되어, 사용자 단말(210)과 연관된 스피커를 통해 출력될 수 있다.Then, the processor 334 may determine the speaker characteristic of the new speaker using the speaker characteristic and the vocalization characteristic change information of the reference speaker. Also, the processor 334 may generate an output voice for the target text in which the determined new speaker characteristic is reflected. For example, the processor 334 may input the target text and the speaker characteristics of the new speaker into the artificial neural network text-to-speech synthesis model to generate output speech from the artificial neural network text-to-speech synthesis model. The output voice generated in this way may be provided to the user terminal 210 through the network 220 and output through a speaker associated with the user terminal 210 .
도 4는 본 개시의 일 실시예에 따른 사용자 단말의 프로세서(334)의 내부 구성을 나타내는 블록도이다. 도시된 바와 같이, 프로세서(334)는 화자 특징 결정 모듈(410), 합성 음성 출력 모듈(420), 발성 특징 변화 정보 결정 모듈(430) 및 출력 음성 검증 모듈(440)을 포함할 수 있다. 이러한 프로세서(334)에서 동작되는 모듈의 각각은 서로 통신하도록 구성될 수 있다. 도 4에서 프로세서(334)의 내부 구성을 기능별로 구분하여 설명하지만, 이는 반드시 물리적으로 구분되는 것을 의미하지 않는다. 또한, 도 4에서 도시한 프로세서(334)의 내부 구성은 예시일 뿐이며, 필수 구성만을 도시한 것은 아니다. 따라서, 일부 실시예에서 프로세서(334)는 도시한 내부 구성 외 다른 구성을 추가로 포함하거나, 도시한 구성 내부 중 일부 구성이 생략되는 등 다르게 구현될 수 있다.4 is a block diagram illustrating an internal configuration of a processor 334 of a user terminal according to an embodiment of the present disclosure. As shown, the processor 334 may include a speaker characteristic determination module 410 , a synthesized speech output module 420 , a speech characteristic change information determination module 430 , and an output speech verification module 440 . Each of the modules operated on the processor 334 may be configured to communicate with each other. In FIG. 4 , the internal configuration of the processor 334 is described separately for each function, but this does not necessarily mean that the processor 334 is physically separated. In addition, the internal configuration of the processor 334 shown in FIG. 4 is only an example, and only essential configurations are not shown. Accordingly, in some embodiments, the processor 334 may be implemented differently, such as by additionally including other components other than the illustrated internal configuration, or by omitting some of the illustrated internal components.
화자 특징 결정 모듈(410)은 기준 화자의 화자 특징을 획득할 수 있다. 일 실시예에 따르면, 도 1에서 설명된 바와 같이, 학습된 인공신경망 화자 특징 추출 모델을 통해 기준 화자의 특징을 추출할 수 있다. 예를 들어, 화자 특징 결정 모듈(410)은 화자 id(예를 들어, 화자 one-hot vector 등) 및 발성 특징(예: 벡터)을 학습된 인공신경망 화자 특징 추출 모델에 입력하여 기준 화자의 화자 특징(예: 벡터)를 추출할 수 있다. 다른 예로서, 화자 특징 결정 모듈(410)은 화자가 녹음한 음성 및 발성 특징(예: 벡터)을 학습된 인공신경망 화자 특징 추출 모델에 입력하여, 기준 화자의 화자 특징(예: 벡터)를 추출할 수 있다.The speaker characteristic determination module 410 may acquire speaker characteristics of a reference speaker. According to an embodiment, as described in FIG. 1 , the features of the reference speaker may be extracted through the learned artificial neural network speaker feature extraction model. For example, the speaker feature determination module 410 inputs the speaker id (eg, speaker one-hot vector, etc.) and vocalization characteristics (eg, vector) into the trained artificial neural network speaker feature extraction model to determine the speaker of the reference speaker. Features (eg vectors) can be extracted. As another example, the speaker feature determination module 410 inputs the speech and vocalization features (eg, vectors) recorded by the speaker into the trained artificial neural network speaker feature extraction model, and extracts the speaker features (eg, vectors) of the reference speaker. can do.
화자 특징 결정 모듈(410)은 기준 화자의 화자 특징 및 발성 특징 변화 정보를 획득하고, 획득된 기준 화자의 화자 특징 및 획득된 발성 특징 변화 정보를 이용하여 새로운 화자의 화자 특징을 결정할 수 있다. 여기서, 기준 화자의 화자 특징은 저장매체에 저장된 복수의 화자의 화자 특징 중 적어도 하나가 선택될 수 있다. 또한, 발성 특징 변화 정보는 기준 화자의 화자 특징의 변화를 나타내는 정보, 저장매체에 저장된 복수의 화자의 적어도 일부의 화자 특징의 변화를 나타내는 정보 및/또는 복수의 화자의 적어도 일부의 화자 특징에 포함된 발성 특징의 변화를 나타내는 정보일 수 있다. 여기서, 복수의 화자의 화자 특징은 학습된 인공신경망 화자 특징 추출 모델로부터 추론된 특징을 포함할 수 있다. 또한, 화자 특징 및 발성 특징의 각각은 벡터 형태로 표현될 수 있다. The speaker characteristic determination module 410 may obtain speaker characteristics and vocalization characteristic change information of the reference speaker, and determine the speaker characteristic of a new speaker by using the acquired speaker characteristic of the reference speaker and the acquired vocalization characteristic change information. Here, as the speaker characteristic of the reference speaker, at least one of the speaker characteristics of a plurality of speakers stored in the storage medium may be selected. In addition, the speech characteristic change information includes information indicating a change in speaker characteristics of a reference speaker, information indicating a change in speaker characteristics of at least some of the plurality of speakers stored in the storage medium, and/or included in the speaker characteristics of at least some of the plurality of speakers It may be information indicating a change in vocal characteristics. Here, the speaker features of the plurality of speakers may include features inferred from the learned artificial neural network speaker feature extraction model. In addition, each of the speaker characteristic and the vocalization characteristic may be expressed in a vector form.
합성 음성 출력 모듈(420)은 사용자 단말로부터 대상 텍스트를 수신하고, 화자 특징 결정 모듈(410)로부터 새로운 화자의 화자 특징을 수신할 수 있다. 합성 음성 출력 모듈(420)은 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대한 출력 음성을 생성할 수 있다. 일 실시예에서, 합성 음성 출력 모듈(420)은 대상 텍스트 및 새로운 화자의 화자 특징을 학습된 인공신경망 텍스트-음성 합성 모델에 입력하여, 인공신경망 텍스트-음성 합성 모델로부터 출력 음성(즉, 합성 음성)을 생성할 수 있다. 이러한 인공신경망 텍스트-음성 합성 모델은 저장매체(예를 들어, 정보 처리 시스템(230)의 메모리(332), 정보 처리 시스템(230)의 프로세서(334)에서 접근 가능한 다른 저장 매체 등)에 저장될 수 있다. 여기서, 인공신경망 텍스트-음성 합성 모델은, 복수의 학습 텍스트 아이템, 복수의 학습 화자의 화자 특징을 기초로, 복수의 학습 화자의 화자 특징이 반영된 대상 텍스트에 대한 음성을 출력하도록 학습된 모델을 포함할 수 있다. 그리고 나서, 합성 음성 출력 모듈(420)은 생성된 합성 음성을 사용자 단말에 제공할 수 있다. 이에 따라, 생성된 합성 음성은 사용자 단말(210)에 내장되거나 유선 또는 무선으로 연결된 임의의 스피커를 통해 출력될 수 있다. The synthesized speech output module 420 may receive the target text from the user terminal and receive the speaker characteristics of the new speaker from the speaker characteristic determination module 410 . The synthesized voice output module 420 may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected. In an embodiment, the synthesized speech output module 420 inputs the target text and the speaker characteristics of the new speaker to the trained artificial neural network text-to-speech synthesis model, and outputs the speech (ie, synthesized speech) from the artificial neural network text-to-speech synthesis model. ) can be created. This artificial neural network text-to-speech synthesis model is to be stored in a storage medium (eg, the memory 332 of the information processing system 230 , other storage media accessible by the processor 334 of the information processing system 230 , etc.). can Here, the artificial neural network text-to-speech synthesis model includes a model trained to output a voice for the target text in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. can do. Then, the synthesized voice output module 420 may provide the generated synthesized voice to the user terminal. Accordingly, the generated synthesized voice may be output through any speaker built into the user terminal 210 or connected via wire or wirelessly.
발성 특징 변화 정보 결정 모듈(430)은 메모리(332)로부터 발성 특징 변화 정보를 획득할 수 있다. 일 실시예에 따르면, 이러한 발성 특징 변화 정보는 사용자 단말(예: 도 2의 사용자 단말(210))을 통해 사용자 입력을 통해 정해진 정보를 통해 결정될 수 있다. 여기서, 발성 특징 변화 정보는 새롭게 생성하고 싶은 화자, 즉, 새로운 화자를 생성하기 위해, 변경하고 싶은 발성 특징에 대한 정보를 포함할 수 있다. 추가적으로 또는 이와 달리, 발성 특징 변화 정보는 기준 화자의 화자 특징과 연관된 정보(예를 들어, 반영 비율 정보)를 포함할 수 있다. The speech characteristic change information determination module 430 may obtain speech characteristic change information from the memory 332 . According to an embodiment, the speech characteristic change information may be determined through information determined through a user input through a user terminal (eg, the user terminal 210 of FIG. 2 ). Here, the speech characteristic change information may include information on a speech characteristic to be changed in order to generate a new speaker, that is, a new speaker. Additionally or alternatively, the vocalization characteristic change information may include information (eg, reflection ratio information) related to the speaker characteristic of the reference speaker.
이하에서는, 화자 특징 결정 모듈(410)과 발성 특징 변화 정보 결정 모듈(430)에 의해 발성 특징 변화 정보가 결정되고, 결정된 발성 특징 변화 정보 및 기준 화자의 화자 특징을 이용하여 새로운 화자의 특징이 결정되는 구체적인 예시들이 설명된다. Hereinafter, the speech characteristic change information is determined by the speaker characteristic determination module 410 and the speech characteristic change information determination module 430 , and the characteristic of a new speaker is determined using the determined speech characteristic change information and the speaker characteristic of the reference speaker. Specific examples are described.
일 실시예에서, 화자 특징 결정 모듈(410)은 기준 화자의 화자 특징 및 발성 특징 변화 정보를 학습된 인공신경망 화자 특징 변화 생성 모델에 입력하여 화자 특징 변화를 생성하고, 기준 화자의 화자 특징 및 생성된 화자 특징 변화를 합성함으로써, 새로운 화자의 화자 특징을 출력할 수 있다. 이러한 인공신경망 화자 특징 변화 생성 모델의 학습 시에, 화자의 화자 특징에 포함된 발성 특징 정보를 입력으로 사용하지 않고, 화자마다 각각의 발성 특징 정보가 획득될 수 있다. 예를 들어, 사람이 태깅(tagging)을 통해 주어진 화자의 발성 특징 정보가 획득될 수 있다. 다른 예로서, 주어진 화자의 화자 특징으로부터 화자의 발성 특징을 추론하도록 학습된 인공신경망 발성 특징 추출 모델을 통해 주어진 화자의 발성 특징 정보가 획득될 수 있다. 이렇게 획득된 화자의 발성 특징 정보는 저장 매체에 저장될 수 있다. 즉, 인공신경망 화자 특징 변화 생성 모델을 이용하여, 발성 특징 변화에 따른 기준 화자의 화자 특징을 조절하는 것이 가능하다. 이러한 인공신경망 화자 특징 변화 생성 모델은 아래 수학식 1을 이용하여 학습될 수 있다. In an embodiment, the speaker characteristic determination module 410 generates a speaker characteristic change by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the learned artificial neural network speaker characteristic change generation model, and the speaker characteristic and generation of the reference speaker By synthesizing the changed speaker characteristics, it is possible to output the speaker characteristics of a new speaker. When the artificial neural network is learning the speaker characteristic change generation model, individual speech characteristic information may be obtained for each speaker without using the speech characteristic information included in the speaker characteristic of the speaker as an input. For example, information on the vocalization characteristic of a given speaker may be obtained through tagging by a person. As another example, the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker. The obtained speaker's speech characteristic information may be stored in a storage medium. That is, it is possible to adjust the speaker characteristics of the reference speaker according to the change in the vocalization characteristics by using the artificial neural network speaker characteristic change generation model. This artificial neural network speaker feature change generation model can be learned using Equation 1 below.
Figure PCTKR2022001414-appb-img-000001
Figure PCTKR2022001414-appb-img-000001
여기서,
Figure PCTKR2022001414-appb-img-000002
는 기준 화자의 화자 특징을 지칭할 수 있고,
Figure PCTKR2022001414-appb-img-000003
는 참조 화자의 화자 특징을 지칭할 수 있다. 이러한 화자 특징은 학습된 인공신경망 화자 특징 추출 모델을 통해 추출될 특징일 수 있다. 이와 마찬가지로,
Figure PCTKR2022001414-appb-img-000004
는 기준 화자의 발성 특징을 지칭할 수 있고,
Figure PCTKR2022001414-appb-img-000005
는 참조 화자의 발성 특징을 지칭할 수 있다. 이러한 발성 특징은 학습된 인공신경망 발성 특징 추출 모델을 통해 추출된 특징일 수 있다. 즉, 발성 특징 변화 정보 결정 모듈(430)은 저장 매체로부터
Figure PCTKR2022001414-appb-img-000006
,
Figure PCTKR2022001414-appb-img-000007
,
Figure PCTKR2022001414-appb-img-000008
Figure PCTKR2022001414-appb-img-000009
를 획득하여 인공신경망 화자 특징 변화 생성 모델을 학습하는데 사용할 수 있다. 또한,
Figure PCTKR2022001414-appb-img-000010
Figure PCTKR2022001414-appb-img-000011
의 차이 값, 즉, loss를 기초로 인공신경망 화자 특징 변화 생성 모델이 학습될 수 있다.
here,
Figure PCTKR2022001414-appb-img-000002
may refer to the speaker characteristics of the reference speaker,
Figure PCTKR2022001414-appb-img-000003
may refer to the speaker characteristics of the reference speaker. These speaker features may be features to be extracted through the learned artificial neural network speaker feature extraction model. Likewise,
Figure PCTKR2022001414-appb-img-000004
may refer to the vocal characteristics of the reference speaker,
Figure PCTKR2022001414-appb-img-000005
may refer to the vocal characteristics of the reference speaker. These vocalization features may be features extracted through a learned artificial neural network vocalization feature extraction model. That is, the vocalization characteristic change information determination module 430 receives the information from the storage medium.
Figure PCTKR2022001414-appb-img-000006
,
Figure PCTKR2022001414-appb-img-000007
,
Figure PCTKR2022001414-appb-img-000008
and
Figure PCTKR2022001414-appb-img-000009
can be obtained and used to learn the artificial neural network speaker feature change generation model. In addition,
Figure PCTKR2022001414-appb-img-000010
and
Figure PCTKR2022001414-appb-img-000011
Based on the difference value of , that is, loss, an artificial neural network speaker feature change generation model can be trained.
그리고 나서, 발성 특징 변화 정보 결정 모듈(430)은 추론 시 기준 화자의 발성 특징과 참조 화자의 발성 특징 사이의 차이 및 기준 화자의 화자 특징을 학습된 인공신경망 화자 특징 변화 생성 모델에 입력하여 발성 특징 변화 정보
Figure PCTKR2022001414-appb-img-000012
를 결정할 수 있다. 화자 특징 결정 모듈(410)은 결정된 발성 특징 변화 정보
Figure PCTKR2022001414-appb-img-000013
및 기준 화자의 화자 특징
Figure PCTKR2022001414-appb-img-000014
을 기초로 새로운 화자의 화자 특징을 결정할 수 있다. 이러한 새로운 화자 특징은 아래 수학식 2와 같이 표시될 수 있다.
Then, the vocalization characteristic change information determination module 430 inputs the difference between the reference speaker's vocalization characteristic and the reference speaker's vocalization characteristic and the reference speaker's speaker characteristic to the learned artificial neural network speaker characteristic change generation model during inference to input the vocalization characteristic change information
Figure PCTKR2022001414-appb-img-000012
can be decided The speaker characteristic determination module 410 is configured to provide the determined speech characteristic change information.
Figure PCTKR2022001414-appb-img-000013
and speaker characteristics of the reference speaker
Figure PCTKR2022001414-appb-img-000014
Based on this, it is possible to determine the speaker characteristics of the new speaker. This new speaker characteristic can be expressed as Equation 2 below.
Figure PCTKR2022001414-appb-img-000015
Figure PCTKR2022001414-appb-img-000015
여기서,
Figure PCTKR2022001414-appb-img-000016
는 새로운 화자의 화자 특징을 의미할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000016
may mean the speaker characteristics of the new speaker.
일 실시예에서, 발성 특징 변화 정보 결정 모듈(430)은, 타겟 발성 특징에 대응하는 발성 특징 분류 모델을 이용하여 타겟 발성 특징에 대한 법선 벡터를 추출할 수 있다. 이를 위해, 복수의 발성 특징의 각각에 대응하는 발성 특징 분류 모델이 생성될 수 있다. 이러한 발성 특징 분류 모델은 Hyperplane 기반으로 한 모델로서, 예를 들어, SVM(Support Vector Machine), 선형 분류기(Linear classifier) 등을 이용하여 구현될 수 있으나, 이에 한정되지 않는다. 또한, 타겟 발성 특징은, 복수의 발성 특징 중 선택된 발성 특징으로서, 새로운 화자의 화자 특징에 변경되어 반영될 발성 특징을 지칭할 수 있다. 또한, 화자의 특징은 화자 벡터로 표현될 수 있다.In an embodiment, the vocalization characteristic change information determining module 430 may extract a normal vector for the target vocalization characteristic by using a vocalization feature classification model corresponding to the target vocalization characteristic. To this end, a speech feature classification model corresponding to each of the plurality of speech features may be generated. The vocal feature classification model is a hyperplane-based model, and may be implemented using, for example, a support vector machine (SVM), a linear classifier, or the like, but is not limited thereto. Also, the target vocalization characteristic may refer to a vocalization characteristic selected from among a plurality of vocalization features, which will be changed and reflected in the speaker characteristic of a new speaker. Also, the speaker's characteristic may be expressed as a speaker vector.
이러한 발성 특징 분류 모델의 학습 시에, 화자의 화자 특징에 포함된 발성 특징 정보를 입력으로 사용하지 않고, 화자마다 각각의 발성 특징 정보이 획득될 수 있다. 예를 들어, 사람이 태깅을 통해 주어진 화자의 발성 특징 정보가 획득될 수 있다. 다른 예로서, 주어진 화자의 화자 특징으로부터 화자의 발성 특징을 추론하도록 학습된 인공신경망 발성 특징 추출 모델을 통해 주어진 화자의 발성 특징 정보가 획득될 수 있다.When the speech feature classification model is trained, speech feature information included in the speaker feature of the speaker is not used as an input, and each speech feature information may be obtained for each speaker. For example, voice characteristic information of a given speaker may be obtained through tagging by a person. As another example, the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker.
이러한 발성 특징 분류 모델은 아래 수학식 3을 통해 학습될 수 있다.Such a speech feature classification model can be learned through Equation 3 below.
Figure PCTKR2022001414-appb-img-000017
Figure PCTKR2022001414-appb-img-000017
여기서,
Figure PCTKR2022001414-appb-img-000018
Figure PCTKR2022001414-appb-img-000019
인 i번째 발성 특징을 의미하고,
Figure PCTKR2022001414-appb-img-000020
는 i번째 발성 특징을 분류하는 Hyperplane의 법선 벡터를 의미하고, b는 편향(Bias)을 의미할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000018
Is
Figure PCTKR2022001414-appb-img-000019
means the i-th vocalization characteristic,
Figure PCTKR2022001414-appb-img-000020
denotes a normal vector of a hyperplane that classifies the i-th vocalization feature, and b denotes a bias.
그리고 나서, 화자 특징 결정 모듈(410)은, 새로운 화자의 합성 음성을 생성하기 위해, 학습된 인공신경망 화자 특징 추출 모델을 통해 새로운 화자와 가장 비슷한, 기준 화자의 화자 특징 벡터
Figure PCTKR2022001414-appb-img-000021
를 획득할 수 있다. 또한, 발성 특징 변화 정보 결정 모듈(430)은 학습된 발성 특징 분류 모델로부터 타겟 발성 특징의 법선 벡터 및 발성 특징을 조절하는 정도를 나타내는 정보를 발성 특징 변화 정보로서 획득할 수 있다. 이렇게 획득된 기준 화자의 화자 특징 벡터
Figure PCTKR2022001414-appb-img-000022
, 타겟 발성 특징의 법선 벡터 및 발성 특징을 조절하는 정도를 이용하여 아래 수학식 4에 따라 새로운 화자의 화자 특징
Figure PCTKR2022001414-appb-img-000023
가 생성될 수 있다.
Then, the speaker feature determination module 410 is configured to generate a speaker feature vector of the reference speaker most similar to the new speaker through the trained artificial neural network speaker feature extraction model to generate a synthesized speech of the new speaker.
Figure PCTKR2022001414-appb-img-000021
can be obtained. Also, the vocalization characteristic change information determination module 430 may obtain, as the vocalization characteristic change information, information indicating a normal vector of the target vocalization characteristic and the degree of adjusting the vocalization characteristic from the learned vocalization characteristic classification model. The speaker feature vector of the reference speaker thus obtained
Figure PCTKR2022001414-appb-img-000022
, the speaker characteristic of the new speaker according to Equation 4 below using the normal vector of the target speech feature and the degree of adjusting the speech feature.
Figure PCTKR2022001414-appb-img-000023
can be created.
Figure PCTKR2022001414-appb-img-000024
Figure PCTKR2022001414-appb-img-000024
여기서,
Figure PCTKR2022001414-appb-img-000025
는 타겟 발성 특징의 법선 벡터,
Figure PCTKR2022001414-appb-img-000026
는 발성 특징을 조절하는 정도를 지칭할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000025
is the normal vector of the target vocalization feature,
Figure PCTKR2022001414-appb-img-000026
may refer to the degree of controlling the vocal characteristics.
일 실시예에서, 화자 특징 결정 모듈(410)은 복수의 기준 화자에 대응하는 복수의 화자 특징을 획득할 수 있다. 또한, 발성 특징 변화 정보 결정 모듈(430)은, 복수의 화자 특징에 대응하는 가중치 세트를 획득하고 획득된 가중치 세트를 화자 특징 결정 모듈(410)에 제공할 수 있다. 화자 특징 결정 모듈(410)은 복수의 화자의 특징의 각각에 획득된 가중치 세트에 포함된 가중치를 적용함으로써, 아래 수학식 5와 같이 새로운 화자의 화자 특징을 결정할 수 있다. 즉, 여러 화자의 목소리가 섞여서 새로운 화자의 목소리가 생성될 수 있다.In an embodiment, the speaker characteristic determining module 410 may acquire a plurality of speaker characteristics corresponding to a plurality of reference speakers. Also, the speech characteristic change information determination module 430 may obtain a weight set corresponding to a plurality of speaker characteristics and provide the obtained weight set to the speaker characteristic determination module 410 . The speaker characteristic determination module 410 may determine the speaker characteristic of a new speaker as shown in Equation 5 below by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. That is, the voices of several speakers may be mixed to generate a new speaker's voice.
Figure PCTKR2022001414-appb-img-000027
Figure PCTKR2022001414-appb-img-000027
여기서,
Figure PCTKR2022001414-appb-img-000028
는 i 화자의 화자 벡터를 의미하고,
Figure PCTKR2022001414-appb-img-000029
는 i 화자에 대한 가중치를 의미할 수 있다.
Figure PCTKR2022001414-appb-img-000030
시그마 제약조건을 적용하면 여러 화자의 특징 벡터가 새로운 화자의 특징 벡터로 섞일 수 있다.
here,
Figure PCTKR2022001414-appb-img-000028
is the speaker vector of the i speaker,
Figure PCTKR2022001414-appb-img-000029
may mean a weight for speaker i.
Figure PCTKR2022001414-appb-img-000030
By applying the sigma constraint, feature vectors of multiple speakers can be mixed into the feature vectors of new speakers.
일 실시예에 따르면, 화자 특징 결정 모듈(410)은 사전 계산된 발성 특징 축 조절 방식을 통해 새로운 화자의 특징 벡터를 생성할 수 있다. 예를 들어, 화자 특징은 하나 이상의 발성 특징을 포함하고 있다. 발성 특징 변화 정보 결정 모듈(430)은 발성 특징 축을 찾아, 발성 특징 축 조절할 수 있다. 이렇게 조절된 발성 특징 축은 화자 특징 결정 모듈(410)에 제공되어 새로운 화자의 화자 특징이 결정되는데 사용될 수 있다. 즉, 화자 특징 결정 모듈(410)은 아래 수학식 6과 같이, 기준 화자의 화자 특징 r, 발성 특징 축
Figure PCTKR2022001414-appb-img-000031
및 발성 특징 변화 정보의 가중치
Figure PCTKR2022001414-appb-img-000032
를 이용하여 새로운 화자의 화자 특징을 결정할 수 있다.
According to an embodiment, the speaker characteristic determining module 410 may generate a new speaker characteristic vector through a pre-calculated method of adjusting the vocalization characteristic axis. For example, a speaker feature includes one or more vocal features. The vocalization characteristic change information determination module 430 may find the vocalization characteristic axis and adjust the vocalization characteristic axis. The adjusted vocalization characteristic axis may be provided to the speaker characteristic determination module 410 and used to determine the speaker characteristic of a new speaker. That is, the speaker characteristic determination module 410 calculates the speaker characteristic r of the reference speaker, the vocalization characteristic axis, as shown in Equation 6 below.
Figure PCTKR2022001414-appb-img-000031
and weight of speech characteristic change information
Figure PCTKR2022001414-appb-img-000032
can be used to determine the speaker characteristics of the new speaker.
Figure PCTKR2022001414-appb-img-000033
Figure PCTKR2022001414-appb-img-000033
여기서,
Figure PCTKR2022001414-appb-img-000034
는 j번째 발성 특징 축을 의미하고,
Figure PCTKR2022001414-appb-img-000035
는 j번째 발성 특징에 대한 가중치를 의미할 수 있다. 또한, C는 정량적으로 표현된 발성 특징을 의미하며, c는 화자 특징 내부 상의 한 축을 의미할 수 있다. 예를 들어, C = {1, 30, -1, 1, 1} 인 경우, 발성 특징 C의 축은 여성(
Figure PCTKR2022001414-appb-img-000036
= 1), 나이 30(
Figure PCTKR2022001414-appb-img-000037
= 30), 톤이 낮고(
Figure PCTKR2022001414-appb-img-000038
= -1), 말이 빠르고(
Figure PCTKR2022001414-appb-img-000039
= 1), 발성 강도는 강함(
Figure PCTKR2022001414-appb-img-000040
= 1)을 나타낼 수 있다. 또한,
Figure PCTKR2022001414-appb-img-000041
는 개별 발성 특징인
Figure PCTKR2022001414-appb-img-000042
를 구분하기 위한 발성 특징 공간 상의 한 축을 의미할 수 있으며,
Figure PCTKR2022001414-appb-img-000043
Figure PCTKR2022001414-appb-img-000044
로 화자 표현과 차원이 동일할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000034
is the j-th vocal feature axis,
Figure PCTKR2022001414-appb-img-000035
may mean a weight for the j-th utterance feature. In addition, C may mean a quantitatively expressed vocalization feature, and c may mean an axis on the inside of a speaker feature. For example, if C = {1, 30, -1, 1, 1}, then the axis of vocal feature C is female (
Figure PCTKR2022001414-appb-img-000036
= 1), age 30 (
Figure PCTKR2022001414-appb-img-000037
= 30), the tone is low (
Figure PCTKR2022001414-appb-img-000038
= -1), speaks fast (
Figure PCTKR2022001414-appb-img-000039
= 1), vocal strength is strong (
Figure PCTKR2022001414-appb-img-000040
= 1) can be expressed. In addition,
Figure PCTKR2022001414-appb-img-000041
is an individual vocalization characteristic.
Figure PCTKR2022001414-appb-img-000042
It may mean one axis on the vocal feature space to distinguish
Figure PCTKR2022001414-appb-img-000043
Is
Figure PCTKR2022001414-appb-img-000044
may have the same dimension as the speaker's expression.
발성 특징 변화 정보로서
Figure PCTKR2022001414-appb-img-000045
를 획득하기 위하여, 발성 특징 변화 정보 결정 모듈(430)은, 복수의 화자의 화자 벡터의 각각을 정규화 시킬 수 있다. 이 경우, 복수의 화자의 화자 벡터는 복수의 화자의 화자 특징에 포함될 수 있다. 예를 들어, 발성 특징 변화 정보 결정 모듈(430)은 전체 화자의 화자 벡터 R = {
Figure PCTKR2022001414-appb-img-000046
,i = 0, ...,
Figure PCTKR2022001414-appb-img-000047
-1}에 대해 정규화를 수행할 수 있다. 예를 들어, 발성 특징 변화 정보 결정 모듈(430)은 전체 데이터에 평균을 빼고 분산을 나눠주는 Z-normalization 또는 전체 데이터에 평균을 빼는 정규화를 수행할 수 있다.
As vocal characteristic change information
Figure PCTKR2022001414-appb-img-000045
In order to obtain , the speech characteristic change information determining module 430 may normalize each of the speaker vectors of the plurality of speakers. In this case, the speaker vectors of the plurality of speakers may be included in the speaker characteristics of the plurality of speakers. For example, the speech feature change information determining module 430 may determine that the speaker vector R = {
Figure PCTKR2022001414-appb-img-000046
,i = 0, ...,
Figure PCTKR2022001414-appb-img-000047
-1} can be normalized. For example, the speech characteristic change information determination module 430 may perform Z-normalization in which the mean is subtracted from all data and the variance is divided, or normalization in which the mean is subtracted from all data.
Figure PCTKR2022001414-appb-img-000048
Figure PCTKR2022001414-appb-img-000048
여기서, N(-)은 정규화 함수를 의미하고, D(-)는 정규화 역함수를 의미한다.Here, N(-) denotes a normalization function, and D(-) denotes an inverse normalization function.
그리고 나서, 발성 특징 변화 정보 결정 모듈(430)은 정규화된 복수의 화자의 화자 벡터에 대한 차원 축소 분석을 수행함으로써, 복수의 주요 성분을 결정할 수 있다. 여기서, 차원 축소 분석은, 예를 들어, PCA(Principal Component Analysis), SVD(Singular Value Decomposition), t-SNE(Stochastic Neighbor Embedding) 등의 종래의 알려진 차원 축소 기법을 통해 수행될 수 있다. 예를 들어, 발성 특징 변화 정보 결정 모듈(430)은 N(R)에 대해 PCA를 수행함으로써, 아래 수학식 8의 복수의 주요 성분 P를 결정할 수 있다.Then, the speech characteristic change information determination module 430 may determine the plurality of main components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers. Here, the dimensionality reduction analysis may be performed through a conventionally known dimension reduction technique, such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), or Stochastic Neighbor Embedding (t-SNE). For example, the speech characteristic change information determination module 430 may determine a plurality of main components P in Equation 8 below by performing PCA on N(R).
Figure PCTKR2022001414-appb-img-000049
Figure PCTKR2022001414-appb-img-000049
여기서,
Figure PCTKR2022001414-appb-img-000050
는 k번째 주요 성분을 지칭할 수 있으며,
Figure PCTKR2022001414-appb-img-000051
은 화자 표현의 r의 차원수를 지칭할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000050
may refer to the kth major component,
Figure PCTKR2022001414-appb-img-000051
may refer to the number of dimensions of r of the speaker expression.
그리고 나서, 아래 수학식 9의
Figure PCTKR2022001414-appb-img-000052
를 이용하여 생성된 음성은 사람에 의해 청취되고 평가되어 발성 특징 레이블이 할당될 수 있다. 발성 특징 변화 정보 결정 모듈(430)은 결정된 복수의 주요 성분 중 적어도 하나의 주요 성분을 선택할 수 있다. 예를 들어, 새로운 화자의 화자 특징에 변경하기 원하는 발성 특징과 연관된 주요 성분이 선택될 수 있다.
Then, in Equation 9 below
Figure PCTKR2022001414-appb-img-000052
A voice generated by using can be listened to and evaluated by a person to be assigned a vocal feature label. The vocalization characteristic change information determination module 430 may select at least one main component from among the plurality of determined main components. For example, key components associated with the vocal characteristics desired to be altered in the speaker characteristics of the new speaker may be selected.
Figure PCTKR2022001414-appb-img-000053
Figure PCTKR2022001414-appb-img-000053
여기서,
Figure PCTKR2022001414-appb-img-000054
는 새로운 화자의 특징을 의미하고,
Figure PCTKR2022001414-appb-img-000055
는 k번째 주요 성분을 의미하고,
Figure PCTKR2022001414-appb-img-000056
는 선택된 주요 성분을 의미할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000054
means the characteristics of the new speaker,
Figure PCTKR2022001414-appb-img-000055
means the kth major component,
Figure PCTKR2022001414-appb-img-000056
may mean the selected main component.
즉, j번째 발성 특징
Figure PCTKR2022001414-appb-img-000057
가 선택된 주요 성분
Figure PCTKR2022001414-appb-img-000058
및 정규화 역함수 D를 통해 결정될 수 있다. 이러한 j번째 발성 특징 및 이에 해당하는 가중치가 화자 특징 결정 모듈(410)에 제공되어, 위 수학식 6을 통해 새로운 화자의 화자 특징이 생성될 수 있다.
That is, the j-th vocalization characteristic
Figure PCTKR2022001414-appb-img-000057
the main ingredient selected
Figure PCTKR2022001414-appb-img-000058
and a normalization inverse function D. The j-th utterance feature and a weight corresponding thereto are provided to the speaker feature determination module 410, so that the speaker feature of a new speaker can be generated through Equation 6 above.
추가적으로 또는 이와 달리, 발성 특징 변화 정보 결정 모듈(430)은 수학식 6에서 사용되는
Figure PCTKR2022001414-appb-img-000059
대신에, 수학식 10을 통해 얻어지는
Figure PCTKR2022001414-appb-img-000060
를 사용함으로써, 발성 특징 축 간의 간섭이 제거될 수 있다.
Additionally or alternatively, the vocalization characteristic change information determination module 430 is used in Equation (6).
Figure PCTKR2022001414-appb-img-000059
Instead, obtained through Equation (10)
Figure PCTKR2022001414-appb-img-000060
By using , interference between the vocal feature axes can be removed.
Figure PCTKR2022001414-appb-img-000061
Figure PCTKR2022001414-appb-img-000061
여기서,
Figure PCTKR2022001414-appb-img-000062
Figure PCTKR2022001414-appb-img-000063
에 일부 발성 특징을 변경한 발성 특징 축을 지칭할 수 있다. 또한,
Figure PCTKR2022001414-appb-img-000064
는 개별 발성 특징인
Figure PCTKR2022001414-appb-img-000065
를 구분하기 위한 발성 특징 공간 상의 한 축을 의미할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000062
Is
Figure PCTKR2022001414-appb-img-000063
may refer to an axis of vocalization characteristics in which some vocalization characteristics are changed. In addition,
Figure PCTKR2022001414-appb-img-000064
is an individual vocalization characteristic.
Figure PCTKR2022001414-appb-img-000065
It may mean one axis on the vocal feature space for classifying .
발성 특징 변화 정보 결정 모듈(430)은 타겟 발성 특징이 상이한 복수의 화자의 화자 벡터를 획득할 수 있다. 이 경우, 복수의 학습 화자의 화자 벡터는 복수의 학습 화자의 화자 특징에 포함될 수 있다. 또한, 복수의 화자의 각각은, 하나 이상의 발성 특징에 대한 레이블이 할당된다. 일 실시예에서,
Figure PCTKR2022001414-appb-img-000066
와 같이 복수의 화자의 각각에 발성 특징 레이블이 할당될 수 있다. 여기서, 발성 특징은 톤, 발성 강도, 발성 속도, 성별, 나이를 포함할 수 있다. 톤, 발성 강도, 발성 속도는
Figure PCTKR2022001414-appb-img-000067
과 같이 표현될 수 있고, 여기서,
Figure PCTKR2022001414-appb-img-000068
는 l의 element일 수 있다. 또한, 남녀의 성별은
Figure PCTKR2022001414-appb-img-000069
로 표현될 수 있으며, 나이는
Figure PCTKR2022001414-appb-img-000070
과 같이 표현될 수 있다. 예를 들어,
Figure PCTKR2022001414-appb-img-000071
은 톤은 낮고, 발성 강도는 중간이고, 발성 속도는 높으며, 50세 남자의 발성 특징을 의미할 수 있다.
The speech characteristic change information determination module 430 may obtain speaker vectors of a plurality of speakers having different target speech characteristics. In this case, the speaker vectors of the plurality of learning speakers may be included in the speaker characteristics of the plurality of learning speakers. Further, each of the plurality of speakers is assigned a label for one or more vocal features. In one embodiment,
Figure PCTKR2022001414-appb-img-000066
A vocal feature label may be assigned to each of a plurality of speakers as shown in FIG. Here, the speech characteristics may include tone, speech strength, speech speed, gender, and age. Tone, vocal strength, and vocal speed
Figure PCTKR2022001414-appb-img-000067
It can be expressed as, where
Figure PCTKR2022001414-appb-img-000068
may be an element of l. Also, the gender of men and women
Figure PCTKR2022001414-appb-img-000069
It can be expressed as , and the age is
Figure PCTKR2022001414-appb-img-000070
can be expressed as for example,
Figure PCTKR2022001414-appb-img-000071
The silver tone is low, the vocal strength is medium, and the vocalization rate is high, which may indicate the vocal characteristics of a 50-year-old male.
Figure PCTKR2022001414-appb-img-000072
Figure PCTKR2022001414-appb-img-000072
그리고 나서, 발성 특징 변화 정보 결정 모듈(430)은, 위 수학식 11과 같이, 타겟 발성 특징이 상이한 복수의 화자의 화자 벡터
Figure PCTKR2022001414-appb-img-000073
Figure PCTKR2022001414-appb-img-000074
사이의 차이를 기초로 발성 특징
Figure PCTKR2022001414-appb-img-000075
를 결정할 수 있다. 여기서, 발성 특징
Figure PCTKR2022001414-appb-img-000076
는 발성 특징 변화 정보에 포함될 수 있다. 이러한 발성 특징 변화 정보는 화자 특징 결정 모듈(410)에 제공되어 위 수학식 6을 이용하여 새로운 화자의 화자 특징이 결정될 수 있다.
Then, the speech characteristic change information determining module 430 is configured to perform speaker vectors of a plurality of speakers having different target speech characteristics, as shown in Equation 11 above.
Figure PCTKR2022001414-appb-img-000073
and
Figure PCTKR2022001414-appb-img-000074
Vocal characteristics based on the difference between
Figure PCTKR2022001414-appb-img-000075
can be decided Here, the vocal features
Figure PCTKR2022001414-appb-img-000076
may be included in the speech characteristic change information. This speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.
다른 실시예에서, 발성 특징 변화 정보 결정 모듈(430)은 복수의 화자 그룹의 화자 벡터 평균 사이의 차이를 기초로 발성 특징 변화 정보를 결정할 수 있다. 위 수학식 11과 연관되어 설명된 바와 같이, 복수의 화자의 화자 특징은 복수의 화자의 화자 벡터를 포함하고, 복수의 화자의 화자 특징의 각각은, 하나 이상의 발성 특징에 대한 레이블이 할당된다.In another embodiment, the speech characteristic change information determining module 430 may determine the speech characteristic change information based on a difference between the averages of the speaker vectors of a plurality of speaker groups. As described in connection with Equation (11) above, the speaker features of the plurality of speakers include speaker vectors of the plurality of speakers, and each of the speaker features of the plurality of speakers is assigned a label for one or more vocalization features.
발성 특징 변화 정보 결정 모듈(430)은 타겟 발성 특징이 상이한 복수의 화자 그룹의 각각에 포함된 화자들의 화자 벡터를 획득할 수 있다. 여기서, 복수의 학습 화자의 그룹은 제1 화자 그룹 및 제2 화자 그룹을 포함할 수 있다.The speech characteristic change information determination module 430 may obtain speaker vectors of speakers included in each of a plurality of speaker groups having different target speech characteristics. Here, the group of the plurality of learning speakers may include a first speaker group and a second speaker group.
그리고 나서, 발성 특징 변화 정보 결정 모듈(430)은 제1 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하고, 제2 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하여, 수학식 12와 같이 제1 화자 그룹에 대응하는 화자 벡터의 평균 및 제2 화자 그룹에 대응하는 화자 벡터의 평균 사이의 차이를 기초로 발성 특징
Figure PCTKR2022001414-appb-img-000077
를 결정할 수 있다. 결정된 발성 특징
Figure PCTKR2022001414-appb-img-000078
는 발성 특징 변화 정보에 포함될 수 있다.
Then, the speech characteristic change information determination module 430 calculates an average of the speaker vectors of the speakers included in the first speaker group, and calculates the average of the speaker vectors of the speakers included in the second speaker group, by calculating the average of the speaker vectors included in the second speaker group, Equation (12) A speech characteristic based on the difference between the average of the speaker vectors corresponding to the first speaker group and the average of the speaker vectors corresponding to the second speaker group as
Figure PCTKR2022001414-appb-img-000077
can be decided Determined vocal characteristics
Figure PCTKR2022001414-appb-img-000078
may be included in the speech characteristic change information.
Figure PCTKR2022001414-appb-img-000079
Figure PCTKR2022001414-appb-img-000079
그런 다음, 이러한 발성 특징 변화 정보는 화자 특징 결정 모듈(410)에 제공되어 위 수학식 6을 이용하여 새로운 화자의 화자 특징이 결정될 수 있다.Then, this speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.
일 실시예에 따르면, 발성 특징 변화 정보 결정 모듈(430)은, 아래 수학식 13에서와 같이, 복수의 화자의 화자 특징
Figure PCTKR2022001414-appb-img-000080
를 인공신경망 발성 특징 예측 모델
Figure PCTKR2022001414-appb-img-000081
에 입력하여, 복수의 화자의 각각의 발성 특징
Figure PCTKR2022001414-appb-img-000082
을 출력할 수 있다. 여기서, 발성 특징 변화 정보 결정 모듈(430)은 복수의 화자의 화자 특징
Figure PCTKR2022001414-appb-img-000083
중에서, 출력된 발성 특징
Figure PCTKR2022001414-appb-img-000084
중 선택된
Figure PCTKR2022001414-appb-img-000085
Figure PCTKR2022001414-appb-img-000086
에 포함된 j발성 특징에서의 차이 값이 존재하고, 나머지 발성 특징들에서의 차이값이 존재하지 않는 화자 특징, 즉,
Figure PCTKR2022001414-appb-img-000087
를 선택 또는 결정할 수 있다. 이러한 화자 특징
Figure PCTKR2022001414-appb-img-000088
는 화자 특징 결정 모듈(410)에 제공될 수 있다.
According to an embodiment, the speech characteristic change information determining module 430 may include, as in Equation 13 below, speaker characteristics of a plurality of speakers.
Figure PCTKR2022001414-appb-img-000080
A neural network vocal feature prediction model
Figure PCTKR2022001414-appb-img-000081
By typing in, each vocalization characteristic of a plurality of speakers
Figure PCTKR2022001414-appb-img-000082
can be printed out. Here, the speech characteristic change information determining module 430 is a speaker characteristic of a plurality of speakers.
Figure PCTKR2022001414-appb-img-000083
Among them, the output vocal characteristics
Figure PCTKR2022001414-appb-img-000084
selected from
Figure PCTKR2022001414-appb-img-000085
class
Figure PCTKR2022001414-appb-img-000086
A speaker feature that has a difference value in the j-voicing features included in
Figure PCTKR2022001414-appb-img-000087
can be selected or determined. These speaker characteristics
Figure PCTKR2022001414-appb-img-000088
may be provided to the speaker characteristic determination module 410 .
Figure PCTKR2022001414-appb-img-000089
Figure PCTKR2022001414-appb-img-000089
또한, 화자 특징 결정 모듈(410)은 선택된 화자의 화자 특징에 대응하는 가중치를 획득할 수 있다. 그리고 나서, 화자 특징 결정 모듈(410)은 기준 화자의 화자 특징, 선택된 화자의 화자 특징 및 선택된 화자의 화자 특징에 대응하는 가중치를 이용하여 새로운 화자의 화자 특징을 결정할 수 있다. 예를 들어, 화자 특징 결정 모듈(410)은 아래 수학식 14을 이용하여 새로운 화자의 화자 특징을 결정할 수 있다.Also, the speaker characteristic determining module 410 may obtain a weight corresponding to the speaker characteristic of the selected speaker. Then, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. For example, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker using Equation 14 below.
Figure PCTKR2022001414-appb-img-000090
Figure PCTKR2022001414-appb-img-000090
여기서,
Figure PCTKR2022001414-appb-img-000091
는 새로운 화자의 화자 특징이고,
Figure PCTKR2022001414-appb-img-000092
는 기준 화자의 화자 특징,
Figure PCTKR2022001414-appb-img-000093
는 선택된 화자의 화자 특징,
Figure PCTKR2022001414-appb-img-000094
는 선택된 화자의 화자 특징에 대응하는 가중치를 지칭할 수 있다.
here,
Figure PCTKR2022001414-appb-img-000091
is the speaker characteristic of the new speaker,
Figure PCTKR2022001414-appb-img-000092
is the speaker characteristics of the reference speaker,
Figure PCTKR2022001414-appb-img-000093
is the speaker characteristics of the selected speaker,
Figure PCTKR2022001414-appb-img-000094
may refer to a weight corresponding to the speaker characteristic of the selected speaker.
출력 음성 검증 모듈(440)은 새로운 화자의 화자 특징과 연관된 출력 음성이 기존에 저장되어 있지 않은 새로운 출력 음성인지 여부를 결정할 수 있다. 일 실시예에 따르면, 출력 음성 검증 모듈(440)은 해쉬 함수를 이용하여 새로운 화자의 화자 특징(예: 화자 특징 벡터)에 대응하는 해쉬값을 산출할 수 있다. 다른 실시예에서, 출력 음성 검증 모듈(440)은 새로운 화자의 화자 음성을 이용하여 해쉬값을 산출하지 않고, 새로운 출력 음성으로부터 화자의 화자 특징을 추출하고, 추출된 화자의 화자 특징을 이용하여 해쉬값이 산출될 수 있다.The output voice verification module 440 may determine whether the output voice associated with the speaker characteristic of the new speaker is a new output voice that is not previously stored. According to an embodiment, the output voice verification module 440 may calculate a hash value corresponding to a speaker feature (eg, a speaker feature vector) of a new speaker by using a hash function. In another embodiment, the output voice verification module 440 does not calculate a hash value using the speaker voice of the new speaker, but extracts the speaker feature of the speaker from the new output voice, and uses the extracted speaker feature of the new speaker to hash A value can be calculated.
그리고 나서, 출력 음성 검증 모듈(440)은 저장 매체에 저장된 복수의 화자의 콘텐츠 중에서, 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 있는지 여부를 판정할 수 있다. 출력 음성 검증 모듈(440)은 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 없는 경우, 새로운 화자의 화자 특징과 연관된 출력 음성이 새로운 출력 음성임을 결정할 수 있다. 이렇게 새로운 출력 음성이라고 결정된 경우, 새로운 화자의 화자 특징이 반영된 합성 음성이 사용되도록 설정될 수 있다.Then, the output voice verification module 440 may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. When there is no content associated with a hash value similar to the calculated hash value, the output voice verification module 440 may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice. When it is determined as the new output voice, the synthesized voice reflecting the speaker characteristics of the new speaker may be set to be used.
도 5는 본 개시의 일 실시예에 따른 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 방법(500)을 나타내는 흐름도이다. 일 실시예에서, 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 방법(500)은 프로세서(예: 사용자 단말(210)의 프로세서(314) 및/또는 합성 음성 생성 시스템(230)의 프로세서(334))에 의해 수행될 수 있다. 도시된 바와 같이, 이러한 방법(500)은 프로세서가 대상 텍스트를 수신함으로써 개시될 수 있다(S510).5 is a flowchart illustrating a method 500 of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure. In an embodiment, the method 500 for generating an output voice reflecting the speaker characteristics of the new speaker includes a processor (eg, the processor 314 of the user terminal 210 and/or the processor 334 of the synthesized voice generating system 230 ). )) can be performed by As shown, the method 500 may be initiated by the processor receiving the target text (S510).
그리고 나서, 프로세서는 기준 화자에 대응하는 기준 화자의 화자 특징을 획득할 수 있다(S520). 일 실시예에서, 기준 화자의 화자 특징은 화자 벡터를 포함할 수 있다. 추가적으로 또는 대안적으로, 기준 화자의 화자 특징은 기준 화자의 발성 특징을 포함할 수 있다. 다른 실시예에 따르면, 기준 화자의 화자 특징은 복수의 기준 화자에 대응하는 복수의 화자 특징을 포함할 수 있다. 여기서, 복수의 화자 특징은 복수의 화자 벡터를 포함할 수 있다.Then, the processor may acquire a speaker characteristic of the reference speaker corresponding to the reference speaker ( S520 ). In one embodiment, the speaker characteristic of the reference speaker may include a speaker vector. Additionally or alternatively, the speaker characteristics of the reference speaker may include vocalization characteristics of the reference speaker. According to another embodiment, the speaker characteristics of the reference speaker may include a plurality of speaker characteristics corresponding to the plurality of reference speakers. Here, the plurality of speaker features may include a plurality of speaker vectors.
그리고 나서, 프로세서는 발성 특징 변화 정보를 획득할 수 있다(S530). 이를 위해, 프로세서는 복수의 화자의 화자 특징을 획득할 수 있다. 여기서, 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함할 수 있다.Then, the processor may acquire vocal feature change information ( S530 ). To this end, the processor may acquire speaker characteristics of the plurality of speakers. Here, the speaker characteristics of the plurality of speakers may include a plurality of speaker vectors.
일 실시예에 따르면, 프로세서는, 복수의 화자의 화자 벡터의 각각에 대한 정규화를 수행하고, 정규화된 복수의 화자의 화자 벡터에 대한 차원 축소 분석을 수행함으로써, 복수의 주요 성분을 결정할 수 있다. 이렇게 결정된 복수의 주요 성분 중 적어도 하나의 주요 분석이 선택될 수 있다. 그리고 나서, 프로세서는, 선택된 주요 성분을 이용하여 발성 특징 변화 정보를 결정할 수 있다.According to an embodiment, the processor may determine the plurality of principal components by performing normalization on each of the speaker vectors of the plurality of speakers and performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers. At least one major analysis from among the plurality of key components thus determined may be selected. Then, the processor may determine the speech characteristic change information using the selected main component.
다른 실시예에 따르면, 프로세서는, 타겟 발성 특징이 상이한 복수의 화자의 화자 벡터를 획득하고, 획득된 복수의 화자의 화자 벡터 사이의 차이를 기초로 발성 특징 변화 정보를 결정할 수 있다. 또 다른 실시예에 따르면, 프로세서는, 타겟 발성 특징이 상이한 복수의 화자 그룹의 각각에 포함된 화자들의 화자 벡터를 획득할 수 있다. 여기서, 복수의 화자의 그룹은 제1 화자 그룹 및 제2 화자 그룹을 포함할 수 있다. 그리고 나서, 프로세서는, 제1 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하고, 제2 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출할 수 있다. 프로세서는, 제1 화자 그룹에 대응하는 화자 벡터의 평균 및 제2 화자 그룹에 대응하는 화자 벡터의 평균 사이의 차이를 기초로 발성 특징 변화 정보를 결정할 수 있다.According to another embodiment, the processor may obtain speaker vectors of a plurality of speakers having different target vocalization characteristics, and determine the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers. According to another embodiment, the processor may obtain a speaker vector of speakers included in each of a plurality of speaker groups having different target vocalization characteristics. Here, the plurality of speaker groups may include a first speaker group and a second speaker group. Then, the processor may calculate an average of speaker vectors of speakers included in the first speaker group, and calculate an average of speaker vectors of speakers included in the second speaker group. The processor may determine the speech characteristic change information based on a difference between an average of speaker vectors corresponding to the first speaker group and an average of speaker vectors corresponding to the second speaker group.
또 다른 실시예에서, 프로세서는, 복수의 화자의 화자 특징을 인공신경망 발성 특징 예측 모델에 입력하여, 복수의 화자의 각각의 발성 특징을 출력할 수 있다. 그리고 나서, 프로세서는, 복수의 화자의 화자 특징 중에서, 출력된 복수의 화자의 각각의 발성 특징 중 타겟 발성 특징과 기준 화자의 복수의 발성 특징 중 타겟 발성 특징 사이의 차이가 존재하는, 화자의 화자 특징을 선택하고, 선택된 화자의 화자 특징에 대응하는 가중치를 획득할 수 있다. 여기서, 선택된 화자의 화자 특징 및 선택된 화자의 화자 특징에 대응하는 가중치는 발성 특징 변화 정보로서 획득될 수 있다.In another embodiment, the processor may input the speaker characteristics of the plurality of speakers to the artificial neural network speech characteristic prediction model, and output the speech characteristics of each of the plurality of speakers. Then, the processor is configured to: a speaker of the speaker, wherein, among the speaker features of the plurality of speakers, a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker. A characteristic may be selected, and a weight corresponding to the speaker characteristic of the selected speaker may be obtained. Here, the speaker characteristic of the selected speaker and the weight corresponding to the speaker characteristic of the selected speaker may be obtained as speech characteristic change information.
또 다른 실시예에 따르면, 프로세서는 타겟 발성 특징에 대응하는 발성 특징 분류 모델을 이용하여 타겟 발성 특징에 대한 법선 벡터를 추출할 수 있다. 여기서, 법선 벡터는 타겟 발성 특징을 분류하는 hyperplane의 법선 벡터를 지칭할 수 있고, 타겟 발성 특징을 조절하는 정도를 나타내는 정보를 획득할 수 있다. 이렇게 추출된 법선 벡터 및 타겟 발성 특징을 조절하는 정도를 나타내는 정보는 발성 특징 변화 정보로서 획득될 수 있다.According to another embodiment, the processor may extract a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature. Here, the normal vector may refer to a normal vector of a hyperplane that classifies the target speech feature, and information indicating the degree of adjusting the target speech feature may be obtained. The extracted normal vector and information indicating the degree to which the target speech feature is adjusted may be obtained as speech feature change information.
그리고 나서, 프로세서는 획득된 기준 화자의 화자 특징 및 획득된 발성 특징 변화 정보를 이용하여 새로운 화자의 화자 특징을 결정할 수 있다(S540).Then, the processor may determine the speaker characteristics of the new speaker by using the acquired speaker characteristics of the reference speaker and the acquired speech characteristic change information ( S540 ).
일 실시예에 따르면, 프로세서는, 기준 화자의 화자 특징 및 획득된 발성 특징 변화 정보를 인공신경망 화자 특징 변화 생성 모델에 입력하여 화자 특징 변화를 생성하고, 기준 화자의 화자 특징 및 생성된 화자 특징 변화를 합성함으로써, 새로운 화자의 화자 특징을 출력할 수 있다. 여기서, 인공신경망 화자 특징 변화 생성 모델은, 복수의 학습 화자의 화자 특징 및 복수의 학습 화자의 화자 특징에 포함된 복수의 발성 특징을 이용하여 학습될 수 있다.According to an embodiment, the processor generates a change in the speaker characteristic by inputting the speaker characteristic of the reference speaker and the acquired speech characteristic change information into the artificial neural network speaker characteristic change generation model, and the speaker characteristic of the reference speaker and the generated speaker characteristic change By synthesizing , it is possible to output the speaker characteristics of the new speaker. Here, the artificial neural network speaker characteristic change generation model may be learned by using the speaker characteristics of the plurality of learned speakers and the plurality of speech characteristics included in the speaker characteristics of the plurality of learned speakers.
다른 실시예에서, 프로세서는, 복수의 화자의 특징의 각각에 획득된 가중치 세트에 포함된 가중치를 적용함으로써, 새로운 화자의 화자 특징을 결정할 수 있다. 또 다른 실시예에서, 프로세서는, 기준 화자의 화자 특징, 발성 특징 변화 정보, 발성 특징 변화 정보의 가중치를 이용하여 새로운 화자의 특징을 결정할 수 있다. 또 다른 실시예에 따르면, 프로세서는, 기준 화자의 화자 특징, 선택된 화자의 화자 특징 및 선택된 화자의 화자 특징에 대응하는 가중치를 이용하여 새로운 화자의 화자 특징을 결정할 수 있다. 또 다른 실시예에 따르면, 프로세서는, 기준 화자의 화자 벡터, 추출된 법선 벡터 및 타겟 발성 특징을 조절하는 정도를 기초로 새로운 화자의 화자 특징을 결정할 수 있다.In another embodiment, the processor may determine the speaker characteristic of the new speaker by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. In another embodiment, the processor may determine the characteristics of the new speaker by using the weights of the speaker characteristics of the reference speaker, the speech characteristics change information, and the speech characteristics change information. According to another embodiment, the processor may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. According to another embodiment, the processor may determine the speaker characteristic of the new speaker based on the degree to which the reference speaker's speaker vector, the extracted normal vector, and the target vocalization characteristic are adjusted.
그리고 나서, 프로세서는 대상 텍스트 및 결정된 새로운 화자의 화자 특징을 인공신경망 텍스트-음성 합성 모델에 입력하여, 결정된 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대한 출력 음성을 생성할 수 있다(S550). 여기서, 인공신경망 텍스트-음성 합성 모델은, 복수의 학습 텍스트 아이템 및 복수의 학습 화자의 화자 특징을 기초로, 복수의 학습 화자의 화자 특징이 반영된, 복수의 학습 텍스트 아이템에 대한 음성을 출력하도록 학습된 모델을 포함할 수 있다.Then, the processor may input the target text and the determined speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected (S550). Here, the artificial neural network text-to-speech synthesis model learns to output voices for a plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. model may be included.
일 실시예에 따르면, 프로세서는, 해쉬 함수를 이용하여 화자 특징 벡터에 대응하는 해쉬값을 산출할 수 있다. 여기서, 화자 특징 벡터는 새로운 화자의 화자 특징에 포함될 수 있다. 그리고 나서, 프로세서는, 저장 매체에 저장된 복수의 화자의 콘텐츠 중에서, 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 있는지 여부를 판정할 수 있다. 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 없는 경우, 프로세서는, 새로운 화자의 화자 특징과 연관된 출력 음성이 새로운 출력 음성임을 결정할 수 있다.According to an embodiment, the processor may calculate a hash value corresponding to the speaker feature vector using a hash function. Here, the speaker feature vector may be included in the speaker feature of the new speaker. Then, the processor may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. If there is no content associated with the hash value similar to the calculated hash value, the processor may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice.
일 실시예에 따르면, 상술된 새로운 화자의 합성 음성을 생성하는 방법에 따라 생성된 새로운 화자의 합성 음성을 포함한 학습 데이터를 이용하여 학습된 음성 합성기가 제공될 수 있다. 여기서, 음성 합성기는, 상술된 새로운 화자의 합성 음성을 생성하는 방법에 따라 생성된 새로운 화자의 합성 음성을 포함한 학습 데이터를 이용하여 학습될 수 있는 임의의 음성 합성기일 수 있다. 예를 들어, 음성 합성기는, 이러한 학습 데이터를 이용하여 학습된 임의의 텍스트-음성 합성(TTS) 모델을 포함할 수 있다. 여기서, TTS 모델은, 본 기술 분야에서 미리 알려진 기계학습 모델, 인공신경망 모델로 구현될 수 있다.According to an embodiment, a speech synthesizer learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker may be provided. Here, the voice synthesizer may be any voice synthesizer that can be learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker. For example, the speech synthesizer may include any text-to-speech synthesis (TTS) model trained using this training data. Here, the TTS model may be implemented as a machine learning model or an artificial neural network model known in the art.
이러한 음성 합성기는 새로운 화자의 합성 음성을 학습 데이터로 학습되었기 때문에, 대상 텍스트가 입력되면, 대상 텍스트가 새로운 화자의 합성 음성으로 출력될 수 있다. 일 실시예에 따르면, 이러한 음성 합성기는, 도 2의 사용자 단말(210) 및/또는 도 2의 정보 처리 시스템(230)에 포함되거나 구현될 수 있다. Since the speech synthesizer has learned the synthesized voice of the new speaker as training data, when the target text is input, the target text may be output as the synthesized voice of the new speaker. According to an embodiment, such a voice synthesizer may be included or implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 .
일 실시예에 따르면, 상술된 새로운 화자의 합성 음성을 생성하는 방법에 따라 생성된 새로운 화자의 합성 음성을 저장하도록 구성된 메모리 및 메모리와 연결되고, 메모리에 포함된 컴퓨터 판독 가능한 적어도 하나의 프로그램을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로그램은, 메모리에 저장된 새로운 화자의 합성 음성 중 적어도 일부를 출력하기 위한 명령어를 포함하는, 합성 음성을 제공하는 장치가 제공될 수 있다. 예를 들어, 이러한 합성 음성을 제공하는 장치는, 미리 생성된 새로운 화자의 합성 음성을 저장하고, 저장된 합성 음성 중 적어도 일부를 제공하는 임의의 장치를 지칭할 수 있다.According to an embodiment, a memory and a memory configured to store a synthesized voice of a new speaker generated according to the method for generating a synthesized voice of a new speaker as described above and connected to the memory, execute at least one computer-readable program included in the memory An apparatus for providing a synthesized voice may be provided, including at least one processor configured to: the at least one program including an instruction for outputting at least a part of the synthesized voice of the new speaker stored in the memory. For example, the device for providing the synthesized voice may refer to any device that stores the synthesized voice of a new speaker that has been generated in advance and provides at least a part of the stored synthesized voice.
일 실시예에 따르면, 이러한 합성 음성을 제공하는 장치는, 도 2의 사용자 단말(210) 및/또는 도 2의 정보 처리 시스템(230)에 구현될 수 있다. 구체적으로, 이러한 합성 음성을 제공하는 장치는, 이에 한정되지 않으나, 동영상 시스템, ARS 시스템, 게임 시스템, 소리펜 등으로 구현될 수 있다. 예를 들어, 이러한 합성 음성을 제공하는 장치가 정보 처리 시스템(230)에 제공되는 경우, 출력된 새로운 화자의 합성 음성 중 적어도 일부가 정보 처리 시스템(230)과 유/무선으로 연결된 사용자 단말 장치에 제공될 수 있다. 구체적으로, 정보 처리 시스템(230)은 출력된 새로운 화자의 합성 음성 중 적어도 일부가 스트리밍 방식으로 제공될 수 있다.According to an embodiment, the apparatus for providing such a synthesized voice may be implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 . Specifically, the apparatus for providing the synthesized voice is not limited thereto, but may be implemented as a video system, an ARS system, a game system, a sound pen, or the like. For example, when a device for providing such a synthesized voice is provided to the information processing system 230 , at least a part of the outputted synthesized voice of the new speaker is transmitted to the user terminal device connected to the information processing system 230 by wire/wireless. can be provided. Specifically, the information processing system 230 may provide at least a part of the output synthesized voice of the new speaker in a streaming manner.
일 실시예에 따르면, 상술된 방법에 따라 생성된 새로운 화자의 합성 음성을 저장하는 단계 및 저장된 새로운 화자의 합성 음성 중 적어도 일부를 제공하는 단계를 포함하는 새로운 화자의 합성 음성을 제공하는 방법이 제공될 수 있다. 이러한 방법은, 사용자 단말(210)의 프로세서 및/또는 도 2의 정보 처리 시스템(230)의 프로세서에 의해 실행될 수 있다. 이러한 방법은, 새로운 화자의 합성 음성을 제공하는 서비스를 위해 제공될 수 있다. 예를 들어, 이러한 서비스는 동영상 시스템, ARS 시스템, 게임 시스템, 소리펜 등으로 구현될 수 있으나, 이에 한정되지 않는다.According to an embodiment, there is provided a method for providing a synthesized voice of a new speaker, comprising the steps of: storing the synthesized voice of the new speaker generated according to the above-described method; and providing at least a part of the stored synthesized voice of the new speaker. can be This method may be executed by the processor of the user terminal 210 and/or the processor of the information processing system 230 of FIG. 2 . This method may be provided for a service providing a synthesized voice of a new speaker. For example, such a service may be implemented as a video system, an ARS system, a game system, a sound pen, etc., but is not limited thereto.
도 6은 본 개시의 일 실시예에 따른 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 예시를 나타내는 도면이다. 일 실시예에서, 인공신경망 텍스트-음성 합성 모델은 인코더(encoder)(610), 어텐션(attention)(620) 및 디코더(decoder)(630)를 포함할 수 있다.6 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected according to an embodiment of the present disclosure; In an embodiment, the artificial neural network text-to-speech synthesis model may include an encoder 610 , an attention 620 , and a decoder 630 .
인코더(610)는 대상 텍스트(640)를 입력 받을 수 있다. 인코더(610)는 입력된 대상 텍스트(640)에 대한 발음 정보(예를 들어, 대상 텍스트에 대한 음소 정보, 대상 텍스트에 포함된 복수의 음소 각각에 대한 벡터 등)를 생성하도록 구성될 수 있다. 일 실시예에서, 인코더(610)는 대상 텍스트(640)를 문자 임베딩(character embedding)으로 변환하여 생성할 수 있다. 예를 들어, 인코더(610)에서, 생성된 문자 임베딩은 완전연결층(fully-connected layer)을 포함한 프리넷(pre-net)에 통과될 수 있다. 또한, 인코더(610)는 프리넷(pre-net)으로부터의 출력을 CBHG 모듈에 제공하여, 인코더의 숨겨진 상태들(Encorder hidden states)을 출력할 수 있다. 예를 들어, CBHG 모듈은 1차원 컨볼루션 뱅크(1D convolution bank), 맥스 풀링(max pooling), 하이웨이 네트워크(highway network), 양방향 GRU(bidirectional gated recurrent unit)를 포함할 수 있다. 인코더(610)에 의해 생성된 발음 정보는 어텐션(620)으로 제공될 수 있고, 어텐션(620)은 제공된 발음 정보와 발음 정보에 대응하는 음성 데이터를 연결 또는 결합시킬 수 있다. 예를 들어, 어텐션(620)은 입력 텍스트 중 어떤 부분으로부터 음성을 생성할지 결정하도록 구성될 수 있다.The encoder 610 may receive the target text 640 as an input. The encoder 610 may be configured to generate pronunciation information for the input target text 640 (eg, phoneme information for the target text, a vector for each of a plurality of phonemes included in the target text, etc.). In an embodiment, the encoder 610 may generate the target text 640 by converting it into character embeddings. For example, in encoder 610, the generated character embeddings may be passed to a pre-net including a fully-connected layer. Also, the encoder 610 may provide the output from the pre-net to the CBHG module to output encoder hidden states. For example, the CBHG module may include a 1D convolution bank, max pooling, a highway network, and a bidirectional gated recurrent unit (GRU). The pronunciation information generated by the encoder 610 may be provided to the attention 620 , and the attention 620 may connect or combine the provided pronunciation information with voice data corresponding to the pronunciation information. For example, attention 620 may be configured to determine from which portion of the input text to generate speech.
이렇게 연결된 발음 정보와 발음 정보에 대응하는 음성 데이터는 디코더(630)에 제공될 수 있다. 디코더(630)는 연결된 발음 정보와 발음 정보에 대응하는 음성 데이터를 기초로 대상 텍스트(640)에 대응하는 음성 데이터(660)를 생성하도록 구성될 수 있다.The pronunciation information connected in this way and voice data corresponding to the pronunciation information may be provided to the decoder 630 . The decoder 630 may be configured to generate the voice data 660 corresponding to the target text 640 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.
일 실시예에 따르면, 디코더(630)는 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000095
)(658)을 수신하여, 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대한 출력 음성을 생성하도록 구성될 수 있다. 여기서, 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000096
)(658)은, 발성 특징 변화 모듈(656)을 통해 생성될 수 있다. 예를 들어, 발성 특징 변화 모듈(656)은 도 4에서 설명된 알고리즘 및/또는 인공신경망 모델을 통해 구현될 수 있다.
According to one embodiment, the decoder 630 provides the speaker characteristics (
Figure PCTKR2022001414-appb-img-000095
) 658 , to generate an output voice for the target text reflecting the speaker characteristics of the new speaker. Here, the speaker characteristics of the new speaker (
Figure PCTKR2022001414-appb-img-000096
) 658 may be generated through the vocalization characteristic change module 656 . For example, the vocalization characteristic change module 656 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .
일 실시예에 따르면, 인공신경망 화자 특징 추출 모델(650)은 화자 식별 정보 i(예를 들어, 화자 one-hot vector 등)(652) 및 화자의 발성 특징 C(654)을 기초로 기준 화자의 화자 특징(r)을 획득할 수 있다. 여기서, 발성 특징 C(654) 및 화자의 화자 특징(r)은 벡터 형태로 표현될 수 있다. 또한, 인공신경망 화자 특징 추출 모델(650)은 복수의 학습 화자 id 및 복수의 학습 발성 특징(예: 벡터)을 입력받아 참조 기준 화자의 화자 벡터(ground truth)를 추출하도록 학습될 수 있다. 이렇게 생성된 기준 화자 특징(r) 및 발성 특징 변화 정보와 연관된 입력 정보(d)(655)를 이용하여 발성 특징 변화 모듈(656)을 통해 발성 특징 변화 정보가 결정되고, 나아가, 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000097
)(658)이 결정될 수 있다. 이러한 발성 특징 변화 정보와 연관된 입력 정보(d)(655)는 새로운 화자에 반영하거나 변경하고 싶은 임의의 정보를 포함할 수 있다.
According to an embodiment, the artificial neural network speaker feature extraction model 650 is a reference speaker's The speaker feature (r) can be obtained. Here, the vocalization feature C 654 and the speaker feature r of the speaker may be expressed in a vector form. In addition, the artificial neural network speaker feature extraction model 650 may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. Using the generated reference speaker characteristic (r) and the input information (d) 655 associated with the speech characteristic change information, the vocalization characteristic change information is determined through the vocalization characteristic change module 656, and further, the new speaker's speaker Characteristic(
Figure PCTKR2022001414-appb-img-000097
) (658) can be determined. The input information (d) 655 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.
일 실시예에서, 디코더(630)는 완전연결층으로 구성된 프리넷과 GRU(gated recurrent unit)로를 포함한 어텐션(attention) RNN(recurrent neural network), 레지듀얼 GRU(residual GRU)를 포함한 디코더 RNN(decoder RNN)을 포함할 수 있다. 디코더(630)로부터 출력되는 음성 데이터(660)는 멜스케일 스펙트로그램(mel-scale spectrogram)으로 표현될 수 있다. 이 경우, 디코더(630)의 출력은 후처리 프로세서(미도시)에 제공될 수 있다. 후처리 프로세서의 CBHG는 디코더(630)의 멜 스케일 스펙트로그램을 리니어스케일 스펙트로그램(linear-scale spectrogram)으로 변환하도록 구성될 있다. 예를 들어, 후처리 프로세서의 CBHG의 출력 신호는 매그니튜드 스펙트로그램(magnitude spectrogram)를 포함할 수 있다. 후처리 프로세서의 CBHG의 출력 신호의 위상(phase)은 그리핀-림(Griffin-Lim) 알고리즘을 통해 복원되고, 역 단시간 푸리에 변환(inverse short-time Fourier transform)될 수 있다. 후처리 프로세서는 시간 도메인(time domain)의 음성 신호로 출력할 수 있다. 또 다른 예로서, 후처리 프로세서는 GAN 기반의 보코더를 이용하여 구현될 수 있다.In one embodiment, the decoder 630 includes a freenet composed of a fully connected layer, an attention recurrent neural network (RNN) including a gated recurrent unit (GRU), and a decoder RNN (decoder RNN) including a residual GRU (residual GRU). RNN) may be included. The voice data 660 output from the decoder 630 may be expressed as a mel-scale spectrogram. In this case, the output of the decoder 630 may be provided to a post-processing processor (not shown). The CBHG of the post-processing processor may be configured to convert the mel-scale spectrogram of the decoder 630 into a linear-scale spectrogram. For example, the output signal of the CBHG of the post-processing processor may include a magnitude spectrogram. The phase of the output signal of the CBHG of the post-processing processor may be restored through a Griffin-Lim algorithm and subjected to inverse short-time Fourier transform. The post-processing processor may output a voice signal in a time domain. As another example, the post-processing processor may be implemented using a GAN-based vocoder.
이러한 인공신경망 텍스트-음성 합성 모델을 생성 또는 학습하기 위해서, 프로세서는 학습 텍스트 아이템, 복수의 학습 화자의 화자 특징 및 화자 특징이 반영된, 학습 텍스트 아이템에 대응하는 학습 음성 데이터 아이템을 포함하는 데이터베이스를 이용할 수 있다. 프로세서는 학습 텍스트 아이템, 학습 화자의 화자 특징 및 학습 텍스트 아이템에 대응하는 학습 음성 데이터 아이템을 기초로, 학습 화자의 화자 특징이 반영된 합성 음성을 출력하도록 인공신경망 텍스트-음성 합성 모델을 학습할 수 있다.In order to generate or learn such an artificial neural network text-to-speech synthesis model, the processor uses a database including a training text item, a speaker characteristic of a plurality of learned speakers, and a training voice data item corresponding to the training text item in which the speaker characteristic is reflected. can The processor may learn the artificial neural network text-to-speech synthesis model to output a synthesized voice reflecting the speaker characteristics of the learning speaker based on the training text item, the speaker characteristics of the training speaker, and the training voice data item corresponding to the training text item. .
프로세서는 이렇게 생성/학습된 인공신경망 텍스트-음성 합성 모델을 통해, 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대한 출력 음성을 생성할 수 있다. 일 실시예에서, 프로세서는 인공신경망 텍스트-음성 합성 모델에 대상 텍스트(640) 및 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000098
)(658)을 입력함으로써, 출력되는 음성 데이터(660)를 기초로 합성 음성을 생성할 수 있다. 이렇게 생성된 합성 음성은, 입력된 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000099
)(658)이 반영된, 대상 텍스트(640)를 발화하는 음성을 포함할 수 있다.
The processor may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way. In one embodiment, the processor adds the target text 640 and the new speaker speaker characteristics (
Figure PCTKR2022001414-appb-img-000098
) 658 , a synthesized voice may be generated based on the output voice data 660 . The synthesized speech generated in this way has the speaker characteristics (
Figure PCTKR2022001414-appb-img-000099
) 658 may be reflected, and may include a voice uttering the target text 640 .
도 6에서는 어텐션(620)과 디코더(630)를 별개의 구성으로 도시하고 있으나, 이에 한정되지 않는다. 예를 들어, 디코더(630)는 어텐션(620)을 포함할 수 있다. 또한, 도 6에서는 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000100
)(658)이 디코더(630)로 입력되고 있으나, 이에 한정되지 않는다. 예를 들어, 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000101
)(658)은 인코더(610) 및/또는 어텐션(620)으로 입력될 수 있다.
6 illustrates the attention 620 and the decoder 630 as separate components, but is not limited thereto. For example, the decoder 630 may include the attention 620 . In addition, in FIG. 6, the speaker characteristics of the new speaker (
Figure PCTKR2022001414-appb-img-000100
) 658 is input to the decoder 630 , but is not limited thereto. For example, the speaker characteristics of the new speaker (
Figure PCTKR2022001414-appb-img-000101
) 658 may be input to the encoder 610 and/or the attention 620 .
도 7은 본 개시의 다른 실시예에 따른 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 예시를 나타내는 도면이다. 도 7에 도시된 인코더(710), 어텐션(720) 및 디코더(730)는 각각 도 6에 도시된 인코더(610), 어텐션(620) 및 디코더(630)와 유사한 기능을 수행할 수 있다. 이에 따라, 도 6와 중복되는 설명은 생략된다.7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to another embodiment of the present disclosure. The encoder 710 , the attention 720 , and the decoder 730 illustrated in FIG. 7 may perform functions similar to those of the encoder 610 , the attention 620 and the decoder 630 illustrated in FIG. 6 , respectively. Accordingly, the description overlapping with FIG. 6 will be omitted.
일 실시예에서, 인코더(710)는 대상 텍스트(740)를 입력 받을 수 있다. 인코더(710)는 입력된 대상 텍스트(740)에 대한 발음 정보(예를 들어, 대상 텍스트에 포함된 복수의 음소 정보, 대상 텍스트에 포함된 복수의 음소 각각에 대한 벡터 등)를 생성하도록 구성될 수 있다. 인코더(710)에 의해 생성된 발음 정보는 어텐션(720)으로 제공될 수 있고, 어텐션(720)은 발음 정보와 발음 정보에 대응하는 음성 데이터를 연결시킬 수 있다. 이렇게 연결된 발음 정보와 발음 정보에 대응하는 음성 데이터는 디코더(730)에 제공될 수 있다. 디코더(730)는 연결된 발음 정보와 발음 정보에 대응하는 음성 데이터를 기초로 대상 텍스트(740)에 대응하는 음성 데이터(760)를 생성하도록 구성될 수 있다.In an embodiment, the encoder 710 may receive the target text 740 as input. The encoder 710 is configured to generate pronunciation information for the input target text 740 (eg, a plurality of phoneme information included in the target text, a vector for each of a plurality of phonemes included in the target text, etc.). can The pronunciation information generated by the encoder 710 may be provided to the attention 720 , and the attention 720 may connect the pronunciation information and voice data corresponding to the pronunciation information. The pronunciation information connected as described above and voice data corresponding to the pronunciation information may be provided to the decoder 730 . The decoder 730 may be configured to generate the voice data 760 corresponding to the target text 740 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.
일 실시예에서, 디코더(730)는 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000102
)(758)을 수신하여, 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대한 출력 음성을 생성하도록 구성될 수 있다. 여기서, 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000103
)(758)은, 발성 특징 변화 모듈(756)을 통해 생성될 수 있다. 예를 들어, 발성 특징 변화 모듈(756)은 도 4에서 설명된 알고리즘 및/또는 인공신경망 모델을 통해 구현될 수 있다.
In one embodiment, the decoder 730 provides the new speaker's speaker characteristics (
Figure PCTKR2022001414-appb-img-000102
) 758 , and generate an output voice for the target text reflecting the speaker characteristics of the new speaker. Here, the speaker characteristics of the new speaker (
Figure PCTKR2022001414-appb-img-000103
) 758 may be generated through the vocal feature change module 756 . For example, the vocal feature change module 756 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .
일 실시예에 따르면, 인공신경망 화자 특징 추출 모델(750)은 화자가 녹음한 음성(752) 및 발성 특징 세트(C)(754)을 기초로 화자 식별 정보(i)(753)를 출력할 수 있으며, 또한, 기준 화자의 화자 특징(r)을 획득할 수 있다. 여기서, 발성 특징 세트는 하나 이상의 발성 특징 c를 포함할 수 있다. 또한, 발성 특징 세트(C)(754) 및 화자의 화자 특징(r)은 벡터 형태로 표현될 수 있다. 또한, 인공신경망 화자 특징 추출 모델은 복수의 학습 화자가 녹음한 음성 및 복수의 학습 발성 특징(예: 벡터)을 입력받아 참조 기준 화자의 화자 벡터(ground truth)를 추출하도록 학습될 수 있다. 발성 특징 변화 모듈(756)은 이렇게 생성된 기준 화자 특징(r) 및 발성 특징 변화 정보와 연관된 입력 정보(d)(755)를 이용하여 발성 특징 변화 정보를 결정하고, 나아가, 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000104
)을 결정할 수 있다. 이러한 발성 특징 변화 정보와 연관된 입력 정보(d)(755)는 새로운 화자에 반영하거나 변경하고 싶은 임의의 정보를 포함할 수 있다.
According to an embodiment, the artificial neural network speaker feature extraction model 750 may output speaker identification information (i) 753 based on the voice 752 and the speech feature set (C) 754 recorded by the speaker. Also, it is possible to obtain the speaker characteristic (r) of the reference speaker. Here, the speech feature set may include one or more speech features c. In addition, the speech feature set (C) 754 and the speaker feature (r) of the speaker may be expressed in a vector form. In addition, the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker. The vocalization characteristic change module 756 determines the vocalization characteristic change information using the generated reference speaker characteristic (r) and the input information (d) 755 associated with the vocalization characteristic change information, and furthermore, the speaker characteristic of the new speaker. (
Figure PCTKR2022001414-appb-img-000104
) can be determined. The input information (d) 755 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.
이러한 인공신경망 텍스트-음성 합성 모델을 생성 또는 학습하기 위해서, 프로세서는 학습 텍스트 아이템 복수의 학습 화자의 화자 특징 및 화자 특징이 반영된, 학습 텍스트 아이템에 대응하는 학습 음성 데이터 아이템 쌍을 포함하는 데이터베이스를 이용할 수 있다. 프로세서는 학습 화자의 화자 특징 및 학습 텍스트 아이템에 대응하는 학습 음성 데이터 아이템을 기초로, 새로운 화자의 화자 특징이 반영된 합성 음성(760)을 출력하도록 인공신경망 텍스트-음성 합성 모델을 학습할 수 있다.In order to generate or learn such an artificial neural network text-to-speech synthesis model, the processor uses a database including a pair of training speech data items corresponding to the training text item, in which the speaker characteristics of the plurality of training text items and the speaker characteristics are reflected. can The processor may learn the artificial neural network text-to-speech synthesis model to output the synthesized voice 760 in which the speaker characteristics of the new speaker are reflected, based on the speaker characteristics of the training speaker and the training voice data item corresponding to the training text item.
프로세서는 이렇게 생성/학습된 인공신경망 텍스트-음성 합성 모델을 통해, 새로운 화자의 화자 특징이 반영된 출력 음성(760)을 생성할 수 있다. 일 실시예에서, 프로세서는 인공신경망 텍스트-음성 합성 모델에 대상 텍스트(740) 및 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000105
)(758)을 입력함으로써, 출력되는 음성 데이터(760)를 기초로 합성 음성을 생성할 수 있다. 이렇게 생성된 합성 음성은, 입력된 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000106
)(758)에 따라 대상 텍스트(740)를 발화하는 음성을 포함할 수 있다.
The processor may generate the output voice 760 in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way. In one embodiment, the processor adds the target text 740 and the new speaker speaker characteristics (
Figure PCTKR2022001414-appb-img-000105
) 758 , a synthesized voice may be generated based on the output voice data 760 . The synthesized speech generated in this way has the speaker characteristics (
Figure PCTKR2022001414-appb-img-000106
) 758 may include a voice uttering the target text 740 .
도 7에서는 어텐션(720)과 디코더(730)를 별개의 구성으로 도시하고 있으나, 이에 한정되지 않는다. 예를 들어, 디코더(730)는 어텐션(720)을 포함할 수 있다. 또한, 도 7에서는 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000107
)이 디코더(730)로 입력되고 있으나, 이에 한정되지 않는다. 예를 들어, 새로운 화자의 화자 특징(
Figure PCTKR2022001414-appb-img-000108
)은 인코더(710) 및/또는 어텐션(720)으로 입력될 수 있다.
Although the attention 720 and the decoder 730 are illustrated as separate components in FIG. 7 , the present invention is not limited thereto. For example, the decoder 730 may include an attention 720 . In addition, in FIG. 7, the speaker characteristics of the new speaker (
Figure PCTKR2022001414-appb-img-000107
) is input to the decoder 730, but is not limited thereto. For example, the speaker characteristics of the new speaker (
Figure PCTKR2022001414-appb-img-000108
) may be input to the encoder 710 and/or the attention 720 .
도 6 및 7에서는, 대상 텍스트가 1개의 입력 데이터 아이템(예를 들어, 벡터)로 표현되고 인공신경망 텍스트-음성 합성 모델을 통해 1개의 출력 데이터 아이템(예를 들어, 멜스케일 스펙트로그램)이 출력되는 것으로 예시적으로 도시되어 있으나, 이에 한정되지 않으며, 임의의 수의 입력 데이터 아이템을 인공신경망 텍스트-음성 합성 모델에 입력하여 임의의 수의 출력 데이터 아이템을 출력하도록 구성될 수 있다.6 and 7, a target text is expressed as one input data item (eg, a vector) and one output data item (eg, a melscale spectrogram) is output through an artificial neural network text-to-speech synthesis model. Although it is illustrated by way of example, the present invention is not limited thereto, and may be configured to output any number of output data items by inputting an arbitrary number of input data items to the artificial neural network text-to-speech synthesis model.
도 8은 본 개시의 일 실시예에 따른 새로운 화자의 화자 특징이 반영된 출력 음성을 생성하는 사용자 인터페이스(800)를 보여주는 예시도이다. 사용자 단말(예: 사용자 단말(210))은 사용자 인터페이스(800)를 통해 새로운 화자의 화자 특징을 반영한 합성 음성을 출력할 수 있다. 사용자 인터페이스(800)는 텍스트 영역(810), 발성 특징 조절 영역(820), 화자 특징 조절 영역(830) 및 출력 음성 표시 영역(840)을 포함할 수 있다. 이하에서, 프로세서는 사용자 단말(210)의 프로세서(314) 및/또는 정보 처리 시스템(230)의 프로세서(334)일 수 있다.8 is an exemplary diagram illustrating a user interface 800 for generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure. The user terminal (eg, the user terminal 210 ) may output a synthesized voice reflecting the speaker characteristics of the new speaker through the user interface 800 . The user interface 800 may include a text area 810 , a speech characteristic adjustment area 820 , a speaker characteristic adjustment area 830 , and an output voice display area 840 . Hereinafter, the processor may be the processor 314 of the user terminal 210 and/or the processor 334 of the information processing system 230 .
프로세서는 입력 인터페이스(예를 들어, 키보드, 마우스, 마이크 등)를 이용한 사용자 입력을 통해 대상 텍스트를 수신하고, 수신된 대상 텍스트를 텍스트 영역(810)을 통해 표시할 수 있다. 이와 달리, 프로세서는 텍스트를 포함한 문서 파일을 수신하고, 문서 파일 내의 텍스트를 추출하여, 추출된 텍스트를 텍스트 영역(810)에 표시할 수 있다. 이렇게 텍스트 영역(810)에 표시된 텍스트는 합성 음성을 통해 발화될 대상이 될 수 있다.The processor may receive the target text through a user input using an input interface (eg, a keyboard, a mouse, a microphone, etc.), and display the received target text through the text area 810 . Alternatively, the processor may receive a document file including text, extract text in the document file, and display the extracted text in the text area 810 . The text displayed in the text area 810 in this way may be a target to be uttered through a synthesized voice.
화자 특징 조절 영역(830)에서 표시된 기준 화자 중에서, 하나 이상의 기준 화자를 선택하는 사용자 입력에 응답하여, 하나 이상의 기준 화자가 선택될 수 있다. 그리고 나서, 프로세서는 선택된 하나 이상의 기준 화자의 화자 특징에 대한 가중치(예: 반영 비율)를 발성 특징 변화 정보로서 수신할 수 있다. 예를 들어, 프로세서는 화자 특징 조절 영역(830)에서의 입력을 통해 도 4에서 설명된 수학식 5에서의 하나 이상의 기준 화자의 화자 특징의 각각에 대한 가중치를 수신할 수 있다. 도시된 바와 같이, 화자 특징 조절 영역(830)에서 '고은별', '김수민', '이우림', '송도영', '신성수', '신진경'이라는 6명의 기준 화자가 주어질 수 있다. 즉, 사용자는 6명의 기준 화자 중 하나 이상의 기준 화자를 선택하고, 선택된 하나 이상의 기준 화자의 각각에 대응하는 반영 비율의 조정 수단(예: 바)를 조정함으로써, 선택된 기준 화자의 화자 특징이 새로운 화자의 화자 특징에 반영되는 비율이 결정될 수 있다. 이와 달리, 6명의 기준 화자 중 하나 이상의 화자는 랜덤하게 선택될 수 있다.One or more reference speakers may be selected in response to a user input for selecting one or more reference speakers from among the reference speakers displayed in the speaker characteristic adjustment area 830 . Then, the processor may receive a weight (eg, a reflection ratio) for the speaker characteristics of the selected one or more reference speakers as speech characteristic change information. For example, the processor may receive a weight for each of the speaker characteristics of one or more reference speakers in Equation 5 described with reference to FIG. 4 through an input in the speaker characteristic adjustment region 830 . As shown, in the speaker feature control area 830, six standard speakers, 'Eun-Byul Ko', 'Soo-Min Kim', 'Woo-Rim Lee', 'Do-Young Song', 'Seong-Soo Shin', and 'Jin-Kyung Shin' may be given. That is, the user selects one or more reference speakers from among the six reference speakers, and adjusts a reflection ratio adjustment means (eg, bar) corresponding to each of the selected one or more reference speakers, so that the speaker characteristics of the selected reference speaker are changed to a new speaker. A ratio that is reflected in the speaker characteristics of may be determined. Alternatively, one or more of the six reference speakers may be randomly selected.
이렇게 선택된 하나 이상의 기준 화자에 대응하는 반영 비율의 총합이 100이 되도록, 각각의 화자에 대한 반영 비율이 수신될 수 있다. 이와 달리, 이렇게 선택된 하나 이상의 기준 화자에 대응하는 반영 비율이 100을 넘거나 100보다 적더라도, 각 비율의 총합이 100이 되도록 각 반영 비율이 자동적으로 조정될 수 있다. 도 6에서는 6명의 기준 화자가 새로운 화자의 화자 특징을 생성하는데 이용되었으나, 이에 한정되지 않으며, 5명 이하의 기준 화자 및 7명 이상의 기준 화자가 화자 특징 조절 영역(830)에 표시되어 새로운 화자의 화자 특징을 생성하는데 이용될 수 있다.The reflection ratios for each speaker may be received so that the sum of reflection ratios corresponding to the selected one or more reference speakers becomes 100. Alternatively, even if the reflection ratio corresponding to the one or more reference speakers selected in this way is greater than or less than 100, each reflection ratio may be automatically adjusted so that the sum of the ratios becomes 100. In FIG. 6 , six reference speakers are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and 5 or less reference speakers and 7 or more reference speakers are displayed in the speaker characteristic adjustment area 830 to create new speaker characteristics. It can be used to generate speaker characteristics.
프로세서는 발성 특징 조절 영역(820)을 통해 복수의 발성 특징의 각각에 대한 가중치(예: 반영 비율)를 발성 특징 변화 정보로서 수신할 수 있다. 일 실시예에 따르면, 프로세서는 발성 특징 조절 영역(820)에서의 입력을 통해 도 4에서 설명한 수학식 6에서 복수의 발성 특징의 각각에 대한 가중치를 수신할 수 있다. 여기서, 수학식 6에서의 r은 화자 특징 조절 영역(830)에서의 하나 이상의 기준 화자 선택 및 반영 비율에 따라 생성된 기준 화자일 수 있다. 예를 들어, r은 화자 특징 조절 영역(830)에서의 입력을 통해 얻어진 도 4에서 설명된 수학식 5의 결과 값인
Figure PCTKR2022001414-appb-img-000109
일 수 있다.
The processor may receive a weight (eg, a reflection ratio) for each of the plurality of speech features as speech feature change information through the speech feature adjustment region 820 . According to an embodiment, the processor may receive a weight for each of the plurality of speech features in Equation 6 described with reference to FIG. 4 through an input in the speech feature adjustment region 820 . Here, r in Equation 6 may be a reference speaker generated according to the selection and reflection ratio of one or more reference speakers in the speaker characteristic adjustment region 830 . For example, r is a result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 .
Figure PCTKR2022001414-appb-img-000109
can be
다른 실시예에서, 발성 특징 조절 영역(820)에서의 입력을 통해 수신되는 발성 특징 및 해당 발성 특징에 대한 가중치는 수학식 13에서
Figure PCTKR2022001414-appb-img-000110
를 찾기 위한 발성 특징을 위해 사용될 수 있다. 여기서, 수학식 13의
Figure PCTKR2022001414-appb-img-000111
는 화자 특징 조절 영역(830)에서의 입력을 통해 얻어진 도 4에서 설명된 수학식 5의 결과값인
Figure PCTKR2022001414-appb-img-000112
일 수 있다.
In another embodiment, the speech feature received through the input in the speech feature adjustment region 820 and the weight for the speech feature are obtained in Equation (13).
Figure PCTKR2022001414-appb-img-000110
It can be used for vocal features to find Here, in Equation 13
Figure PCTKR2022001414-appb-img-000111
is the result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 ,
Figure PCTKR2022001414-appb-img-000112
can be
본 개시에서, 발성 특징 조절 영역(820)에서 성별, 발성 톤, 발성 강도, 남자 연령, 여자 연령, 피치 및 템포가 정량적으로 조절 가능한 발성 특징으로서 주어질 수 있다. 사용자 입력에 따라 복수의 발성 특징의 각각에 대응하는 비율 조정 수단(예: 바)가 조정됨으로써, 각 복수의 발성 특징이 새로운 화자의 화자 특징에 반영되는 비율을 결정할 수 있다. 예를 들어, 하나 이상의 발성 특징에 대응하는 바를 0으로 조정하면, 해당 발성 특징은 새로운 화자의 화자 특징에 전혀 반영되지 않는다. 도 6에서는 7개의 발성 특징이 새로운 화자의 화자 특징을 생성하는데 이용되었으나, 이에 한정되지 않으며, 6명 이하의 발성 특징 및 추가적인 발성 특징이 발성 특징 조절 영역(820)에 표시되어 새로운 화자의 화자 특징을 생성하는데 이용될 수 있다.In the present disclosure, gender, vocal tone, vocal strength, male age, female age, pitch, and tempo may be given as quantitatively adjustable vocal characteristics in the vocalization characteristic adjustment area 820 . A ratio adjusting means (eg, a bar) corresponding to each of the plurality of speech features is adjusted according to a user input, thereby determining a ratio at which each of the plurality of speech features is reflected in the speaker features of a new speaker. For example, if a bar corresponding to one or more vocalization characteristics is adjusted to 0, the corresponding vocalization characteristic is not reflected in the speaker characteristic of the new speaker at all. In FIG. 6 , seven vocal characteristics are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and the vocal characteristics of six or less people and additional vocal characteristics are displayed in the vocalization characteristic control area 820 to display the speaker characteristics of the new speaker. can be used to create
그리고 나서, 프로세서는, 화자 특징 조절 영역(830)에서 선택된 하나 이상의 기준 화자의 화자 특징을 수신하고, 화자 특징 조절 영역(830)에서 입력된 가중치 및/또는 발성 특징 조절 영역(820)에서 입력된 가중치를 포함한 발성 특징 조절 정보를 이용하여 새로운 화자의 화자 특징을 생성할 수 있다. 새로운 화자의 화자 특징을 생성하는 구체적인 방식은 도 4에서 설명한 방식들 중 하나의 방식이 사용될 수 있다. 그리고 나서, 프로세서는, 대상 텍스트 및 생성된 새로운 화자의 화자 특징을 인공신경망 텍스트-음성 합성 모델에 입력하여, 결정된 새로운 화자의 화자 특징이 반영된, 대상 텍스트에 대한 출력 음성을 생성할 수 있다. 일 예로, 텍스트 영역(810), 발성 특징 조절 영역(820), 화자 특징 조절 영역(830)에서의 입력이 완료되고, 발성 특징 조절 영역(820) 아래에 위치된 '생성' 버튼이 선택 또는 클릭되면, 새로운 화자의 화자 특징이 반영된, 대상 텍스트에 대한 출력 음성이 생성될 수 있다. 이렇게 생성된 출력 음성은 사용자 단말과 연결된 스피커를 통해 출력될 수 있다. 출력 음성의 재생 시간 및/또는 위치는 출력 음성 표시 영역(840)을 통해 표시될 수 있다.Then, the processor receives the speaker characteristics of one or more reference speakers selected in the speaker characteristic adjustment area 830 , and weights input from the speaker characteristic adjustment area 830 and/or the speech characteristics adjustment area 820 . A speaker characteristic of a new speaker may be generated by using the speech characteristic adjustment information including the weight. One of the methods described with reference to FIG. 4 may be used as a specific method for generating the speaker characteristic of a new speaker. Then, the processor may input the target text and the generated speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected. As an example, the input in the text area 810 , the speech characteristic adjustment area 820 , and the speaker characteristic adjustment area 830 is completed, and the 'Create' button located below the speech characteristic adjustment area 820 is selected or clicked Then, an output voice for the target text in which the speaker characteristics of the new speaker are reflected may be generated. The output voice thus generated may be output through a speaker connected to the user terminal. The reproduction time and/or position of the output voice may be displayed through the output voice display area 840 .
도 9은 본 개시의 일 실시예에 따른 인공신경망 모델(900)을 나타내는 구조도이다. 일 실시예에 따르면, 인공신경망 모델(900)은 기계학습(Machine Learning) 기술과 인지과학에서, 생물학적 신경망의 구조에 기초하여 구현된 통계학적 학습 알고리즘 또는 그 알고리즘을 실행하는 구조이다. 일 실시예에 따르면, 인공신경망 모델(900)은, 생물학적 신경망에서와 같이 시냅스의 결합으로 네트워크를 형성한 인공 뉴런인 노드(Node)들이 시냅스의 가중치를 반복적으로 조정하여, 특정 입력에 대응한 올바른 출력과 추론된 출력 사이의 오차가 감소되도록 학습함으로써, 문제 해결 능력을 가지는 기계학습 모델을 나타낼 수 있다. 예를 들어, 인공신경망 모델(900)은 기계학습, 딥러닝 등의 인공지능 학습법에 사용되는 임의의 확률 모델, 뉴럴 네트워크 모델 등을 포함할 수 있다. 본 개시에서, 인공신경망 모델(900)은 상술된 인공신경망 텍스트-음성 합성 모델, 상술된 인공신경망 화자 특징 변화 생성 모델, 상술된 인공신경망 발성 특징 예측 모델 및/또는 상술된 인공신경망 화자 특징 추출 모델을 포함할 수 있다.9 is a structural diagram illustrating an artificial neural network model 900 according to an embodiment of the present disclosure. According to an embodiment, the artificial neural network model 900 is a statistical learning algorithm implemented based on the structure of a biological neural network or a structure for executing the algorithm in machine learning technology and cognitive science. According to an embodiment, the artificial neural network model 900 is an artificial neuron that forms a network by combining synapses, as in a biological neural network, by repeatedly adjusting the weights of synapses, so that By learning to reduce the error between the output and the inferred output, it is possible to represent a machine learning model with problem-solving ability. For example, the artificial neural network model 900 may include arbitrary probabilistic models, neural network models, etc. used in artificial intelligence learning methods such as machine learning and deep learning. In the present disclosure, the artificial neural network model 900 includes the aforementioned artificial neural network text-to-speech synthesis model, the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. may include
인공신경망 모델(900)은 다층의 노드들과 이들 상이의 연결로 구성된 다층 퍼셉트론(MLP: multilayer perceptron)으로 구현될 수 있다. 본 실시예에 따른 인공신경망 모델(900)은 MLP 등을 포함하는 다양한 인공신경망 구조들 중의 하나를 이용하여 구현될 수 있다. 도 9에서 도시된 바와 같이, 인공신경망 모델(900)은 외부로부터 입력 신호 또는 데이터(910)를 수신하는 입력층(920), 입력 데이터에 대응한 출력 신호 또는 데이터(950)를 출력하는 출력층(940), 입력층(920)과 출력층(940) 사이에 위치하며 입력층(920)으로부터 신호를 받아 특성을 추출하여 출력층(940)으로 전달하는 n개의 은닉층(930_1 내지 930_n)으로 구성될 수 있다. 여기서, 출력층(940)은, 은닉층(930_1 내지 930_n)으로부터 신호를 받아 외부로 출력할 수 있다. 인공신경망 모델(900)의 학습 방법에는, 교사 신호(정답)의 입력에 의해서 문제의 해결에 최적화되도록 학습하는 지도 학습(Supervised Learning) 방법과, 교사 신호를 필요로 하지 않는 비지도 학습(Unsupervised Learning) 방법이 있다.The artificial neural network model 900 may be implemented as a multilayer perceptron (MLP) composed of multiple layers of nodes and connections between them. The artificial neural network model 900 according to the present embodiment may be implemented using one of various artificial neural network structures including MLP. As shown in FIG. 9 , the artificial neural network model 900 includes an input layer 920 that receives an input signal or data 910 from the outside, and an output layer that outputs an output signal or data 950 corresponding to the input data ( 940), located between the input layer 920 and the output layer 940, receiving a signal from the input layer 920, extracting characteristics, and transferring the characteristics to the output layer 940. It may be composed of n hidden layers 930_1 to 930_n. . Here, the output layer 940 may receive a signal from the hidden layers 930_1 to 930_n and output the signal to the outside. The learning method of the artificial neural network model 900 includes a supervised learning method that learns to be optimized to solve a problem by input of a teacher signal (correct answer), and an unsupervised learning method that does not require a teacher signal. ) is a way.
일 실시예에 따르면, 인공신경망 모델(900)이 인공신경망 텍스트-음성 합성 모델인 경우, 프로세서는 텍스트 정보 및 새로운 화자의 화자 특징을 인공신경망 모델(900)에 입력하여, 인공신경망 모델(900)이 새로운 화자 특징이 반영된 텍스트에 대한 음성 데이터를 출력하도록 end-to-end로 학습될 수 있다. 즉, 인공신경망 모델(900)은 텍스트에 대한 정보 및 새로운 화자에 대한 정보를 입력하면 중간 과정은 프로세서에 의해 자체적으로 학습되어, 합성 음성이 출력될 수 있다. 프로세서는 텍스트 정보 및 새로운 화자의 화자 특징을 인공신경망 모델(900)의 인코딩 레이어를 통해 임베딩(예를 들어, 임베딩 벡터)으로 변환시킴으로써 합성 음성을 생성할 수 있다. 여기서, 인공신경망 모델(900)의 입력 변수는, 텍스트 정보 및 새로운 화자 정보를 나타내는 벡터 데이터 요소로 구성된 벡터(910)가 될 수 있다. 여기서, 텍스트 정보는 텍스트를 나타내는 임의의 임베딩으로 나타낼 수 있는데, 예를 들어, 문자 임베딩, 음소 임베딩 등으로 표현될 수 있다. 또한, 새로운 화자의 화자 특징은, 화자의 발성을 나타내는 임의의 형태의 임베딩으로 나타낼 수 있다. 인공신경망 모델(900)이 end-to-end로 학습되는 경우, 인공신경망 모델(900)은 텍스트 정보 및 새로운 화자 정보 사이의 의존성(dependency)이 반영되도록 학습될 수 있다. 이러한 구성 하에서, 출력 변수는 새로운 화자의 화자 특징이 반영된, 대상 텍스트에 대한 합성 음성을 나타내는 결과 벡터(950)로 구성될 수 있다.According to an embodiment, when the artificial neural network model 900 is an artificial neural network text-to-speech synthesis model, the processor inputs text information and speaker characteristics of a new speaker into the artificial neural network model 900, and the artificial neural network model 900 This new speaker characteristic can be learned end-to-end to output voice data for the reflected text. That is, when the artificial neural network model 900 inputs information about text and information about a new speaker, the intermediate process is learned by itself by the processor, and a synthesized voice can be output. The processor may generate the synthesized speech by converting the text information and the speaker characteristics of the new speaker into embeddings (eg, embedding vectors) through the encoding layer of the neural network model 900 . Here, the input variable of the artificial neural network model 900 may be a vector 910 composed of vector data elements representing text information and new speaker information. Here, the text information may be represented by arbitrary embeddings representing text, for example, it may be represented by character embeddings, phoneme embeddings, and the like. In addition, the speaker characteristics of the new speaker may be represented by any type of embedding representing the speaker's utterance. When the artificial neural network model 900 is trained end-to-end, the artificial neural network model 900 may be trained to reflect the dependency between text information and new speaker information. Under this configuration, the output variable may be composed of a result vector 950 representing the synthesized voice for the target text in which the speaker characteristics of the new speaker are reflected.
이와 같이, 인공신경망 모델(900)의 입력층(920)과 출력층(940)에 복수의 입력변수와 대응되는 복수의 출력변수를 각각 매칭시키고, 입력층(920), 은닉층(930_1 ... 930_n, 여기서, n은 2 이상의 자연수임) 및 출력층(940)에 포함된 노드들 사이의 시냅스 값을 조정함으로써, 인공신경망 모델(900)이 특정 입력에 대응한 올바른 출력을 추론할 수 있도록 학습될 수 있다. 올바른 출력을 추론하는데 있어, 분석 결과의 정답 데이터가 사용될 수 있으며, 이러한 정답 데이터는 어노테이터의 어노테이션 작업의 결과로 획득될 수 있다. 이러한 학습과정을 통해, 인공신경망 모델(900)의 입력변수에 숨겨져 있는 특성이 파악될 수 있고, 입력변수에 기초하여 계산된 출력변수와 목표 출력 간의 오차가 줄어들도록 인공신경망 모델(900)의 노드들 사이의 시냅스 값(또는 가중치)가 조정될 수 있다.In this way, the input layer 920 and the output layer 940 of the artificial neural network model 900 are matched with a plurality of input variables and a plurality of output variables corresponding to each other, and the input layer 920 and the hidden layers 930_1 ... 930_n , where n is a natural number equal to or greater than 2) and by adjusting the synapse values between the nodes included in the output layer 940, the artificial neural network model 900 can be trained to infer the correct output corresponding to a specific input. have. In inferring the correct output, correct answer data of the analysis result may be used, and such correct answer data may be obtained as a result of an annotator's annotation work. Through this learning process, the characteristics hidden in the input variable of the artificial neural network model 900 can be identified, and the error between the output variable calculated based on the input variable and the target output is reduced. A synapse value (or weight) between the two may be adjusted.
이러한 입력 정보 사이의 의존 현상을 해결하기 위하여, 인공신경망 모델(900)의 학습 시, 텍스트 정보 및 새로운 화자 정보 사이(예를 들어, 텍스트 정보 임베딩 및 새로운 화자 정보 임베딩)의 상호 정보(mutual information)를 최소화하는 손실 함수(loss)가 이용될 수 있다. 일 실시예에 따르면, 인공신경망 모델(900)이 인공신경망 텍스트-음성 합성 모델인 경우, 인공신경망 모델(900)은 텍스트 정보 임베딩 및 새로운 화자 정보 임베딩 사이의 손실을 예측하도록 구성된 모듈(예를 들어, fully-connected layer 등)을 포함할 수 있다. 이러한 구성 하에서, 인공신경망 모델(900)은 텍스트 정보-화자 정보 사이의 상호 정보를 예측하고 이를 최소화하도록 학습될 수 있다. 이렇게 학습된 인공신경망 모델(900)은 입력된 텍스트 정보 및 새로운 화자 정보 각각을 서로 독립적으로 조절하도록 구성될 수 있다.In order to solve the dependency phenomenon between input information, when the artificial neural network model 900 is trained, mutual information between text information and new speaker information (eg, text information embedding and new speaker information embedding) A loss function that minimizes may be used. According to an embodiment, when the neural network model 900 is an artificial neural network text-to-speech synthesis model, the neural network model 900 is configured to predict a loss between embedding text information and embedding new speaker information (for example, , a fully-connected layer, etc.). Under this configuration, the artificial neural network model 900 may be trained to predict and minimize mutual information between text information and speaker information. The artificial neural network model 900 learned in this way may be configured to independently adjust each of the input text information and the new speaker information.
그리고 나서, 인공신경망 모델(900)이 인공신경망 텍스트-음성 합성 모델인 경우, 프로세서는 학습된 인공신경망 모델(900)에 대상 텍스트 정보 및 새로운 화자 정보를 입력하여, 새로운 화자의 화자 특징이 반영된 대상 텍스트에 대응하는 합성 음성을 출력할 수 있다. 이러한 음성 데이터는 대상 텍스트 정보 및 새로운 화자 정보 사이의 상호 정보가 최소화되도록 구성될 수 있다.Then, when the neural network model 900 is an artificial neural network text-to-speech synthesis model, the processor inputs target text information and new speaker information to the learned artificial neural network model 900, and the new speaker's speaker characteristics are reflected. A synthesized voice corresponding to the text may be output. Such voice data may be configured such that mutual information between the target text information and the new speaker information is minimized.
이러한 인공신경망 모델(900)의 학습 과정은, 각 모델의 학습 데이터를 이용하여 상술된 인공신경망 화자 특징 변화 생성 모델, 상술된 인공신경망 발성 특징 예측 모델 및/또는 상술된 인공신경망 화자 특징 추출 모델에도 적용될 수 있다. 또한, 이렇게 학습된 인공신경망 모델들은, 학습 입력 데이터에 대응하는 데이터를 입력으로 해서, 추론 값을 출력 데이터로서 생성할 수 있다.The learning process of the artificial neural network model 900 uses the training data of each model to generate the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. can be applied. In addition, the artificial neural network models trained in this way may generate an inference value as output data by using data corresponding to the learning input data as input.
상술한 방법은 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램으로 제공될 수 있다. 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The above-described method may be provided as a computer program stored in a computer-readable recording medium for execution by a computer. The medium may continuously store a computer executable program, or may be a temporary storage for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like. In addition, examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute various other software, or servers.
본 개시의 방법, 동작 또는 기법들은 다양한 수단에 의해 구현될 수도 있다. 예를 들어, 이러한 기법들은 하드웨어, 펌웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수도 있다. 본원의 개시와 연계하여 설명된 다양한 예시적인 논리적 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양자의 조합들로 구현될 수도 있음을 통상의 기술자들은 이해할 것이다. 하드웨어 및 소프트웨어의 이러한 상호 대체를 명확하게 설명하기 위해, 다양한 예시적인 구성요소들, 블록들, 모듈들, 회로들, 및 단계들이 그들의 기능적 관점에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는지의 여부는, 특정 애플리케이션 및 전체 시스템에 부과되는 설계 요구사항들에 따라 달라진다. 통상의 기술자들은 각각의 특정 애플리케이션을 위해 다양한 방식들로 설명된 기능을 구현할 수도 있으나, 그러한 구현들은 본 개시의 범위로부터 벗어나게 하는 것으로 해석되어서는 안된다.The method, operation, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those of ordinary skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.
하드웨어 구현에서, 기법들을 수행하는 데 이용되는 프로세싱 유닛들은, 하나 이상의 ASIC들, DSP들, 디지털 신호 프로세싱 디바이스들(digital signal processing devices; DSPD들), 프로그램가능 논리 디바이스들(programmable logic devices; PLD들), 필드 프로그램가능 게이트 어레이들(field programmable gate arrays; FPGA들), 프로세서들, 제어기들, 마이크로제어기들, 마이크로프로세서들, 전자 디바이스들, 본 개시에 설명된 기능들을 수행하도록 설계된 다른 전자 유닛들, 컴퓨터, 또는 이들의 조합 내에서 구현될 수도 있다.In a hardware implementation, the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.
따라서, 본 개시와 연계하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 및 회로들은 범용 프로세서, DSP, ASIC, FPGA나 다른 프로그램 가능 논리 디바이스, 이산 게이트나 트랜지스터 로직, 이산 하드웨어 컴포넌트들, 또는 본원에 설명된 기능들을 수행하도록 설계된 것들의 임의의 조합으로 구현되거나 수행될 수도 있다. 범용 프로세서는 마이크로프로세서일 수도 있지만, 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한, 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서, 복수의 마이크로프로세서들, DSP 코어와 연계한 하나 이상의 마이크로프로세서들, 또는 임의의 다른 구성의 조합으로서 구현될 수도 있다.Accordingly, the various illustrative logic blocks, modules, and circuits described in connection with this disclosure are suitable for use in general purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the present disclosure. It may be implemented or performed in any combination of those designed to perform the functions described in A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.
펌웨어 및/또는 소프트웨어 구현에 있어서, 기법들은 랜덤 액세스 메모리(random access memory; RAM), 판독 전용 메모리(read-only memory; ROM), 비휘발성 RAM(non-volatile random access memory; NVRAM), PROM(programmable read-only memory), EPROM(erasable programmable read-only memory), EEPROM(electrically erasable PROM), 플래시 메모리, 컴팩트 디스크(compact disc; CD), 자기 또는 광학 데이터 스토리지 디바이스 등과 같은 컴퓨터 판독가능 매체 상에 저장된 명령들로서 구현될 수도 있다. 명령들은 하나 이상의 프로세서들에 의해 실행 가능할 수도 있고, 프로세서(들)로 하여금 본 개시에 설명된 기능의 특정 양태들을 수행하게 할 수도 있다.In firmware and/or software implementations, the techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM ( on computer-readable media such as programmable read-only memory), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.
이상 설명된 실시예들이 하나 이상의 독립형 컴퓨터 시스템에서 현재 개시된 주제의 양태들을 활용하는 것으로 기술되었으나, 본 개시는 이에 한정되지 않고, 네트워크나 분산 컴퓨팅 환경과 같은 임의의 컴퓨팅 환경과 연계하여 구현될 수도 있다. 또 나아가, 본 개시에서 주제의 양상들은 복수의 프로세싱 칩들이나 장치들에서 구현될 수도 있고, 스토리지는 복수의 장치들에 걸쳐 유사하게 영향을 받게 될 수도 있다. 이러한 장치들은 PC들, 네트워크 서버들, 및 휴대용 장치들을 포함할 수도 있다.Although the above-described embodiments have been described utilizing aspects of the presently disclosed subject matter in one or more standalone computer systems, the present disclosure is not so limited and may be implemented in connection with any computing environment, such as a network or distributed computing environment. . Still further, aspects of the subject matter in this disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across the plurality of devices. Such devices may include PCs, network servers, and portable devices.
본 명세서에서는 본 개시가 일부 실시예들과 관련하여 설명되었지만, 본 개시의 발명이 속하는 기술분야의 통상의 기술자가 이해할 수 있는 본 개시의 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.Although the present disclosure has been described in connection with some embodiments herein, various modifications and changes can be made without departing from the scope of the present disclosure that can be understood by those skilled in the art to which the present disclosure pertains. Further, such modifications and variations are intended to fall within the scope of the claims appended hereto.

Claims (14)

  1. 적어도 하나의 프로세서에 의해 수행되는, 새로운 화자의 합성 음성을 생성하는 방법에 있어서,A method for generating a synthesized voice of a new speaker, performed by at least one processor, the method comprising:
    대상 텍스트를 수신하는 단계;receiving the target text;
    기준 화자의 화자 특징을 획득하는 단계;obtaining a speaker characteristic of a reference speaker;
    발성 특징 변화 정보를 획득하는 단계;obtaining vocal feature change information;
    상기 획득된 기준 화자의 화자 특징 및 상기 획득된 발성 특징 변화 정보를 이용하여 새로운 화자의 화자 특징을 결정하는 단계; 및determining a speaker characteristic of a new speaker using the acquired speaker characteristic of the reference speaker and the acquired speech characteristic change information; and
    상기 대상 텍스트 및 상기 결정된 새로운 화자의 화자 특징을 인공신경망 텍스트-음성 합성 모델에 입력하여, 상기 결정된 새로운 화자의 화자 특징이 반영된, 상기 대상 텍스트에 대한 출력 음성을 생성하는 단계를 포함하고,inputting the target text and the determined speaker characteristics of the new speaker into an artificial neural network text-to-speech synthesis model, and generating an output voice for the target text, reflecting the speaker characteristics of the determined new speaker;
    상기 인공신경망 텍스트-음성 합성 모델은, 복수의 학습 텍스트 아이템 및 복수의 학습 화자의 화자 특징을 기초로, 상기 복수의 학습 화자의 화자 특징이 반영된, 복수의 학습 텍스트 아이템에 대한 음성을 출력하도록 학습되는,The artificial neural network text-to-speech synthesis model is trained to output voices for a plurality of training text items in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of training text items and the speaker characteristics of the plurality of learning speakers. felled,
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  2. 제1항에 있어서,According to claim 1,
    상기 새로운 화자의 화자 특징을 결정하는 단계는,Determining the speaker characteristics of the new speaker comprises:
    상기 기준 화자의 화자 특징 및 상기 획득된 발성 특징 변화 정보를 인공신경망 화자 특징 변화 생성 모델에 입력하여 화자 특징 변화를 생성하는 단계; 및 generating a change in speaker characteristic by inputting the speaker characteristic of the reference speaker and the acquired speech characteristic change information into an artificial neural network speaker characteristic change generation model; and
    상기 기준 화자의 화자 특징 및 상기 생성된 화자 특징 변화를 합성함으로써, 상기 새로운 화자의 화자 특징을 출력하는 단계를 포함하고, outputting the speaker characteristic of the new speaker by synthesizing the speaker characteristic of the reference speaker and the generated speaker characteristic change;
    상기 인공신경망 화자 특징 변화 생성 모델은, 복수의 학습 화자의 화자 특징 및 상기 복수의 학습 화자의 화자 특징에 포함된 복수의 발성 특징을 이용하여 학습되는,The artificial neural network speaker characteristic change generation model is learned using speaker characteristics of a plurality of learned speakers and a plurality of speech characteristics included in the speaker characteristics of the plurality of learned speakers,
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  3. 제2항에 있어서,3. The method of claim 2,
    상기 발성 특징 변화 정보는, 타겟 발성 특징의 변화에 대한 정보를 포함하는, 새로운 화자의 합성 음성을 생성하는 방법.The method of generating a synthesized voice of a new speaker, wherein the speech characteristic change information includes information on a change in a target speech characteristic.
  4. 제1항에 있어서,According to claim 1,
    상기 기준 화자의 화자 특징을 획득하는 단계는, 복수의 기준 화자에 대응하는 복수의 화자 특징을 획득하는 단계를 포함하고,The acquiring of the speaker characteristics of the reference speaker includes acquiring a plurality of speaker characteristics corresponding to the plurality of reference speakers,
    상기 발성 특징 변화 정보를 획득하는 단계는, 상기 복수의 화자 특징에 대응하는 가중치 세트를 획득하는 단계를 포함하고,The obtaining of the speech characteristic change information includes obtaining a set of weights corresponding to the plurality of speaker characteristics,
    상기 새로운 화자의 화자 특징을 결정하는 단계는, 상기 복수의 화자의 특징의 각각에 상기 획득된 가중치 세트에 포함된 가중치를 적용함으로써, 상기 새로운 화자의 화자 특징을 결정하는 단계를 포함하는,The determining of the speaker characteristic of the new speaker includes determining the speaker characteristic of the new speaker by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics,
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  5. 제1항에 있어서,According to claim 1,
    복수의 화자의 화자 특징을 획득하는 단계 - 상기 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고, further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;
    상기 발성 특징 변화 정보를 획득하는 단계는,The step of obtaining the speech characteristic change information comprises:
    상기 복수의 화자의 화자 벡터의 각각을 정규화시키는 단계;normalizing each of the speaker vectors of the plurality of speakers;
    상기 정규화된 복수의 화자의 화자 벡터에 대한 차원 축소 분석을 수행함으로써, 복수의 주요 성분을 결정하는 단계;determining a plurality of principal components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers;
    상기 결정된 복수의 주요 성분 중 적어도 하나의 주요 성분을 선택하는 단계; 및selecting at least one main component from the determined plurality of main components; and
    상기 선택된 주요 성분을 이용하여 상기 발성 특징 변화 정보를 결정하는 단계를 포함하고,determining the vocalization characteristic change information by using the selected main component,
    상기 새로운 화자의 화자 특징을 결정하는 단계는,Determining the speaker characteristics of the new speaker comprises:
    상기 기준 화자의 화자 특징, 상기 결정된 발성 특징 변화 정보 및 상기 결정된 발성 특징 변화 정보의 가중치를 이용하여 상기 새로운 화자의 화자 특징을 결정하는 단계를 포함하는,determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and a weight of the determined speech characteristic change information;
    새로운 화자의 합성 음성을 생성하는 방법. How to generate a synthesized voice for a new speaker.
  6. 제1항에 있어서,According to claim 1,
    복수의 화자의 화자 특징을 획득하는 단계 - 상기 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고, further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;
    상기 복수의 화자의 각각은, 하나 이상의 발성 특징에 대한 레이블이 할당되고,each of the plurality of speakers is assigned a label for one or more vocal features,
    상기 발성 특징 변화 정보를 획득하는 단계는,The step of obtaining the speech characteristic change information comprises:
    타겟 발성 특징이 상이한 복수의 화자의 화자 벡터를 획득하는 단계; 및acquiring speaker vectors of a plurality of speakers having different target vocalization characteristics; and
    상기 획득된 복수의 화자의 화자 벡터 사이의 차이를 기초로 상기 발성 특징 변화 정보를 결정하는 단계를 포함하고,determining the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers,
    상기 새로운 화자의 화자 특징을 결정하는 단계는,Determining the speaker characteristics of the new speaker comprises:
    상기 기준 화자의 화자 특징, 상기 결정된 발성 특징 변화 정보 및 상기 결정된 발성 특징 변화 정보의 가중치를 이용하여 상기 새로운 화자의 화자 특징을 결정하는 단계를 포함하는,determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and a weight of the determined speech characteristic change information;
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  7. 제1항에 있어서,According to claim 1,
    복수의 화자의 화자 특징을 획득하는 단계 - 상기 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고,further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;
    상기 복수의 화자의 각각은, 하나 이상의 발성 특징에 대한 레이블이 할당되고,each of the plurality of speakers is assigned a label for one or more vocal features,
    상기 발성 특징 변화 정보를 획득하는 단계는,The step of obtaining the speech characteristic change information comprises:
    타겟 발성 특징이 상이한 복수의 화자 그룹의 각각에 포함된 화자들의 화자 벡터를 획득하는 단계 - 상기 복수의 화자의 그룹은 제1 화자 그룹 및 제2 화자 그룹을 포함함 -;obtaining a speaker vector of speakers included in each of a plurality of speaker groups having different target vocalization characteristics, the plurality of speaker groups including a first speaker group and a second speaker group;
    상기 제1 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하는 단계;calculating an average of speaker vectors of speakers included in the first speaker group;
    상기 제2 화자 그룹에 포함된 화자들의 화자 벡터의 평균을 산출하는 단계; 및calculating an average of speaker vectors of speakers included in the second speaker group; and
    상기 제1 화자 그룹에 대응하는 화자 벡터의 평균 및 상기 제2 화자 그룹에 대응하는 화자 벡터의 평균 사이의 차이를 기초로 상기 발성 특징 변화 정보를 결정하는 단계를 포함하고,determining the speech characteristic change information based on a difference between the average of the speaker vectors corresponding to the first speaker group and the average of the speaker vectors corresponding to the second speaker group;
    상기 새로운 화자의 화자 특징을 결정하는 단계는,Determining the speaker characteristics of the new speaker comprises:
    상기 기준 화자의 화자 특징, 상기 결정된 발성 특징 변화 정보 및 상기 결정된 발성 특징 변화 정보의 가중치를 이용하여 상기 새로운 화자의 화자 특징을 결정하는 단계를 포함하는,determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and a weight of the determined speech characteristic change information;
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  8. 제1항에 있어서,According to claim 1,
    복수의 화자의 화자 특징을 획득하는 단계 - 상기 복수의 화자의 화자 특징은 복수의 화자 벡터를 포함함 -를 더 포함하고,further comprising: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors;
    상기 기준 화자의 화자 특징은, 상기 기준 화자의 복수의 발성 특징을 포함하고,The speaker characteristics of the reference speaker include a plurality of vocalization characteristics of the reference speaker,
    상기 발성 특징 변화 정보를 획득하는 단계는,The step of obtaining the speech characteristic change information comprises:
    상기 복수의 화자의 화자 특징을 인공신경망 발성 특징 예측 모델에 입력하여, 상기 복수의 화자의 각각의 발성 특징을 출력하는 단계;inputting the speaker characteristics of the plurality of speakers into an artificial neural network speech characteristic prediction model, and outputting the speech characteristics of each of the plurality of speakers;
    상기 복수의 화자의 화자 특징 중에서, 상기 출력된 복수의 화자의 각각의 발성 특징 중 타겟 발성 특징과 상기 기준 화자의 복수의 발성 특징 중 타겟 발성 특징 사이의 차이가 존재하는, 화자의 화자 특징을 선택하는 단계; 및Selecting a speaker characteristic of a speaker in which a difference exists between a target vocalization characteristic among the output vocalization characteristics of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker from among the speaker characteristics of the plurality of speakers to do; and
    상기 선택된 화자의 화자 특징에 대응하는 가중치를 획득하는 단계를 포함하고,obtaining a weight corresponding to a speaker characteristic of the selected speaker;
    상기 새로운 화자의 화자 특징을 결정하는 단계는,Determining the speaker characteristics of the new speaker comprises:
    상기 기준 화자의 화자 특징, 상기 선택된 화자의 화자 특징 및 상기 선택된 화자의 화자 특징에 대응하는 가중치를 이용하여 상기 새로운 화자의 화자 특징을 결정하는 단계를 포함하는,determining the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker;
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  9. 제1항에 있어서,According to claim 1,
    상기 새로운 화자의 화자 특징은, 화자 특징 벡터를 포함하고,The speaker feature of the new speaker includes a speaker feature vector,
    해쉬 함수를 이용하여 상기 화자 특징 벡터에 대응하는 해쉬값을 산출하는 단계;calculating a hash value corresponding to the speaker feature vector using a hash function;
    저장 매체에 저장된 복수의 화자의 콘텐츠 중에서, 상기 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 있는지 여부를 판정하는 단계; 및determining whether there is content associated with a hash value similar to the calculated hash value from among the plurality of speaker contents stored in the storage medium; and
    상기 산출된 해쉬값과 유사한 해쉬값과 연관된 콘텐츠가 없는 경우, 상기 새로운 화자의 화자 특징과 연관된 출력 음성이 새로운 출력 음성임을 결정하는 단계를 포함하는,If there is no content associated with a hash value similar to the calculated hash value, determining that the output voice associated with the speaker characteristic of the new speaker is the new output voice,
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  10. 제1항에 있어서,According to claim 1,
    상기 기준 화자의 화자 특징은, 화자 벡터를 포함하고,The speaker characteristic of the reference speaker includes a speaker vector,
    상기 발성 특징 변화 정보를 획득하는 단계는,The step of obtaining the speech characteristic change information comprises:
    타겟 발성 특징에 대응하는 발성 특징 분류 모델을 이용하여 상기 타겟 발성 특징에 대한 법선 벡터를 추출하는 단계 - 상기 법선 벡터는 상기 타겟 발성 특징을 분류하는 hyperplane의 법선 벡터를 지칭함 -; 및extracting a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature, wherein the normal vector refers to a normal vector of a hyperplane that classifies the target speech feature; and
    상기 타겟 발성 특징을 조절하는 정도를 나타내는 정보를 획득하는 단계를 포함하고,Comprising the step of obtaining information indicating the degree of adjusting the target vocalization characteristics,
    상기 새로운 화자의 화자 특징을 결정하는 단계는,Determining the speaker characteristics of the new speaker comprises:
    상기 기준 화자의 화자 벡터, 상기 추출된 법선 벡터 및 상기 타겟 발성 특징을 조절하는 정도를 기초로 상기 새로운 화자의 화자 특징을 결정하는 단계를 포함하는, determining the speaker characteristic of the new speaker based on the degree of adjusting the speaker vector of the reference speaker, the extracted normal vector, and the target vocalization characteristic,
    새로운 화자의 합성 음성을 생성하는 방법.How to generate a synthesized voice for a new speaker.
  11. 제1항에 따른 방법을 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램.A computer program stored in a computer-readable recording medium for executing the method according to claim 1 in a computer.
  12. 음성 합성기로서,A speech synthesizer comprising:
    제1항에 따른 방법에 따라 생성된 새로운 화자의 합성 음성을 포함한 학습 데이터를 이용하여 학습된 음성 합성기.A speech synthesizer trained using training data including a synthesized voice of a new speaker generated according to the method according to claim 1 .
  13. 합성 음성을 제공하는 장치로서,A device for providing synthetic speech, comprising:
    제1항에 따른 방법에 따라 생성된 새로운 화자의 합성 음성을 저장하도록 구성된 메모리; 및a memory configured to store the synthesized voice of the new speaker generated according to the method according to claim 1 ; and
    상기 메모리와 연결되고, 상기 메모리에 포함된 컴퓨터 판독 가능한 적어도 하나의 프로그램을 실행하도록 구성된 적어도 하나의 프로세서at least one processor coupled to the memory and configured to execute at least one computer readable program contained in the memory
    를 포함하고,including,
    상기 적어도 하나의 프로그램은,the at least one program,
    상기 메모리에 저장된 새로운 화자의 합성 음성 중 적어도 일부를 출력하기 위한 명령어를 포함하는,Comprising a command for outputting at least a part of the synthesized voice of the new speaker stored in the memory,
    합성 음성을 제공하는 장치.A device that provides synthetic speech.
  14. 적어도 하나의 프로세서에 의해 수행되는, 새로운 화자의 합성 음성을 제공하는 방법에 있어서,A method for providing a synthesized voice of a new speaker, performed by at least one processor, the method comprising:
    제1항에 따른 방법에 따라 생성된 새로운 화자의 합성 음성을 저장하는 단계;storing the synthesized voice of a new speaker generated according to the method according to claim 1 ;
    상기 저장된 새로운 화자의 합성 음성 중 적어도 일부를 제공하는 단계를 포함하는,providing at least a portion of the stored synthesized voice of the new speaker;
    새로운 화자의 합성 음성을 제공하는 방법.How to provide a synthesized voice for a new speaker.
PCT/KR2022/001414 2021-01-26 2022-01-26 Method and system for generating synthesized speech of new speaker WO2022164207A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0011093 2021-01-26
KR20210011093 2021-01-26
KR10-2022-0011853 2022-01-26
KR1020220011853A KR102604932B1 (en) 2021-01-26 2022-01-26 Method and system for generating synthesis voice of a new speaker

Publications (1)

Publication Number Publication Date
WO2022164207A1 true WO2022164207A1 (en) 2022-08-04

Family

ID=82653616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/001414 WO2022164207A1 (en) 2021-01-26 2022-01-26 Method and system for generating synthesized speech of new speaker

Country Status (1)

Country Link
WO (1) WO2022164207A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190085882A (en) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning
KR20190096877A (en) * 2019-07-31 2019-08-20 엘지전자 주식회사 Artificial intelligence(ai)-based voice sampling apparatus and method for providing speech style in heterogeneous label
US20200005763A1 (en) * 2019-07-25 2020-01-02 Lg Electronics Inc. Artificial intelligence (ai)-based voice sampling apparatus and method for providing speech style
KR20200056342A (en) * 2018-11-14 2020-05-22 네오사피엔스 주식회사 Method for retrieving content having voice identical to voice of target speaker and apparatus for performing the same
KR20200088263A (en) * 2018-05-29 2020-07-22 한국과학기술원 Method and system of text to multiple speech

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190085882A (en) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning
KR20200088263A (en) * 2018-05-29 2020-07-22 한국과학기술원 Method and system of text to multiple speech
KR20200056342A (en) * 2018-11-14 2020-05-22 네오사피엔스 주식회사 Method for retrieving content having voice identical to voice of target speaker and apparatus for performing the same
US20200005763A1 (en) * 2019-07-25 2020-01-02 Lg Electronics Inc. Artificial intelligence (ai)-based voice sampling apparatus and method for providing speech style
KR20190096877A (en) * 2019-07-31 2019-08-20 엘지전자 주식회사 Artificial intelligence(ai)-based voice sampling apparatus and method for providing speech style in heterogeneous label

Similar Documents

Publication Publication Date Title
WO2019117466A1 (en) Electronic device for analyzing meaning of speech, and operation method therefor
WO2020190054A1 (en) Speech synthesis apparatus and method therefor
WO2020190050A1 (en) Speech synthesis apparatus and method therefor
WO2019139430A1 (en) Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
WO2020189850A1 (en) Electronic device and method of controlling speech recognition by electronic device
WO2020145439A1 (en) Emotion information-based voice synthesis method and device
WO2019139431A1 (en) Speech translation method and system using multilingual text-to-speech synthesis model
WO2020105856A1 (en) Electronic apparatus for processing user utterance and controlling method thereof
WO2020246702A1 (en) Electronic device and method for controlling the electronic device thereof
WO2015005679A1 (en) Voice recognition method, apparatus, and system
WO2022045651A1 (en) Method and system for applying synthetic speech to speaker image
WO2020213842A1 (en) Multi-model structures for classification and intent determination
WO2018097439A1 (en) Electronic device for performing translation by sharing context of utterance and operation method therefor
WO2020111676A1 (en) Voice recognition device and method
WO2020116930A1 (en) Electronic device for outputting sound and operating method thereof
WO2020209647A1 (en) Method and system for generating synthetic speech for text through user interface
WO2021029642A1 (en) System and method for recognizing user's speech
WO2022265273A1 (en) Method and system for providing service for conversing with virtual person simulating deceased person
WO2022164207A1 (en) Method and system for generating synthesized speech of new speaker
WO2022004971A1 (en) Learning device and method for generating image
WO2022260432A1 (en) Method and system for generating composite speech by using style tag expressed in natural language
WO2021040490A1 (en) Speech synthesis method and apparatus
WO2021085661A1 (en) Intelligent voice recognition method and apparatus
WO2022102987A1 (en) Electronic device and control method thereof
WO2022034982A1 (en) Method for performing synthetic speech generation operation on text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22746233

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.11.2023)