WO2022164207A1 - Procédé et système permettant de générer une parole synthétisée d'un nouveau locuteur - Google Patents

Procédé et système permettant de générer une parole synthétisée d'un nouveau locuteur Download PDF

Info

Publication number
WO2022164207A1
WO2022164207A1 PCT/KR2022/001414 KR2022001414W WO2022164207A1 WO 2022164207 A1 WO2022164207 A1 WO 2022164207A1 KR 2022001414 W KR2022001414 W KR 2022001414W WO 2022164207 A1 WO2022164207 A1 WO 2022164207A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
characteristic
speech
new
speakers
Prior art date
Application number
PCT/KR2022/001414
Other languages
English (en)
Korean (ko)
Inventor
김태수
이영근
황영태
Original Assignee
네오사피엔스 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 네오사피엔스 주식회사 filed Critical 네오사피엔스 주식회사
Priority claimed from KR1020220011853A external-priority patent/KR102604932B1/ko
Publication of WO2022164207A1 publication Critical patent/WO2022164207A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present disclosure relates to a method and system for generating a synthesized voice of a new speaker, and more particularly, to determine the speaker characteristic of a new speaker using the speaker characteristic and vocal characteristic change information of a reference speaker, and artificial neural network text-to-speech synthesis
  • any content creator can easily produce audio content or video content.
  • virtual voice generation technology and virtual image production technology a neural network voice model is trained through audio samples recorded by voice actors, and voice synthesis technology having the same voice characteristics as voice actors recording audio samples is being developed.
  • the present disclosure provides a method for generating a new speaker's synthesized voice, a computer program stored in a computer-readable recording medium, and an apparatus (system) to solve the above problems.
  • the present disclosure may be implemented in various ways including a method, a system, an apparatus, or a computer program stored in a computer-readable storage medium, and a computer-readable recording medium.
  • a method for generating a synthesized voice of a new speaker includes the steps of receiving a target text, acquiring speaker characteristics of a reference speaker, and changing speech characteristics. obtaining a speaker characteristic of a reference speaker and determining the speaker characteristic of a new speaker using the acquired speaker characteristic of the reference speaker and the acquired speech characteristic change information, and the target text and the speaker characteristic of the determined new speaker to the artificial neural network text-to-speech synthesis model and generating an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected, wherein the artificial neural network text-to-speech synthesis model is based on the plurality of training text items and the speaker characteristics of the plurality of learned speakers.
  • the artificial neural network text-to-speech synthesis model is based on the plurality of training text items and the speaker characteristics of the plurality of learned speakers.
  • the determining of the speaker characteristic of the new speaker may include generating the speaker characteristic change by inputting the speaker characteristic of the reference speaker and the acquired vocal characteristic change information into an artificial neural network speaker characteristic change generation model, and outputting the speaker characteristics of a new speaker by synthesizing the speaker characteristic and the generated speaker characteristic change, wherein the artificial neural network speaker characteristic change generation model is based on the speaker characteristics of the plurality of learned speakers and the speaker characteristics of the plurality of learned speakers. It is learned using a plurality of included vocal features.
  • the speech characteristic change information includes information about a change in the target speech characteristic.
  • the acquiring the speaker characteristics of the reference speaker includes acquiring a plurality of speaker characteristics corresponding to the plurality of reference speakers
  • the acquiring the vocalization characteristic change information includes: obtaining a corresponding set of weights, wherein the determining of the speaker characteristic of the new speaker includes applying a weight included in the obtained weight set to each of the plurality of speaker characteristics, thereby determining the speaker characteristic of the new speaker. including the steps of
  • the method further includes obtaining speaker characteristics of the plurality of speakers, wherein the speaker characteristics of the plurality of speakers include a plurality of speaker vectors, wherein the obtaining of the vocalization characteristic change information includes: Normalizing each of the speaker vectors; determining a plurality of principal components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers; selecting at least one principal component from among the determined plurality of principal components; and determining the speech characteristic change information by using the selected main component, wherein the determining of the speaker characteristic of the new speaker includes determining a weight of the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the determined speech characteristic change information. and determining the speaker characteristics of the new speaker using
  • the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is assigned, and the obtaining of the speech characteristic change information includes: obtaining the speaker vectors of a plurality of speakers having different target speech characteristics, and determining the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers and determining the speaker characteristic of the new speaker, wherein the determining of the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the determined speech characteristic change information, and the weight of the determined speech characteristic change information. do.
  • the method further comprises: obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising a plurality of speaker vectors, each of the plurality of speakers comprising: a label for one or more vocalization characteristics is allocated, and the obtaining the speech characteristic change information includes: obtaining a speaker vector of speakers included in each of a plurality of speaker groups having different target speech characteristics; including a speaker group; calculating an average of speaker vectors of speakers included in the first speaker group; calculating an average of speaker vectors of speakers included in the second speaker group; and the steps corresponding to the first speaker group and determining the speech characteristic change information based on a difference between the average of the speaker vectors and the average of the speaker vectors corresponding to the second speaker group, wherein the determining of the speaker characteristic of the new speaker includes the speaker characteristic of the reference speaker. , determining a speaker characteristic of a new speaker by using the determined speech characteristic change information and a weight of the determined speech characteristic change information.
  • the method further comprises obtaining speaker characteristics of the plurality of speakers, the speaker characteristics of the plurality of speakers comprising the plurality of speaker vectors, wherein the speaker characteristics of the reference speaker include the plurality of vocalization characteristics of the reference speaker.
  • the obtaining of the speech characteristic change information includes: inputting the speaker characteristics of the plurality of speakers into the artificial neural network speech characteristic prediction model, and outputting the speech characteristics of each of the plurality of speakers; the speaker characteristics of the plurality of speakers selecting a speaker characteristic of a speaker in which a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker; acquiring a corresponding weight, wherein the determining of the speaker characteristic of the new speaker includes: the speaker characteristic of the new speaker using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker determining the characteristics.
  • the speaker feature of the new speaker includes the speaker feature vector, and calculating a hash value corresponding to the speaker feature vector using a hash function; determining whether there is content associated with a hash value similar to the hash value, and if there is no content associated with the hash value similar to the calculated hash value, determining that the output voice associated with the speaker characteristic of the new speaker is the new output voice include
  • the speaker feature of the reference speaker includes a speaker vector
  • the obtaining of the speech feature change information includes extracting a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature.
  • a step of - the normal vector refers to a normal vector of a hyperplane for classifying a target speech feature - and obtaining information indicating a degree of adjusting the target speech feature
  • the determining of the speaker feature of a new speaker includes: and determining the speaker characteristic of the new speaker based on the degree of adjusting the speaker vector, the extracted normal vector, and the target vocalization characteristic of the speaker.
  • a computer program stored in a computer-readable recording medium is provided for executing the above-described method for generating a synthesized voice of a new speaker according to an embodiment of the present disclosure in a computer.
  • the speech synthesizer is trained using learning data including the synthesized voice of the new speaker generated according to the above-described method for generating the synthesized voice of the new speaker.
  • an apparatus for providing a synthesized voice is connected to a memory and a memory configured to store a synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker, and stored in the memory.
  • An apparatus for providing synthesized speech comprising at least one processor configured to execute at least one computer readable program included therein, wherein the at least one program is configured to output at least a portion of the synthesized voice of the new speaker stored in the memory is provided
  • a method of providing a synthesized voice of a new speaker includes a synthesized voice of a new speaker generated according to the above-described method of generating a synthesized voice of a new speaker. storing and providing at least a portion of the stored synthesized voice of the new speaker.
  • a synthesized voice having a new voice may be generated by modifying a speaker feature vector through quantitative adjustment of vocalization features.
  • a new speaker's voice may be generated by mixing the voices of several speakers (eg, two or more speakers or three or more speakers).
  • the output voice may be generated by finely adjusting one or more vocalization characteristics from the user terminal.
  • the one or more vocal characteristics may include gender control, vocal tone control, vocal strength, male age control, female age control, pitch, tempo, and the like.
  • FIG. 1 is a diagram illustrating an example in which a synthesized voice generating system according to an embodiment of the present disclosure generates an output voice by receiving a target text and speaker characteristics of a new speaker.
  • FIG. 2 is a schematic diagram illustrating a configuration in which a plurality of user terminals and a synthesized voice generating system are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure.
  • FIG. 3 is a block diagram illustrating an internal configuration of a user terminal and a synthesized voice generating system according to an embodiment of the present disclosure.
  • FIG. 4 is a block diagram illustrating an internal configuration of a processor of a user terminal according to an embodiment of the present disclosure.
  • FIG. 5 is a flowchart illustrating a method of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.
  • FIG. 6 is a diagram illustrating an example of generating an output voice in which the speaker characteristics of a new speaker are reflected using the artificial neural network text-to-speech synthesis model according to an embodiment of the present disclosure.
  • FIG. 7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected using an artificial neural network text-to-speech synthesis model according to another embodiment of the present disclosure.
  • FIG. 8 is an exemplary diagram illustrating a user interface for generating an output voice in which a speaker characteristic of a new speaker is reflected, according to an embodiment of the present disclosure.
  • FIG. 9 is a structural diagram illustrating an artificial neural network model according to an embodiment of the present disclosure.
  • 'unit' or 'module' used in the specification means a software or hardware component, and 'module' performs certain roles.
  • 'unit' or 'module' is not meant to be limited to software or hardware.
  • a 'unit' or 'module' may be configured to reside on an addressable storage medium or may be configured to reproduce one or more processors.
  • 'part' or 'module' refers to components such as software components, object-oriented software components, class components, and task components, and processes, functions, and properties. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays or at least one of variables.
  • Components and 'units' or 'modules' are the functions provided therein that are combined into a smaller number of components and 'units' or 'modules' or additional components and 'units' or 'modules' can be further separated.
  • a 'unit' or a 'module' may be implemented with a processor and a memory.
  • 'Processor' should be construed broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like.
  • a 'processor' may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like.
  • ASIC application specific semiconductor
  • PLD programmable logic device
  • FPGA field programmable gate array
  • 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such configurations. You may. Also, 'memory' should be construed broadly to include any electronic component capable of storing electronic information.
  • RAM random access memory
  • ROM read-only memory
  • NVRAM non-volatile random access memory
  • PROM programmable read-only memory
  • EPROM erase-programmable read-only memory
  • a memory is said to be in electronic communication with the processor if the processor is capable of reading information from and/or writing information to the memory.
  • a memory integrated in the processor is in electronic communication with the processor.
  • a 'text item' may refer to a part or all of text, and the text may refer to a text item.
  • each of 'data item' and 'information item' may refer to at least a portion of data and at least a portion of information, and data and information may refer to a data item and information item.
  • 'each of a plurality of A' or 'each of a plurality of A' may refer to each of all components included in the plurality of As or may refer to each of some components included in the plurality of As.
  • each of the features of the plurality of speakers may refer to each of all speaker features included in each of the features of the plurality of speakers or to each of some speaker features included in the features of the plurality of speakers. .
  • FIG. 1 is a diagram illustrating an example in which a synthesized voice generating system 100 according to an embodiment of the present disclosure generates an output voice 130 by receiving a target text 110 and a speaker characteristic 120 of a new speaker.
  • the synthesized voice generating system 100 may receive the target text 110 and the speaker characteristic 120 of the new speaker, and generate the output voice 130 in which the speaker characteristic 120 of the new speaker is reflected.
  • the target text 110 may include one or more paragraphs, sentences, clauses, phrases, words, words, phonemes, and the like.
  • the speaker characteristic 120 of the new speaker may be determined or generated using the speaker characteristic of the reference speaker and information on the change of the vocalization characteristic.
  • the speaker characteristic of the reference speaker may include the speaker characteristic of the speaker to be newly created, that is, the speaker characteristic of the speaker that is a reference in generating the speaker characteristic of the new speaker.
  • the speaker characteristic of the reference speaker may include a speaker characteristic similar to the speaker characteristic of the speaker to be newly created.
  • the speaker characteristics of the reference speaker may include speaker characteristics of a plurality of reference speakers.
  • the speaker characteristic of the reference speaker may include a speaker vector of the reference speaker.
  • the speaker vector of the reference speaker may be extracted based on the speaker id (eg, speaker one-hot vector, etc.) and the vocalization feature (eg, vector) using the neural network speaker feature extraction model.
  • the artificial neural network speaker feature extraction model may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • a speaker vector of a reference speaker may be extracted based on speech and vocalization features (eg, vectors) recorded by a speaker using a human neural network speaker feature extraction model.
  • the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocalization features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • the speaker vector of the reference speaker may include one or more speech characteristics (eg, tone, speech strength, speech speed, gender, age, etc.) of the reference speaker's voice.
  • the speaker id and/or the voice recorded by the speaker may be selected as the voice on which the speaker characteristics of the new speaker are based.
  • the vocalization characteristic may include a basic vocalization characteristic that will be reflected in the speaker characteristic of the new speaker.
  • the speaker id, voice and/or vocal characteristics recorded by the speaker are generated as the speaker characteristics of the reference speaker, and the speaker characteristics of the reference speaker thus generated are synthesized with the speech characteristic change information to obtain the speaker characteristics of the new speaker.
  • the vocalization characteristic change information may include any information about the vocalization characteristic desired to be applied to the speaker characteristic of the new speaker.
  • the speech characteristic change information may include information about a difference between the speaker characteristic of the new speaker and the speaker characteristic of the reference speaker.
  • the new speaker characteristic may be generated by synthesizing the speaker characteristic and the speaker characteristic change of the reference speaker.
  • the speaker characteristic change may be generated by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the artificial neural network speaker characteristic change generation model.
  • the artificial neural network speaker characteristic change generation model may be trained using speaker characteristics of a plurality of learned speakers and a plurality of speech characteristics included in the plurality of speaker characteristics.
  • the vocalization characteristic change information may include information indicating a difference between the target vocalization characteristic included in the speaker characteristic of the new speaker and the target vocalization characteristic included in the speaker characteristic of the reference speaker. That is, the speech characteristic change information may include information about a change in the target speech characteristic.
  • the speech feature change information may include a normal vector of a hyperplane that classifies the target speech feature from the speaker feature and information indicating the degree of adjusting the target speech feature.
  • the speech characteristic change information may include a weight to be applied to each of the speaker characteristics of the plurality of reference speakers.
  • the speech feature change information may include a target speech feature generated based on an axis between target speech features included in the learned speaker and a weight of the target speech feature.
  • the speech feature change information may include a target generated feature generated based on a difference between speaker features of speakers having different target speech features and a weight for the target speech feature.
  • the speech characteristic change information may include a speaker characteristic of a speaker having a difference from a target speaker characteristic included in the speaker characteristic of the reference speaker and a weight of the corresponding speaker characteristic.
  • the synthesized voice generation system 100 is a synthesized voice for the target text 110 in which the speaker characteristics 120 of the new speaker are reflected, and generates an output voice 130 in which the target text is uttered according to the speaker characteristics of the newly created speaker. can do.
  • the synthesized speech generation system 100 learns to output voices for the plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. and artificial neural network text-to-speech synthesis model.
  • the artificial neural network text-to-speech synthesis model may be configured to output voice data for a plurality of training text items when the target text 110 and the speaker characteristic 120 of a new speaker are input, in this case, the output
  • the obtained voice data may be post-processed into human audible voice using a post-processing processor, a vocoder, or the like.
  • FIG. 2 illustrates a configuration in which a plurality of user terminals 210_1 , 210_2 , and 210_3 and a synthesized voice generating system 230 are communicatively connected to provide a synthetic voice generating service for text according to an embodiment of the present disclosure.
  • the plurality of user terminals 210_1 , 210_2 , and 210_3 may communicate with the synthesized voice generation system 230 through the network 220 .
  • the network 220 may be configured to enable communication between the plurality of user terminals 210_1 , 210_2 , and 210_3 and the synthesized voice generating system 230 .
  • Network 220 according to the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired network 220 such as RS-serial communication, mobile communication network, WLAN (Wireless) LAN), Wi-Fi, Bluetooth, and a wireless network 220 such as ZigBee, or a combination thereof.
  • the communication method is not limited, and the user terminals 210_1, 210_2, 210_3) may also include short-range wireless communication.
  • the network 220 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like.
  • the network 220 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. not limited
  • the mobile phone or smart phone 210_1, the tablet computer 210_2, and the laptop or desktop computer 210_3 are illustrated as examples of a user terminal that executes or operates a user interface that provides a synthetic voice generation service, but is not limited thereto.
  • the user terminals 210_1, 210_2, and 210_3 are capable of wired and/or wireless communication and have a web browser, a mobile browser application, or a synthetic voice generating application installed so that a user interface providing a synthetic voice generating service can be executed. It may be a computing device.
  • the user terminal 210 may include a smartphone, a mobile phone, a navigation terminal, a desktop computer, a laptop computer, a digital broadcasting terminal, a personal digital assistant (PDA), a portable multimedia player (PMP), a tablet computer, and a game console (game). console), a wearable device, an Internet of things (IoT) device, a virtual reality (VR) device, an augmented reality (AR) device, and the like.
  • IoT Internet of things
  • VR virtual reality
  • AR augmented reality
  • three user terminals 210_1 , 210_2 , and 210_3 are illustrated as communicating with the synthesized speech generation system 230 through the network 220 , but the present invention is not limited thereto, and a different number of user terminals is connected to the network It may be configured to communicate with a synthetic speech generation system 230 via 220 .
  • the user terminals 210_1, 210_2, and 210_3 provide the target text, information about the speaker characteristics of the reference speaker, and/or information indicating or selecting speech characteristics change information to the synthesized speech generation system 230.
  • the user terminals 210_1 , 210_2 , and 210_3 may receive the speaker characteristic and/or the candidate vocalization characteristic change information of the candidate reference speaker from the synthesized speech generation system 230 .
  • the user terminals 210_1, 210_2, and 210_3 may select, in response to the user input, speaker characteristics and/or speech characteristics change information of the reference speaker from the candidate reference speaker speaker characteristics and/or candidate vocal characteristics change information.
  • the user terminals 210_1 , 210_2 , and 210_3 may receive the output voice generated from the synthesized voice generating system 230 .
  • each of the user terminals 210_1, 210_2, and 210_3 and the synthesized voice generating system 230 are illustrated as separately configured elements, but the present invention is not limited thereto. 210_3) may be configured to be included in each.
  • the synthesized speech generation system 230 includes an input/output interface to determine the target text, the speaker characteristics of the reference speaker, and the speech characteristics change information without communication with the user terminals 210_1, 210_2, and 210_3 for the target text. , it may be configured to output a synthesized voice in which the speaker characteristics of the new speaker are reflected.
  • the user terminal 210 may refer to any computing device capable of wired/wireless communication, for example, the mobile phone or smart phone 210_1, the tablet computer 210_2, the laptop or desktop computer 210_3 of FIG. 2 . and the like.
  • the user terminal 210 may include a memory 312 , a processor 314 , a communication module 316 , and an input/output interface 318 .
  • the synthesized speech generation system 230 may include a memory 332 , a processor 334 , a communication module 336 , and an input/output interface 338 . As shown in FIG.
  • the user terminal 210 and the synthesized voice generation system 230 are configured to communicate information and/or data via the network 220 using the respective communication modules 316 and 336 , respectively. can be configured.
  • the input/output device 320 may be configured to input information and/or data to the user terminal 210 through the input/output interface 318 or to output information and/or data generated from the user terminal 210 .
  • the memories 312 and 332 may include any non-transitory computer-readable recording medium.
  • the memories 312 and 332 are non-volatile mass storage devices such as random access memory (RAM), read only memory (ROM), disk drives, solid state drives (SSDs), flash memory, and the like. (permanent mass storage device) may be included.
  • a non-volatile mass storage device such as a ROM, SSD, flash memory, disk drive, etc. may be included in the user terminal 210 and/or the synthetic voice generation system 230 as a separate persistent storage device separate from the memory. have.
  • an operating system and at least one program code (eg, a code for determining a speaker characteristic of a new speaker, a code for generating an output voice reflecting the speaker characteristic of a new speaker, etc.) are stored.
  • a program code eg, a code for determining a speaker characteristic of a new speaker, a code for generating an output voice reflecting the speaker characteristic of a new speaker, etc.
  • the separate computer-readable recording medium may include a recording medium directly connectable to the user terminal 210 and the synthesized voice generation system 230, for example, a floppy drive, a disk, a tape, a DVD/CD.
  • a computer-readable recording medium such as a ROM drive and a memory card.
  • the software components may be loaded into the memories 312 and 332 through a communication module rather than a computer-readable recording medium.
  • the at least one program is a computer program (eg, artificial neural network text-to-speech synthesis model program) installed by files provided through the network 220 by developers or a file distribution system that distributes installation files of applications. ) may be loaded into the memories 312 and 332 based on the.
  • a computer program eg, artificial neural network text-to-speech synthesis model program
  • the processors 314 and 334 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the processor 314 , 334 by the memory 312 , 332 or the communication module 316 , 336 . For example, the processors 314 and 334 may be configured to execute received instructions according to program code stored in a recording device, such as the memories 312 and 332 .
  • the communication modules 316 and 336 may provide a configuration or function for the user terminal 210 and the synthesized voice generation system 230 to communicate with each other via the network 220 , and the user terminal 210 and/or synthesis
  • the voice generating system 230 may provide a configuration or function for communicating with another user terminal or another system (eg, a separate cloud system, a separate frame image generating system, etc.).
  • a request generated by the processor 314 of the user terminal 210 according to a program code stored in a recording device such as the memory 312 eg, a synthetic voice generation request, a new speaker's speaker characteristic generation request, etc.
  • a control signal or command provided under the control of the processor 334 of the synthesized speech generation system 230 is transmitted to the communication module 316 of the user terminal 210 via the communication module 336 and the network 220 . through the user terminal 210 may be received.
  • the input/output interface 318 may be a means for interfacing with the input/output device 320 .
  • the input device may include a device such as a keyboard, a microphone, a mouse, a camera including an image sensor
  • the output device may include a device such as a display, a speaker, a haptic feedback device, and the like.
  • the input/output interface 318 may be a means for an interface with a device in which a configuration or function for performing input and output, such as a touch screen, is integrated into one.
  • a service screen or user interface configured using data may be displayed on the display through the input/output interface 318 .
  • the input/output device 320 is illustrated not to be included in the user terminal 210 , but the present invention is not limited thereto, and may be configured as a single device with the user terminal 210 .
  • the input/output interface 338 of the synthesized voice generation system 230 interfaces with a device (not shown) for input or output that is connected to the synthesized voice generation system 230 or may include the synthesized voice generation system 230 . may be a means for In FIG.
  • the input/output interfaces 318 and 338 are illustrated as elements configured separately from the processors 314 and 334, but the present invention is not limited thereto, and the input/output interfaces 318 and 338 may be configured to be included in the processors 314 and 334. have.
  • the user terminal 210 and the synthesized voice generation system 230 may include more components than those of FIG. 3 . However, there is no need to clearly show most of the prior art components. According to an embodiment, the user terminal 210 may be implemented to include at least a portion of the above-described input/output device 320 . In addition, the user terminal 210 may further include other components such as a transceiver, a global positioning system (GPS) module, a camera, various sensors, and a database. For example, when the user terminal 210 is a smart phone, it may include components generally included in the smart phone, for example, an acceleration sensor, a gyro sensor, a camera module, various physical buttons, and touch. Various components such as a button using a panel, an input/output port, and a vibrator for vibration may be implemented to be further included in the user terminal 210 .
  • GPS global positioning system
  • the processor 314 of the user terminal 210 may be configured to operate a synthetic voice output application or the like.
  • a code associated with a corresponding application and/or program may be loaded into the memory 312 of the user terminal 210 .
  • the processor 314 of the user terminal 210 receives information and/or data provided from the input/output device 320 through the input/output interface 318 or through the communication module 316 .
  • Information and/or data may be received from the synthesized speech generation system 230 , and the received information and/or data may be processed and stored in the memory 312 .
  • such information and/or data may be provided to the synthesized voice generation system 230 through the communication module 316 .
  • the processor 314 may receive text input or selected through an input device 320 such as a touch screen or a keyboard connected to the input/output interface 318, and receive The synthesized text may be stored in the memory 312 or provided to the synthesized speech generation system 230 through the communication module 316 and the network 220 .
  • the processor 314 may receive an input for the target text (eg, one or more paragraphs, sentences, phrases, words, phonemes, etc.) through the input device 320 .
  • the processor 314 may receive, through the input device 320 , any information indicating or selecting information about a reference speaker and/or information on change of speech characteristics.
  • the processor 314 may receive an input for the target text through the input device 320 through the input/output interface 318 .
  • the processor 314 may receive, through the input device 320 and the input/output interface 318 , an input for uploading a file in a document format including the target text through the user interface.
  • the processor 314 may receive a file in a document format corresponding to the input from the memory 312 .
  • the processor 314 may receive the target text included in the file.
  • the received target text may be provided to the synthesized speech generating system 230 through the communication module 316 .
  • the processor 314 may be configured to provide the uploaded file to the synthesized speech generation system 230 via the communication module 316 and to receive the target text contained within the file from the synthesized speech generation system 230 . have.
  • the processor 314 outputs the processed information and/or data through an output device such as a display output capable device (eg, a touch screen, a display, etc.) of the user terminal 210 and an audio output capable device (eg, a speaker).
  • an output device such as a display output capable device (eg, a touch screen, a display, etc.) of the user terminal 210 and an audio output capable device (eg, a speaker).
  • the processor 314 may display information representing or selecting target text and/or speech characteristic change information received from at least one of the input device 320 , the memory 312 , or the synthesized speech generation system 230 to the user. It can be output through the screen of the terminal 210 . Additionally or alternatively, the processor 314 may output the speaker characteristics of the new speaker determined or generated by the information processing system 230 through the screen of the user terminal 210 . Also, the processor 314 may output the synthesized voice through a voice output capable device such as a speaker. Additionally, the processor 314 may output the audio through
  • the processor 334 of the synthesized speech generation system 230 may be configured to manage, process and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems, including the user terminal 210 .
  • the information and/or data processed by the processor 334 may be provided to the user terminal 210 through the communication module 336 .
  • the processor 334 receives from the user terminal 210, the memory 332 and/or the external storage device information indicating or selecting the target text, information about the reference speaker, and speech characteristic change information, It is possible to obtain or determine the speaker characteristics and vocal characteristics change information of the reference speaker included in the memory 332 and/or the external storage device.
  • the processor 334 may determine the speaker characteristic of the new speaker using the speaker characteristic and the vocalization characteristic change information of the reference speaker. Also, the processor 334 may generate an output voice for the target text in which the determined new speaker characteristic is reflected. For example, the processor 334 may input the target text and the speaker characteristics of the new speaker into the artificial neural network text-to-speech synthesis model to generate output speech from the artificial neural network text-to-speech synthesis model. The output voice generated in this way may be provided to the user terminal 210 through the network 220 and output through a speaker associated with the user terminal 210 .
  • the processor 334 may include a speaker characteristic determination module 410 , a synthesized speech output module 420 , a speech characteristic change information determination module 430 , and an output speech verification module 440 .
  • Each of the modules operated on the processor 334 may be configured to communicate with each other.
  • the internal configuration of the processor 334 is described separately for each function, but this does not necessarily mean that the processor 334 is physically separated.
  • the internal configuration of the processor 334 shown in FIG. 4 is only an example, and only essential configurations are not shown. Accordingly, in some embodiments, the processor 334 may be implemented differently, such as by additionally including other components other than the illustrated internal configuration, or by omitting some of the illustrated internal components.
  • the speaker characteristic determination module 410 may acquire speaker characteristics of a reference speaker.
  • the features of the reference speaker may be extracted through the learned artificial neural network speaker feature extraction model.
  • the speaker feature determination module 410 inputs the speaker id (eg, speaker one-hot vector, etc.) and vocalization characteristics (eg, vector) into the trained artificial neural network speaker feature extraction model to determine the speaker of the reference speaker.
  • Features eg vectors
  • the speaker feature determination module 410 inputs the speech and vocalization features (eg, vectors) recorded by the speaker into the trained artificial neural network speaker feature extraction model, and extracts the speaker features (eg, vectors) of the reference speaker. can do.
  • the speaker characteristic determination module 410 may obtain speaker characteristics and vocalization characteristic change information of the reference speaker, and determine the speaker characteristic of a new speaker by using the acquired speaker characteristic of the reference speaker and the acquired vocalization characteristic change information.
  • the speaker characteristic of the reference speaker at least one of the speaker characteristics of a plurality of speakers stored in the storage medium may be selected.
  • the speech characteristic change information includes information indicating a change in speaker characteristics of a reference speaker, information indicating a change in speaker characteristics of at least some of the plurality of speakers stored in the storage medium, and/or included in the speaker characteristics of at least some of the plurality of speakers It may be information indicating a change in vocal characteristics.
  • the speaker features of the plurality of speakers may include features inferred from the learned artificial neural network speaker feature extraction model.
  • each of the speaker characteristic and the vocalization characteristic may be expressed in a vector form.
  • the synthesized speech output module 420 may receive the target text from the user terminal and receive the speaker characteristics of the new speaker from the speaker characteristic determination module 410 .
  • the synthesized voice output module 420 may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected.
  • the synthesized speech output module 420 inputs the target text and the speaker characteristics of the new speaker to the trained artificial neural network text-to-speech synthesis model, and outputs the speech (ie, synthesized speech) from the artificial neural network text-to-speech synthesis model. ) can be created.
  • This artificial neural network text-to-speech synthesis model is to be stored in a storage medium (eg, the memory 332 of the information processing system 230 , other storage media accessible by the processor 334 of the information processing system 230 , etc.).
  • the artificial neural network text-to-speech synthesis model includes a model trained to output a voice for the target text in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. can do.
  • the synthesized voice output module 420 may provide the generated synthesized voice to the user terminal. Accordingly, the generated synthesized voice may be output through any speaker built into the user terminal 210 or connected via wire or wirelessly.
  • the speech characteristic change information determination module 430 may obtain speech characteristic change information from the memory 332 .
  • the speech characteristic change information may be determined through information determined through a user input through a user terminal (eg, the user terminal 210 of FIG. 2 ).
  • the speech characteristic change information may include information on a speech characteristic to be changed in order to generate a new speaker, that is, a new speaker.
  • the vocalization characteristic change information may include information (eg, reflection ratio information) related to the speaker characteristic of the reference speaker.
  • the speech characteristic change information is determined by the speaker characteristic determination module 410 and the speech characteristic change information determination module 430 , and the characteristic of a new speaker is determined using the determined speech characteristic change information and the speaker characteristic of the reference speaker. Specific examples are described.
  • the speaker characteristic determination module 410 generates a speaker characteristic change by inputting the speaker characteristic and vocalization characteristic change information of the reference speaker to the learned artificial neural network speaker characteristic change generation model, and the speaker characteristic and generation of the reference speaker By synthesizing the changed speaker characteristics, it is possible to output the speaker characteristics of a new speaker.
  • the artificial neural network is learning the speaker characteristic change generation model
  • individual speech characteristic information may be obtained for each speaker without using the speech characteristic information included in the speaker characteristic of the speaker as an input.
  • information on the vocalization characteristic of a given speaker may be obtained through tagging by a person.
  • the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker.
  • the obtained speaker's speech characteristic information may be stored in a storage medium. That is, it is possible to adjust the speaker characteristics of the reference speaker according to the change in the vocalization characteristics by using the artificial neural network speaker characteristic change generation model.
  • This artificial neural network speaker feature change generation model can be learned using Equation 1 below.
  • the vocalization characteristic change information determination module 430 receives the information from the storage medium. , , and can be obtained and used to learn the artificial neural network speaker feature change generation model. In addition, and Based on the difference value of , that is, loss, an artificial neural network speaker feature change generation model can be trained.
  • the vocalization characteristic change information determination module 430 inputs the difference between the reference speaker's vocalization characteristic and the reference speaker's vocalization characteristic and the reference speaker's speaker characteristic to the learned artificial neural network speaker characteristic change generation model during inference to input the vocalization characteristic change information can be decided
  • the speaker characteristic determination module 410 is configured to provide the determined speech characteristic change information. and speaker characteristics of the reference speaker Based on this, it is possible to determine the speaker characteristics of the new speaker. This new speaker characteristic can be expressed as Equation 2 below.
  • the vocalization characteristic change information determining module 430 may extract a normal vector for the target vocalization characteristic by using a vocalization feature classification model corresponding to the target vocalization characteristic.
  • a speech feature classification model corresponding to each of the plurality of speech features may be generated.
  • the vocal feature classification model is a hyperplane-based model, and may be implemented using, for example, a support vector machine (SVM), a linear classifier, or the like, but is not limited thereto.
  • the target vocalization characteristic may refer to a vocalization characteristic selected from among a plurality of vocalization features, which will be changed and reflected in the speaker characteristic of a new speaker.
  • the speaker's characteristic may be expressed as a speaker vector.
  • each speech feature information included in the speaker feature of the speaker is not used as an input, and each speech feature information may be obtained for each speaker.
  • voice characteristic information of a given speaker may be obtained through tagging by a person.
  • the speech feature information of a given speaker may be obtained through an artificial neural network speech feature extraction model trained to infer the speech feature of the speaker from the speaker feature of the given speaker.
  • Is means the i-th vocalization characteristic, denotes a normal vector of a hyperplane that classifies the i-th vocalization feature, and b denotes a bias.
  • the speaker feature determination module 410 is configured to generate a speaker feature vector of the reference speaker most similar to the new speaker through the trained artificial neural network speaker feature extraction model to generate a synthesized speech of the new speaker. can be obtained.
  • the vocalization characteristic change information determination module 430 may obtain, as the vocalization characteristic change information, information indicating a normal vector of the target vocalization characteristic and the degree of adjusting the vocalization characteristic from the learned vocalization characteristic classification model. The speaker feature vector of the reference speaker thus obtained , the speaker characteristic of the new speaker according to Equation 4 below using the normal vector of the target speech feature and the degree of adjusting the speech feature. can be created.
  • the normal vector of the target vocalization feature may refer to the degree of controlling the vocal characteristics.
  • the speaker characteristic determining module 410 may acquire a plurality of speaker characteristics corresponding to a plurality of reference speakers. Also, the speech characteristic change information determination module 430 may obtain a weight set corresponding to a plurality of speaker characteristics and provide the obtained weight set to the speaker characteristic determination module 410 . The speaker characteristic determination module 410 may determine the speaker characteristic of a new speaker as shown in Equation 5 below by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. That is, the voices of several speakers may be mixed to generate a new speaker's voice.
  • the speaker vector of the i speaker may mean a weight for speaker i.
  • feature vectors of multiple speakers can be mixed into the feature vectors of new speakers.
  • the speaker characteristic determining module 410 may generate a new speaker characteristic vector through a pre-calculated method of adjusting the vocalization characteristic axis.
  • a speaker feature includes one or more vocal features.
  • the vocalization characteristic change information determination module 430 may find the vocalization characteristic axis and adjust the vocalization characteristic axis.
  • the adjusted vocalization characteristic axis may be provided to the speaker characteristic determination module 410 and used to determine the speaker characteristic of a new speaker. That is, the speaker characteristic determination module 410 calculates the speaker characteristic r of the reference speaker, the vocalization characteristic axis, as shown in Equation 6 below. and weight of speech characteristic change information can be used to determine the speaker characteristics of the new speaker.
  • the j-th vocal feature axis may mean a weight for the j-th utterance feature.
  • is an individual vocalization characteristic It may mean one axis on the vocal feature space to distinguish Is may have the same dimension as the speaker's expression.
  • the speech characteristic change information determining module 430 may normalize each of the speaker vectors of the plurality of speakers.
  • the speaker vectors of the plurality of speakers may be included in the speaker characteristics of the plurality of speakers.
  • the speech characteristic change information determination module 430 may perform Z-normalization in which the mean is subtracted from all data and the variance is divided, or normalization in which the mean is subtracted from all data.
  • N(-) denotes a normalization function
  • D(-) denotes an inverse normalization function
  • the speech characteristic change information determination module 430 may determine the plurality of main components by performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers.
  • the dimensionality reduction analysis may be performed through a conventionally known dimension reduction technique, such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), or Stochastic Neighbor Embedding (t-SNE).
  • PCA Principal Component Analysis
  • SVD Singular Value Decomposition
  • t-SNE Stochastic Neighbor Embedding
  • the speech characteristic change information determination module 430 may determine a plurality of main components P in Equation 8 below by performing PCA on N(R).
  • kth major component may refer to the number of dimensions of r of the speaker expression.
  • the vocalization characteristic change information determination module 430 may select at least one main component from among the plurality of determined main components. For example, key components associated with the vocal characteristics desired to be altered in the speaker characteristics of the new speaker may be selected.
  • the j-th vocalization characteristic the main ingredient selected and a normalization inverse function D.
  • the j-th utterance feature and a weight corresponding thereto are provided to the speaker feature determination module 410, so that the speaker feature of a new speaker can be generated through Equation 6 above.
  • the vocalization characteristic change information determination module 430 is used in Equation (6). Instead, obtained through Equation (10) By using , interference between the vocal feature axes can be removed.
  • Is may refer to an axis of vocalization characteristics in which some vocalization characteristics are changed.
  • the speech characteristic change information determination module 430 may obtain speaker vectors of a plurality of speakers having different target speech characteristics.
  • the speaker vectors of the plurality of learning speakers may be included in the speaker characteristics of the plurality of learning speakers.
  • each of the plurality of speakers is assigned a label for one or more vocal features.
  • a vocal feature label may be assigned to each of a plurality of speakers as shown in FIG.
  • the speech characteristics may include tone, speech strength, speech speed, gender, and age. Tone, vocal strength, and vocal speed It can be expressed as, where may be an element of l.
  • the gender of men and women It can be expressed as , and the age is can be expressed as for example, The silver tone is low, the vocal strength is medium, and the vocalization rate is high, which may indicate the vocal characteristics of a 50-year-old male.
  • the speech characteristic change information determining module 430 is configured to perform speaker vectors of a plurality of speakers having different target speech characteristics, as shown in Equation 11 above. and Vocal characteristics based on the difference between can be decided
  • the vocal features may be included in the speech characteristic change information.
  • This speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.
  • the speech characteristic change information determining module 430 may determine the speech characteristic change information based on a difference between the averages of the speaker vectors of a plurality of speaker groups.
  • the speaker features of the plurality of speakers include speaker vectors of the plurality of speakers, and each of the speaker features of the plurality of speakers is assigned a label for one or more vocalization features.
  • the speech characteristic change information determination module 430 may obtain speaker vectors of speakers included in each of a plurality of speaker groups having different target speech characteristics.
  • the group of the plurality of learning speakers may include a first speaker group and a second speaker group.
  • the speech characteristic change information determination module 430 calculates an average of the speaker vectors of the speakers included in the first speaker group, and calculates the average of the speaker vectors of the speakers included in the second speaker group, by calculating the average of the speaker vectors included in the second speaker group, Equation (12)
  • a speech characteristic based on the difference between the average of the speaker vectors corresponding to the first speaker group and the average of the speaker vectors corresponding to the second speaker group as can be decided Determined vocal characteristics may be included in the speech characteristic change information.
  • this speech characteristic change information is provided to the speaker characteristic determination module 410 so that the speaker characteristic of a new speaker can be determined using Equation 6 above.
  • the speech characteristic change information determining module 430 may include, as in Equation 13 below, speaker characteristics of a plurality of speakers.
  • a neural network vocal feature prediction model By typing in, each vocalization characteristic of a plurality of speakers can be printed out.
  • the speech characteristic change information determining module 430 is a speaker characteristic of a plurality of speakers.
  • the output vocal characteristics selected from class A speaker feature that has a difference value in the j-voicing features included in can be selected or determined. These speaker characteristics may be provided to the speaker characteristic determination module 410 .
  • the speaker characteristic determining module 410 may obtain a weight corresponding to the speaker characteristic of the selected speaker. Then, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. For example, the speaker characteristic determination module 410 may determine the speaker characteristic of the new speaker using Equation 14 below.
  • the speaker characteristic of the new speaker is the speaker characteristics of the reference speaker
  • the speaker characteristics of the selected speaker may refer to a weight corresponding to the speaker characteristic of the selected speaker.
  • the output voice verification module 440 may determine whether the output voice associated with the speaker characteristic of the new speaker is a new output voice that is not previously stored. According to an embodiment, the output voice verification module 440 may calculate a hash value corresponding to a speaker feature (eg, a speaker feature vector) of a new speaker by using a hash function. In another embodiment, the output voice verification module 440 does not calculate a hash value using the speaker voice of the new speaker, but extracts the speaker feature of the speaker from the new output voice, and uses the extracted speaker feature of the new speaker to hash A value can be calculated.
  • a speaker feature eg, a speaker feature vector
  • the output voice verification module 440 may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. When there is no content associated with a hash value similar to the calculated hash value, the output voice verification module 440 may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice. When it is determined as the new output voice, the synthesized voice reflecting the speaker characteristics of the new speaker may be set to be used.
  • the method 500 for generating an output voice reflecting the speaker characteristics of the new speaker includes a processor (eg, the processor 314 of the user terminal 210 and/or the processor 334 of the synthesized voice generating system 230 ). )) can be performed by As shown, the method 500 may be initiated by the processor receiving the target text (S510).
  • a processor eg, the processor 314 of the user terminal 210 and/or the processor 334 of the synthesized voice generating system 230 ).
  • the processor may acquire a speaker characteristic of the reference speaker corresponding to the reference speaker ( S520 ).
  • the speaker characteristic of the reference speaker may include a speaker vector. Additionally or alternatively, the speaker characteristics of the reference speaker may include vocalization characteristics of the reference speaker.
  • the speaker characteristics of the reference speaker may include a plurality of speaker characteristics corresponding to the plurality of reference speakers.
  • the plurality of speaker features may include a plurality of speaker vectors.
  • the processor may acquire vocal feature change information ( S530 ).
  • the processor may acquire speaker characteristics of the plurality of speakers.
  • the speaker characteristics of the plurality of speakers may include a plurality of speaker vectors.
  • the processor may determine the plurality of principal components by performing normalization on each of the speaker vectors of the plurality of speakers and performing dimensionality reduction analysis on the speaker vectors of the plurality of normalized speakers. At least one major analysis from among the plurality of key components thus determined may be selected. Then, the processor may determine the speech characteristic change information using the selected main component.
  • the processor may obtain speaker vectors of a plurality of speakers having different target vocalization characteristics, and determine the speech characteristic change information based on a difference between the obtained speaker vectors of the plurality of speakers.
  • the processor may obtain a speaker vector of speakers included in each of a plurality of speaker groups having different target vocalization characteristics.
  • the plurality of speaker groups may include a first speaker group and a second speaker group. Then, the processor may calculate an average of speaker vectors of speakers included in the first speaker group, and calculate an average of speaker vectors of speakers included in the second speaker group.
  • the processor may determine the speech characteristic change information based on a difference between an average of speaker vectors corresponding to the first speaker group and an average of speaker vectors corresponding to the second speaker group.
  • the processor may input the speaker characteristics of the plurality of speakers to the artificial neural network speech characteristic prediction model, and output the speech characteristics of each of the plurality of speakers. Then, the processor is configured to: a speaker of the speaker, wherein, among the speaker features of the plurality of speakers, a difference exists between a target vocalization characteristic among the output vocalization characteristics of each of the plurality of speakers and a target vocalization characteristic among the plurality of vocalization characteristics of the reference speaker. A characteristic may be selected, and a weight corresponding to the speaker characteristic of the selected speaker may be obtained.
  • the speaker characteristic of the selected speaker and the weight corresponding to the speaker characteristic of the selected speaker may be obtained as speech characteristic change information.
  • the processor may extract a normal vector for the target speech feature using a speech feature classification model corresponding to the target speech feature.
  • the normal vector may refer to a normal vector of a hyperplane that classifies the target speech feature, and information indicating the degree of adjusting the target speech feature may be obtained.
  • the extracted normal vector and information indicating the degree to which the target speech feature is adjusted may be obtained as speech feature change information.
  • the processor may determine the speaker characteristics of the new speaker by using the acquired speaker characteristics of the reference speaker and the acquired speech characteristic change information ( S540 ).
  • the processor generates a change in the speaker characteristic by inputting the speaker characteristic of the reference speaker and the acquired speech characteristic change information into the artificial neural network speaker characteristic change generation model, and the speaker characteristic of the reference speaker and the generated speaker characteristic change By synthesizing , it is possible to output the speaker characteristics of the new speaker.
  • the artificial neural network speaker characteristic change generation model may be learned by using the speaker characteristics of the plurality of learned speakers and the plurality of speech characteristics included in the speaker characteristics of the plurality of learned speakers.
  • the processor may determine the speaker characteristic of the new speaker by applying a weight included in the obtained weight set to each of the plurality of speaker characteristics. In another embodiment, the processor may determine the characteristics of the new speaker by using the weights of the speaker characteristics of the reference speaker, the speech characteristics change information, and the speech characteristics change information. According to another embodiment, the processor may determine the speaker characteristic of the new speaker by using the speaker characteristic of the reference speaker, the speaker characteristic of the selected speaker, and a weight corresponding to the speaker characteristic of the selected speaker. According to another embodiment, the processor may determine the speaker characteristic of the new speaker based on the degree to which the reference speaker's speaker vector, the extracted normal vector, and the target vocalization characteristic are adjusted.
  • the processor may input the target text and the determined speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected (S550).
  • the artificial neural network text-to-speech synthesis model learns to output voices for a plurality of learning text items, in which the speaker characteristics of the plurality of learning speakers are reflected, based on the plurality of learning text items and the speaker characteristics of the plurality of learning speakers. model may be included.
  • the processor may calculate a hash value corresponding to the speaker feature vector using a hash function.
  • the speaker feature vector may be included in the speaker feature of the new speaker. Then, the processor may determine whether there is content associated with a hash value similar to the calculated hash value among the plurality of speaker contents stored in the storage medium. If there is no content associated with the hash value similar to the calculated hash value, the processor may determine that the output voice associated with the speaker characteristic of the new speaker is the new output voice.
  • a speech synthesizer learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker may be provided.
  • the voice synthesizer may be any voice synthesizer that can be learned using learning data including the synthesized voice of a new speaker generated according to the above-described method for generating a synthesized voice of a new speaker.
  • the speech synthesizer may include any text-to-speech synthesis (TTS) model trained using this training data.
  • TTS text-to-speech synthesis
  • the TTS model may be implemented as a machine learning model or an artificial neural network model known in the art.
  • the speech synthesizer since the speech synthesizer has learned the synthesized voice of the new speaker as training data, when the target text is input, the target text may be output as the synthesized voice of the new speaker. According to an embodiment, such a voice synthesizer may be included or implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 .
  • a memory and a memory configured to store a synthesized voice of a new speaker generated according to the method for generating a synthesized voice of a new speaker as described above and connected to the memory, execute at least one computer-readable program included in the memory
  • An apparatus for providing a synthesized voice may be provided, including at least one processor configured to: the at least one program including an instruction for outputting at least a part of the synthesized voice of the new speaker stored in the memory.
  • the device for providing the synthesized voice may refer to any device that stores the synthesized voice of a new speaker that has been generated in advance and provides at least a part of the stored synthesized voice.
  • the apparatus for providing such a synthesized voice may be implemented in the user terminal 210 of FIG. 2 and/or the information processing system 230 of FIG. 2 .
  • the apparatus for providing the synthesized voice is not limited thereto, but may be implemented as a video system, an ARS system, a game system, a sound pen, or the like.
  • a device for providing such a synthesized voice is provided to the information processing system 230
  • at least a part of the outputted synthesized voice of the new speaker is transmitted to the user terminal device connected to the information processing system 230 by wire/wireless.
  • the information processing system 230 may provide at least a part of the output synthesized voice of the new speaker in a streaming manner.
  • a method for providing a synthesized voice of a new speaker comprising the steps of: storing the synthesized voice of the new speaker generated according to the above-described method; and providing at least a part of the stored synthesized voice of the new speaker.
  • This method may be executed by the processor of the user terminal 210 and/or the processor of the information processing system 230 of FIG. 2 .
  • This method may be provided for a service providing a synthesized voice of a new speaker.
  • a service may be implemented as a video system, an ARS system, a game system, a sound pen, etc., but is not limited thereto.
  • the artificial neural network text-to-speech synthesis model may include an encoder 610 , an attention 620 , and a decoder 630 .
  • the encoder 610 may receive the target text 640 as an input.
  • the encoder 610 may be configured to generate pronunciation information for the input target text 640 (eg, phoneme information for the target text, a vector for each of a plurality of phonemes included in the target text, etc.).
  • the encoder 610 may generate the target text 640 by converting it into character embeddings.
  • the generated character embeddings may be passed to a pre-net including a fully-connected layer.
  • the encoder 610 may provide the output from the pre-net to the CBHG module to output encoder hidden states.
  • the CBHG module may include a 1D convolution bank, max pooling, a highway network, and a bidirectional gated recurrent unit (GRU).
  • the pronunciation information generated by the encoder 610 may be provided to the attention 620 , and the attention 620 may connect or combine the provided pronunciation information with voice data corresponding to the pronunciation information.
  • attention 620 may be configured to determine from which portion of the input text to generate speech.
  • the pronunciation information connected in this way and voice data corresponding to the pronunciation information may be provided to the decoder 630 .
  • the decoder 630 may be configured to generate the voice data 660 corresponding to the target text 640 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.
  • the decoder 630 provides the speaker characteristics ( ) 658 , to generate an output voice for the target text reflecting the speaker characteristics of the new speaker.
  • the speaker characteristics of the new speaker ( ) 658 may be generated through the vocalization characteristic change module 656 .
  • the vocalization characteristic change module 656 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .
  • the artificial neural network speaker feature extraction model 650 is a reference speaker's
  • the speaker feature (r) can be obtained.
  • the vocalization feature C 654 and the speaker feature r of the speaker may be expressed in a vector form.
  • the artificial neural network speaker feature extraction model 650 may be trained to receive a plurality of learned speaker ids and a plurality of learned vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • the vocalization characteristic change information is determined through the vocalization characteristic change module 656, and further, the new speaker's speaker Characteristic( ) (658) can be determined.
  • the input information (d) 655 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.
  • the decoder 630 includes a freenet composed of a fully connected layer, an attention recurrent neural network (RNN) including a gated recurrent unit (GRU), and a decoder RNN (decoder RNN) including a residual GRU (residual GRU). RNN) may be included.
  • the voice data 660 output from the decoder 630 may be expressed as a mel-scale spectrogram.
  • the output of the decoder 630 may be provided to a post-processing processor (not shown).
  • the CBHG of the post-processing processor may be configured to convert the mel-scale spectrogram of the decoder 630 into a linear-scale spectrogram.
  • the output signal of the CBHG of the post-processing processor may include a magnitude spectrogram.
  • the phase of the output signal of the CBHG of the post-processing processor may be restored through a Griffin-Lim algorithm and subjected to inverse short-time Fourier transform.
  • the post-processing processor may output a voice signal in a time domain.
  • the post-processing processor may be implemented using a GAN-based vocoder.
  • the processor uses a database including a training text item, a speaker characteristic of a plurality of learned speakers, and a training voice data item corresponding to the training text item in which the speaker characteristic is reflected.
  • the processor may learn the artificial neural network text-to-speech synthesis model to output a synthesized voice reflecting the speaker characteristics of the learning speaker based on the training text item, the speaker characteristics of the training speaker, and the training voice data item corresponding to the training text item.
  • the processor may generate an output voice for the target text in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way.
  • the processor adds the target text 640 and the new speaker speaker characteristics ( ) 658 , a synthesized voice may be generated based on the output voice data 660 .
  • the synthesized speech generated in this way has the speaker characteristics ( ) 658 may be reflected, and may include a voice uttering the target text 640 .
  • the decoder 630 may include the attention 620 .
  • the speaker characteristics of the new speaker ( ) 658 is input to the decoder 630 , but is not limited thereto.
  • the speaker characteristics of the new speaker ( ) 658 may be input to the encoder 610 and/or the attention 620 .
  • FIG. 7 is a diagram illustrating an example of generating an output voice in which a speaker characteristic of a new speaker is reflected, according to another embodiment of the present disclosure.
  • the encoder 710 , the attention 720 , and the decoder 730 illustrated in FIG. 7 may perform functions similar to those of the encoder 610 , the attention 620 and the decoder 630 illustrated in FIG. 6 , respectively. Accordingly, the description overlapping with FIG. 6 will be omitted.
  • the encoder 710 may receive the target text 740 as input.
  • the encoder 710 is configured to generate pronunciation information for the input target text 740 (eg, a plurality of phoneme information included in the target text, a vector for each of a plurality of phonemes included in the target text, etc.).
  • the pronunciation information generated by the encoder 710 may be provided to the attention 720 , and the attention 720 may connect the pronunciation information and voice data corresponding to the pronunciation information.
  • the pronunciation information connected as described above and voice data corresponding to the pronunciation information may be provided to the decoder 730 .
  • the decoder 730 may be configured to generate the voice data 760 corresponding to the target text 740 based on the connected pronunciation information and the voice data corresponding to the pronunciation information.
  • the decoder 730 provides the new speaker's speaker characteristics ( ) 758 , and generate an output voice for the target text reflecting the speaker characteristics of the new speaker.
  • the speaker characteristics of the new speaker ( ) 758 may be generated through the vocal feature change module 756 .
  • the vocal feature change module 756 may be implemented through the algorithm and/or artificial neural network model described in FIG. 4 .
  • the artificial neural network speaker feature extraction model 750 may output speaker identification information (i) 753 based on the voice 752 and the speech feature set (C) 754 recorded by the speaker. Also, it is possible to obtain the speaker characteristic (r) of the reference speaker.
  • the speech feature set may include one or more speech features c.
  • the speech feature set (C) 754 and the speaker feature (r) of the speaker may be expressed in a vector form.
  • the artificial neural network speaker feature extraction model may be trained to receive voices recorded by a plurality of learning speakers and a plurality of learning vocal features (eg, vectors) to extract a speaker vector (ground truth) of a reference reference speaker.
  • the vocalization characteristic change module 756 determines the vocalization characteristic change information using the generated reference speaker characteristic (r) and the input information (d) 755 associated with the vocalization characteristic change information, and furthermore, the speaker characteristic of the new speaker. ( ) can be determined.
  • the input information (d) 755 associated with the speech characteristic change information may include any information desired to be reflected or changed in a new speaker.
  • the processor uses a database including a pair of training speech data items corresponding to the training text item, in which the speaker characteristics of the plurality of training text items and the speaker characteristics are reflected.
  • the processor may learn the artificial neural network text-to-speech synthesis model to output the synthesized voice 760 in which the speaker characteristics of the new speaker are reflected, based on the speaker characteristics of the training speaker and the training voice data item corresponding to the training text item.
  • the processor may generate the output voice 760 in which the speaker characteristics of the new speaker are reflected through the artificial neural network text-to-speech synthesis model created/learned in this way.
  • the processor adds the target text 740 and the new speaker speaker characteristics ( ) 758 , a synthesized voice may be generated based on the output voice data 760 .
  • the synthesized speech generated in this way has the speaker characteristics ( ) 758 may include a voice uttering the target text 740 .
  • the attention 720 and the decoder 730 are illustrated as separate components in FIG. 7 , the present invention is not limited thereto.
  • the decoder 730 may include an attention 720 .
  • the speaker characteristics of the new speaker ( ) is input to the decoder 730, but is not limited thereto.
  • the speaker characteristics of the new speaker ( ) may be input to the encoder 710 and/or the attention 720 .
  • a target text is expressed as one input data item (eg, a vector) and one output data item (eg, a melscale spectrogram) is output through an artificial neural network text-to-speech synthesis model.
  • one input data item eg, a vector
  • one output data item eg, a melscale spectrogram
  • the present invention is not limited thereto, and may be configured to output any number of output data items by inputting an arbitrary number of input data items to the artificial neural network text-to-speech synthesis model.
  • the user terminal (eg, the user terminal 210 ) may output a synthesized voice reflecting the speaker characteristics of the new speaker through the user interface 800 .
  • the user interface 800 may include a text area 810 , a speech characteristic adjustment area 820 , a speaker characteristic adjustment area 830 , and an output voice display area 840 .
  • the processor may be the processor 314 of the user terminal 210 and/or the processor 334 of the information processing system 230 .
  • the processor may receive the target text through a user input using an input interface (eg, a keyboard, a mouse, a microphone, etc.), and display the received target text through the text area 810 .
  • the processor may receive a document file including text, extract text in the document file, and display the extracted text in the text area 810 .
  • the text displayed in the text area 810 in this way may be a target to be uttered through a synthesized voice.
  • One or more reference speakers may be selected in response to a user input for selecting one or more reference speakers from among the reference speakers displayed in the speaker characteristic adjustment area 830 . Then, the processor may receive a weight (eg, a reflection ratio) for the speaker characteristics of the selected one or more reference speakers as speech characteristic change information. For example, the processor may receive a weight for each of the speaker characteristics of one or more reference speakers in Equation 5 described with reference to FIG. 4 through an input in the speaker characteristic adjustment region 830 .
  • a weight eg, a reflection ratio
  • the speaker feature control area 830 six standard speakers, 'Eun-Byul Ko', 'Soo-Min Kim', 'Woo-Rim Lee', 'Do-Young Song', 'Seong-Soo Shin', and 'Jin-Kyung Shin' may be given. That is, the user selects one or more reference speakers from among the six reference speakers, and adjusts a reflection ratio adjustment means (eg, bar) corresponding to each of the selected one or more reference speakers, so that the speaker characteristics of the selected reference speaker are changed to a new speaker. A ratio that is reflected in the speaker characteristics of may be determined. Alternatively, one or more of the six reference speakers may be randomly selected.
  • a reflection ratio adjustment means eg, bar
  • the reflection ratios for each speaker may be received so that the sum of reflection ratios corresponding to the selected one or more reference speakers becomes 100. Alternatively, even if the reflection ratio corresponding to the one or more reference speakers selected in this way is greater than or less than 100, each reflection ratio may be automatically adjusted so that the sum of the ratios becomes 100.
  • FIG. 6 six reference speakers are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and 5 or less reference speakers and 7 or more reference speakers are displayed in the speaker characteristic adjustment area 830 to create new speaker characteristics. It can be used to generate speaker characteristics.
  • the processor may receive a weight (eg, a reflection ratio) for each of the plurality of speech features as speech feature change information through the speech feature adjustment region 820 .
  • the processor may receive a weight for each of the plurality of speech features in Equation 6 described with reference to FIG. 4 through an input in the speech feature adjustment region 820 .
  • r in Equation 6 may be a reference speaker generated according to the selection and reflection ratio of one or more reference speakers in the speaker characteristic adjustment region 830 .
  • r is a result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 .
  • Equation 13 is the result value of Equation 5 described in FIG. 4 obtained through input in the speaker feature adjustment region 830 , can be
  • gender, vocal tone, vocal strength, male age, female age, pitch, and tempo may be given as quantitatively adjustable vocal characteristics in the vocalization characteristic adjustment area 820 .
  • a ratio adjusting means eg, a bar
  • a ratio adjusting means eg, a bar
  • the corresponding vocalization characteristic is not reflected in the speaker characteristic of the new speaker at all.
  • seven vocal characteristics are used to generate the speaker characteristics of a new speaker, but the present invention is not limited thereto, and the vocal characteristics of six or less people and additional vocal characteristics are displayed in the vocalization characteristic control area 820 to display the speaker characteristics of the new speaker. can be used to create
  • the processor receives the speaker characteristics of one or more reference speakers selected in the speaker characteristic adjustment area 830 , and weights input from the speaker characteristic adjustment area 830 and/or the speech characteristics adjustment area 820 .
  • a speaker characteristic of a new speaker may be generated by using the speech characteristic adjustment information including the weight.
  • One of the methods described with reference to FIG. 4 may be used as a specific method for generating the speaker characteristic of a new speaker.
  • the processor may input the target text and the generated speaker characteristics of the new speaker to the artificial neural network text-to-speech synthesis model to generate an output voice for the target text in which the determined speaker characteristics of the new speaker are reflected.
  • the input in the text area 810 , the speech characteristic adjustment area 820 , and the speaker characteristic adjustment area 830 is completed, and the 'Create' button located below the speech characteristic adjustment area 820 is selected or clicked Then, an output voice for the target text in which the speaker characteristics of the new speaker are reflected may be generated.
  • the output voice thus generated may be output through a speaker connected to the user terminal.
  • the reproduction time and/or position of the output voice may be displayed through the output voice display area 840 .
  • the artificial neural network model 900 is a statistical learning algorithm implemented based on the structure of a biological neural network or a structure for executing the algorithm in machine learning technology and cognitive science.
  • the artificial neural network model 900 is an artificial neuron that forms a network by combining synapses, as in a biological neural network, by repeatedly adjusting the weights of synapses, so that By learning to reduce the error between the output and the inferred output, it is possible to represent a machine learning model with problem-solving ability.
  • the artificial neural network model 900 may include arbitrary probabilistic models, neural network models, etc.
  • the artificial neural network model 900 includes the aforementioned artificial neural network text-to-speech synthesis model, the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. may include
  • the artificial neural network model 900 may be implemented as a multilayer perceptron (MLP) composed of multiple layers of nodes and connections between them.
  • the artificial neural network model 900 according to the present embodiment may be implemented using one of various artificial neural network structures including MLP.
  • the artificial neural network model 900 includes an input layer 920 that receives an input signal or data 910 from the outside, and an output layer that outputs an output signal or data 950 corresponding to the input data ( 940), located between the input layer 920 and the output layer 940, receiving a signal from the input layer 920, extracting characteristics, and transferring the characteristics to the output layer 940. It may be composed of n hidden layers 930_1 to 930_n. .
  • the output layer 940 may receive a signal from the hidden layers 930_1 to 930_n and output the signal to the outside.
  • the learning method of the artificial neural network model 900 includes a supervised learning method that learns to be optimized to solve a problem by input of a teacher signal (correct answer), and an unsupervised learning method that does not require a teacher signal. ) is a way.
  • the processor inputs text information and speaker characteristics of a new speaker into the artificial neural network model 900, and the artificial neural network model 900 This new speaker characteristic can be learned end-to-end to output voice data for the reflected text. That is, when the artificial neural network model 900 inputs information about text and information about a new speaker, the intermediate process is learned by itself by the processor, and a synthesized voice can be output.
  • the processor may generate the synthesized speech by converting the text information and the speaker characteristics of the new speaker into embeddings (eg, embedding vectors) through the encoding layer of the neural network model 900 .
  • the input variable of the artificial neural network model 900 may be a vector 910 composed of vector data elements representing text information and new speaker information.
  • the text information may be represented by arbitrary embeddings representing text, for example, it may be represented by character embeddings, phoneme embeddings, and the like.
  • the speaker characteristics of the new speaker may be represented by any type of embedding representing the speaker's utterance.
  • the output variable may be composed of a result vector 950 representing the synthesized voice for the target text in which the speaker characteristics of the new speaker are reflected.
  • the input layer 920 and the output layer 940 of the artificial neural network model 900 are matched with a plurality of input variables and a plurality of output variables corresponding to each other, and the input layer 920 and the hidden layers 930_1 ... 930_n , where n is a natural number equal to or greater than 2) and by adjusting the synapse values between the nodes included in the output layer 940, the artificial neural network model 900 can be trained to infer the correct output corresponding to a specific input. have.
  • correct answer data of the analysis result may be used, and such correct answer data may be obtained as a result of an annotator's annotation work.
  • the characteristics hidden in the input variable of the artificial neural network model 900 can be identified, and the error between the output variable calculated based on the input variable and the target output is reduced.
  • a synapse value (or weight) between the two may be adjusted.
  • the neural network model 900 is trained, mutual information between text information and new speaker information (eg, text information embedding and new speaker information embedding) A loss function that minimizes may be used.
  • the neural network model 900 is an artificial neural network text-to-speech synthesis model
  • the neural network model 900 is configured to predict a loss between embedding text information and embedding new speaker information (for example, , a fully-connected layer, etc.).
  • the artificial neural network model 900 may be trained to predict and minimize mutual information between text information and speaker information.
  • the artificial neural network model 900 learned in this way may be configured to independently adjust each of the input text information and the new speaker information.
  • the processor inputs target text information and new speaker information to the learned artificial neural network model 900, and the new speaker's speaker characteristics are reflected.
  • a synthesized voice corresponding to the text may be output.
  • voice data may be configured such that mutual information between the target text information and the new speaker information is minimized.
  • the learning process of the artificial neural network model 900 uses the training data of each model to generate the aforementioned artificial neural network speaker feature change generation model, the aforementioned artificial neural network speech feature prediction model, and/or the aforementioned artificial neural network speaker feature extraction model. can be applied.
  • the artificial neural network models trained in this way may generate an inference value as output data by using data corresponding to the learning input data as input.
  • the above-described method may be provided as a computer program stored in a computer-readable recording medium for execution by a computer.
  • the medium may continuously store a computer executable program, or may be a temporary storage for execution or download.
  • the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like.
  • examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute various other software, or servers.
  • the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.
  • a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.
  • the techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM (on computer-readable media such as programmable read-only memory), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention se rapporte à un procédé, mis en œuvre par au moins un processeur, permettant de générer une parole synthétisée d'un nouveau locuteur. Le procédé peut comprendre les étapes consistant : à recevoir un texte cible ; à acquérir des caractéristiques de locuteur d'un locuteur de référence ; à acquérir des informations concernant des changements dans des caractéristiques d'énoncé ; à déterminer des caractéristiques de locuteur d'un nouveau locuteur en utilisant les caractéristiques de locuteur acquises du locuteur de référence et les informations acquises concernant des changements des caractéristiques de l'énoncé ; et à générer une parole de sortie pour le texte cible en entrant le texte cible et les caractéristiques de locuteur déterminées du nouveau locuteur en un modèle de synthèse de texte-parole de réseau neuronal artificiel, la parole de sortie reflétant les caractéristiques de locuteur déterminées du nouveau locuteur. Selon l'invention, le modèle de synthèse de texte-parole de réseau neuronal artificiel peut être formé sur la base d'une pluralité d'éléments de texte d'apprentissage et de caractéristiques de locuteur d'une pluralité de locuteurs d'apprentissage pour délivrer en sortie une parole pour la pluralité d'éléments de texte d'apprentissage, la parole de sortie reflétant les caractéristiques de locuteur de la pluralité de locuteurs d'apprentissage.
PCT/KR2022/001414 2021-01-26 2022-01-26 Procédé et système permettant de générer une parole synthétisée d'un nouveau locuteur WO2022164207A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0011093 2021-01-26
KR20210011093 2021-01-26
KR10-2022-0011853 2022-01-26
KR1020220011853A KR102604932B1 (ko) 2021-01-26 2022-01-26 새로운 화자의 합성 음성을 생성하는 방법 및 시스템

Publications (1)

Publication Number Publication Date
WO2022164207A1 true WO2022164207A1 (fr) 2022-08-04

Family

ID=82653616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/001414 WO2022164207A1 (fr) 2021-01-26 2022-01-26 Procédé et système permettant de générer une parole synthétisée d'un nouveau locuteur

Country Status (1)

Country Link
WO (1) WO2022164207A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190085882A (ko) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체
KR20190096877A (ko) * 2019-07-31 2019-08-20 엘지전자 주식회사 이종 레이블 간 발화 스타일 부여를 위한 인공지능 기반의 음성 샘플링 장치 및 방법
US20200005763A1 (en) * 2019-07-25 2020-01-02 Lg Electronics Inc. Artificial intelligence (ai)-based voice sampling apparatus and method for providing speech style
KR20200056342A (ko) * 2018-11-14 2020-05-22 네오사피엔스 주식회사 대상 화자 음성과 동일한 음성을 가진 컨텐츠를 검색하는 방법 및 이를 실행하기 위한 장치
KR20200088263A (ko) * 2018-05-29 2020-07-22 한국과학기술원 텍스트- 다중 음성 변환 방법 및 시스템

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190085882A (ko) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체
KR20200088263A (ko) * 2018-05-29 2020-07-22 한국과학기술원 텍스트- 다중 음성 변환 방법 및 시스템
KR20200056342A (ko) * 2018-11-14 2020-05-22 네오사피엔스 주식회사 대상 화자 음성과 동일한 음성을 가진 컨텐츠를 검색하는 방법 및 이를 실행하기 위한 장치
US20200005763A1 (en) * 2019-07-25 2020-01-02 Lg Electronics Inc. Artificial intelligence (ai)-based voice sampling apparatus and method for providing speech style
KR20190096877A (ko) * 2019-07-31 2019-08-20 엘지전자 주식회사 이종 레이블 간 발화 스타일 부여를 위한 인공지능 기반의 음성 샘플링 장치 및 방법

Similar Documents

Publication Publication Date Title
WO2020246702A1 (fr) Dispositif électronique et procédé de commande de dispositif électronique associé
WO2019117466A1 (fr) Dispositif électronique pour analyser la signification de la parole, et son procédé de fonctionnement
WO2020190054A1 (fr) Appareil de synthèse de la parole et procédé associé
WO2020190050A1 (fr) Appareil de synthèse vocale et procédé associé
WO2020027619A1 (fr) Procédé, dispositif et support d'informations lisible par ordinateur pour la synthèse vocale à l'aide d'un apprentissage automatique sur la base d'une caractéristique de prosodie séquentielle
WO2019139430A1 (fr) Procédé et appareil de synthèse texte-parole utilisant un apprentissage machine, et support de stockage lisible par ordinateur
WO2019139431A1 (fr) Procédé et système de traduction de parole à l'aide d'un modèle de synthèse texte-parole multilingue
WO2020145439A1 (fr) Procédé et dispositif de synthèse vocale basée sur des informations d'émotion
WO2020213842A1 (fr) Structures multi-modèles pour la classification et la détermination d'intention
WO2020105856A1 (fr) Appareil électronique pour traitement d'énoncé utilisateur et son procédé de commande
WO2015005679A1 (fr) Procédé, appareil et système de reconnaissance vocale
WO2022045651A1 (fr) Procédé et système pour appliquer une parole synthétique à une image de haut-parleur
WO2020116930A1 (fr) Dispositif électronique permettant de délivrer en sortie un son et procédé de fonctionnement associé
WO2022260432A1 (fr) Procédé et système pour générer une parole composite en utilisant une étiquette de style exprimée en langage naturel
WO2020209647A1 (fr) Procédé et système pour générer une synthèse texte-parole par l'intermédiaire d'une interface utilisateur
WO2022265273A1 (fr) Procédé et système pour fournir un service de conversation avec une personne virtuelle simulant une personne décédée
WO2021085661A1 (fr) Procédé et appareil de reconnaissance vocale intelligent
WO2021029642A1 (fr) Système et procédé pour reconnaître la voix d'un utilisateur
WO2020060311A1 (fr) Procédé de fourniture ou d'obtention de données pour l'apprentissage et dispositif électronique associé
WO2022164207A1 (fr) Procédé et système permettant de générer une parole synthétisée d'un nouveau locuteur
WO2021040490A1 (fr) Procédé et appareil de synthèse de la parole
WO2020171545A1 (fr) Dispositif électronique et système de traitement de saisie d'utilisateur et procédé associé
WO2022108040A1 (fr) Procédé de conversion d'une caractéristique vocale de la voix
WO2020180000A1 (fr) Procédé d'expansion de langues utilisées dans un modèle de reconnaissance vocale et dispositif électronique comprenant un modèle de reconnaissance vocale
WO2022102987A1 (fr) Dispositif électronique et procédé de commande associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22746233

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.11.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 22746233

Country of ref document: EP

Kind code of ref document: A1