WO2023085635A1 - 음성 합성 서비스 제공 방법 및 그 시스템 - Google Patents
음성 합성 서비스 제공 방법 및 그 시스템 Download PDFInfo
- Publication number
- WO2023085635A1 WO2023085635A1 PCT/KR2022/015990 KR2022015990W WO2023085635A1 WO 2023085635 A1 WO2023085635 A1 WO 2023085635A1 KR 2022015990 W KR2022015990 W KR 2022015990W WO 2023085635 A1 WO2023085635 A1 WO 2023085635A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- speaker
- text
- speech synthesis
- model
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 227
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 227
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000006243 chemical reaction Methods 0.000 claims abstract description 71
- 238000011161 development Methods 0.000 claims abstract description 41
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 12
- 238000013473 artificial intelligence Methods 0.000 claims description 75
- 230000008569 process Effects 0.000 claims description 42
- 238000013526 transfer learning Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims 1
- 238000012549 training Methods 0.000 abstract description 4
- 238000012360 testing method Methods 0.000 description 35
- 238000003058 natural language processing Methods 0.000 description 33
- 230000006870 function Effects 0.000 description 28
- 238000004458 analytical method Methods 0.000 description 27
- 238000001228 spectrum Methods 0.000 description 24
- 238000010586 diagram Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 18
- 238000004891 communication Methods 0.000 description 17
- 238000007726 management method Methods 0.000 description 16
- 238000007781 pre-processing Methods 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 12
- 239000003795 chemical substances by application Substances 0.000 description 10
- 238000012790 confirmation Methods 0.000 description 8
- 238000013507 mapping Methods 0.000 description 6
- 230000005236 sound signal Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000021615 conjugation Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000010454 slate Substances 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000011017 operating method Methods 0.000 description 1
- 239000004984 smart glass Substances 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present disclosure relates to a method and system for providing a voice synthesis service based on tone or timbre conversion.
- the voice recognition technology that started in smartphones has a structure that selects the optimal answer to a user's question by utilizing an enormous amount of database.
- Speech synthesis technology is a technology that automatically converts input text into a voice waveform including corresponding phonological information, and is usefully used in various voice application fields such as conventional automated response systems (ARS) and computer games.
- ARS automated response systems
- Representative voice synthesis technologies include a voice synthesis technology based on a corpus-based audio link method and a voice synthesis technology based on a hidden Markov model (HMM) based parameter method.
- HMM hidden Markov model
- An object of the present disclosure is to provide a method and system for providing a user's own voice synthesis service based on voice conversion.
- a method for providing a speech synthesis service includes a sound source for synthesizing a speaker's speech for a plurality of predefined first texts through a speech synthesis service platform providing a development toolkit.
- receiving data performing tone conversion learning on sound source data of a speaker using a pre-generated tone conversion base model; generating a voice synthesis model for the speaker through the voice conversion learning; receiving second text; generating a speech synthesis model through speech synthesis inference based on the speech synthesis model for the speaker and the second text; and generating a synthesized voice using the voice synthesis model.
- An artificial intelligence-based speech synthesis service system includes: an artificial intelligence device; and a computing device that exchanges data with the artificial intelligence device, wherein the computing device is configured to synthesize a speaker's voice for a plurality of predefined first texts through a voice synthesis service platform that provides a development toolkit.
- Sound source data is received, and tone conversion learning is performed on the speaker's sound source data using a pre-generated tone conversion base model to generate a voice synthesis model for the speaker, and when a second text is input, the speaker and generating a speech synthesis model through speech synthesis inference based on the speech synthesis model for the second text and generating a synthesized speech using the speech synthesis model.
- a personalized voice synthesizer can be used even in a virtual space or a virtual character such as a digital human or a metaverse.
- FIG. 1 is a diagram for explaining a voice system according to an embodiment of the present invention.
- FIG. 2 is a block diagram for explaining the configuration of an artificial intelligence device according to an embodiment of the present disclosure.
- FIG. 3 is a block diagram for explaining the configuration of a voice service server according to an embodiment of the present invention.
- FIG. 4 is a diagram illustrating an example of converting a voice signal into a power spectrum according to an embodiment of the present invention.
- FIG. 5 is a block diagram illustrating the configuration of a processor for voice recognition and synthesis of an artificial intelligence device according to an embodiment of the present invention.
- FIG. 6 is a configuration block diagram of a voice service system for voice synthesis according to an embodiment of the present disclosure.
- FIG. 7 is a configuration block diagram of an artificial intelligence device according to another embodiment of the present disclosure.
- FIG. 8 is a diagram for explaining a method of registering a user-defined remote trigger word in a voice service system according to an embodiment of the present disclosure.
- FIG. 9 is a flowchart illustrating a voice synthesis service process according to an embodiment of the present disclosure.
- 10A to 15D are diagrams illustrating a process of using a voice synthesis service on a service platform using a development toolkit according to an embodiment of the present disclosure.
- the 'artificial intelligence device' described in this specification includes a mobile phone, a smart phone, a laptop computer, an artificial intelligence device for digital broadcasting, personal digital assistants (PDA), a portable multimedia player (PMP), a navigation device, and a slate.
- PDA personal digital assistants
- PMP portable multimedia player
- PC slate PC
- tablet PC tablet PC
- ultrabook ultrabook
- wearable device for example, watch type artificial intelligence device (smartwatch), glass type artificial intelligence device (smart glass), HMD ( head mounted display)
- the present disclosure is not limited to the above examples.
- AI devices may also include stationary AI devices such as smart TVs, desktop computers, digital signage, refrigerators, washing machines, air conditioners, and dishwashers.
- the artificial intelligence device 10 can be applied to a fixed or movable robot.
- the artificial intelligence device 10 may perform or support a function of a voice agent.
- the voice agent may be a program that recognizes a user's voice and outputs a response suitable for the recognized user's voice as a voice, or a program that outputs a synthesized voice for text through voice synthesis based on the recognized user's voice. .
- FIG. 1 is a diagram for explaining a voice service system according to an embodiment of the present invention.
- the voice service may include at least one of voice recognition and voice synthesis services.
- the voice recognition and synthesis process converts the speaker's (or user's) voice data into text data, analyzes the speaker's intent based on the converted text data, converts the text data corresponding to the analyzed intent into synthesized voice data, , and outputting the converted synthesized voice data.
- a voice service system as shown in FIG. 1 can be used.
- the voice service system includes an artificial intelligence device 10, a speech to text (STT) server 20, a natural language processing (NLP) server 30, and a speech synthesis server ( 40) may be included.
- the plurality of AI agent servers 50-1 to 50-3 communicate with the NLP server 30 and may be included in the voice service system.
- the STT server 20, the NLP server 30, and the voice synthesis server 40 may exist as separate servers, respectively, as shown, or may exist included in one server.
- the plurality of AI agent servers 50-1 to 50-3 may also exist as separate servers or may be included in the NLP server 30.
- the artificial intelligence device 10 may transmit a voice signal corresponding to the speaker's voice received through the microphone 122 to the STT server 20 .
- the STT server 20 may convert voice data received from the artificial intelligence device 10 into text data.
- the STT server 20 may increase the accuracy of voice-to-text conversion by using a language model.
- the language model may refer to a model capable of calculating a probability of a sentence or a probability of a next word given previous words.
- the language model may include probabilistic language models such as a Unigram model, a Bigram model, and an N-gram model.
- the unigram model is a model that assumes that the conjugations of all words are completely independent of each other, and calculates the probability of a sequence of words as the product of the probabilities of each word.
- the bigram model is a model that assumes that the conjugation of a word depends only on one previous word.
- the N-gram model is a model that assumes that the conjugation of a word depends on the previous (n-1) number of words.
- the STT server 20 may determine whether text data converted from voice data is appropriately converted using a language model, and through this, accuracy of conversion into text data may be increased.
- NLP server 30 may receive text data from the STT server (20).
- the STT server 20 may be included in the NLP server 30.
- the NLP server 30 may perform intention analysis on the text data based on the received text data.
- the NLP server 30 may transmit intention analysis information indicating a result of performing the intention analysis to the artificial intelligence device 10 .
- the NLP server 30 may transmit intent analysis information to the voice synthesis server 40 .
- the voice synthesis server 40 may generate a synthesized voice based on the intention analysis information and transmit the generated synthesized voice to the artificial intelligence device 10 .
- the NLP server 30 may generate intention analysis information by sequentially performing a morpheme analysis step, a syntax analysis step, a dialogue act analysis step, and a dialog processing step on text data.
- the morpheme analysis step is a step of classifying text data corresponding to a voice uttered by a user into morpheme units, which are the smallest units having meaning, and determining what parts of speech each classified morpheme has.
- the syntactic analysis step is a step of classifying the text data into noun phrases, verb phrases, adjective phrases, etc. using the result of the morpheme analysis step, and determining what kind of relationship exists between the classified phrases.
- the subject, object, and modifiers of the voice uttered by the user may be determined.
- the dialogue act analysis step is a step of analyzing the intention of the voice uttered by the user by using the result of the syntax analysis step. Specifically, the dialogue act analysis step is a step of determining the intent of the sentence, such as whether the user asks a question, makes a request, or simply expresses emotion.
- the dialog processing step is a step of determining whether to respond to the user's utterance, respond to the user's utterance, or ask a question asking for additional information by using the result of the dialogue act analysis step.
- the NLP server 30 may generate intention analysis information including one or more of an answer, a response, and an inquiry for additional information about the intention uttered by the user.
- the NLP server 30 may transmit a search request to a search server (not shown) and receive search information corresponding to the search request in order to search for information suitable for the user's speech intention.
- the search information may include information about the searched content.
- the NLP server 30 transmits search information to the artificial intelligence device 10, and the artificial intelligence device 10 may output the search information.
- the NLP server 30 may receive text data from the artificial intelligence device (10).
- the artificial intelligence device 10 may convert voice data into text data and transmit the converted text data to the NLP server 30. .
- the voice synthesis server 40 may generate a synthesized voice by combining pre-stored voice data.
- the voice synthesis server 40 may record the voice of one person selected as a model and divide the recorded voice into syllables or words.
- the speech synthesis server 40 may store the divided speech in units of syllables or words in an internal or external database.
- the voice synthesizing server 40 may search a database for syllables or words corresponding to given text data, synthesize a combination of the searched syllables or words, and generate a synthesized voice.
- the speech synthesis server 40 may store a plurality of speech language groups corresponding to each of a plurality of languages.
- the voice synthesis server 40 may include a first voice language group recorded in Korean and a second voice language group recorded in English.
- the speech synthesis server 40 may translate text data of the first language into text of the second language and generate synthesized speech corresponding to the translated text of the second language by using the second speech language group.
- the voice synthesis server 40 may transmit the generated synthesized voice to the artificial intelligence device 10 .
- the speech synthesis server 40 may receive analysis information from the NLP server 30 .
- the analysis information may include information obtained by analyzing the intention of the voice spoken by the user.
- the voice synthesis server 40 may generate a synthesized voice reflecting the user's intention based on the analysis information.
- the artificial intelligence device 10 may include one or more processors.
- Each of the plurality of AI agent servers 50-1 to 50-3 may transmit search information to the NLP server 30 or the artificial intelligence device 10 at the request of the NLP server 30.
- the NLP server 30 transmits the content search request to one or more of the plurality of AI agent servers 50-1 to 50-3, and , content search results may be received from the corresponding server.
- NLP server 30 may transmit the received search results to the artificial intelligence device (10).
- FIG. 2 is a block diagram for explaining the configuration of an artificial intelligence device 10 according to an embodiment of the present disclosure.
- the artificial intelligence device 10 includes a communication unit 110, an input unit 120, a running processor 130, a sensing unit 140, an output unit 150, a memory 170, and a processor 180.
- a communication unit 110 can include
- the communication unit 110 may transmit/receive data with external devices using wired/wireless communication technology.
- the communication unit 110 may transmit/receive sensor information, a user input, a learning model, a control signal, and the like with external devices.
- the communication technology used by the communication unit 110 includes Global System for Mobile communication (GSM), Code Division Multi Access (CDMA), Long Term Evolution (LTE), advanced (LTV-A), 5G, wireless LAN (WLAN), Wi-Fi (Wireless-Fidelity), BluetoothTM (Radio Frequency Identification), RFID (Radio Frequency Identification), Infrared Data Association (IrDA), ZigBee, NFC (Near Field Communication), and the like.
- GSM Global System for Mobile communication
- CDMA Code Division Multi Access
- LTE Long Term Evolution
- LTV-A advanced
- 5G wireless LAN
- WLAN Wi-Fi
- BluetoothTM Radio Frequency Identification
- RFID Radio Frequency Identification
- IrDA Infrared Data Association
- ZigBee ZigBee
- NFC Near Field Communication
- the input unit 120 may acquire various types of data.
- the input unit 120 may include a camera for inputting a video signal, a microphone for receiving an audio signal, and a user input unit for receiving information from a user.
- a camera or microphone may be treated as a sensor, and signals obtained from the camera or microphone may be referred to as sensing data or sensor information.
- the input unit 120 may obtain learning data for model learning and input data to be used when obtaining an output using the learning model.
- the input unit 120 may obtain raw input data, and in this case, the processor 180 or the learning processor 130 may extract input features as preprocessing of the input data.
- the input unit 120 may include a camera 121 for inputting a video signal, a microphone 122 for receiving an audio signal, and a user input unit 123 for receiving information from a user. there is.
- Voice data or image data collected by the input unit 120 may be analyzed and processed as a user's control command.
- the input unit 120 is for inputting video information (or signals), audio information (or signals), data, or information input from a user.
- video information or signals
- audio information or signals
- data or information input from a user.
- one or more artificial intelligence devices 10 are provided.
- the cameras 121 may be provided.
- the camera 121 processes an image frame such as a still image or a moving image obtained by an image sensor in a video call mode or a photographing mode.
- the processed image frame may be displayed on the display unit 151 or stored in the memory 170 .
- the microphone 122 processes external sound signals into electrical voice data.
- the processed voice data may be utilized in various ways according to the function (or application program being executed) being performed by the artificial intelligence device 10 . Meanwhile, various noise cancellation algorithms may be applied to the microphone 122 to remove noise generated in the process of receiving an external sound signal.
- the user input unit 123 is for receiving information from a user, and when information is input through the user input unit 123, the processor 180 can control the operation of the artificial intelligence device 10 to correspond to the input information. there is.
- the user input unit 123 is a mechanical input means (or a mechanical key, for example, a button located on the front/rear or side of the terminal 100, a dome switch, a jog wheel, a jog switch, etc.) and A touch input means may be included.
- the touch input means consists of a virtual key, soft key, or visual key displayed on a touch screen through software processing, or a part other than the touch screen. It can be made of a touch key (touch key) disposed on.
- the learning processor 130 may learn a model composed of an artificial neural network using training data.
- the learned artificial neural network may be referred to as a learning model.
- the learning model may be used to infer a result value for new input data other than learning data, and the inferred value may be used as a basis for a decision to perform a certain operation.
- the learning processor 130 may include a memory integrated or implemented in the artificial intelligence device 10 .
- the learning processor 130 may be implemented using the memory 170, an external memory directly coupled to the artificial intelligence device 10, or a memory maintained in an external device.
- the sensing unit 140 may obtain at least one of internal information of the artificial intelligence device 10, surrounding environment information of the artificial intelligence device 10, and user information by using various sensors.
- the sensors included in the sensing unit 140 include a proximity sensor, an illuminance sensor, an acceleration sensor, a magnetic sensor, a gyro sensor, an inertial sensor, an RGB sensor, an IR sensor, a fingerprint recognition sensor, an ultrasonic sensor, an optical sensor, a microphone, and a LiDAR sensor. , radar, etc.
- the output unit 150 may generate an output related to sight, hearing, or touch.
- the output unit 150 includes at least one of a display unit 151, a sound output unit 152, a haptic module 153, and an optical output unit 154.
- a display unit 151 includes at least one of a liquid crystal display 151, a liquid crystal display 151, a liquid crystal display 151, a liquid crystal display 151, a liquid crystal display 151, a liquid crystal display 151, a liquid crystal display 151, a liquid crystal display 152, and a lens output unit 154.
- a display unit 151 includes a display unit 151, a sound output unit 152, a haptic module 153, and an optical output unit 154.
- an optical output unit 154 can include
- the display unit 151 displays (outputs) information processed by the artificial intelligence device 10 .
- the display unit 151 may display execution screen information of an application program driven by the artificial intelligence device 10 or UI (User Interface) and GUI (Graphic User Interface) information according to such execution screen information.
- UI User Interface
- GUI Graphic User Interface
- the display unit 151 may implement a touch screen by forming a mutual layer structure or integrally with the touch sensor.
- a touch screen may function as a user input unit 123 providing an input interface between the artificial intelligence device 10 and the user, and may provide an output interface between the terminal 100 and the user.
- the audio output unit 152 may output audio data received from the communication unit 110 or stored in the memory 170 in reception of a call signal, communication mode or recording mode, voice recognition mode, or broadcast reception mode.
- the sound output unit 152 may include at least one of a receiver, a speaker, and a buzzer.
- the haptic module 153 generates various tactile effects that a user can feel.
- a representative example of the tactile effect generated by the haptic module 153 may be vibration.
- the light output unit 154 outputs a signal for notifying the occurrence of an event using light from a light source of the artificial intelligence device 10 .
- Examples of events generated by the artificial intelligence device 10 may include message reception, call signal reception, missed calls, alarms, schedule notifications, e-mail reception, and information reception through applications.
- the memory 170 may store data supporting various functions of the artificial intelligence device 10 .
- the memory 170 may store input data obtained from the input unit 120, learning data, a learning model, a learning history, and the like.
- the processor 180 may determine at least one executable operation of the artificial intelligence device 10 based on information determined or generated using a data analysis algorithm or a machine learning algorithm. Also, the processor 180 may perform the determined operation by controlling components of the artificial intelligence device 10 .
- the processor 180 may request, retrieve, receive, or utilize data from the learning processor 130 or the memory 170, and may perform an artificial operation to execute a predicted operation or an operation determined to be desirable among the at least one executable operation. Components of the intelligent device 10 may be controlled.
- the processor 180 may generate a control signal for controlling the external device and transmit the generated control signal to the external device when the connection of the external device is required to perform the determined operation.
- the processor 180 may obtain intention information for a user input and determine a user's requirement based on the obtained intention information.
- the processor 180 may obtain intent information corresponding to the user input by using at least one of an STT engine for converting a voice input into a character string and an NLP engine for obtaining intent information of a natural language.
- At least one or more of the STT engine or the NLP engine may include at least a part of an artificial neural network trained according to a machine learning algorithm. And, at least one or more of the STT engine or the NLP engine is learned by the learning processor 130, learned by the learning processor 240 of the AI server 200, or learned by distributed processing thereof it could be
- the processor 180 collects history information including user feedback on the operation contents or operation of the artificial intelligence device 10 and stores it in the memory 170 or the learning processor 130, or the AI server 200, etc. can be transmitted to an external device.
- the collected history information can be used to update the learning model.
- the processor 180 may control at least some of the components of the artificial intelligence device 10 in order to drive an application program stored in the memory 170 . Furthermore, the processor 180 may combine and operate two or more of the components included in the artificial intelligence device 10 to drive the application program.
- FIG. 3 is a block diagram for explaining the configuration of a voice service server 200 according to an embodiment of the present invention.
- the voice service server 200 may include one or more of the STT server 20 , the NLP server 30 , and the voice synthesis server 40 shown in FIG. 1 .
- the voice service server 200 may be referred to as a server system.
- the voice service server 200 may include a pre-processing unit 220, a controller 230, a communication unit 270, and a database 290.
- the pre-processing unit 220 may pre-process the voice received through the communication unit 270 or the voice stored in the database 290 .
- the pre-processing unit 220 may be implemented as a chip separate from the controller 230 or as a chip included in the controller 230 .
- the pre-processing unit 220 may receive a voice signal (uttered by a user) and filter a noise signal from the voice signal before converting the received voice signal into text data.
- the pre-processing unit 220 When the pre-processing unit 220 is provided in the artificial intelligence device 10, it may recognize an activation word for activating voice recognition of the artificial intelligence device 10.
- the pre-processor 220 converts the start word received through the microphone 121 into text data, and when the converted text data is text data corresponding to a pre-stored start word, it may be determined that the start word is recognized. .
- the pre-processor 220 may convert the noise-removed voice signal into a power spectrum.
- the power spectrum may be a parameter indicating which frequency components are included in the waveform of the voice signal that fluctuates over time and in what magnitude.
- the power spectrum shows the distribution of amplitude squared values according to the frequency of the waveform of the speech signal.
- FIG. 4 is a diagram illustrating an example of converting a voice signal into a power spectrum according to an embodiment of the present invention.
- the voice signal 410 may be received from an external device or may be a signal previously stored in the memory 170 .
- the x-axis of the audio signal 310 may represent time, and the y-axis may represent amplitude.
- the power spectrum processor 225 may convert the audio signal 410, of which the x-axis is the time axis, into the power spectrum 430, the x-axis of which is the frequency axis.
- the power spectrum processing unit 225 may transform the voice signal 410 into a power spectrum 430 using Fast Fourier Transform (FFT).
- FFT Fast Fourier Transform
- the x-axis of the power spectrum 430 represents the frequency, and the y-axis represents the square of the amplitude.
- the functions of the pre-processing unit 220 and the controller 230 described in FIG. 3 may be performed in the NLP server 30 as well.
- the pre-processing unit 220 may include a wave processing unit 221, a frequency processing unit 223, a power spectrum processing unit 225, and an STT conversion unit 227.
- the wave processing unit 221 may extract a voice waveform.
- the frequency processing unit 223 may extract a frequency band of voice.
- the power spectrum processing unit 225 may extract a power spectrum of voice.
- the power spectrum may be a parameter indicating which frequency components are included in the waveform and at what magnitude when a temporally varying waveform is given.
- the STT conversion unit 227 may convert voice into text.
- the STT conversion unit 227 may convert voice of a specific language into text of the corresponding language.
- the controller 230 may control overall operations of the voice service server 200 .
- the controller 230 may include a voice analysis unit 231 , a text analysis unit 232 , a feature clustering unit 233 , a text mapping unit 234 and a voice synthesis unit 235 .
- the voice analysis unit 231 may extract characteristic information of the voice by using at least one of a voice waveform, a frequency band of the voice, and a power spectrum of the voice preprocessed by the preprocessor 220 .
- Voice characteristic information may include one or more of speaker's gender information, speaker's voice (or tone), pitch, tone of voice, speaker's speech speed, and speaker's emotion.
- the voice characteristic information may further include a speaker's timbre.
- the text analyzer 232 may extract main expression phrases from the text converted by the voice-to-text converter 227 .
- the text analyzer 232 When the text analyzer 232 detects a change in tone between phrases from the converted text, it may extract the phrase whose tone changes as the main expression phrase.
- the text analyzer 232 may determine that the tone has changed when a frequency band between phrases is changed by more than a preset band.
- the text analyzer 232 may extract main words from phrases of the converted text.
- the main word may be a noun present in a phrase, but this is only an example.
- the feature clustering unit 233 may classify the speech type of the speaker using the voice characteristic information extracted by the voice analysis unit 231 .
- the feature clustering unit 233 may classify the speaker's speech type by assigning a weight to each of the type items constituting the voice feature information.
- the feature clustering unit 233 may classify a speaker's speech type using an attention technique of a deep learning model.
- the text mapping unit 234 may translate the text converted into the first language into text in the second language.
- the text mapping unit 234 may map the text translated into the second language to the text of the first language.
- the text mapping unit 234 may map the main expression phrase constituting the text of the first language to the corresponding phrase of the second language.
- the text mapping unit 234 may map the utterance type corresponding to the main expression phrase constituting the text of the first language to the phrase of the second language. This is to apply the classified utterance type to the phrase of the second language.
- the voice synthesis unit 235 applies the speech type and speaker's timbre classified by the feature clustering unit 233 to the main expression phrases of the text translated into the second language by the text mapping unit 234, and produces a synthesized voice.
- the controller 230 may determine the speech characteristics of the user by using one or more of the transmitted text data or the power spectrum 430 .
- the user's speech characteristics may include the user's gender, the user's voice pitch, the user's tone, the user's speech subject, the user's speech speed, and the user's voice volume.
- the controller 230 may obtain a frequency of the voice signal 410 and an amplitude corresponding to the frequency using the power spectrum 430 .
- the controller 230 may use the frequency band of the power spectrum 430 to determine the gender of the user who uttered the voice.
- the controller 230 may determine the gender of the user as male.
- the controller 230 may determine the gender of the user as female when the frequency band of the power spectrum 430 is within a preset second frequency band range.
- the second frequency band range may be larger than the first frequency band range.
- the controller 230 may determine the pitch of the voice using a frequency band of the power spectrum 430 .
- the controller 230 may determine the level of pitch of a sound within a specific frequency band range according to the magnitude of the amplitude.
- the controller 230 may determine a user's tone using a frequency band of the power spectrum 430 .
- the controller 230 may determine a frequency band having an amplitude greater than or equal to a predetermined level among frequency bands of the power spectrum 430 as the user's main sound range, and determine the determined main sound range as the user's tone color.
- the controller 230 may determine the user's speech speed from the converted text data through the number of syllables spoken per unit time.
- the controller 230 may determine the subject of the user's speech using the Bag-Of-Word Model technique with respect to the converted text data.
- the Bag-Of-Word Model technique is a technique for extracting frequently used words based on the number of word frequencies in a sentence.
- the Bag-Of-Word Model technique is a technique for determining the characteristics of an utterance subject by extracting a unique word from a sentence and expressing the frequency count of each extracted word as a vector.
- the subject of the user's speech may be classified as exercise.
- the controller 230 may determine a user's utterance subject from text data using a known text categorization technique.
- the controller 230 may extract keywords from text data to determine the subject of the user's speech.
- the controller 230 may determine the user's voice in consideration of amplitude information in the entire frequency band.
- the controller 230 may determine the user's voice based on an average or a weighted average of amplitudes in each frequency band of the power spectrum.
- the communication unit 270 may perform wired or wireless communication with an external server.
- the database 290 may store the voice of the first language included in the content.
- the database 290 may store synthesized voices obtained by converting voices of the first language into voices of the second language.
- the database 290 may store first text corresponding to speech in a first language and second text obtained by translating the first text into a second language.
- the database 290 may store various learning models required for speech recognition.
- the processor 180 of the artificial intelligence device 10 shown in FIG. 2 may include the pre-processing unit 220 and the controller 230 shown in FIG. 3 .
- the processor 180 of the artificial intelligence device 10 may perform the function of the pre-processing unit 220 and the function of the controller 230.
- FIG. 5 is a block diagram illustrating the configuration of a processor for voice recognition and synthesis of the artificial intelligence device 10 according to an embodiment of the present invention.
- the voice recognition and synthesis process of FIG. 5 may be performed by the learning processor 130 or the processor 180 of the artificial intelligence device 10 without going through a server.
- the processor 180 of the artificial intelligence device 10 may include an STT engine 510, an NLP engine 530, and a speech synthesis engine 550.
- Each engine can be either hardware or software.
- the STT engine 510 may perform the functions of the STT server 20 of FIG. 1 . That is, the STT engine 510 may convert voice data into text data.
- NLP engine 530 may perform the function of NLP server 30 of FIG. That is, the NLP engine 530 may obtain intention analysis information representing the speaker's intention from the converted text data.
- the voice synthesis engine 550 may perform the function of the voice synthesis server 40 of FIG. 1 .
- the speech synthesis engine 550 may search a database for syllables or words corresponding to given text data, synthesize a combination of the searched syllables or words, and generate synthesized speech.
- the speech synthesis engine 550 may include a pre-processing engine 551 and a TTS engine 553.
- the pre-processing engine 551 may pre-process text data before generating synthesized speech.
- the pre-processing engine 551 performs tokenization by dividing text data into tokens, which are meaningful units.
- the preprocessing engine 551 may perform a cleansing operation to remove unnecessary characters and symbols to remove noise.
- the pre-processing engine 551 may generate the same word token by combining word tokens having different expression methods.
- the preprocessing engine 551 may remove meaningless word tokens (stopwords).
- the TTS engine 553 may synthesize a voice corresponding to the preprocessed text data and generate a synthesized voice.
- a voice service system that provides a voice synthesis service based on tone conversion or an operating method of the artificial intelligence device 10 will be described.
- the voice service system or artificial intelligence device 10 may generate and use a unique TTS model for voice synthesis service.
- a voice service system may provide a platform for a voice synthesis service.
- the voice synthesis service platform may provide a voice synthesis service development toolkit (Voice Agent Development Toolkit).
- the voice synthesis service development toolkit allows non-experts in voice synthesis technology to develop a voice agent or voice agent according to the present disclosure. It may indicate a development toolkit provided to more easily use the voice synthesis service through
- the speech synthesis service development toolkit may be a web-based development tool for developing a voice agent.
- a development toolkit can be used by accessing a web service through the artificial intelligence device 10, and various user interface screens related to the development toolkit can be provided on the screen of the artificial intelligence device 10. there is.
- the voice synthesis function may include an emotional voice synthesis and a tone conversion function.
- the voice conversion function may indicate a function in which a user of the development toolkit can directly register his or her own voice to generate a voice (synthetic voice) for an arbitrary text.
- a user can use a voice synthesis service by creating his or her own TTS model using a development toolkit, and thus the convenience and satisfaction of the use can be greatly improved.
- Voice synthesis based on tone conversion (voice change) enables expressing a speaker's tone and vocalization habit with a relatively very small amount of learning data compared to the prior art.
- FIG. 6 is a configuration block diagram of a voice service system for voice synthesis according to another embodiment of the present disclosure.
- a voice service system for voice synthesis may include an artificial intelligence device 10 and a voice service server 200 .
- the artificial intelligence device 10 uses a communication unit (not shown) to process a voice synthesis service through a voice synthesis service platform provided by the voice service server 200 (but is not necessarily limited thereto). , It may be configured to include an output unit 150 and a processing unit 600.
- the communication unit may support communication between the artificial intelligence device 10 and the voice service server 200 . Through this, the communication unit may exchange various data through the voice synthesis service platform provided by the voice service server 200 .
- the output unit 150 may provide various user interface screens related to or including the development toolkit provided by the speech synthesis service platform.
- the output unit 150 provides an input interface for receiving target data for voice synthesis, that is, an arbitrary text input, and provides a user through the provided input interface.
- target data for voice synthesis that is, an arbitrary text input
- synthesized voice data ie, synthesized voice data
- the processing unit 600 may include a memory 610 and a processor 620.
- the processing unit 600 may process various data of the user and the voice service server 200 on the voice synthesis service platform.
- the memory 610 may store various data received or processed by the artificial intelligence device 10 .
- the memory 610 may store various voice synthesis-related data that is processed by the processing unit 600 or exchanged through a voice synthesis service platform or received from the voice service server 200 .
- the processor 620 controls to store in the memory 610 when speech synthesis data finally generated through the speech synthesis service platform (including data such as input for speech synthesis) is received, and the voice stored in the memory 610 Link information (or linkage information) between synthesized data and a target user of the corresponding voice synthesized data may be generated and stored, and the corresponding information may be transmitted to the voice service server 200 .
- the processor 620 may control the output unit 150 to receive synthesized voice data for an arbitrary text based on the link information from the voice service server 200 and provide the synthesized voice data to the user.
- the processor 620 may provide not only the received synthesized voice data, but also information related to recommended information, a recommended function, or the like, or output a guide.
- the voice service server 200 may include the STT server 20 shown in FIG. 1 , the NLP server 30 and the voice synthesis server 40 .
- At least a part or function of the voice service server 200 shown in FIG. 1 may be replaced by an engine within the artificial intelligence device 10 as shown in FIG. 5 .
- the processor 620 may be the processor 180 of FIG. 2 or may be a separate component.
- the artificial intelligence device 10 may be replaced with or included in the voice service server 200 depending on the context.
- FIG. 7 is a schematic diagram illustrating a voice synthesis service based on tone conversion according to an embodiment of the present disclosure.
- Voice synthesis based on timbre conversion may largely include a learning process (or training process) and an inference process.
- the learning process may be performed as follows.
- the voice synthesis service platform may pre-generate and retain a voice conversion base model for providing a voice conversion function.
- the voice synthesis service platform can learn them in the voice conversion learning module.
- the amount of audio data for learning is a small amount of audio data, for example, about 3 to 7 minutes, and can be learned for a time of 3 to 7 hours.
- an inference process may be performed as follows.
- the reasoning process shown in (b) of FIG. 7 may be performed after learning in the above-described tone conversion learning module, for example.
- the voice synthesis service platform may generate a user voice synthesis model for each user through, for example, the learning process of FIG. 7 (a).
- the speech synthesis service platform determines a target user for the text data, and performs an inference process in the speech synthesis reasoning module based on the previously generated user voice synthesis model for the determined target user to voice the target user. Synthetic data can be created.
- the learning process of (a) of FIG. 7 and the reasoning process of (b) of FIG. 7 according to an embodiment of the present disclosure are not limited to the foregoing.
- FIG. 8 is a diagram for explaining the configuration of a voice synthesis service platform according to an embodiment of the present disclosure.
- the speech synthesis service platform has a hierarchical structure composed of a database layer, a storage layer, an engine layer, a framework layer, and a service layer. can be formed as However, it is not limited thereto.
- At least one or more layers in the hierarchical structure shown in FIG. 8 constituting the voice synthesis service platform may be omitted or combined to be implemented as one layer.
- At least one or more layers not shown in FIG. 8 may be further included to form a voice synthesis service platform.
- each layer constituting the speech synthesis service platform will be described as follows.
- the database layer may hold (or include) a user voice data DB and a user model management DB in order to provide a voice synthesis service in the voice synthesis service platform.
- the user voice data DB is a space for storing user voices, and may individually store each user voice (ie, voice). Depending on the embodiment, in the user voice data DB, a plurality of spaces may be allocated to one user or vice versa. In the former case, a plurality of spaces may be allocated to the user voice data DB based on a plurality of voice synthesis models generated for one user or text data requested for voice synthesis.
- the user voice data DB for example, when each user's sound source (voice) is registered through a development toolkit provided in the service layer, that is, when the user's sound source data is uploaded, it can be stored in a space for the corresponding user.
- the sound source data may be directly received and uploaded from the artificial intelligence device 10 or indirectly uploaded through the artificial intelligence device 10 through a remote control device (not shown).
- the remote control device may include a remote control, a mobile device such as a smartphone in which an application related to a voice synthesis service, an application programming interface (API), and a plug-in are installed. However, it is not limited thereto.
- the user model management DB stores information (target data, related motion control information, etc.) when a user voice model is created, learned, deleted, etc. by a user through a development toolkit provided in the service layer, for example. can
- the user model management DB may store information about a sound source managed by a user, a model, a learning progress state, and the like.
- the user model management DB may store related information when a request for adding or deleting a speaker is made by a user through, for example, a development toolkit provided in a service layer. Accordingly, it is possible to manage the model of the corresponding user through the user model management DB.
- the storage layer may include a voice conversion base model and a user voice synthesis model.
- the tone color conversion base model may indicate a basic model (common model) used for tone tone conversion.
- the user voice synthesis model may represent a voice synthesis model generated for the user through learning in the voice conversion learning module.
- the engine layer may represent an engine that includes a voice conversion learning module and a voice synthesis inference module, and performs the learning and inference process as shown in FIG. 7 described above.
- the module (engine) belonging to the engine layer may be written based on Python, for example. However, it is not limited thereto.
- Data learned through the tone conversion learning module belonging to the engine layer may be transmitted to a user voice synthesis model of the storage layer and a user model management DB of the database layer, respectively.
- the voice conversion learning module may start learning based on the voice conversion base model of the storage layer and user voice data of the database layer.
- the voice conversion learning module may perform speaker transfer learning to fit a new user voice based on the voice conversion base model.
- the voice conversion learning module may generate a user voice synthesis model as a learning result.
- the voice conversion learning module may generate a plurality of user voice synthesis models for one user.
- the tone conversion learning module may generate a model similar to the generated user voice synthesis model according to a request or setting.
- the similar model may be one obtained by arbitrarily modifying or changing a part previously defined in the initial user voice synthesis model.
- the voice conversion learning module may generate a new voice synthesis model by combining it with another previously generated user voice synthesis model of the corresponding user.
- Various new voice synthesis models may be combined and generated according to the previously created voice synthesis model of the user.
- the tone conversion learning module may store learning completion status information in a user model management DB.
- the speech synthesis reasoning module may receive a speech synthesis request for text together with text from a user through a speech synthesis function of a development toolkit of a service layer.
- the voice synthesis inference module When a voice synthesis request is received, the voice synthesis inference module generates a synthesized voice along with a user voice synthesis model on the storage layer, that is, a user voice synthesis model generated through the voice conversion learning module, and returns or delivers it to the user through the development toolkit.
- can Delivered through the development toolkit may mean provided to the user through the screen of the artificial intelligence device 10 .
- the framework layer may be implemented by including a voice conversion framework and a voice conversion learning framework, but is not limited thereto.
- the tone conversion framework is based on Java and can perform a function of transmitting commands, data, etc. between a development toolkit, an engine, and a database layer.
- the tone conversion framework may utilize a RESTful API for command transmission, but is not limited thereto.
- the tone conversion framework can transfer it to the user's voice data DB of the database layer.
- the tone conversion framework can transfer it to the user model management DB of the database layer.
- the tone conversion framework may transfer it to the user model management DB of the database layer.
- the tone conversion framework may transmit it to the voice synthesis reasoning module of the engine layer.
- the speech synthesis inference module may pass it back to the user speech synthesis model of the storage layer.
- the tone conversion learning framework may periodically check whether a learning request is received by a user.
- the tone conversion learning framework may automatically start learning if there is a model to be trained.
- the tone conversion learning framework may transmit a confirmation signal to the user model management DB of the database layer as to whether or not the learning request has been received.
- the tone conversion learning framework may control the tone conversion learning module of the engine layer to start learning according to the contents returned from the user model management DB regarding the transmission of the above-described confirmation signal for whether or not the learning request is received.
- the voice conversion learning module may transfer the learning result to the user voice synthesis model of the storage layer and the user model management of the database layer, as described above.
- the service layer may provide a development toolkit (user interface) of the aforementioned voice synthesis service platform.
- the development toolkit may be provided on the screen of the artificial intelligence device 10 when the user uses the voice synthesis service platform through the artificial intelligence device 10 .
- FIG. 9 is a flowchart illustrating a voice synthesis service process according to an embodiment of the present disclosure.
- the voice synthesis service according to the present disclosure is performed through a voice synthesis service platform, but various data may be transmitted/received between the artificial intelligence device 10, which is hardware, and the server 200 in the process.
- FIG. 9 is described based on the operation of the server 200 through the voice synthesis service platform, but is not limited thereto.
- the server 200 may provide a development toolkit for the user's convenience in using the speech synthesis service to be output on the artificial intelligence device 10 through the speech synthesis service platform. At least one of the processes shown in FIG. 9 may be performed on a development toolkit or through a development toolkit.
- the server 200 can check the registered learning request (S105) and start learning (S107).
- the server 200 may check the status of the created learning model (S111).
- the server 200 When the speech synthesis request is received through the speech synthesis service platform after step S111 (S113), the server 200 performs an operation for speech synthesis based on the user speech synthesis model and the speech synthesis inference model to transmit synthesized speech. (S115).
- 10A to 15D are diagrams illustrating a process of using a voice synthesis service on a service platform using a development toolkit according to an embodiment of the present disclosure.
- the development toolkit is named and described as a user interface for convenience.
- FIG. 10A illustrates a user interface for functions available through a development toolkit for a speech synthesis service according to an embodiment of the present disclosure.
- various functions such as speaker information management, speaker voice registration, speaker voice confirmation, speaker model management, and speaker model voice synthesis are available through the development toolkit.
- 10B and 10C illustrate user interfaces for a speaker information management function among functions available through the development toolkit of FIG. 10A.
- the speaker information may include information about a speaker ID (or identifier), a speaker name, and a speaker registration date.
- 10C may show a screen for registering a new speaker through the registration button in FIG. 10B. As described above, one speaker can register a plurality of speakers.
- 11A shows a designated text list for registering a speaker's sound source (voice) for voice synthesis.
- test text list shown in FIG. 11A may be registered by a speaker recording sound sources according to the order.
- FIG. 11B shows a recording process for the test text list selected in FIG. 11A.
- the screen of FIG. 11B may be provided.
- a test text list may be automatically selected and directly converted to the screen of FIG. 11B.
- recording may be requested when a record button is activated, and when recording is completed by a speaker, an item for uploading a recorded file to the server 200 may be provided. there is.
- test text data when an item (recording function) is activated and a speaker utters a given test text, information on the recording time and the recorded sound source waveform of the speaker may be provided.
- test text data according to utterances may also be provided to check whether the text uttered by the speaker and the test text match each other. Through this, it may be determined whether the provided test text matches the uttered text.
- the server 200 may request a speaker to repeat utterances of the test text a plurality of times. Through this, the server 200 can determine whether the sound source waveforms according to the speaker's utterance match each time.
- the server 200 may request the speaker to utter different nuances for the same test text, or may request utterances of the same nuance.
- the server 200 compares the sound source waveforms obtained by utterances of the same test text by the speaker, and the utterances corresponding to the sound source waveforms in which the sound source waveforms differ by more than a threshold value will not be excluded or adopted from the number of times.
- the server 200 may calculate an average value of sound source waveforms obtained when a speaker utters the same test text a predefined number of times.
- the server 200 may define a maximum allowable value and a minimum allowable value based on the calculated average value. When the average value, the maximum allowable value, and the minimum allowable value are defined in this way, the server 200 may reconfirm the defined values through a test on the values.
- the server 200 may redefine the predefined average value, maximum allowable value, and minimum allowable value when the sound source waveform according to the test result deviates from the maximum allowable value and the minimum allowable value continuously for more than a predetermined number of times based on the defined average value. .
- the server 200 generates a sound source waveform considering the maximum allowable value and the minimum allowable value based on the average value of the text data, and overlaps the test sound source waveform with the corresponding sound source waveform to obtain the sound source waveform. can be compared
- the server 200 may filter and remove a part corresponding to silence or a sound source waveform of less than a predefined size among the sound source waveforms, and use only meaningful sound source waveforms as a comparison target to determine whether the sound source waveforms match. there is.
- the server 200 when the speaker's sound source registration for one test text is completed, the server 200 provides information on whether or not the sound source state is good and provides a service so that the speaker can upload the corresponding sound source information. .
- the server 200 may provide an error message as shown in FIG. 12A when the speaker utters the text 'Hello' rather than the test text. .
- an error message may be provided as shown in FIG. 12B.
- the threshold may be -30 dB, for example. However, it is not limited thereto.
- the server 200 may provide an error message of 'Low Volume'.
- the intensity of the recorded voice that is, the volume level
- RMS Root Mean Square
- the server 200 may provide information so that the speaker can clearly recognize what kind of error has occurred.
- the server 200 may be notified that the sound source of the corresponding speaker for the test text has been uploaded.
- the speaker may call and register through the service platform if there is a speaker's sound source file on another device.
- legal problems such as theft of sound sources may occur, and appropriate protection measures need to be devised.
- the server 200 may determine whether the sound source corresponds to the test text, and as a result of the determination, if the sound source corresponds to the test text, Again, the speaker's sound source for the corresponding test text is requested, and it is determined whether the sound source waveform of the requested sound source and the uploaded sound source match or there is at least a difference less than a threshold value.
- the server 200 provides a legally effective notice for the call in advance before the registration is rejected, that is, before the speaker calls the sound source file stored in another device through the service platform, and the sound source file is stored only when the speaker agrees. You can service to enable upload.
- the server 200 may register a voice of another person other than the speaker's voice if it is called and uploaded when there is no legal issue such as theft of a sound source when registering a sound source file of another device through a service platform. .
- the server 200 registers a speaker's sound source file for each test text through the service platform, and when a file is generated, the server 200 may upload the file in bulk or all, or may select and upload a part of the file and control the service.
- the server 200 may perform service control such that sound source files of a plurality of speakers are uploaded and registered for each test text through the service platform.
- Each of the plurality of uploaded files may have different sound source waveforms depending on the speaker's emotional state or nuance with respect to the same test text.
- the user interface of FIG. 13A shows a list of sound sources registered by a speaker.
- the server 200 can control the service so that the speaker can reproduce or delete the registered sound source for each test text that the speaker directly uploaded and registered.
- the server 200 may provide a playback bar for reproducing the corresponding test text and sound source.
- the server 200 may provide a service so that a speaker can check a sound source registered by the speaker through a play bar.
- the server 200 may provide a service so that the speaker can re-record, re-upload, and re-register the sound source for the test text through the above-described process according to the confirmation result, or delete it immediately as shown in FIG. 13C.
- the speaker model management may be, for example, a user interface for managing a speaker voice synthesis model.
- the server 200 may start model learning with each speaker ID, and may delete already learned models or registered sound sources.
- the server 200 may provide a service so that the speaker can check the progress status through confirmation of the learning progress status of the speaker's voice synthesis model.
- FIG. 14B the learning progress of the model may be displayed as follows.
- FIG. 14B shows an INITIATE state indicating that there is no learning data in the first registered state, a READY state indicating that there is learning data, a REQUESTED state indicating that learning is requested, a PROCESSING state indicating that learning is in progress, and learning is complete.
- a service can be provided to enable status checks such as a COMPLETED status indicating a case, a DELETED status indicating a case in which a model has been removed, and a FAILED status indicating a case in which an error has occurred during learning.
- the server 200 may provide a service so that a 'READY' status is displayed if there is learning data for a speaker whose speaker ID is 'Kildong Hong'.
- the server 200 may provide a guide message as shown in FIG. 14C.
- the guide message may vary according to the status of a speaker having a corresponding speaker ID. Referring to FIGS.
- the server 200 may provide a guide message so that a speaker with a speaker ID of 'Hong Gil-dong' is currently in a 'READY' state, so that the next state, that is, a learning start request can be requested.
- the server 200 receives a learning start request from a speaker through a corresponding guide message
- the server 200 changes the state of the corresponding speaker from 'READY' to 'REQUESTED' and starts learning in the tone conversion learning module. can be changed to a PROCESSING status that indicates the status during learning.
- the server 200 may automatically change the state of the corresponding speaker from the 'PROCESSING' state to the 'COMPLETED' state.
- the user interface for synthesizing the speaker model voice may be for, for example, when voice synthesis is performed next when learning is completed (COMPLETED) according to a request from the voice conversion learning module.
- the illustrated user interface may be for at least one speaker ID that has been learned through the above process.
- an item for selecting a speaker ID (or speaker name)
- an item for selecting/changing text for speech synthesis a synthesis request item
- a synthesis method control item a playback, download, and deletion
- At least one or more of the items may be included.
- 15A is a user interface screen related to selecting a speaker who can start voice synthesis.
- the server 200 may provide selectable at least one speaker ID for which learning by the tone conversion learning module is completed so that voice synthesis can be started.
- 15B is a user interface screen for selecting or changing speech synthesis text desired by a speaker having a corresponding speaker ID, that is, target text for speech synthesis.
- 'Kanadaramabasa' displayed in the corresponding item is only an example of a text item, but is not limited thereto.
- the server 200 may activate a text item to provide a text input window as shown in FIG. 15B.
- the server 200 may provide a blank screen so that the speaker can directly input the text input window, or text set as default and text selected randomly from among texts frequently used for voice synthesis. Any one of the following may be provided. Meanwhile, even when the text input window is activated, an interface for voice input as well as an interface for text input such as a keyboard may be provided, and voice input through the STT process may be provided to the text input window.
- the server 200 may recommend and provide keywords or text related thereto through automatic completion.
- the server 200 may control text selection for voice synthesis to be completed through selection of a change or close button.
- the server 200 may provide a guide message as shown in (c) of FIG. 15 , and voice synthesis may start according to the speaker's selection.
- FIGS. 15B and 15C may be performed between FIGS. 15B and 15C or may be performed after the process of FIG. 15C. For convenience, the latter is described.
- the server 200 may select the play button or listen to the synthesized voice as shown in FIG. 15D, or click the download button A sound source for the synthesized voice may be downloaded by selecting or a sound source generated for the synthesized voice may be deleted by selecting the delete button.
- the server 200 may provide a service so that the synthesized voice of the text for which voice synthesis has been completed by the speaker can be adjusted.
- the server 200 may adjust the volume level, pitch, or speed.
- the volume level is set to an intermediate value (eg, 5 when the volume level is 1-10) by default, but in this way, the volume level (5) of the first synthesized voice is It can be arbitrarily adjusted within the level control range (1-10).
- the volume level the convenience of adjusting the volume level can be increased by allowing the synthesized voice to be immediately executed and provided according to the volume level control.
- a medium value may be set as a default value for the first synthesized voice, but it may be changed and set to an arbitrary value (one of Lowest, Low, High, and Highest).
- the pitch value adjusted for the synthesized voice is provided simultaneously with the pitch adjustment, thereby increasing the convenience of pitch adjustment.
- a medium value may be set as a default for the first synthesized voice, but it may be adjusted to an arbitrary speed value (one of Very Slow, Slow, Fast, and Very Fast).
- volume level may be provided to be selectable in a non-numerical manner.
- pitch and speed control values may be provided in numerical form.
- a synthesized voice adjusted according to a request for adjusting at least one of volume, pitch, and speed of the first synthesized voice may be stored separately together with the first synthesized voice, but may be linked to the first synthesized voice.
- the synthesized voice adjusted according to the request for adjustment of at least one of volume, pitch, and speed is applied only when playing on the service platform, and in the case of download, the service may be provided so that only the first synthesized voice having a default value can be downloaded. It is not limited to this. That is, it may be applicable even when downloading.
- the values of the basic volume, basic pitch, and basic speed may be changed according to preset values prior to a synthesis request.
- Each of the above values may be arbitrarily selected or changed.
- each value may be pre-mapped according to a speaker ID and may be applied upon a synthesis request.
- a user may have his/her own voice synthesis model, and through this, it may be utilized on various social media or personal broadcasting platforms.
- a personalized voice synthesizer can be used for virtual spaces or virtual characters such as digital humans or metaverses.
- the above-described method can be implemented as a processor-readable code in a medium on which a program is recorded.
- media readable by the processor include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like.
- the artificial intelligence device described above is not limited to the configuration and method of the above-described embodiments, but the above embodiments are configured by selectively combining all or part of each embodiment so that various modifications can be made. It could be.
- the voice service system since a personalized voice synthesis model is provided and a user's unique synthesized voice can be used in various media environments, it has industrial applicability.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Claims (15)
- 개발 툴킷을 제공하는 음성 합성 서비스 플랫폼을 통해 미리 정의된 복수의 제1 텍스트에 대한 화자의 음성 합성을 위한 음원 데이터를 수신하는 단계;미리 생성된 음색 변환 베이스 모델을 이용하여 화자의 음원 데이터에 대한 음색 변환 학습을 하는 단계;상기 음색 변환 학습을 통해 상기 화자에 대한 음성 합성 모델을 생성하는 단계;제2 텍스트를 입력받는 단계;상기 화자에 대한 음성 합성 모델과 상기 제2 텍스트에 기반하여 음성 합성 추론을 통해 음성 합성 모델을 생성하는 단계; 및상기 음성 합성 모델을 이용하여 합성 음성을 생성하는 단계;를 포함하는,음성 합성 서비스 제공 방법.
- 제1항에 있어서,상기 미리 정의된 복수의 제1 텍스트에 대한 화자의 음성 합성을 위한 음원 데이터를 수신하는 단계는,상기 각 제1 텍스트에 대하여 화자의 음원을 복수 회 입력받는 단계; 및상기 복수 회 입력받은 화자의 음원에 기초하여 상기 화자의 음성 합성을 위한 음원 데이터를 생성하는 단계;를 포함하는,음성 합성 서비스 제공 방법.
- 제2항에 있어서,상가 화자의 음성 합성을 위한 음원 데이터는,상기 복수 회 입력받은 화자의 음원의 평균값인,음성 합성 서비스 제공 방법.
- 제3항에 있어서,상기 음색 변환 학습을 하는 단계는,상기 음색 변환 베이스 모델을 기초로 화자 전이 학습을 수행하는 단계를 포함하는,음성 합성 서비스 제공 방법.
- 제1항에 있어서,상기 음성 합성 모델은,상기 화자에 대하여 복수 개 생성되는,음성 합성 서비스 제공 방법.
- 제1항에 있어서,상기 미리 정의된 복수의 제1 텍스트 중 선택된 제1 텍스트만 상기 음성 합성에 이용되는,음성 합성 서비스 제공 방법.
- 제1항에 있어서,화자 아이디와 제3 텍스트를 입력받는 단계;상기 화자 아이디에 상응하는 화자에 대 생성된 음성 합성 모델을 호출하는 단계;상기 호출된 음성 합성 모델에 기초하여 상기 제3 텍스트에 대해 음성 합성하는 단계; 및상기 제3 텍스트에 대한 합성 음성을 생성하는 단계;를 더 포함하는,음성 합성 서비스 제공 방법.
- 제7항에 있어서,상기 생성된 합성 음성에 대한 볼륨 레벨, 피치 및 속도 중 적어도 하나에 대한 입력을 수신하는 단계; 및상기 수신된 입력에 기초하여 상기 생성된 합성 음성에 대한 볼륨 레벨, 피치 및 속도 중 하나를 조절하는 단계;를 더 포함하는,음성 합성 서비스 제공 방법.
- 인공지능 기기; 및상기 인공지능 기기와 데이터를 주고받는 컴퓨팅 디바이스를 포함하여 구성되되,상기 컴퓨팅 디바이스는,개발 툴킷을 제공하는 음성 합성 서비스 플랫폼을 통해 미리 정의된 복수의 제1 텍스트에 대한 화자의 음성 합성을 위한 음원 데이터를 수신하고, 미리 생성된 음색 변환 베이스 모델을 이용하여 화자의 음원 데이터에 대한 음색 변환 학습을 수행하여, 상기 화자에 대한 음성 합성 모델을 생성하며, 제2 텍스트를 입력받으면, 상기 화자에 대한 음성 합성 모델과 상기 제2 텍스트에 기반하여 음성 합성 추론을 통해 음성 합성 모델을 생성하고, 상기 음성 합성 모델을 이용하여 합성 음성을 생성하는 프로세스를 포함하는,인공지능 기반 음성 합성 서비스 시스템.
- 제9항에 있어서,상기 프로세서는,상기 각 제1 텍스트에 대하여 화자의 음원을 복수 회 입력받아, 상기 복수 회 입력받은 화자의 음원에 기초하여 상기 화자의 음성 합성을 위한 음원 데이터를 생성하는,인공지능 기반 음성 합성 서비스 시스템.
- 제10항에 있어서,상기 프로세서는,상기 복수 회 입력받은 화자의 음원의 평균값으로 상가 화자의 음성 합성을 위한 음원 데이터를 설정하는,인공지능 기반 음성 합성 서비스 시스템.
- 제11항에 있어서,상기 프로세서는,상기 음색 변환 베이스 모델을 기초로 화자 전이 학습을 수행하여 상기 음색 변환 학습하는,인공지능 기반 음성 합성 서비스 시스템.
- 제9항에 있어서,상기 프로세서는,상기 음성 합성 모델을 상기 화자에 대하여 복수 개 생성하고, 상기 미리 정의된 복수의 제1 텍스트 중 선택된 제1 텍스트만 상기 음성 합성에 이용하는,인공지능 기반 음성 합성 서비스 시스템.
- 제9항에 있어서,상기 프로세서는,화자 아이디와 제3 텍스트를 입력받으면, 상기 화자 아이디에 상응하는 화자에 대 생성된 음성 합성 모델을 호출하고, 상기 호출된 음성 합성 모델에 기초하여 상기 제3 텍스트에 대해 음성 합성하여, 상기 제3 텍스트에 대한 합성 음성을 생성하는,인공지능 기반 음성 합성 서비스 시스템.
- 제14항에 있어서,상기 프로세서는,상기 생성된 합성 음성에 대한 볼륨 레벨, 피치 및 속도 중 적어도 하나에 대한 입력을 수신하고, 상기 수신된 입력에 기초하여 상기 생성된 합성 음성에 대한 볼륨 레벨, 피치 및 속도 중 하나를 조절하는,인공지능 기반 음성 합성 서비스 시스템.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020247015553A KR20240073991A (ko) | 2021-11-09 | 2022-10-20 | 음성 합성 서비스 제공 방법 및 그 시스템 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20210153451 | 2021-11-09 | ||
KR10-2021-0153451 | 2021-11-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023085635A1 true WO2023085635A1 (ko) | 2023-05-19 |
Family
ID=86336358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2022/015990 WO2023085635A1 (ko) | 2021-11-09 | 2022-10-20 | 음성 합성 서비스 제공 방법 및 그 시스템 |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20240073991A (ko) |
WO (1) | WO2023085635A1 (ko) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101665882B1 (ko) * | 2015-08-20 | 2016-10-13 | 한국과학기술원 | 음색변환과 음성dna를 이용한 음성합성 기술 및 장치 |
KR20190085882A (ko) * | 2018-01-11 | 2019-07-19 | 네오사피엔스 주식회사 | 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체 |
US20200058288A1 (en) * | 2018-08-16 | 2020-02-20 | National Taiwan University Of Science And Technology | Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium |
KR20200048620A (ko) * | 2018-10-30 | 2020-05-08 | 주식회사 셀바스에이아이 | 음성 합성 모델의 학습용 데이터 생성 방법 및 음성 합성 모델의 학습 방법 |
WO2020246641A1 (ko) * | 2019-06-07 | 2020-12-10 | 엘지전자 주식회사 | 복수의 화자 설정이 가능한 음성 합성 방법 및 음성 합성 장치 |
-
2022
- 2022-10-20 WO PCT/KR2022/015990 patent/WO2023085635A1/ko active Application Filing
- 2022-10-20 KR KR1020247015553A patent/KR20240073991A/ko unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101665882B1 (ko) * | 2015-08-20 | 2016-10-13 | 한국과학기술원 | 음색변환과 음성dna를 이용한 음성합성 기술 및 장치 |
KR20190085882A (ko) * | 2018-01-11 | 2019-07-19 | 네오사피엔스 주식회사 | 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체 |
US20200058288A1 (en) * | 2018-08-16 | 2020-02-20 | National Taiwan University Of Science And Technology | Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium |
KR20200048620A (ko) * | 2018-10-30 | 2020-05-08 | 주식회사 셀바스에이아이 | 음성 합성 모델의 학습용 데이터 생성 방법 및 음성 합성 모델의 학습 방법 |
WO2020246641A1 (ko) * | 2019-06-07 | 2020-12-10 | 엘지전자 주식회사 | 복수의 화자 설정이 가능한 음성 합성 방법 및 음성 합성 장치 |
Also Published As
Publication number | Publication date |
---|---|
KR20240073991A (ko) | 2024-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020017849A1 (en) | Electronic device and method for providing artificial intelligence services based on pre-gathered conversations | |
WO2020222444A1 (en) | Server for determining target device based on speech input of user and controlling target device, and operation method of the server | |
WO2019039834A1 (en) | METHOD FOR PROCESSING VOICE DATA AND ELECTRONIC DEVICE SUPPORTING SAID METHOD | |
WO2017160073A1 (en) | Method and device for accelerated playback, transmission and storage of media files | |
WO2019078588A1 (ko) | 전자 장치 및 그의 동작 방법 | |
WO2020246634A1 (ko) | 다른 기기의 동작을 제어할 수 있는 인공 지능 기기 및 그의 동작 방법 | |
WO2019078615A1 (en) | METHOD AND ELECTRONIC DEVICE FOR TRANSLATING A VOICE SIGNAL | |
WO2020196955A1 (ko) | 인공 지능 기기 및 인공 지능 기기의 동작 방법 | |
WO2020085794A1 (en) | Electronic device and method for controlling the same | |
WO2020218650A1 (ko) | 전자기기 | |
WO2021045447A1 (en) | Apparatus and method for providing voice assistant service | |
WO2020230926A1 (ko) | 인공 지능을 이용하여, 합성 음성의 품질을 평가하는 음성 합성 장치 및 그의 동작 방법 | |
WO2021029627A1 (en) | Server that supports speech recognition of device, and operation method of the server | |
WO2020218635A1 (ko) | 인공 지능을 이용한 음성 합성 장치, 음성 합성 장치의 동작 방법 및 컴퓨터로 판독 가능한 기록 매체 | |
AU2019319322B2 (en) | Electronic device for performing task including call in response to user utterance and operation method thereof | |
WO2020263016A1 (ko) | 사용자 발화를 처리하는 전자 장치와 그 동작 방법 | |
WO2021086065A1 (en) | Electronic device and operating method thereof | |
WO2018174445A1 (ko) | 파셜 랜딩 후 사용자 입력에 따른 동작을 수행하는 전자 장치 | |
WO2021085661A1 (ko) | 지능적 음성 인식 방법 및 장치 | |
EP3841460A1 (en) | Electronic device and method for controlling the same | |
WO2023085584A1 (en) | Speech synthesis device and speech synthesis method | |
WO2019039873A1 (ko) | Tts 모델을 생성하는 시스템 및 전자 장치 | |
WO2020222338A1 (ko) | 화상 정보를 제공하는 인공 지능 장치 및 그 방법 | |
WO2023085635A1 (ko) | 음성 합성 서비스 제공 방법 및 그 시스템 | |
WO2020256170A1 (ko) | 인공 지능을 이용한 음성 합성 장치, 음성 합성 장치의 동작 방법 및 컴퓨터로 판독 가능한 기록 매체 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22893058 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20247015553 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022893058 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022893058 Country of ref document: EP Effective date: 20240607 |