CN117546233A - Electronic apparatus and control method thereof - Google Patents

Electronic apparatus and control method thereof Download PDF

Info

Publication number
CN117546233A
CN117546233A CN202280043868.2A CN202280043868A CN117546233A CN 117546233 A CN117546233 A CN 117546233A CN 202280043868 A CN202280043868 A CN 202280043868A CN 117546233 A CN117546233 A CN 117546233A
Authority
CN
China
Prior art keywords
phoneme
acoustic feature
feature information
information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280043868.2A
Other languages
Chinese (zh)
Inventor
朴相俊
朱基岘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020210194532A external-priority patent/KR20220170330A/en
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority claimed from PCT/KR2022/006304 external-priority patent/WO2022270752A1/en
Publication of CN117546233A publication Critical patent/CN117546233A/en
Pending legal-status Critical Current

Links

Abstract

A method for controlling an electronic device, comprising: obtaining text; obtaining acoustic feature information corresponding to the text and alignment information in which each frame of the acoustic feature information is matched with each phoneme included in the text by inputting the text into the first neural network model; identifying speech speed of the acoustic feature information based on the alignment information; identifying a reference speaking speed of each phoneme included in the acoustic feature information based on the text and the acoustic feature information; obtaining speech speed adjustment information based on the speech speed of the acoustic feature information and the reference speech speed of each phoneme; and obtaining speech data corresponding to the text by inputting the acoustic feature information into the second neural network model based on the speech speed adjustment information.

Description

Electronic apparatus and control method thereof
Technical Field
The present disclosure relates generally to an electronic device and a control method thereof. More particularly, the present disclosure relates to an electronic device that performs speech synthesis using an artificial intelligence model and a control method thereof.
Background
With the development of electronic technology, various types of devices have been developed and distributed, and in particular, devices that perform speech synthesis have been promoted.
Speech synthesis is a technique called text-to-speech (TTS) that implements human voice from text, and in recent years, a neural TTS using a neural network model is being developed.
For example, the neural TTS may include a prosodic neural network model and a neural vocoder neural network model. The prosodic neural network model may receive text and output acoustic feature information, and the neural vocoder neural network model may receive acoustic feature information and output voice data (waveforms).
Among TTS models, a prosodic neural network model has speech characteristics of a speaker for learning. In other words, the output of the prosodic neural network model may be acoustic feature information that includes speech features of a particular speaker and speech speed features of the particular speaker.
In the related art, with the development of artificial intelligence models, a personalized TTS model is proposed, which outputs voice data including voice features of a user of an electronic device. The personalized TTS model is a TTS model trained based on speech data of an individual user, and outputs speech data including speech features and speech speed features of the user used in learning.
The sound quality of the speech data of the utterances of the individual user used in the training of the personalized TTS model is generally lower than that of the data used in the general TTS model training, and thus, a problem may occur with respect to the speech speed of the speech data output from the personalized TTS mode.
Disclosure of Invention
Technical problem
An adaptive speech rate adjustment method for a text-to-speech (TTS) model is provided.
Technical proposal
According to an aspect of an exemplary embodiment, a method for controlling an electronic device may include: obtaining text; obtaining acoustic feature information corresponding to the text and alignment information in which each frame of the acoustic feature information is matched with each phoneme included in the text by inputting the text into the first neural network model; identifying speech speed of the acoustic feature information based on the alignment information; identifying a reference speaking speed of each phoneme included in the acoustic feature information based on the text and the acoustic feature information; obtaining speech speed adjustment information based on the speech speed of the acoustic feature information and the reference speech speed of each phoneme; and obtaining speech data corresponding to the text by inputting the acoustic feature information into the second neural network model based on the speech speed adjustment information.
Identifying the speaking speed of the acoustic feature information may include identifying the speaking speed corresponding to the first phoneme included in the acoustic feature information based on the alignment information. Identifying the reference speech speed for each phoneme may include: the first phoneme included in the acoustic feature information is recognized based on the acoustic feature information, and a reference speech speed corresponding to the first phoneme is recognized based on the text.
Identifying the reference speech speed corresponding to the first phoneme may include: a first reference speech speed corresponding to the first phoneme is obtained based on the text, and sample data for training a first neural network model is obtained.
Identifying the reference speech speed corresponding to the first phoneme may include: the method includes obtaining evaluation information of sample data for training a first neural network model, and identifying a second reference speech speed corresponding to the first phoneme based on the first reference speech speed corresponding to the first phoneme and the evaluation information. The evaluation information may be obtained by a user of the electronic device.
The method may include identifying a reference speech speed corresponding to the first phoneme based on one of the first reference speech speed and the second reference speech speed.
Identifying the speech speed corresponding to the first phoneme may include: the average speech speed corresponding to the first phoneme is recognized based on the speech speed corresponding to the first phoneme and the speech speed corresponding to at least one phoneme preceding the first phoneme in the acoustic feature information. Obtaining the speech speed adjustment information may include obtaining the speech speed adjustment information corresponding to the first phoneme based on the average speech speed corresponding to the first phoneme and the reference speech speed corresponding to the first phoneme.
The second neural network model may include an encoder configured to receive an input of acoustic feature information; and a decoder configured to receive an input of the vector information output from the encoder. Obtaining voice data may include: identifying the number of cycles of a decoder included in the second neural network model based on the speech speed adjustment information corresponding to the first phoneme when at least one frame corresponding to the first phoneme in the acoustic feature information is input to the second neural network model; and obtaining at least one frame corresponding to the first phoneme and a plurality of pieces of first speech data corresponding to the number of loops based on the input of the at least one frame corresponding to the first phoneme to the second neural network model. The first speech data includes speech data corresponding to a first phoneme.
A plurality of pieces of second voice data corresponding to the number of loops may be obtained based on one of at least one frame corresponding to the first phoneme in the acoustic feature information input to the second neural network model.
The decoder may be configured to obtain speech data of a first frequency based on acoustic feature information in which the offset size is a first time interval. A frame included in the acoustic feature information is input to the second neural network model based on the value of the speech speed adjustment information as the reference value, and a plurality of pieces of second voice data corresponding to the product of the first time interval and the first frequency can be obtained.
The speaking speed adjustment information may include information on a ratio of the speaking speed of the acoustic feature information to the reference speaking speed of each phoneme.
According to an aspect of an exemplary embodiment, an electronic device may include: a memory configured to store instructions; and a processor configured to execute the instructions to: obtaining text; obtaining acoustic feature information corresponding to the text and alignment information in which each frame of the acoustic feature information is matched with each phoneme included in the text by inputting the text into the first neural network model; identifying speech speed of the acoustic feature information based on the alignment information; identifying a reference speaking speed of each phoneme included in the acoustic feature information based on the text and the acoustic feature information; obtaining speech speed adjustment information based on the speech speed of the acoustic feature information and the reference speech speed of each phoneme; and obtaining speech data corresponding to the text by inputting the acoustic feature information into the second neural network model based on the speech speed adjustment information.
The processor may be further configured to execute the instructions to: the speech speed corresponding to the first phoneme included in the acoustic characteristic information is recognized based on the alignment information, the first phoneme included in the acoustic characteristic information is recognized based on the acoustic characteristic information, and the reference speech speed corresponding to the first phoneme is recognized based on the text.
The processor may be further configured to execute the instructions to: a first reference speech speed corresponding to the first phoneme is obtained based on the text, and sample data for training a first neural network model is obtained.
The processor may be further configured to execute the instructions to: obtaining evaluation information of sample data for training a first neural network model; and identifying a second reference speech speed corresponding to the first phoneme based on the first reference speech speed corresponding to the first phoneme and the evaluation information. The evaluation information is obtained by a user of the electronic device.
The processor may be further configured to execute the instructions to: a reference speech speed corresponding to the first phoneme is identified based on one of the first reference speech speed and the second reference speech speed.
Drawings
The above and other aspects, features, and advantages of certain embodiments of the present disclosure will become more apparent from the following description taken in conjunction with the accompanying drawings in which:
Fig. 1 is a block diagram showing a configuration of an electronic device according to an example embodiment.
Fig. 2 is a block diagram showing a configuration of a text-to-speech (TTS) model according to an example embodiment.
Fig. 3 is a block diagram showing a configuration of a neural network model in a TTS model according to an example embodiment.
Fig. 4 is a diagram illustrating a method for obtaining voice data with improved speech speed according to an example embodiment.
Fig. 5 is a diagram illustrating alignment information in which each frame of acoustic feature information is matched with each phoneme included in text according to an example embodiment.
Fig. 6 is a diagram illustrating a method for recognizing a reference speaking speed of each phoneme included in acoustic feature information according to an example embodiment.
Fig. 7 is a mathematical expression for describing an embodiment of recognizing an average speaking speed of each phoneme by an Exponential Moving Average (EMA) method according to an embodiment.
Fig. 8 is a diagram illustrating a method for recognizing a reference utterance speed according to an example embodiment.
Fig. 9 is a flowchart illustrating an operation of an electronic device according to an example embodiment.
Fig. 10 is a block diagram showing a configuration of an electronic device according to an example embodiment.
Detailed Description
Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a block diagram showing a configuration of an electronic device according to an example embodiment.
Referring to fig. 1, an electronic device 100 may include a memory 110 and a processor 120. In accordance with the present disclosure, electronic device 100 may be implemented as various types of electronic devices, such as smartphones, augmented Reality (AR) glasses, tablet Personal Computers (PCs), mobile phones, video phones, electronic book readers, televisions (TVs), desktop PCs, laptop PCs, netbook computers, workstations, cameras, smartwatches, and servers.
The memory 110 may store at least one instruction or data regarding at least one of the other elements of the electronic device 100. In particular, the memory 110 may be implemented as a non-volatile memory, a flash memory, a Hard Disk Drive (HDD), or a solid state drive (SDD). The memory 110 is accessible by the processor 120 and performs reading, recording, correction, deletion, updating, etc. of data by the processor 120.
According to the present disclosure, the term memory may include Read Only Memory (ROM) and Random Access Memory (RAM) in the memory 110, the processor 120, and a memory card (not shown) (e.g., a micro-Secure Digital (SD) card or memory stick) attached to the electronic device 100.
As described above, the memory 110 may store at least one instruction. Herein, the instruction may be an instruction for controlling the electronic device 100. The memory 110 may store instructions related to functions for changing the operation mode according to the dialog situation of the user. In particular, according to the present disclosure, the memory 110 may include a plurality of constituent elements (or modules) for changing an operation mode according to a dialog situation of a user, which will be described below.
The memory 110 may store data, which is information in units of bits or bytes capable of representing characters, numbers, images, etc. For example, the memory 110 may store the first neural network model 10 and the second neural network model 20. Here, the first neural network model may be a prosodic neural network model, and the second neural network model may be a neural vocoder neural network model.
The processor 120 may be electrically connected to the memory 110 to control general operation and functions of the electronic device 100.
According to an embodiment, the processor 120 may be implemented as a Digital Signal Processor (DSP), a microprocessor, a Time Controller (TCON), or the like. However, the processor is not limited thereto, and may include one or more of a Central Processing Unit (CPU), a Micro Controller Unit (MCU), a Micro Processing Unit (MPU), a controller, an Application Processor (AP), or a Communication Processor (CP) and an ARM processor, or may be defined as corresponding terms. Further, the processor 132 may be implemented as a system on a chip (SoC) or a large scale integrated circuit (LSI) including a processing algorithm, or may be implemented in the form of a Field Programmable Gate Array (FPGA).
The one or more processors may perform control to process the input data according to predefined action rules or artificial intelligence models stored in the memory 110. The predefined action rules or artificial intelligence model are formed by training. For example, being formed by training herein may mean forming a predefined action rule or artificial intelligence model for a desired feature by applying a learning algorithm to a plurality of pieces of learning data. Such training may be performed in a device demonstrating artificial intelligence according to the present disclosure, or by a separate server and/or system.
The artificial intelligence model may include a plurality of neural network layers. Each layer has a plurality of weight values, and the operation of the layer is performed by the operation between the operation result of the previous layer and the plurality of weights. Examples of neural networks may include Convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann-limited machines (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks, and deep Q networks, but unless otherwise noted, the neural networks of the present disclosure are not limited to the above examples.
For example, the processor 120 may control a plurality of hardware or software elements connected to the processor 120 by driving an operating system or application programs, and perform various data processing and operations. Further, the processor 120 may load and process commands or data received from at least one of the other elements into the nonvolatile memory, and store various data in the nonvolatile memory.
In particular, the processor 120 may provide adaptive speech speed adjustment functionality when synthesizing speech data. Referring to fig. 1, the adaptive speech speed adjusting function according to the present disclosure may include a text obtaining module 121, an acoustic feature information obtaining module 122, a speech speed obtaining module 123, a reference speech speed obtaining unit 124, a speech speed adjusting information obtaining module 125, and a voice data obtaining module 126, each of which may be stored in the memory 110. In an example, the adaptive speech speed adjustment function may adjust the speech speed by adjusting the number of cycles of the second neural network model 20 included in the text-to-speech (TTS) model 200 shown in fig. 2.
Fig. 2 is a block diagram showing a configuration of a TTS model according to an example embodiment. Fig. 3 is a block diagram showing a configuration of a neural network model (e.g., a neural vocoder neural network model) in a TTS model according to an example embodiment.
The TTS model 200 shown in fig. 2 may include a first neural network model 10 and a second neural network model 20.
The first neural network model 10 may be a constituent element for receiving text 210 and outputting acoustic feature information 220 corresponding to the text 210. In an example, the first neural network model 10 may be implemented as a prosodic neural network model.
The prosodic neural network model may be a neural network model in which a relationship between a plurality of sample texts and a plurality of pieces of sample acoustic feature information corresponding to the plurality of sample texts, respectively, has been learned. Specifically, the prosodic neural network model may learn a relationship between one sample and sample acoustic feature information obtained from sample speech data corresponding to the one sample, and perform such processing on a plurality of sample texts, thereby performing learning of the prosodic neural network model. Further, in an example, the prosodic neural network model may include a language processor for performance enhancement, and the language processor may include a text normalization module, a phoneme conversion (glyph-to-phoneme (G2P)) module, and the like. The acoustic feature information 220 output from the first neural network model 10 may include speech features of the speaker used in the training of the first neural network module 10. In other words, the acoustic feature information 220 output from the first neural network model 10 may include speech features of a particular speaker (e.g., a speaker corresponding to data used in training of the first neural network model).
The second neural network model 20 is a neural network model for converting acoustic feature information 220 into speech data 230, and may be implemented as a neural vocoder neural network model. According to the present disclosure, the neural vocoder neural network model may receive acoustic feature information 220 output from the first neural network model 10 and output voice data 230 corresponding to the acoustic feature information 220. Specifically, the second neural network model 20 may be a neural network model in which the relationship between the pieces of sample acoustic feature information and the sample speech data corresponding to each of the pieces of sample audio feature information has been learned.
Further, referring to fig. 3, the second neural network model 20 may include an encoder 20-1 receiving input of acoustic feature information 220 and a decoder 20-2 receiving input of vector information output from the encoder 20-1 and outputting voice data 230, and the second neural network model 20 will be described below with reference to fig. 3.
Returning to fig. 1, the plurality of modules 121 to 126 may be loaded to a memory (e.g., a volatile memory) included in the processor 120 in order to perform an adaptive speech speed adjustment function. In other words, to perform the adaptive speech speed adjustment function, the processor 120 may perform the function of each of the plurality of modules 121-126 by loading the plurality of modules 121-126 from the non-volatile memory to the volatile memory. Loading may refer to an operation of calling data stored in a non-volatile memory to a volatile memory and storing the data therein to enable the processor 120 to access the data.
In the embodiment according to the present disclosure, referring to fig. 1, the adaptive speaking speed adjusting function may be implemented through a plurality of modules 121 to 126 stored in the memory 110, but is not limited thereto, and may be implemented through an external device connected to the electronic device 100.
The plurality of modules 121 to 126 according to the present disclosure may be implemented as each software, but are not limited thereto, and some modules may be implemented as a combination of hardware and software. In another embodiment, the plurality of modules 121 to 126 may be implemented as one piece of software. In addition, some modules may be implemented in the electronic device 100 and other modules may be implemented in an external device.
The text obtaining module 121 may be a module for obtaining text to be converted into voice data. In an example, the text obtained by the text obtaining module 121 may be text corresponding to a response to a voice command of the user. In an example, the text may be text displayed on a display of the electronic device 100. In an example, the text may be text entered from a user of the electronic device 100. In an example, the text may be text provided from a speech recognition system (e.g., bixby). In an example, the text may be text received from an external server. In other words, according to the present disclosure, the text may be various texts to be converted into voice data.
The acoustic feature information obtaining module 122 may be a constituent element for obtaining acoustic feature information corresponding to the text obtained by the text obtaining module 121.
The acoustic feature information obtaining module 122 may input the text obtained by the text obtaining module 121 to the first neural network model 10 and output acoustic feature information corresponding to the input text.
According to the present disclosure, the acoustic feature information may be information including information (e.g., intonation information, rhythm information, and speech speed information) about the voice feature of a specific speaker. Such acoustic feature information may be input to the second neural network model 20, which will be described below, to output voice data corresponding to text.
In this context, acoustic feature information may refer to silence features within short intervals (e.g., frames) of speech data, and acoustic feature information for each interval may be obtained after a short time analysis of the speech data. The frame of acoustic feature information may be set to 10 to 20msec, but may be set to any other time interval. Examples of acoustic feature information may include spectrum, mel spectrum, cepstrum, pitch lag, pitch correlation, and the like, and one or a combination of these may be used.
For example, the acoustic feature information may be set by a 257-dimensional spectrum, an 80-dimensional Mel spectrum, or a cepstrum (20-dimensional) +pitch lag (one-dimensional) +pitch correlation (one-dimensional) method. More specifically, for example, in the case where the offset size is 10msec and an 80-dimensional Mel spectrum is used as acoustic feature information, the [100,80] dimensional acoustic feature information can be obtained from 1 second of voice data, and [ T, D ] herein may contain the following meanings.
[ T, D ]: t frames, D dimension acoustic feature information.
In addition, the acoustic feature information obtaining module 122 may obtain alignment information in which each frame of acoustic feature information output from the first neural network model 10 is matched with each phoneme included in the input text. Specifically, the acoustic feature information obtaining module 122 may obtain acoustic feature information corresponding to the text by inputting the text to the first neural network model 10, and obtain alignment information in which each frame of the acoustic feature information is matched with each phoneme included in the text input to the first neural network model 10.
According to the present disclosure, the alignment information may be matrix information for alignment between input/output sequences on a sequence-to-sequence model. In particular, information about each time step of the output sequence is predicted from which input can be obtained by the alignment information. Further, according to the present disclosure, the alignment information obtained by the first neural network model 10 may be alignment information in which "phonemes" corresponding to text input to the first neural network module 10 are matched with "frames of acoustic feature information" output from the first neural network model 10, and the alignment information will be described below with reference to fig. 5.
The speaking speed obtaining module 123 is a constituent element for recognizing the speaking speed of the acoustic feature information obtained from the acoustic feature information obtaining module 122 based on the alignment information obtained from the acoustic feature information obtaining module 122.
The speaking speed obtaining module 123 may identify the speaking speed corresponding to each phoneme included in the acoustic feature information obtained from the acoustic feature information obtaining module 122 based on the alignment information obtained from the acoustic feature information obtaining module 122.
Specifically, the speaking speed obtaining module 123 may identify the speaking speed of each phoneme included in the acoustic feature information obtained from the acoustic feature information obtaining module 122 based on the alignment information obtained from the acoustic feature information obtaining module 122. According to the present disclosure, since the alignment information is alignment information in which "phonemes" corresponding to text input to the first neural network model 10 match "frames of acoustic feature information" output from the first neural network module 10, it is found that the first phonemes are slowly uttered as the number of frames of acoustic feature information corresponding to the first phonemes included in the alignment information becomes large. In an example, when the number of frames of acoustic feature information corresponding to the first phoneme is recognized as three and the number of frames of acoustic feature information corresponding to the second phoneme is recognized as five based on the alignment information, the speaking speed of the first phoneme is found to be relatively higher than the speaking speed of the second phoneme.
When obtaining the speaking speed of each phoneme included in the text, the speaking speed obtaining module 123 may obtain the average speaking speed of the specific phoneme in consideration of the speaking speed corresponding to the specific phoneme and at least one phoneme before the corresponding phoneme included in the text. In an example, the utterance speed obtaining module 123 may recognize an average utterance speed corresponding to a first phoneme based on an utterance speed corresponding to the first phoneme included in the text and an utterance speed corresponding to each of at least one phoneme.
However, since the speaking speed of one phoneme is the speed of a short section, when the speaking speed of an extremely short section is predicted, the length difference between phonemes may be reduced, thereby producing an unnatural result. In addition, when the speaking speed of the extremely short section is predicted, the speaking speed predicted value changes excessively fast on the time axis, thereby producing an unnatural result. Accordingly, in the present disclosure, an average speaking speed corresponding to a phoneme considering the speaking speed of a phoneme before the phoneme may be recognized, and the recognized average speaking speed may be used as the speaking speed of the corresponding phoneme.
However, when the average speech speed of an ultra-long section is predicted in speech speed prediction, it is difficult to reflect whether or not a low-speed speech and a high-speed speech are together in text. Further, in the streaming structure, which is a speed prediction of the recognized speech speed utterance that has been output, a delay in the speech speed adjustment may occur, and thus, it is necessary to provide a method for measuring the average speech speed of an appropriate section.
According to an embodiment, the average speech speed may be recognized by a simple moving average method or an Exponential Moving Average (EMA) method, and this will be described in detail below with reference to fig. 6 and 7.
The reference utterance speed obtaining module 124 is a constituent element for recognizing the reference utterance speed of each phoneme included in the acoustic feature information. According to the present disclosure, the reference speaking speed may refer to an optimal speaking speed perceived as an appropriate speed for each phoneme included in the acoustic feature information.
In a first embodiment, the reference speech speed obtaining module 124 may obtain a first reference speech speed corresponding to a first phoneme included in the acoustic feature information based on sample data (e.g., sample text and sample speech data) for training the first neural network model 10.
In an example, when the number of vowels in a phone ring including a first phone is large, a first reference utterance speed corresponding to the first phone may be relatively slow. In addition, when the number of consonants in a phone ring including a first phone is large, a first reference utterance speed corresponding to the first phone may be relatively fast. Further, when the word including the first phoneme is a word to be emphasized, the corresponding word will be sounded slowly, and thus, the first reference speaking speed corresponding to the first phoneme may be relatively slow.
In an example, the reference speech speed obtaining module 124 may obtain the first reference speech speed corresponding to the first phoneme using a third neural network model that estimates the reference speech speed. In particular, the reference utterance speed obtaining module 124 may identify the first phoneme according to the alignment information obtained from the acoustic feature information obtaining module 122. Further, the reference speech speed obtaining module 124 may obtain the first reference speech speed corresponding to the first phoneme by inputting information about the recognized first phoneme and the text obtained from the text obtaining module 121 to the third neural network model.
In an example, the third neural network model may be trained based on sample data (e.g., sample text and sample speech data) used in the training of the first neural network model 10. In other words, the third neural network model may be trained to estimate the interval average speech speed of the sample acoustic feature information based on the sample acoustic feature information and the sample text corresponding to the sample acoustic feature signal. Here, the third neural network model may be implemented as a statistical model capable of estimating an interval average utterance speed, such as a Hidden Markov Model (HMM) and DNN. Data for training the third neural network model will be described below with reference to fig. 8.
In the above-described embodiment, the first reference utterance speed corresponding to the first phoneme is obtained using the third neural network model is described, but the disclosure is not limited thereto. In other words, the reference speech speed obtaining module 124 may obtain the first reference speech speed corresponding to the first phoneme using a rule-based prediction method or a decision-based prediction method other than the third neural network.
In a second embodiment, the reference speech speed obtaining module 124 may obtain a second reference speech speed, which is a speech speed subjectively determined by a user listening to the voice data. Specifically, the reference utterance speed obtaining module 124 may obtain evaluation information of sample data used in training of the first neural network model 10. In an example, the reference speech speed obtaining module 124 may obtain evaluation information of the sample speech data used in the training of the first neural network model 10 by the user. Here, the evaluation information may be evaluation information of a speed perceived by a user who listens to the sample voice data. In an example, the evaluation information may be obtained by receiving user input through a UI displayed on a display of the electronic device 100.
In an example, if a user listening to the sample voice data feels that the speaking speed of the sample voice data is slightly slow, the reference speaking speed obtaining module 124 may obtain first evaluation information for setting the speaking speed of the sample voice data faster (e.g., 1.1 times) from the user. In an example, if a user listening to the sample voice data feels the speaking speed of the sample voice data slightly faster, the reference speaking speed obtaining module 124 may obtain second evaluation information for setting the sample voice data slower (e.g., 0.95 times) from the user.
Further, the reference utterance speed obtaining module 124 may obtain a second reference utterance speed obtained by applying the evaluation information to the first reference utterance speed corresponding to the first phoneme. In an example, when the first evaluation information is obtained, the reference utterance speed obtaining module 124 may identify an utterance speed corresponding to 1.1 times a first reference utterance speed corresponding to the first phoneme as a second reference utterance speed corresponding to the first phoneme. In an example, when the second evaluation information is obtained, the reference utterance speed obtaining module 124 may recognize an utterance speed corresponding to 0.95 times the first reference utterance speed corresponding to the first phoneme as the second reference utterance speed corresponding to the first phoneme.
In the third embodiment, the reference utterance speed obtaining module 124 may obtain the third reference utterance speed based on the evaluation information of the reference sample data. Herein, the reference sample data may include a plurality of sample texts and a plurality of pieces of sample voice data, which are obtained by speaking each of the plurality of sample texts by the reference speaker. In an example, the first reference sample data may include a plurality of sample voice data obtained by speaking each of the plurality of sample texts by a particular dubbing actor, and the second reference sample data may include a plurality of sample voice data obtained by speaking each of the plurality of sample texts by another dubbing actor. Further, the reference speech speed obtaining module 124 may obtain the third reference speech speed based on the evaluation information of the reference sample data by the user. In an example, when the first evaluation information is obtained for the first reference sample data, the reference speech speed obtaining module 124 may identify a speed 1.1 times the speech speed of the first phoneme corresponding to the first reference sample number as the third reference speech speed corresponding to the first phoneme. In an example, when the second evaluation information is obtained for the first reference sample data, the reference utterance speed obtaining module 124 may identify a speed that is 0.95 times the utterance speed of the first phoneme corresponding to the first reference sample number as the third reference utterance speed corresponding to the first phoneme.
Further, the reference utterance speed obtaining module 124 may identify one of a first reference utterance speed corresponding to a first phoneme, a second reference utterance speed corresponding to the first phoneme, and a third reference utterance speed corresponding to the first phoneme as the reference utterance speed corresponding to the first phoneme.
The speaking speed adjustment information obtaining module 125 is a constituent element for obtaining speaking speed adjustment information based on the speaking speed corresponding to the first phoneme obtained by the speaking speed obtaining module 123 and the speaking speed corresponding to the first phoneme obtained by the reference speaking speed obtaining unit 124.
Specifically, when the speaking speed corresponding to the nth phoneme obtained by the speaking speed obtaining module 123 is defined as Xn and the reference speaking speed corresponding to the nth phoneme obtained by the reference speaking speed obtaining module 124 is defined as Xrefn, the speaking speed adjustment information Sn corresponding to the nth phoneme may be defined as (Xrefn/Xn). In an example, when the current predicted speaking speed X1 corresponding to the first phoneme is 20 (phonemes/second) and the reference speaking speed Xref1 corresponding to the first phoneme is 18 (phonemes/second), the speaking speed adjustment information S1 corresponding to the first phoneme may be 0.9.
The voice data obtaining module 126 is a constituent element for obtaining voice data corresponding to text.
Specifically, the voice data obtaining module 126 may obtain voice data corresponding to the text by inputting acoustic feature information corresponding to the text obtained from the acoustic feature information obtaining module 122 to the second neural network model 20 set based on the speech speed adjustment information.
When at least one frame corresponding to the first phoneme in the acoustic feature information 220 is input to the second neural network model 20, the speech data obtaining module 126 may identify the number of cycles of the decoder 20-2 in the second neural network model 20 based on the speech speed adjustment information corresponding to the first phoneme. Further, the voice data obtaining module 126 may obtain pieces of first voice data corresponding to the number of cycles from the decoder 20-2 while at least one frame corresponding to the first phoneme is input to the second neural network model 20.
When one of at least one frame corresponding to the first phoneme in the acoustic feature information is input to the second neural network model 20, a plurality of pieces of second speech sample data, the number of which corresponds to the number of cycles, can be obtained. Further, the set of second speech sample data obtained by inputting each of at least one frame corresponding to the first phoneme to the second neural network model 20 may be the first speech data. Here, the plurality of pieces of first voice data may be voice data corresponding to the first phonemes.
In other words, the number of samples of the voice data to be output can be adjusted by adjusting the number of cycles of the decoder 20-2, and accordingly, the speaking speed of the voice data can be adjusted by adjusting the number of cycles of the decoder 20-2. The speaking speed adjusting method by the second neural network model 20 will be described below with reference to fig. 3.
The voice data obtaining module 126 may obtain voice data corresponding to text by inputting each of a plurality of phonemes included in the acoustic feature information to the second neural network model 20, and in the second neural network model 20, the number of cycles of the decoder 20-2 is set based on the speaking speed adjustment information corresponding to each of the plurality of phonemes.
Fig. 3 is a block diagram showing a configuration of a neural network model (e.g., a neural vocoder neural network model) in a TTS model according to an example embodiment.
Referring to fig. 3, the encoder 20-1 of the second neural network model 20 may receive the acoustic feature information 220 and output vector information 225 corresponding to the acoustic feature information 22. From the perspective of the second neural network model 20 herein, the vector information 225 is data output from the hidden layer and thus may be referred to as a hidden representation.
When at least one frame corresponding to the first phoneme in the acoustic feature information 220 is input to the second neural network model 20, the voice data obtaining module 126 may recognize the number of cycles of the decoder 20-2 based on the speaking speed adjustment information corresponding to the first phoneme. Further, the voice data obtaining module 126 may obtain a plurality of pieces of first voice data corresponding to the number of cycles recognized from the decoder 20-2 while at least one frame corresponding to the first phoneme is input to the second neural network model 20.
In other words, when one of at least one frame corresponding to the first phoneme in the acoustic feature information is input to the second neural network model 20, a plurality of pieces of second speech sample data, the number of which corresponds to the number of cycles, can be obtained. In an example, when one of at least one frame corresponding to the first phoneme in the acoustic feature information 220 is input to the encoder 20-1 of the second neural network model 20, vector information corresponding thereto may be output. Further, the vector information is input to the decoder 20-2, and the decoder 20-2 may operate in N cycles, i.e., N cycles per frame of the acoustic feature information 220, and output N pieces of voice data.
Further, the set of second speech data obtained by inputting each of at least one frame corresponding to the first phoneme to the second neural network model 20 may be the first speech data. Here, the plurality of pieces of first voice data may be voice data corresponding to the first phonemes.
In an embodiment in which voice data of a first frequency (khz) is obtained from the decoder 20-2 based on acoustic feature information whose offset size is a first time interval (sec), when the value of the speech speed adjustment information is a reference value (e.g., 1), one frame included in the acoustic feature information is input to the second neural network model 20, and the decoder 20-2 may operate at a number of cycles corresponding to the first time interval X (first frequency), thereby obtaining voice data, the number of which corresponds to the corresponding number of cycles. In an example, when voice data of 24khz is obtained from the decoder 20-2 based on acoustic feature information whose offset size is 10msec, when the value of the speaking speed adjusting information is a reference value (e.g., 1), one frame included in the acoustic feature information is input to the second neural network model 20, and the decoder 20-2 may operate in 240 cycles, thereby obtaining 240 voice data.
Further, in an embodiment in which voice data of a first frequency is obtained from the decoder 20-2 based on acoustic feature information whose offset size is a first time interval, one frame included in the acoustic feature information is input to the second neural network model 20, and the decoder 20-2 may operate with the number of cycles corresponding to the product of the first time interval, the first frequency, and the speaking speed adjustment information, thereby obtaining voice data, the number of voice data corresponding to the corresponding number of cycles. In an example, when voice data of 24khz is obtained from the decoder 20-2 based on acoustic feature information whose offset size is 10msec, when the value of the speaking speed adjusting information is a reference value (e.g., 1.1), one frame included in the acoustic feature information is input to the second neural network model 20, and the decoder 20-2 may operate in 264 cycles, thereby obtaining 264 voice data.
Herein, the number of voice data (e.g., 264) obtained when the value of the speaking speed adjusting information is 1.1 may be greater than the number of voice data (e.g., 240) obtained when the value of the speaking speed adjusting information is a reference value. In other words, when the value of the speech speed adjustment information is adjusted to 1.1, the voice data corresponding to the previous shift value of 10msec is output for 11msec, and therefore, the speech speed can be adjusted to be slower than in the case where the value of the speech speed adjustment signal is the reference value.
In other words, when the reference value of the speaking speed adjusting information is 1, if the value of the speaking speed adjusting signal is defined as S, the number of cycles N' of the decoder 20-2 may be as shown in equation (1).
Equation (1)
In equation (1), N' n The number of cycles of the decoder 20-2 for speech speed adjustment in the nth phoneme may be represented, and N may represent the reference number of cycles of the decoder 20-2. In addition, S in the nth phoneme n Is the value of the speech speed adjustment information, and therefore, when S n When the voice data is 1.1, voice data with 10% sound can be obtained.
Further, as shown in equation (1), the speaking speed adjustment information may be differently set for each phoneme included in the acoustic feature information 220 input to the second neural network model 20. In other words, in the present disclosure, based on equation (1), the speech data having the speech speed adjusted in real time may be obtained by using an adaptive speech speed adjusting method for adjusting the speech speed differently for each phoneme included in the acoustic feature information 220.
Fig. 4 is a diagram illustrating a method for obtaining speech data with improved speech speed by an electronic device according to an example embodiment.
Referring to fig. 4, the electronic device 100 may obtain text 210. Here, the text 210 is text to be converted into voice data, and a method for obtaining the text is not limited. In other words, the text 210 may include various texts such as text input from a user of the electronic device 100, text provided from a voice recognition system (e.g., bixby) of the electronic device 100, and text received from an external server.
Further, the electronic device 100 may obtain the acoustic feature information 220 and the alignment information 400 by inputting the text 210 to the first neural network model 10. Here, the acoustic feature information 220 may be information including a speech feature and a speaking speed feature corresponding to the text 210 of a specific speaker (e.g., a specific speaker corresponding to the first neural network model). The alignment information 400 may be alignment information in which phonemes included in the text 210 match each frame of the acoustic feature information 220.
Further, the electronic device 100 may obtain the speaking speed 410 corresponding to the acoustic feature information 220 based on the alignment information 400 through the speaking speed obtaining module 123. In this context, in case the acoustic feature information 220 is converted into the speech data 230, the speaking speed 410 may be information about an actual speaking speed. In addition, the speaking speed 410 may include speaking speed information for each phoneme included in the acoustic feature information 220.
In addition, the electronic device 100 may obtain the reference utterance speed 420 based on the text 210 and the alignment information 400 through the utterance speed adjustment information obtaining module 125. Here, the reference speaking speed 420 may refer to an optimal speaking speed of phonemes included in the text 210. In addition, the reference speaking speed 420 may include reference speaking speed information for each phoneme included in the acoustic feature information 220.
In addition, the electronic device 100 may obtain the utterance speed adjustment information 430 based on the utterance speed 410 and the reference utterance speed 420 through the utterance speed adjustment information obtaining module 125. Here, the speaking speed adjustment information 430 may be information for adjusting the speaking speed of each phoneme included in the acoustic feature information 220. For example, if the speaking speed 410 of the mth phoneme is 20 (phonemes/second) and the reference speaking speed 420 of the mth phoneme is 18 (phonemes/second), the speaking speed adjustment information 430 of the mth phoneme may be recognized as 0.9 (18/20).
Further, the electronic device 100 may obtain the voice data 230 corresponding to the text 210 by inputting the acoustic feature information 220 to the second neural network model 20 set based on the speaking speed adjustment information 430.
In an embodiment, when at least one frame corresponding to the mth phoneme in the acoustic feature information 220 is input to the encoder 20-1 of the second neural network model 20, the electronic device 100 may recognize the number of cycles of the decoder 20-2 of the second neural network model 20 based on the speaking speed adjustment information 430 corresponding to the mth phoneme. In an example, when the speaking speed adjustment information 430 for the mth phoneme is 0.9, the number of cycles of the decoder 20-2 while inputting the frame corresponding to the mth phoneme in the acoustic feature information 220 to the encoder 20-1 may be (the basic number of cycles/the speaking speed adjustment information corresponding to the m phoneme). In other words, if the basic number of cycles is 240 times, the number of cycles of the decoder 20-2 may be 264 times while the frame corresponding to the mth phoneme in the acoustic feature information 220 is input to the encoder 20-1.
When recognizing the number of loops, the electronic device 100 may operate the decoder 20-2 by the number of loops corresponding to the mth phoneme while the frame corresponding to the mth phoneme in the acoustic feature information 220 is input to the decoder 20-2 and obtain voice data corresponding to the number of loops of the mth phoneme per frame corresponding to the acoustic feature information 220. Further, the electronic device 100 may obtain the voice data 230 corresponding to the text 210 by performing such processing on all phonemes included in the text 210.
Fig. 5 is a diagram illustrating alignment information in which each frame of acoustic feature information is matched with each phoneme included in text according to an example embodiment.
Referring to fig. 5, alignment information of each frame of acoustic feature information matched with each phoneme included in text may have a size of (N, T). Herein, N may represent the number of all phonemes included in the text 210, and T may represent the number of frames of the acoustic feature information 220 corresponding to the text 210.
When lambda n,t Is defined as the weight at the nth phoneme and the t-th frame from the acoustic feature information 220, can satisfy Σ n Λ n,t =1。
Phoneme P mapped in t-th frame in alignment information t May be equation (2).
Equation (2)
In other words, referring to equation (2), the t-th frame mapped phoneme P t May be a frame having a maximum value Λ corresponding to the t-th frame n,t Is a phoneme of (a).
Can identify P t =n≠P t+1 P among frames of =n+1 t The length of the corresponding phoneme. In other words, when the length of the nth phoneme is defined as d n When, the length of the nth phoneme may be the same as that in equation (3).
Equation (3)
In other words, referring to equation (3), the alignment information d of fig. 5 1 May be 2 and d 2 May be 3.
Factors that do not map to a maximum may exist as in the square area of fig. 5. In an example, a special symbol may be used for a phoneme in the TTS model using the first neural network model 10, and in this case, the special symbol may produce a pause, but may affect only the front prosody and the rear prosody, and may not actually sound. In this case, the unmapped phonemes of the frame may exist as in the square region of fig. 5.
In this case, the length of the unmapped phonemes may be allocated as in equation (4). In other words, at P t =n≠P t+1 The length from the nth phoneme to the n+δ1 th phoneme among the frames=n+δ may be in formula (4). Herein, δ may be a value greater than 1.
Formula (4)
Referring to equation (4), the alignment information d of fig. 5 7 The alignment information in (c) may be 0.5 and d 8 May be 0.5.
As described above, by the alignment information, the length of the phonemes included in the acoustic feature information 220 can be recognized, and the speaking speed of each phoneme can be recognized by the length of the phonemes.
Specifically, the speaking speed x of the nth phoneme included in the acoustic feature information 220 n Equation (5) may be as follows.
Equation (5)
In equation (5), r may be a reduction factor of the first neural network model 10. In the example, when r is 1 and the frame length is 10ms, x 1 May be 50 and x 2 May be 33.3.
However, since the speaking speed of one phoneme is the speed of a short section, when the speaking speed of an extremely short section is predicted, the length difference between phonemes may be reduced, thereby producing an unnatural result. In addition, when the speaking speed of the extremely short section is predicted, the speaking speed predicted value changes excessively fast on the time axis, thereby producing an unnatural result. Further, when the average speech speed of an ultra-long section is predicted in speech speed prediction, it is difficult to reflect whether or not a slow utterance and a fast utterance are together in text. Further, in the streaming structure, which is a speed prediction of a speech that has outputted a recognized speaking speed, a delay in the speaking speed adjustment may occur, and thus it is necessary to provide a method for measuring an average speaking speed of an appropriate section, and this will be described below with reference to fig. 6 and 7.
Fig. 6 is a diagram illustrating a method for recognizing an average speaking speed of each phoneme included in acoustic feature information according to an example embodiment.
Referring to embodiment 610 of fig. 6, electronic device 100 may calculate an average of utterance speeds of the last M phonemes included in acoustic feature information 220. In an example, if n < M, the average speech speed may be calculated by averaging only the corresponding elements.
In addition, when M is 5, as in embodiment 620 of FIG. 6, the average speech speed of the third phonemeCan be calculated as x 1 、x 2 And x 3 Average value of (2). Furthermore, the average speaking speed of the fifth phoneme +.>Can be calculated as x 1 To x 5 Average value of (2).
The method of calculating the average speaking speed of each phoneme by the embodiments 610 and 620 of fig. 6 may refer to a simple moving average method.
Fig. 7 is a diagram illustrating an embodiment of recognizing an average speech rate of each phoneme by an EMA (exponential moving average) method according to an embodiment.
Fig. 7 is a mathematical expression for describing an embodiment of recognizing an average speaking speed of each phoneme by an Exponential Moving Average (EMA) method according to an embodiment.
In other words, according to the EMA method, which is the mathematical expression of fig. 7, since the weight is the speaking speed of the phoneme far from the current phoneme, the weight is exponentially reduced, and thus, the average length of the appropriate section can be calculated.
Here, when the α value of fig. 7 is large, the average speaking speed of the short section may be calculated, and when the α value is small, the average speaking speed of the long section may be calculated. Thus, the electronic device 100 may calculate the current average speaking speed in real time by selecting an appropriate α value according to circumstances.
Fig. 8 is a diagram illustrating a method for recognizing a reference utterance speed according to an embodiment.
Fig. 8 is a diagram illustrating a method for training a third neural network model, according to an embodiment, to obtain a reference utterance speed corresponding to each of phonemes included in acoustic feature information 220.
In an example, the third neural network model may be trained based on sample data (e.g., sample text and sample speech data). In an example, the sample data may be sample data used in the training of the first neural network model 10.
Acoustic feature information corresponding to the sample voice data may be extracted based on the sample voice data, and the speaking speed of each phoneme included in the sample voice data may be recognized as in fig. 8. Further, the third neural network model may be trained based on the sample text and the speech speed of each phoneme included in the sample speech data.
In other words, the third neural network model may be trained to estimate the interval average speech speed of the sample acoustic feature information based on the sample acoustic feature information and the sample text corresponding to the sample acoustic feature signal. Here, the third neural network model may be implemented as a statistical model such as HMM and DNN capable of estimating the interval average utterance speed.
The electronic device 100 may recognize the reference speaking speed of each phoneme included in the acoustic feature information 220 by using the trained third neural network model, the text 210, and the alignment information 400.
Fig. 9 is a flowchart showing an operation of the electronic device according to the embodiment.
Referring to fig. 9, the electronic device 100 may obtain text in operation S910. In this context, the text may include various texts such as text input from a user of the electronic device 100, text provided from a voice recognition system (e.g., bixby) of the electronic device, and text received from an external server.
Further, the electronic device 100 may obtain acoustic feature information corresponding to the text by inputting the text to the first neural network model and alignment information in which each frame of the acoustic feature information matches each phoneme included in the text in operation S920. In an example, the alignment information may be matrix information having a size of (N, T), as shown in fig. 5.
In operation S930, the electronic device 100 may recognize the speaking speed of the acoustic feature information based on the obtained alignment information. Specifically, the electronic device 100 may recognize the speaking speed of each phoneme included in the acoustic feature information based on the obtained alignment information. Herein, the speaking speed of each phoneme may be the speaking speed corresponding to one phoneme, but is not limited thereto. In other words, the speaking speed of each phoneme may be an average speaking speed obtained by further considering the speaking speed corresponding to each of at least one phoneme preceding the corresponding phoneme.
Further, the electronic device 100 may recognize a reference speaking speed of each phoneme included in the acoustic feature information based on the text and the acoustic feature information in operation S940. Here, the reference utterance speed may be recognized by various methods as described with reference to fig. 1.
In an example, the electronic device 100 may obtain the first reference speech speed of each phoneme included in the acoustic feature information based on the obtained text and sample data used in the training of the first neural network.
In an example, the electronic device 100 may obtain evaluation information of sample data used in training of the first neural network model. In an example, the electronic device 100 may provide the user with speech data in the sample data and then receive input of evaluation information for feedback thereof. The electronic device 100 may obtain a second reference speaking speed of each phoneme included in the acoustic feature information based on the first reference speaking speed and the evaluation information.
The electronic device 100 may recognize the reference utterance speed of each phoneme included in the acoustic feature information based on at least one of the first reference utterance speed and the second reference utterance speed.
In operation S950, the electronic device 100 may obtain the utterance speed adjustment information based on the utterance speed of the acoustic feature information and the reference utterance speed. Specifically, when the speaking speed corresponding to the nth phoneme is defined as Xn and the reference speaking speed corresponding to the nth phoneme is defined as Xrefn, the speaking speed adjustment information Sn corresponding to the nth phoneme may be defined as (Xrefn/Xn).
The electronic device 100 may obtain voice data corresponding to the text by inputting acoustic feature information to a second neural network model set based on the obtained utterance speed adjustment information (S960).
In particular, the second neural network model may include an encoder that receives an input of acoustic feature information and a decoder that receives an input of vector information output from the encoder and outputs voice data. When at least one frame corresponding to a specific phoneme included in the acoustic feature information is input to the second neural network model, the electronic device 100 may recognize the number of cycles of the decoder included in the second neural network model based on the speech speed adjustment information corresponding to the corresponding phoneme. The electronic device 100 may obtain the first voice data corresponding to the number of loops by operating the decoder at the identified number of loops, the number of loops being based on an input to at least one frame of the second neural network model corresponding to the corresponding phoneme.
Specifically, when one of at least one frame corresponding to a specific phoneme in the acoustic feature information is input to the second neural network model, a plurality of pieces of second speech data, the number of which corresponds to the number of recognized cycles, may be obtained. Further, the set of the plurality of second voice data obtained by at least one frame corresponding to a specific phoneme in the acoustic feature information may be first voice data corresponding to the specific phoneme. In other words, the second voice data may be voice data corresponding to one frame of acoustic feature information, and the first voice data may be voice data corresponding to one specific phoneme.
In an example, voice data of a first frequency is obtained based on acoustic feature information in which the offset size is a first time interval, and when the value of the speaking speed adjustment information is a reference value, one frame included in the acoustic feature information is input to a second neural network model, thereby obtaining second voice data, the number of which corresponds to the product of the first time interval and the first frequency.
Fig. 10 is a block diagram showing a configuration of an electronic device according to an example embodiment. Referring to fig. 10, the electronic device 100 may include a memory 110, a processor 120, a microphone 130, a display 140, a speaker 150, a communication interface 160, and a user interface 170. The memory 110 and the processor 120 shown in fig. 10 overlap with the processor 120 and the memory 110 shown in fig. 1, and thus a description thereof will not be repeated. Further, according to an implementation example of the electronic device 100, some of the constituent elements of fig. 10 may be removed or other constituent elements may be added.
The microphone 130 is a constituent element for the electronic device 100 to receive an input of a voice signal. In particular, the microphone 130 may receive external voice signals using the microphone and process them into electrical voice data. In this case, the microphone 130 may transmit the processed voice data to the processor 120.
The display 140 is a constituent element for visually providing information to the electronic device 100. The electronic device 100 may include one or more displays 140, and text to be converted into voice data, a UI for obtaining evaluation information from a user, and the like may be displayed through the displays 140. In this case, the display 140 may be implemented as a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), an Organic Light Emitting Diode (OLED), a Transparent OLED (TOLED), a micro LED, or the like. Further, the display 140 may be implemented as a touch screen type capable of sensing a touch manipulation of a user, and may also be implemented as a flexible display capable of being folded or bent. In particular, the display 140 may visually provide a response corresponding to a command included in the voice signal.
The speaker 150 is a constituent element for the electronic device 100 to acoustically provide information. The electronic device 100 may include one or more speakers 150, and voice data obtained according to the present disclosure is output as an audio signal through the speakers 150. The constituent element for outputting an audio signal may be implemented as the speaker 150, but this is just one embodiment, and may also be implemented as an output terminal.
The communication interface 160 is a constituent element capable of communicating with an external device. Communication interface 160 may include communication via a third device (e.g., repeater, hub, access point, server, gateway, etc.) with an external device. For example, the wireless communication may include cellular communication using at least one of Long Term Evolution (LTE), LTE-advanced (LTE-a), code Division Multiple Access (CDMA), wideband CDMA (WCDMA), universal Mobile Telecommunications System (UMTS), wireless broadband (WiBro), and global system for mobile communication (GSM). According to an embodiment, the wireless communication may include, for example, at least one of wireless fidelity (WiFi), bluetooth Low Energy (BLE), zigbee, near Field Communication (NFC), magnetic security transmission, radio Frequency (RF), or Body Area Network (BAN). The wired communication may include, for example, at least one of Universal Serial Bus (USB), high Definition Multimedia Interface (HDMI), recommended standard 232 (RS-232), power line communication, or Plain Old Telephone Service (POTS). The network for wireless and wired communication may include a telecommunications network, for example, at least one of a computer network (e.g., LAN or WAN), the internet, or a telephone network.
In particular, the communication interface 160 may provide voice recognition functionality to the electronic device 100 by communicating with an external server. However, the present disclosure is not limited thereto, and the electronic device 100 may provide a voice recognition function within the electronic device 100 without communicating with an external server.
The user interface 170 is a constituent element for receiving a user command for controlling the electronic device 100. In particular, the user interface 170 may be implemented as a device such as a button, a touch pad, a mouse, and a keyboard, and may also be implemented as a touch screen capable of performing a display function and manipulating an input function. Here, the buttons may be various types of buttons such as mechanical buttons, touch pads, or wheels formed in any area of the front, side, or rear of the exterior of the body of the electronic device 100.
It should be understood that the present disclosure includes various modifications, equivalents, and/or alternatives to the embodiments of the present disclosure. With respect to the description of the drawings, like reference numerals may be used for like constituent elements.
In this disclosure, terms such as "comprising," "possibly comprising," "consisting of …," or "possibly consisting of …" are used herein to specify the presence of corresponding features (e.g., constituent elements, such as numbers, functions, operations, or components), but do not preclude the presence of additional features.
In the specification, the term "a or B", "at least one of a or/and B" or "one or more of a or/and B" may include all possible combinations of items listed together. For example, the term "a or B" or "at least one of a or/and B" may denote (1) at least one a, (2) at least one B, or (3) at least one a and at least one B. In the specification, the terms "first, second, etc. are used to describe different constituent elements regardless of their order and/or importance and to distinguish one constituent element from another, but are not limited to the corresponding constituent elements.
If an element (e.g., a first element) is recited as being "operably coupled/connected" or "connected to" another element (e.g., a second element), it is understood that the element can be directly connected to the other element or through the other element (e.g., a third element). On the other hand, if an element (e.g., a first element) is described as being "directly coupled to" or "directly connected to" another element (e.g., a second element), it is understood that there are no elements (e.g., third element) between the element and the other element.
In the specification, the term "configured to" may be changed to, for example, "adapted to", "capable of", "designed to", "adapted to", "manufactured to" or "capable of" in some cases. The term "configured to (set to)" does not necessarily mean "specially designed to" at the hardware level. In some cases, the term "configured to" may refer to a device that is "capable of doing something with another device or component. For example, the phrase "a unit or processor configured (or arranged) to perform A, B and C" may refer to, for example, a dedicated processor (e.g., an embedded processor), a general-purpose processor (e.g., a Central Processing Unit (CPU) or an application processor), etc. for performing the corresponding operations, which may perform the corresponding operations by executing one or more software programs stored in a memory device.
The term "unit" or "module" as used herein includes a unit consisting of hardware, software or firmware and may be used interchangeably with terms such as logic, logic blocks, components or circuitry. "units" or "modules" may be an integrally formed component or may be a minimal unit or portion thereof for performing one or more functions. For example, a module may be implemented as an Application Specific Integrated Circuit (ASIC).
Various embodiments of the present disclosure may be implemented as software comprising instructions stored in a machine (e.g., computer) readable storage medium. The machine is a device capable of invoking instructions stored in a storage medium and operating according to the invoked instructions, and may include a laminate display device according to the disclosed embodiments. Where the instructions are executed by a processor, the processor may perform functions corresponding to the instructions directly or using other elements under the control of the processor. The instructions may include code generated by a compiler or code executable by an interpreter. The machine-readable storage medium may be provided in the form of a non-transitory storage medium. Herein, a "non-transitory" storage medium is tangible and may not include a signal, and it does not distinguish whether data is semi-permanently or temporarily stored in the storage medium.
According to an embodiment, a method according to various embodiments disclosed in the present disclosure may be provided in a computer program product. The computer program product may be exchanged between the seller and the buyer as a commercially available product. The computer program product may be distributed in the form of a machine-readable storage medium, such as a compact disc read only memory (CD-ROM), or online through an application store, such as playstore. In the case of online distribution, at least a portion of the computer program product may be at least temporarily stored or temporarily generated in a storage medium such as a memory of a manufacturer's server, a server of an application store, or a relay server.
Each element (e.g., module or program) according to the various embodiments described above may include a single entity or multiple entities, and some of the sub-elements described above may be omitted or other sub-elements may be further included in various embodiments. Alternatively or additionally, some elements (e.g., modules or programs) may be integrated into one entity to perform the same or similar functions performed by each respective element prior to integration. According to various embodiments, operations performed by modules, programs, or other elements may be performed sequentially, in parallel, repeatedly, or heuristically, or at least some operations may be performed in a different order, omitted, or different operations may be added.

Claims (15)

1. A method for controlling an electronic device, the method comprising:
obtaining text;
obtaining acoustic feature information corresponding to the text and alignment information in which each frame of the acoustic feature information is matched with each phoneme included in the text by inputting the text into the first neural network model;
identifying speech speed of the acoustic feature information based on the alignment information;
identifying a reference speaking speed of each phoneme included in the acoustic feature information based on the text and the acoustic feature information;
obtaining speech speed adjustment information based on the speech speed of the acoustic feature information and the reference speech speed of each phoneme; and
based on the speech speed adjustment information, speech data corresponding to the text is obtained by inputting acoustic feature information into the second neural network model.
2. The method of claim 1, wherein recognizing the speaking speed of the acoustic feature information includes recognizing the speaking speed corresponding to the first phoneme included in the acoustic feature information based on the alignment information, and
wherein the identifying the reference speech speed for each phoneme comprises:
identifying a first phoneme included in the acoustic feature information based on the acoustic feature information; and
A reference speech speed corresponding to the first phoneme is identified based on the text.
3. The method of claim 2, wherein identifying a reference speech speed corresponding to the first phoneme comprises:
obtaining a first reference speech speed corresponding to the first phoneme based on the text, and
sample data for training a first neural network model is obtained.
4. The method of claim 3, wherein identifying a reference speech speed corresponding to the first phoneme further comprises:
obtaining evaluation information of sample data for training a first neural network model; and
identifying a second reference speech speed corresponding to the first phoneme based on the first reference speech speed corresponding to the first phoneme and the evaluation information, and
wherein the evaluation information is obtained by a user of the electronic device.
5. The method of claim 4, further comprising:
a reference speech speed corresponding to the first phoneme is identified based on one of the first reference speech speed and the second reference speech speed.
6. The method according to claim 2,
wherein identifying the speech speed corresponding to the first phoneme further comprises: identifying an average speech speed corresponding to the first phoneme based on the speech speed corresponding to the first phoneme and the speech speed corresponding to at least one phoneme preceding the first phoneme among the acoustic feature information, and
Wherein obtaining the speech speed adjustment information includes obtaining the speech speed adjustment information corresponding to the first phoneme based on the average speech speed corresponding to the first phoneme and the reference speech speed corresponding to the first phoneme.
7. The method of claim 2, wherein the second neural network model includes an encoder configured to receive input of acoustic feature information and a decoder configured to receive input of vector information output from the encoder,
wherein obtaining speech data comprises:
identifying a number of cycles of a decoder included in the second neural network model based on the utterance speed adjustment information corresponding to the first phoneme when at least one frame corresponding to the first phoneme among the acoustic feature information is input to the second neural network model; and
obtaining at least one frame corresponding to the first phoneme and a plurality of pieces of first voice data corresponding to the number of loops based on the input of the at least one frame corresponding to the first phoneme to the second neural network model, and
wherein the first speech data includes speech data corresponding to the first phoneme.
8. The method of claim 7, wherein a plurality of pieces of second voice data corresponding to the number of loops are obtained based on one of at least one frame corresponding to the first phoneme among the acoustic feature information being input to the second neural network model.
9. The method according to claim 7,
wherein the decoder is configured to obtain speech data of a first frequency based on the acoustic feature information in which the offset size is a first time interval, and
wherein a frame included in the acoustic feature information is input to the second neural network model based on the value of the speech speed adjustment information as a reference value, and a plurality of pieces of second voice data corresponding to the product of the first time interval and the first frequency are obtained.
10. The method of claim 1, wherein the speaking speed adjustment information includes information on a ratio of a speaking speed of the acoustic feature information to a reference speaking speed of each phoneme.
11. An electronic device, comprising:
a memory configured to store instructions; and
a processor configured to execute instructions to:
obtaining text;
obtaining acoustic feature information corresponding to the text and alignment information in which each frame of the acoustic feature information is matched with each phoneme included in the text by inputting the text into the first neural network model;
identifying speech speed of the acoustic feature information based on the alignment information;
identifying a reference speaking speed of each phoneme included in the acoustic feature information based on the text and the acoustic feature information;
Obtaining speech speed adjustment information based on the speech speed of the acoustic feature information and the reference speech speed of each phoneme; and
based on the speech speed adjustment information, speech data corresponding to the text is obtained by inputting acoustic feature information into the second neural network model.
12. The electronic device of claim 11, wherein the processor is further configured to execute the instructions to:
identifying a speaking speed corresponding to the first phoneme included in the acoustic feature information based on the alignment information;
identifying a first phoneme included in the acoustic feature information based on the acoustic feature information; and
a reference speech speed corresponding to the first phoneme is identified based on the text.
13. The electronic device of claim 12, wherein the processor is further configured to execute the instructions to:
obtaining a first reference speech speed corresponding to the first phoneme based on the text, and
sample data for training a first neural network model is obtained.
14. The electronic device of claim 13, wherein the processor is further configured to execute the instructions to:
obtaining evaluation information of sample data for training a first neural network model; and
Identifying a second reference speech speed corresponding to the first phoneme based on the first reference speech speed corresponding to the first phoneme and the evaluation information, and
wherein the evaluation information is obtained by a user of the electronic device.
15. The electronic device of claim 14, wherein the processor is further configured to execute the instructions to:
a reference speech speed corresponding to the first phoneme is identified based on one of the first reference speech speed and the second reference speech speed.
CN202280043868.2A 2021-06-22 2022-05-03 Electronic apparatus and control method thereof Pending CN117546233A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0081109 2021-06-22
KR10-2021-0194532 2021-12-31
KR1020210194532A KR20220170330A (en) 2021-06-22 2021-12-31 Electronic device and method for controlling thereof
PCT/KR2022/006304 WO2022270752A1 (en) 2021-06-22 2022-05-03 Electronic device and method for controlling same

Publications (1)

Publication Number Publication Date
CN117546233A true CN117546233A (en) 2024-02-09

Family

ID=89782767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280043868.2A Pending CN117546233A (en) 2021-06-22 2022-05-03 Electronic apparatus and control method thereof

Country Status (1)

Country Link
CN (1) CN117546233A (en)

Similar Documents

Publication Publication Date Title
JP7106680B2 (en) Text-to-Speech Synthesis in Target Speaker&#39;s Voice Using Neural Networks
CN108573693B (en) Text-to-speech system and method, and storage medium therefor
CN106688034B (en) Text-to-speech conversion with emotional content
US8571871B1 (en) Methods and systems for adaptation of synthetic speech in an environment
EP3504709B1 (en) Determining phonetic relationships
CN111369971B (en) Speech synthesis method, device, storage medium and electronic equipment
JP2012037619A (en) Speaker-adaptation device, speaker-adaptation method and program for speaker-adaptation
US8600744B2 (en) System and method for improving robustness of speech recognition using vocal tract length normalization codebooks
JP2016537662A (en) Bandwidth extension method and apparatus
JPWO2018159612A1 (en) Voice conversion device, voice conversion method and program
US11468892B2 (en) Electronic apparatus and method for controlling electronic apparatus
CN110379411B (en) Speech synthesis method and device for target speaker
EP3796316A1 (en) Electronic device and control method thereof
KR102489498B1 (en) A method and a system for communicating with a virtual person simulating the deceased based on speech synthesis technology and image synthesis technology
CN113963679A (en) Voice style migration method and device, electronic equipment and storage medium
US11404045B2 (en) Speech synthesis method and apparatus
KR102198598B1 (en) Method for generating synthesized speech signal, neural vocoder, and training method thereof
US11848004B2 (en) Electronic device and method for controlling thereof
CN117546233A (en) Electronic apparatus and control method thereof
EP4207192A1 (en) Electronic device and method for controlling same
KR20220170330A (en) Electronic device and method for controlling thereof
CN114708876A (en) Audio processing method and device, electronic equipment and storage medium
JP2021099454A (en) Speech synthesis device, speech synthesis program, and speech synthesis method
KR20210135917A (en) Electronic device and operating method for generating a speech signal corresponding to at least one text
EP4207805A1 (en) Electronic device and control method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination