WO2020173134A1 - Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention - Google Patents
Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention Download PDFInfo
- Publication number
- WO2020173134A1 WO2020173134A1 PCT/CN2019/117785 CN2019117785W WO2020173134A1 WO 2020173134 A1 WO2020173134 A1 WO 2020173134A1 CN 2019117785 W CN2019117785 W CN 2019117785W WO 2020173134 A1 WO2020173134 A1 WO 2020173134A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- text
- matrix
- target text
- target
- Prior art date
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 75
- 238000001308 synthesis method Methods 0.000 title claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims abstract description 151
- 238000000034 method Methods 0.000 claims abstract description 45
- 230000008859 change Effects 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 74
- 230000015572 biosynthetic process Effects 0.000 claims description 49
- 238000003786 synthesis reaction Methods 0.000 claims description 49
- 238000012545 processing Methods 0.000 claims description 23
- 230000009466 transformation Effects 0.000 claims description 19
- 238000001228 spectrum Methods 0.000 claims description 13
- 238000004891 communication Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 230000003993 interaction Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 9
- 230000005236 sound signal Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 4
- 230000010365 information processing Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000001525 retina Anatomy 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01L—MEASURING FORCE, STRESS, TORQUE, WORK, MECHANICAL POWER, MECHANICAL EFFICIENCY, OR FLUID PRESSURE
- G01L13/00—Devices or apparatus for measuring differences of two or more fluid pressure values
- G01L13/02—Devices or apparatus for measuring differences of two or more fluid pressure values using elastically-deformable members or pistons as sensing elements
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Definitions
- This application relates to the technical field of speech synthesis, and in particular to a speech synthesis method and device based on an attention mechanism.
- Speech synthesis is a technology that generates artificial speech through mechanical and electronic methods. It is also called Text to Speech (TTS). TTS technology belongs to speech synthesis. It is the text generated by the computer or input from outside. Information is transformed into a technology that can be understood and spoken Chinese fluently.
- the original algorithm in the existing attention-based speech synthesis technology can make the attention mechanism have the property of forcibly aligning the text and speech signals with obvious position deviations, but the complexity is too high, and usually requires a lot of sample data and training time. In order to achieve the desired effect, how to reduce the difficulty of the algorithm while ensuring the forced alignment of text and voice signals with obvious position deviations is a problem to be solved at present.
- the present application is proposed in order to provide an attention mechanism-based speech synthesis method and device that overcomes or at least partially solves the above-mentioned problems.
- the embodiment of the present application provides a speech synthesis method based on an attention mechanism, which may include: determining a text coding matrix and an audio coding matrix according to the target text;
- the audio decoding matrix is determined by the function L(A), where the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where Ant is used to
- the text encoding matrix is converted. If the alignment strength of the target text is less than the alignment strength threshold, W nt changes linearly; the alignment strength of the target text is determined by the position of the nth character in the target text and the position of the nth character in the target text.
- the time point t of the pronunciation of n characters is determined, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text;
- An embodiment of the present application provides a speech synthesis device based on an attention mechanism, which may include: a first determining unit configured to determine a text coding matrix and an audio coding matrix according to a target text;
- the second determining unit is used to determine the audio decoding matrix through the function L(A) according to the text encoding matrix and the audio encoding matrix, where the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where , Ant is used to transform the text encoding matrix. If the alignment strength of the target text is less than the alignment strength threshold, W nt changes linearly; the alignment strength of the target text is determined by the position of the nth character in the target text Determined with the time point t of the pronunciation of the nth text in the target text, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text;
- the third determining unit is used to determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
- the application embodiment provides a computer-readable storage medium that stores program instructions, and when the program instructions are executed by a processor, the processor executes any of the above-mentioned attention-based speech synthesis method.
- the embodiment of the application provides a speech synthesis device based on an attention mechanism, including a storage component, a processing component and a communication component, a storage component, and the processing component and the communication component are connected to each other.
- the storage component is used to store data processing code and communicate.
- the component is used for information interaction with external devices; the processing component is configured to call program code and execute any speech synthesis method based on the attention mechanism, which will not be repeated here.
- the embodiment of the application provides a speech synthesis method based on the attention mechanism.
- the audio decoding matrix is determined by the function L(A) according to the text encoding matrix and the audio encoding matrix of the target text
- the audio decoding matrix can be further determined according to the audio decoding matrix.
- Mel cepstrum coefficient and determine the target audio frequency according to Mel cepstrum coefficient.
- the function L(A) is the attention mechanism loss function determined according to Ant and Wnt, where Ant is used to transform the text encoding matrix.
- the alignment strength of the target text is less than the alignment strength threshold, Wnt changes linearly ; Further, the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t of the pronunciation of the nth character in the target text. If the alignment strength of the target text is less than the alignment strength threshold, making Wnt change linearly will not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also can not require a lot of sample data and training time. Ensuring that the text and speech signals with obvious position deviations are forced to align, which helps to achieve the purpose of approximate alignment of the attention mechanism matrix faster, so that the speech synthesis is more organized.
- FIG. 1 is a schematic diagram of a speech synthesis system architecture based on an attention mechanism provided by an embodiment of the present application
- FIG. 2 is a schematic diagram of a terminal interface when synthesizing speech provided by an embodiment of the present application
- 3A is a schematic diagram of the process of a speech synthesis method based on an attention mechanism provided by an embodiment of the present application
- FIG. 3B is a schematic diagram of the framework of a speech synthesis technology method based on an improved attention mechanism based on an embodiment of the present application;
- FIG. 4 is a schematic diagram of another method for speech synthesis based on attention mechanism provided by an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a speech synthesis device based on an attention mechanism provided by an embodiment of the present application
- FIG. 6 is a schematic diagram of a physical device structure of a simplified speech synthesis device based on an attention mechanism provided by an embodiment of the present application.
- server used in this application are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
- the server may be, but is not limited to, a processor, a data processing platform, a computing device, a computer, two or more computers, etc.
- Speech synthesis is the input of a text to output a speech corresponding to the text. It is also a technology that generates artificial speech through mechanical and electronic methods.
- TTS technology also known as text-to-speech technology
- Attention Mechanism is derived from the study of human vision. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on a part of all information while ignoring other visible information. The above mechanism is usually referred to as the attention mechanism. Different parts of the human retina have different degrees of information processing capabilities, namely acuity, and only the fovea has the strongest acuity. In order to make rational use of the limited visual information processing resources, humans need to select a specific part of the visual area and then focus on it. For example, when people are reading, usually only a few words to be read will be paid attention to and processed. In summary, the attention mechanism mainly has two aspects: decide which part of the input needs to be paid attention to; and allocate limited information processing resources to important parts.
- Short-term Fourier transform is a variant of Fourier transform, used to determine the sinusoidal frequency and phase of the local part of the signal that changes with time.
- STFT short-time Fourier transform
- FIG. 1 is a schematic diagram of an attention mechanism-based speech synthesis system architecture provided by an embodiment of the present application, including: an attention mechanism-based speech synthesis device 101 and a terminal device 102.
- the speech synthesis apparatus 101 based on the attention mechanism may be a server, where the server may be, but is not limited to, a processor, a data processing platform, a computing device, a computer, two or more computers, etc.
- the server is a way of acquiring, processing, analyzing, and extracting valuable, massive and diversified data, based on interactive data, and bringing various benefits to third parties. A convenient service equipment.
- the speech synthesis device 101 based on the attention mechanism can determine the text coding matrix and the audio coding matrix according to the target text; according to the text coding matrix and the audio coding matrix, the audio decoding matrix is determined by the function L(A), where the function L(A) is The attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix.
- W nt changes linearly;
- the alignment strength is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or It is equal to the time point of the total pronunciation of the target text;
- the Mel cepstrum coefficient is determined according to the audio decoding matrix, and the target audio is determined according to the Mel cepstrum coefficient.
- the terminal device 102 may be a computer network such as a communication terminal, a portable terminal, a mobile device, a user terminal, a mobile terminal, a wireless communication device, a user agent, a user device, a service device, or a user equipment (User Equipment, UE) that is at the periphery of the network.
- the device is mainly used for data input and processing result output or display, etc. It can also be a software client, application level, etc. installed or running on any of the above-mentioned devices.
- the client can be a smartphone, computer, or tablet device used by the target user or the current rental user, or a software client, application level, etc. installed or running on a smartphone, computer, or tablet device. Please refer to FIG.
- the terminal device 102 is a computer, it can be used to send target text to the attention mechanism-based speech synthesis device 101, and receive and play the target audio sent by the attention mechanism-based speech synthesis device 101.
- the speech synthesis device 101 based on the attention mechanism can simultaneously receive different target texts sent by multiple different terminal devices 102.
- FIG. 3A is a schematic diagram of a process of a speech synthesis method based on an attention mechanism provided by an embodiment of the present application. It can be applied to the system in FIG. 1 described above.
- the following describes the speech synthesis device 101 based on the attention mechanism from one side of the speech synthesis device 101 based on the attention mechanism in conjunction with FIG. 3A.
- the method may include the following steps S301-S303.
- Step S301 Determine the text coding matrix and the audio coding matrix according to the target text.
- the target text may be obtained, the target text includes a text of N characters; the offset audio is obtained, and the duration of the offset audio including the target text is The audio of T; then determine the text encoding matrix according to the target text; determine the audio encoding matrix according to the offset audio.
- the target text can be the sample text "Ping An Technology Co., Ltd.” input by the user, and the word order of the input text is marked.
- Obtaining the offset audio may be determining the offset audio by matching the audio corresponding to the target text in a speech library according to the target text. For example: the sample text "Ping An Technology Co., Ltd.” can be matched with “Ping”, “An”, “Ke”, “Technology”, “Technology”, “Technology”, “Yes”, “Limit”, Corresponding audio of the ten characters of " ⁇ ” and " ⁇ ” form offset audio.
- the target text is obtained, and the target text includes text of N characters; the offset audio is obtained, and the offset audio includes the audio of the target text whose duration is T; and the target text is determined according to the target text.
- Text encoding matrix; the audio encoding matrix is determined according to the offset audio, and the sequence of the four steps is not specifically limited.
- the target text may be acquired first, the text encoding matrix may be determined according to the target text, then the offset audio may be acquired, and finally the audio encoding matrix may be determined according to the offset audio.
- Step S302 According to the text encoding matrix and the audio encoding matrix, the audio decoding matrix is determined through the function L(A).
- Figure 3B is a schematic diagram of the framework of an improved attention mechanism-based speech synthesis technology method provided by an embodiment of the present application, including: text encoding module, audio encoding module, attention matrix module, audio decoding module, and Short-time Fourier spectrum module.
- the target text and the offset audio are respectively input from the text encoding module and the audio encoding module to obtain the corresponding text decoding matrix and audio decoding matrix
- the audio decoding matrix and the text decoding matrix are forcibly aligned with the text after attention
- the decoding matrix is input into the audio decoding module to obtain the target audio corresponding to the target text.
- Encoder-Decoder is a very general computing framework.
- the specific model functions used by Encoder and Decoder there is no limitation.
- Convolutional Neural Network CNN, Recurrent Neural Network RNN, Bidirectional Long and Short-term Memory Recurrent Neural Network BiRNN, gated loop unit GRU, long short-term memory network LSTM, etc. can all be used as model functions of Encoder and Decoder.
- the function L(A) is an attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix. If the alignment strength of the target text is less than the alignment strength threshold, Then W nt changes linearly; the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced, n is greater than 0 and less than or equal to the target The number of text characters, t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text.
- the preset function L(A) can be For example: the function L(A) is applied to the attention mechanism matrix A in the attention mechanism module described in Figure 3B, A ⁇ R N ⁇ T , and its meaning is to evaluate the correspondence between the nth character and the T time, that is The nth character, the 1st and the tth time frame S 1:F,t are related, where Ant can be Means that the attention mechanism module looks at the nth character at time t, it will look at the nth character or the n+1th character or the characters around them at the subsequent time t+1, where d is the length of the text Related preset parameters. That is, L(A) can use the LSTM standard function normalized exponential function (Softmax function) to obtain the weight of the sound feature of the nth character in the training process through the attention mechanism, and then normalize after the summation.
- Softmax function LSTM standard function normalized exponential function
- ⁇ is the alignment strength threshold
- N is the total number of characters in the N characters of the target text
- n is the number of characters in the nth word in the N characters
- T is the The time point when the Nth character of the target text is pronounced
- t is the time point when the nth character is pronounced.
- W nt is a segmentation function related to the alignment strength of the target text. If the alignment strength of the target text is less than the alignment strength threshold, W nt decreases as the target text increases.
- Step S303 Determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
- determining the mel cepstrum coefficient according to the audio decoding matrix, and determining the target audio according to the mel cepstrum coefficient is specifically: determining the mel cepstrum coefficient according to the audio decoding matrix, and performing short-time Fu The inner transform, and then according to the short-time Fourier spectrum, determine the target audio.
- Mel-Frequency Cepstrum is a linear transformation of the logarithmic energy spectrum based on the non-linear mel scale of sound frequency.
- Mel-Frequency Cepstral Coefficients are the coefficients that make up the Mel-Frequency Cepstral Coefficients. It is derived from the cepstrum of audio fragments.
- cepstrum and mel frequency cepstrum The difference between cepstrum and mel frequency cepstrum is that the band division of mel frequency cepstrum is equally spaced on the mel scale, which is more approximate than the linearly spaced frequency band used in normal cepstrum
- the human auditory system Such a non-linear representation can provide a better representation of the sound signal in multiple fields.
- the determination of the corresponding Mel cepstrum coefficients can be pre-emphasis, framing, and windowing of the speech of the audio decoding matrix; for each short-term analysis window, the fast discrete Fourier transform can be used.
- the algorithm (Fast Fourier Transformation, FFT) obtains the corresponding frequency spectrum; the above frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum (Mel spectrum).
- the spectrogram is the spectrogram sound describing the speech signal.
- the spectrogram can represent the target audio. It can be understood that the embodiment of the present invention does not specifically limit the manner of determining the target audio according to the Mel cepstrum coefficient.
- the Mel cepstrum coefficients are further determined according to the audio decoding matrix, and the Mel cepstrum coefficients are determined according to the Mel cepstrum coefficients. Determine the target audio.
- the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix.
- W nt changes linearly; further, the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced. Therefore, when the alignment strength of the target text is less than the alignment strength threshold, the linear change of W nt can not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also ensure that the text and speech signals with obvious position deviations are forced to align, which is helpful In order to make the attention mechanism matrix achieve the purpose of approximate alignment faster, so that the speech synthesis is more organized.
- FIG. 4 is a schematic diagram of another method for speech synthesis based on an attention mechanism provided by an embodiment of the present application. It can be applied to the system in FIG. 1 described above, and the following will describe from a single side of the attention mechanism-based speech synthesis device 101 in conjunction with FIG. 4.
- the method may include the following steps S401 to S403.
- Step S401 Perform function L(A) model training according to the sample text and sample speech, and determine the alignment intensity threshold ⁇ of the function L(A).
- the embodiment of the present application can be applied to a speech synthesis scene based on a directed attention mechanism.
- the positions of text and audio signal segments are roughly related. Therefore, when a person speaks a sentence, the position n of the character and the time point t have an approximate linear relationship, that is, n ⁇ at, where a ⁇ N/T.
- ⁇ is a linearly adjustable alignment intensity threshold, which is used to indicate the preset alignment intensity between the position of the nth character and the time point t when the nth character is pronounced, and the value range of ⁇ is ⁇ ⁇ (0,1).
- the threshold ⁇ is closer to zero, it means that the corresponding intensity of the font position and the speech time obtained by it is stronger, and the corresponding intensity of speech and text is higher.
- Step S402 Determine the text coding matrix and the audio coding matrix according to the target text.
- Step S403 According to the text encoding matrix and the audio encoding matrix, the audio decoding matrix is determined by the function L(A).
- Step S404 Determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
- step S402-step S404 may correspond to the related descriptions of step S301-step S303 in FIG. 3A, and details are not repeated here.
- the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix.
- the alignment strength of the target text is less than the alignment strength threshold, then W nt changes linearly; further, the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced.
- the size of the threshold ⁇ it is determined that when the threshold ⁇ is closer to zero, the alignment strength of the target text is closer to the threshold ⁇ , it is proved that the corresponding strength of the font position and the speech time is stronger, and the corresponding strength of the speech and text is stronger. high. Therefore, the speech synthesis technology through the improved attention mechanism model can ensure the forced alignment of text and speech signals with obvious position deviations, while reducing the difficulty of the algorithm and greatly reducing the time for speech synthesis.
- the linear change of W nt can not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also ensure that the text and speech signals with obvious position deviations are forced to align, which is helpful In order to make the attention mechanism matrix achieve the purpose of approximate alignment faster, so that the speech synthesis is more organized.
- this application can determine the audio decoding matrix through the function L(A) according to the text coding matrix and the audio coding matrix of the target text, and then further determine the Mel cepstrum coefficients according to the audio decoding matrix, and according to the Mel cepstrum system The number determines the target audio.
- the attention mechanism loss function L(A) changes with the change of the alignment intensity of the target text.
- the alignment intensity threshold because W nt changes linearly, by adjusting the calculation method of the loss function W nt , the attention mechanism matrix
- the loss function of has a linear adjustable ⁇ threshold, which makes the loss function linear. At this time, it can not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also can not require a lot of sample data and training time. , To ensure that the text and voice signals with obvious position deviations are forced to align.
- the following provides a speech synthesis device based on the attention mechanism related to the embodiment of the application.
- the speech synthesis device based on the attention mechanism can be a method of rapid acquisition, processing, analysis and Extract valuable data, based on interactive data, to bring various convenient service equipment for third parties.
- FIG. 5 is a schematic structural diagram of a speech synthesis device based on an attention mechanism provided by an embodiment of the present application. It may include a first determining unit 501, a second determining unit 502, and a third determining unit 503, and may also include a fourth determining unit 504.
- the first determining unit 501 is configured to determine a text encoding matrix and an audio encoding matrix according to the target text;
- the second determining unit 502 is configured to determine the audio decoding matrix through the function L(A) according to the text encoding matrix and the audio encoding matrix, where the function L(A) is the attention mechanism loss function determined according to Ant and W nt , Wherein, Ant is used to transform the text encoding matrix.
- W nt changes linearly;
- the alignment strength of the target text is determined by the nth character in the target text
- the position and the time point t of the pronunciation of the nth text in the target text are determined, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text;
- the third determining unit 503 is configured to determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
- the first determining unit 501 is specifically configured to: obtain a target text, the target text includes a text of N characters; obtain an offset audio, the offset audio includes the duration of the target text Is the audio of T; the text encoding matrix is determined according to the target text; the audio encoding matrix is determined according to the offset audio.
- ⁇ is a linearly adjustable alignment intensity threshold, which is used to represent the preset alignment intensity between the position of the nth character and the time point t when the nth character is pronounced, and the value of ⁇ The value range is ⁇ (0,1).
- the device further includes a fourth determining unit 504, configured to determine the audio decoding matrix according to the sample code matrix and the audio coding matrix by using the function L(A) Text and sample speech are trained on the function L(A) model, and the alignment intensity threshold ⁇ of the function L(A) is determined.
- a fourth determining unit 504 configured to determine the audio decoding matrix according to the sample code matrix and the audio coding matrix by using the function L(A) Text and sample speech are trained on the function L(A) model, and the alignment intensity threshold ⁇ of the function L(A) is determined.
- the fourth determining unit 504 is specifically configured to: set ⁇ at preset intervals to automatically change from 0 to 1; for the ⁇ after each change, perform actions based on sample text and sample voice.
- the function L(A) is trained; it is determined that when the alignment strength of the sample text and the sample speech reaches a first threshold, the shortest time ⁇ for the speech synthesis is the alignment strength threshold ⁇ .
- ⁇ is the alignment strength threshold
- N is the total number of characters in the N characters of the target text
- n is the number of characters in the nth word in the N characters
- T is the The time point when the Nth character of the target text is pronounced
- t is the time point when the nth character is pronounced.
- the third determining unit 503 is specifically configured to determine the Mel cepstrum coefficients according to the audio decoding matrix, and perform short-time Fourier transform on the Mel cepstrum coefficients, and then according to the short-term Time Fourier spectrum to determine the target audio frequency.
- FIG. 6 is a simplified physical device structure diagram of an attention mechanism-based speech synthesis device provided by an embodiment of the present application, which is easy to understand and easy to illustrate.
- the device 60 in FIG. 6 may include the following: Or multiple components: storage component 601, processing component 602, and communication component 603.
- the storage component 601 may include one or more storage units, and each unit may include one or more memories.
- the storage component can be used to store programs and various data, and can complete the programs or data at high speed and automatically during the operation of the device 60 Access.
- a physical device with two stable states can be used to store information, and the two stable states are represented as "0" and "1" respectively.
- the storage component can be used to store target text, target audio, and other related data.
- the processing component 602 may also be called a processor, a processing unit, a processing board, a processing module, a processing device, and so on.
- the processing component may be a central processing unit (CPU), a network processor (NP) or a combination of CPU and NP.
- CPU central processing unit
- NP network processor
- the processing component 603 is used to call the data of the storage component 601 to execute the method described in FIGS. 3A to 4 above. Related descriptions are not repeated here.
- the communication component 603 may also be called a transceiver, or a transceiver, etc., which may include a unit for wireless, wired, or other communication methods.
- the device for implementing the receiving function in part 603 can be regarded as the receiving unit, and the device for implementing the sending function as the sending unit, that is, the part 603 can receive target text or send target audio.
- the units described as separate parts may or may not be physically separated, and the parts displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed in multiple locations.
- Network unit Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
- the functional components in the various embodiments of the present application may be integrated into one component, or each component may exist alone physically, or two or more components may be integrated into one component.
- the above-mentioned integrated components can be implemented in the form of hardware or software functional units.
- the integrated component is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- an embodiment of the present application provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement FIGS. 3A and 4 Corresponding description of the method embodiment shown.
- the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium and includes several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
- the size of the sequence numbers of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, rather than corresponding to the embodiments of the present application.
- the implementation process constitutes any limitation.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
L'invention concerne un procédé et un dispositif de synthèse vocale fondée sur un mécanisme d'attention. Le procédé consiste : à déterminer, en fonction d'un texte cible, une matrice de texte codée et une matrice audio codée (S301) ; à déterminer une matrice audio décodée en fonction de la matrice de texte codée et de la matrice audio codée au moyen d'une fonction L(A) (302), la fonction L(A) constituant une fonction de perte d'un mécanisme d'attention déterminé en fonction de A nt et Wnt ; et à déterminer un coefficient cepstral de fréquence Mel en fonction de la matrice audio décodée, et à déterminer un élément audio cible en fonction du coefficient cepstral de fréquence Mel (S303). Le procédé peut provoquer le changement linéaire de W nt lorsqu'une force d'alignement d'un texte cible est inférieure à un seuil de force d'alignement. L'invention peut réduire considérablement la difficulté d'algorithme dans des mécanismes d'attention classiques, et ne nécessite pas de grandes quantités de données d'échantillon et de temps d'apprentissage pour effectuer un alignement forcé d'un texte et d'un signal audio évidemment décalé.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910149065.5A CN109767752B (zh) | 2019-02-27 | 2019-02-27 | 一种基于注意力机制的语音合成方法及装置 |
CN201910149065.5 | 2019-02-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020173134A1 true WO2020173134A1 (fr) | 2020-09-03 |
Family
ID=66457333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/117785 WO2020173134A1 (fr) | 2019-02-27 | 2019-11-13 | Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109767752B (fr) |
WO (1) | WO2020173134A1 (fr) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257407A (zh) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | 音频中的文本对齐方法、装置、电子设备及可读存储介质 |
CN113112987A (zh) * | 2021-04-14 | 2021-07-13 | 北京地平线信息技术有限公司 | 语音合成方法、语音合成模型的训练方法及装置 |
CN113539232A (zh) * | 2021-07-10 | 2021-10-22 | 东南大学 | 一种基于慕课语音数据集的语音合成方法 |
CN115410550A (zh) * | 2022-06-02 | 2022-11-29 | 柯登峰 | 一种细粒度韵律可控的情感语音合成方法、系统及存储介质 |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109767752B (zh) * | 2019-02-27 | 2023-05-26 | 平安科技(深圳)有限公司 | 一种基于注意力机制的语音合成方法及装置 |
CN110264991B (zh) * | 2019-05-20 | 2023-12-22 | 平安科技(深圳)有限公司 | 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质 |
CN112133279B (zh) * | 2019-06-06 | 2024-06-21 | Tcl科技集团股份有限公司 | 车载信息播报方法、装置及终端设备 |
US11183201B2 (en) * | 2019-06-10 | 2021-11-23 | John Alexander Angland | System and method for transferring a voice from one body of recordings to other recordings |
CN110264987A (zh) * | 2019-06-18 | 2019-09-20 | 王子豪 | 基于深度学习的和弦进行生成方法 |
CN111508466A (zh) * | 2019-09-12 | 2020-08-07 | 马上消费金融股份有限公司 | 一种文本处理方法、装置、设备及计算机可读存储介质 |
CN110808027B (zh) * | 2019-11-05 | 2020-12-08 | 腾讯科技(深圳)有限公司 | 语音合成方法、装置以及新闻播报方法、系统 |
CN111133506A (zh) * | 2019-12-23 | 2020-05-08 | 深圳市优必选科技股份有限公司 | 语音合成模型的训练方法、装置、计算机设备及存储介质 |
CN111259188B (zh) * | 2020-01-19 | 2023-07-25 | 成都潜在人工智能科技有限公司 | 一种基于seq2seq网络的歌词对齐方法及系统 |
CN113314096A (zh) * | 2020-02-25 | 2021-08-27 | 阿里巴巴集团控股有限公司 | 语音合成方法、装置、设备和存储介质 |
CN111524503B (zh) * | 2020-04-15 | 2023-01-17 | 上海明略人工智能(集团)有限公司 | 音频数据的处理方法、装置、音频识别设备和存储介质 |
CN111862934B (zh) * | 2020-07-24 | 2022-09-27 | 思必驰科技股份有限公司 | 语音合成模型的改进方法和语音合成方法及装置 |
US11798527B2 (en) | 2020-08-19 | 2023-10-24 | Zhejiang Tonghu Ashun Intelligent Technology Co., Ltd. | Systems and methods for synthesizing speech |
CN112466272B (zh) * | 2020-10-23 | 2023-01-17 | 浙江同花顺智能科技有限公司 | 一种语音合成模型的评价方法、装置、设备及存储介质 |
CN112837673B (zh) * | 2020-12-31 | 2024-05-10 | 平安科技(深圳)有限公司 | 基于人工智能的语音合成方法、装置、计算机设备和介质 |
CN112908294B (zh) * | 2021-01-14 | 2024-04-05 | 杭州倒映有声科技有限公司 | 一种语音合成方法以及语音合成系统 |
CN113345413B (zh) * | 2021-06-01 | 2023-12-29 | 平安科技(深圳)有限公司 | 基于音频特征提取的语音合成方法、装置、设备及介质 |
CN113299268A (zh) * | 2021-07-28 | 2021-08-24 | 成都启英泰伦科技有限公司 | 一种基于流生成模型的语音合成方法 |
CN113707127B (zh) * | 2021-08-30 | 2023-12-15 | 中国科学院声学研究所 | 一种基于线性自注意力的语音合成方法及系统 |
CN115691476B (zh) * | 2022-06-06 | 2023-07-04 | 腾讯科技(深圳)有限公司 | 语音识别模型的训练方法、语音识别方法、装置及设备 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101651788A (zh) * | 2008-12-26 | 2010-02-17 | 中国科学院声学研究所 | 一种在线语音文本对齐系统及方法 |
US20180330713A1 (en) * | 2017-05-14 | 2018-11-15 | International Business Machines Corporation | Text-to-Speech Synthesis with Dynamically-Created Virtual Voices |
CN109036371A (zh) * | 2018-07-19 | 2018-12-18 | 北京光年无限科技有限公司 | 用于语音合成的音频数据生成方法及系统 |
CN109767752A (zh) * | 2019-02-27 | 2019-05-17 | 平安科技(深圳)有限公司 | 一种基于注意力机制的语音合成方法及装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4291755B2 (ja) * | 2004-08-13 | 2009-07-08 | 京セラ株式会社 | 携帯端末装置及び音声信号の出力方法 |
JP2008225254A (ja) * | 2007-03-14 | 2008-09-25 | Canon Inc | 音声合成装置及び方法並びにプログラム |
JP6716397B2 (ja) * | 2016-08-31 | 2020-07-01 | 株式会社東芝 | 音声処理装置、音声処理方法およびプログラム |
CN107943405A (zh) * | 2016-10-13 | 2018-04-20 | 广州市动景计算机科技有限公司 | 语音播报装置、方法、浏览器及用户终端 |
-
2019
- 2019-02-27 CN CN201910149065.5A patent/CN109767752B/zh active Active
- 2019-11-13 WO PCT/CN2019/117785 patent/WO2020173134A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101651788A (zh) * | 2008-12-26 | 2010-02-17 | 中国科学院声学研究所 | 一种在线语音文本对齐系统及方法 |
US20180330713A1 (en) * | 2017-05-14 | 2018-11-15 | International Business Machines Corporation | Text-to-Speech Synthesis with Dynamically-Created Virtual Voices |
CN109036371A (zh) * | 2018-07-19 | 2018-12-18 | 北京光年无限科技有限公司 | 用于语音合成的音频数据生成方法及系统 |
CN109767752A (zh) * | 2019-02-27 | 2019-05-17 | 平安科技(深圳)有限公司 | 一种基于注意力机制的语音合成方法及装置 |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257407A (zh) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | 音频中的文本对齐方法、装置、电子设备及可读存储介质 |
CN112257407B (zh) * | 2020-10-20 | 2024-05-14 | 网易(杭州)网络有限公司 | 音频中的文本对齐方法、装置、电子设备及可读存储介质 |
CN113112987A (zh) * | 2021-04-14 | 2021-07-13 | 北京地平线信息技术有限公司 | 语音合成方法、语音合成模型的训练方法及装置 |
CN113112987B (zh) * | 2021-04-14 | 2024-05-03 | 北京地平线信息技术有限公司 | 语音合成方法、语音合成模型的训练方法及装置 |
CN113539232A (zh) * | 2021-07-10 | 2021-10-22 | 东南大学 | 一种基于慕课语音数据集的语音合成方法 |
CN113539232B (zh) * | 2021-07-10 | 2024-05-14 | 东南大学 | 一种基于慕课语音数据集的语音合成方法 |
CN115410550A (zh) * | 2022-06-02 | 2022-11-29 | 柯登峰 | 一种细粒度韵律可控的情感语音合成方法、系统及存储介质 |
CN115410550B (zh) * | 2022-06-02 | 2024-03-26 | 北京听见科技有限公司 | 一种细粒度韵律可控的情感语音合成方法、系统及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN109767752A (zh) | 2019-05-17 |
CN109767752B (zh) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020173134A1 (fr) | Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention | |
CN110111775B (zh) | 一种流式语音识别方法、装置、设备及存储介质 | |
WO2020215666A1 (fr) | Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage | |
CN104732977B (zh) | 一种在线口语发音质量评价方法和系统 | |
CN110246488B (zh) | 半优化CycleGAN模型的语音转换方法及装置 | |
WO2016150257A1 (fr) | Programme de synthèse vocale | |
US12027165B2 (en) | Computer program, server, terminal, and speech signal processing method | |
CN110600013B (zh) | 非平行语料声音转换数据增强模型训练方法及装置 | |
WO2020098269A1 (fr) | Procédé de synthèse de la parole et dispositif de synthèse de la parole | |
US20220383876A1 (en) | Method of converting speech, electronic device, and readable storage medium | |
US20230127787A1 (en) | Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium | |
Chaudhary et al. | Feature extraction methods for speaker recognition: A review | |
CN112185363B (zh) | 音频处理方法及装置 | |
CN114023300A (zh) | 一种基于扩散概率模型的中文语音合成方法 | |
CN111192659A (zh) | 用于抑郁检测的预训练方法和抑郁检测方法及装置 | |
WO2023142454A1 (fr) | Procédés de traduction vocale et d'entraînement de modèle, appareil, dispositif électronique et support de stockage | |
CN112017690B (zh) | 一种音频处理方法、装置、设备和介质 | |
Mian Qaisar | Isolated speech recognition and its transformation in visual signs | |
CN114255740A (zh) | 语音识别方法、装置、计算机设备和存储介质 | |
Priyadarshani et al. | Dynamic time warping based speech recognition for isolated Sinhala words | |
CN113963679A (zh) | 一种语音风格迁移方法、装置、电子设备及存储介质 | |
Anees | Speech coding techniques and challenges: A comprehensive literature survey | |
US20230368777A1 (en) | Method And Apparatus For Processing Audio, Electronic Device And Storage Medium | |
Shankarappa et al. | A faster approach for direct speech to speech translation | |
CN116434736A (zh) | 语音识别方法、交互方法、系统和设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19917374 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19917374 Country of ref document: EP Kind code of ref document: A1 |