WO2020173134A1 - Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention - Google Patents

Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention Download PDF

Info

Publication number
WO2020173134A1
WO2020173134A1 PCT/CN2019/117785 CN2019117785W WO2020173134A1 WO 2020173134 A1 WO2020173134 A1 WO 2020173134A1 CN 2019117785 W CN2019117785 W CN 2019117785W WO 2020173134 A1 WO2020173134 A1 WO 2020173134A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
text
matrix
target text
target
Prior art date
Application number
PCT/CN2019/117785
Other languages
English (en)
Chinese (zh)
Inventor
房树明
程宁
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020173134A1 publication Critical patent/WO2020173134A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01LMEASURING FORCE, STRESS, TORQUE, WORK, MECHANICAL POWER, MECHANICAL EFFICIENCY, OR FLUID PRESSURE
    • G01L13/00Devices or apparatus for measuring differences of two or more fluid pressure values
    • G01L13/02Devices or apparatus for measuring differences of two or more fluid pressure values using elastically-deformable members or pistons as sensing elements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • This application relates to the technical field of speech synthesis, and in particular to a speech synthesis method and device based on an attention mechanism.
  • Speech synthesis is a technology that generates artificial speech through mechanical and electronic methods. It is also called Text to Speech (TTS). TTS technology belongs to speech synthesis. It is the text generated by the computer or input from outside. Information is transformed into a technology that can be understood and spoken Chinese fluently.
  • the original algorithm in the existing attention-based speech synthesis technology can make the attention mechanism have the property of forcibly aligning the text and speech signals with obvious position deviations, but the complexity is too high, and usually requires a lot of sample data and training time. In order to achieve the desired effect, how to reduce the difficulty of the algorithm while ensuring the forced alignment of text and voice signals with obvious position deviations is a problem to be solved at present.
  • the present application is proposed in order to provide an attention mechanism-based speech synthesis method and device that overcomes or at least partially solves the above-mentioned problems.
  • the embodiment of the present application provides a speech synthesis method based on an attention mechanism, which may include: determining a text coding matrix and an audio coding matrix according to the target text;
  • the audio decoding matrix is determined by the function L(A), where the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where Ant is used to
  • the text encoding matrix is converted. If the alignment strength of the target text is less than the alignment strength threshold, W nt changes linearly; the alignment strength of the target text is determined by the position of the nth character in the target text and the position of the nth character in the target text.
  • the time point t of the pronunciation of n characters is determined, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text;
  • An embodiment of the present application provides a speech synthesis device based on an attention mechanism, which may include: a first determining unit configured to determine a text coding matrix and an audio coding matrix according to a target text;
  • the second determining unit is used to determine the audio decoding matrix through the function L(A) according to the text encoding matrix and the audio encoding matrix, where the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where , Ant is used to transform the text encoding matrix. If the alignment strength of the target text is less than the alignment strength threshold, W nt changes linearly; the alignment strength of the target text is determined by the position of the nth character in the target text Determined with the time point t of the pronunciation of the nth text in the target text, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text;
  • the third determining unit is used to determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
  • the application embodiment provides a computer-readable storage medium that stores program instructions, and when the program instructions are executed by a processor, the processor executes any of the above-mentioned attention-based speech synthesis method.
  • the embodiment of the application provides a speech synthesis device based on an attention mechanism, including a storage component, a processing component and a communication component, a storage component, and the processing component and the communication component are connected to each other.
  • the storage component is used to store data processing code and communicate.
  • the component is used for information interaction with external devices; the processing component is configured to call program code and execute any speech synthesis method based on the attention mechanism, which will not be repeated here.
  • the embodiment of the application provides a speech synthesis method based on the attention mechanism.
  • the audio decoding matrix is determined by the function L(A) according to the text encoding matrix and the audio encoding matrix of the target text
  • the audio decoding matrix can be further determined according to the audio decoding matrix.
  • Mel cepstrum coefficient and determine the target audio frequency according to Mel cepstrum coefficient.
  • the function L(A) is the attention mechanism loss function determined according to Ant and Wnt, where Ant is used to transform the text encoding matrix.
  • the alignment strength of the target text is less than the alignment strength threshold, Wnt changes linearly ; Further, the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t of the pronunciation of the nth character in the target text. If the alignment strength of the target text is less than the alignment strength threshold, making Wnt change linearly will not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also can not require a lot of sample data and training time. Ensuring that the text and speech signals with obvious position deviations are forced to align, which helps to achieve the purpose of approximate alignment of the attention mechanism matrix faster, so that the speech synthesis is more organized.
  • FIG. 1 is a schematic diagram of a speech synthesis system architecture based on an attention mechanism provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a terminal interface when synthesizing speech provided by an embodiment of the present application
  • 3A is a schematic diagram of the process of a speech synthesis method based on an attention mechanism provided by an embodiment of the present application
  • FIG. 3B is a schematic diagram of the framework of a speech synthesis technology method based on an improved attention mechanism based on an embodiment of the present application;
  • FIG. 4 is a schematic diagram of another method for speech synthesis based on attention mechanism provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a speech synthesis device based on an attention mechanism provided by an embodiment of the present application
  • FIG. 6 is a schematic diagram of a physical device structure of a simplified speech synthesis device based on an attention mechanism provided by an embodiment of the present application.
  • server used in this application are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the server may be, but is not limited to, a processor, a data processing platform, a computing device, a computer, two or more computers, etc.
  • Speech synthesis is the input of a text to output a speech corresponding to the text. It is also a technology that generates artificial speech through mechanical and electronic methods.
  • TTS technology also known as text-to-speech technology
  • Attention Mechanism is derived from the study of human vision. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on a part of all information while ignoring other visible information. The above mechanism is usually referred to as the attention mechanism. Different parts of the human retina have different degrees of information processing capabilities, namely acuity, and only the fovea has the strongest acuity. In order to make rational use of the limited visual information processing resources, humans need to select a specific part of the visual area and then focus on it. For example, when people are reading, usually only a few words to be read will be paid attention to and processed. In summary, the attention mechanism mainly has two aspects: decide which part of the input needs to be paid attention to; and allocate limited information processing resources to important parts.
  • Short-term Fourier transform is a variant of Fourier transform, used to determine the sinusoidal frequency and phase of the local part of the signal that changes with time.
  • STFT short-time Fourier transform
  • FIG. 1 is a schematic diagram of an attention mechanism-based speech synthesis system architecture provided by an embodiment of the present application, including: an attention mechanism-based speech synthesis device 101 and a terminal device 102.
  • the speech synthesis apparatus 101 based on the attention mechanism may be a server, where the server may be, but is not limited to, a processor, a data processing platform, a computing device, a computer, two or more computers, etc.
  • the server is a way of acquiring, processing, analyzing, and extracting valuable, massive and diversified data, based on interactive data, and bringing various benefits to third parties. A convenient service equipment.
  • the speech synthesis device 101 based on the attention mechanism can determine the text coding matrix and the audio coding matrix according to the target text; according to the text coding matrix and the audio coding matrix, the audio decoding matrix is determined by the function L(A), where the function L(A) is The attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix.
  • W nt changes linearly;
  • the alignment strength is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or It is equal to the time point of the total pronunciation of the target text;
  • the Mel cepstrum coefficient is determined according to the audio decoding matrix, and the target audio is determined according to the Mel cepstrum coefficient.
  • the terminal device 102 may be a computer network such as a communication terminal, a portable terminal, a mobile device, a user terminal, a mobile terminal, a wireless communication device, a user agent, a user device, a service device, or a user equipment (User Equipment, UE) that is at the periphery of the network.
  • the device is mainly used for data input and processing result output or display, etc. It can also be a software client, application level, etc. installed or running on any of the above-mentioned devices.
  • the client can be a smartphone, computer, or tablet device used by the target user or the current rental user, or a software client, application level, etc. installed or running on a smartphone, computer, or tablet device. Please refer to FIG.
  • the terminal device 102 is a computer, it can be used to send target text to the attention mechanism-based speech synthesis device 101, and receive and play the target audio sent by the attention mechanism-based speech synthesis device 101.
  • the speech synthesis device 101 based on the attention mechanism can simultaneously receive different target texts sent by multiple different terminal devices 102.
  • FIG. 3A is a schematic diagram of a process of a speech synthesis method based on an attention mechanism provided by an embodiment of the present application. It can be applied to the system in FIG. 1 described above.
  • the following describes the speech synthesis device 101 based on the attention mechanism from one side of the speech synthesis device 101 based on the attention mechanism in conjunction with FIG. 3A.
  • the method may include the following steps S301-S303.
  • Step S301 Determine the text coding matrix and the audio coding matrix according to the target text.
  • the target text may be obtained, the target text includes a text of N characters; the offset audio is obtained, and the duration of the offset audio including the target text is The audio of T; then determine the text encoding matrix according to the target text; determine the audio encoding matrix according to the offset audio.
  • the target text can be the sample text "Ping An Technology Co., Ltd.” input by the user, and the word order of the input text is marked.
  • Obtaining the offset audio may be determining the offset audio by matching the audio corresponding to the target text in a speech library according to the target text. For example: the sample text "Ping An Technology Co., Ltd.” can be matched with “Ping”, “An”, “Ke”, “Technology”, “Technology”, “Technology”, “Yes”, “Limit”, Corresponding audio of the ten characters of " ⁇ ” and " ⁇ ” form offset audio.
  • the target text is obtained, and the target text includes text of N characters; the offset audio is obtained, and the offset audio includes the audio of the target text whose duration is T; and the target text is determined according to the target text.
  • Text encoding matrix; the audio encoding matrix is determined according to the offset audio, and the sequence of the four steps is not specifically limited.
  • the target text may be acquired first, the text encoding matrix may be determined according to the target text, then the offset audio may be acquired, and finally the audio encoding matrix may be determined according to the offset audio.
  • Step S302 According to the text encoding matrix and the audio encoding matrix, the audio decoding matrix is determined through the function L(A).
  • Figure 3B is a schematic diagram of the framework of an improved attention mechanism-based speech synthesis technology method provided by an embodiment of the present application, including: text encoding module, audio encoding module, attention matrix module, audio decoding module, and Short-time Fourier spectrum module.
  • the target text and the offset audio are respectively input from the text encoding module and the audio encoding module to obtain the corresponding text decoding matrix and audio decoding matrix
  • the audio decoding matrix and the text decoding matrix are forcibly aligned with the text after attention
  • the decoding matrix is input into the audio decoding module to obtain the target audio corresponding to the target text.
  • Encoder-Decoder is a very general computing framework.
  • the specific model functions used by Encoder and Decoder there is no limitation.
  • Convolutional Neural Network CNN, Recurrent Neural Network RNN, Bidirectional Long and Short-term Memory Recurrent Neural Network BiRNN, gated loop unit GRU, long short-term memory network LSTM, etc. can all be used as model functions of Encoder and Decoder.
  • the function L(A) is an attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix. If the alignment strength of the target text is less than the alignment strength threshold, Then W nt changes linearly; the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced, n is greater than 0 and less than or equal to the target The number of text characters, t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text.
  • the preset function L(A) can be For example: the function L(A) is applied to the attention mechanism matrix A in the attention mechanism module described in Figure 3B, A ⁇ R N ⁇ T , and its meaning is to evaluate the correspondence between the nth character and the T time, that is The nth character, the 1st and the tth time frame S 1:F,t are related, where Ant can be Means that the attention mechanism module looks at the nth character at time t, it will look at the nth character or the n+1th character or the characters around them at the subsequent time t+1, where d is the length of the text Related preset parameters. That is, L(A) can use the LSTM standard function normalized exponential function (Softmax function) to obtain the weight of the sound feature of the nth character in the training process through the attention mechanism, and then normalize after the summation.
  • Softmax function LSTM standard function normalized exponential function
  • is the alignment strength threshold
  • N is the total number of characters in the N characters of the target text
  • n is the number of characters in the nth word in the N characters
  • T is the The time point when the Nth character of the target text is pronounced
  • t is the time point when the nth character is pronounced.
  • W nt is a segmentation function related to the alignment strength of the target text. If the alignment strength of the target text is less than the alignment strength threshold, W nt decreases as the target text increases.
  • Step S303 Determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
  • determining the mel cepstrum coefficient according to the audio decoding matrix, and determining the target audio according to the mel cepstrum coefficient is specifically: determining the mel cepstrum coefficient according to the audio decoding matrix, and performing short-time Fu The inner transform, and then according to the short-time Fourier spectrum, determine the target audio.
  • Mel-Frequency Cepstrum is a linear transformation of the logarithmic energy spectrum based on the non-linear mel scale of sound frequency.
  • Mel-Frequency Cepstral Coefficients are the coefficients that make up the Mel-Frequency Cepstral Coefficients. It is derived from the cepstrum of audio fragments.
  • cepstrum and mel frequency cepstrum The difference between cepstrum and mel frequency cepstrum is that the band division of mel frequency cepstrum is equally spaced on the mel scale, which is more approximate than the linearly spaced frequency band used in normal cepstrum
  • the human auditory system Such a non-linear representation can provide a better representation of the sound signal in multiple fields.
  • the determination of the corresponding Mel cepstrum coefficients can be pre-emphasis, framing, and windowing of the speech of the audio decoding matrix; for each short-term analysis window, the fast discrete Fourier transform can be used.
  • the algorithm (Fast Fourier Transformation, FFT) obtains the corresponding frequency spectrum; the above frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum (Mel spectrum).
  • the spectrogram is the spectrogram sound describing the speech signal.
  • the spectrogram can represent the target audio. It can be understood that the embodiment of the present invention does not specifically limit the manner of determining the target audio according to the Mel cepstrum coefficient.
  • the Mel cepstrum coefficients are further determined according to the audio decoding matrix, and the Mel cepstrum coefficients are determined according to the Mel cepstrum coefficients. Determine the target audio.
  • the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix.
  • W nt changes linearly; further, the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced. Therefore, when the alignment strength of the target text is less than the alignment strength threshold, the linear change of W nt can not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also ensure that the text and speech signals with obvious position deviations are forced to align, which is helpful In order to make the attention mechanism matrix achieve the purpose of approximate alignment faster, so that the speech synthesis is more organized.
  • FIG. 4 is a schematic diagram of another method for speech synthesis based on an attention mechanism provided by an embodiment of the present application. It can be applied to the system in FIG. 1 described above, and the following will describe from a single side of the attention mechanism-based speech synthesis device 101 in conjunction with FIG. 4.
  • the method may include the following steps S401 to S403.
  • Step S401 Perform function L(A) model training according to the sample text and sample speech, and determine the alignment intensity threshold ⁇ of the function L(A).
  • the embodiment of the present application can be applied to a speech synthesis scene based on a directed attention mechanism.
  • the positions of text and audio signal segments are roughly related. Therefore, when a person speaks a sentence, the position n of the character and the time point t have an approximate linear relationship, that is, n ⁇ at, where a ⁇ N/T.
  • is a linearly adjustable alignment intensity threshold, which is used to indicate the preset alignment intensity between the position of the nth character and the time point t when the nth character is pronounced, and the value range of ⁇ is ⁇ ⁇ (0,1).
  • the threshold ⁇ is closer to zero, it means that the corresponding intensity of the font position and the speech time obtained by it is stronger, and the corresponding intensity of speech and text is higher.
  • Step S402 Determine the text coding matrix and the audio coding matrix according to the target text.
  • Step S403 According to the text encoding matrix and the audio encoding matrix, the audio decoding matrix is determined by the function L(A).
  • Step S404 Determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
  • step S402-step S404 may correspond to the related descriptions of step S301-step S303 in FIG. 3A, and details are not repeated here.
  • the function L(A) is the attention mechanism loss function determined according to Ant and W nt , where Ant is used to transform the text encoding matrix.
  • the alignment strength of the target text is less than the alignment strength threshold, then W nt changes linearly; further, the alignment strength of the target text is determined by the position of the nth character in the target text and the time point t when the nth character in the target text is pronounced.
  • the size of the threshold ⁇ it is determined that when the threshold ⁇ is closer to zero, the alignment strength of the target text is closer to the threshold ⁇ , it is proved that the corresponding strength of the font position and the speech time is stronger, and the corresponding strength of the speech and text is stronger. high. Therefore, the speech synthesis technology through the improved attention mechanism model can ensure the forced alignment of text and speech signals with obvious position deviations, while reducing the difficulty of the algorithm and greatly reducing the time for speech synthesis.
  • the linear change of W nt can not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also ensure that the text and speech signals with obvious position deviations are forced to align, which is helpful In order to make the attention mechanism matrix achieve the purpose of approximate alignment faster, so that the speech synthesis is more organized.
  • this application can determine the audio decoding matrix through the function L(A) according to the text coding matrix and the audio coding matrix of the target text, and then further determine the Mel cepstrum coefficients according to the audio decoding matrix, and according to the Mel cepstrum system The number determines the target audio.
  • the attention mechanism loss function L(A) changes with the change of the alignment intensity of the target text.
  • the alignment intensity threshold because W nt changes linearly, by adjusting the calculation method of the loss function W nt , the attention mechanism matrix
  • the loss function of has a linear adjustable ⁇ threshold, which makes the loss function linear. At this time, it can not only greatly reduce the difficulty of the algorithm in the original attention mechanism, but also can not require a lot of sample data and training time. , To ensure that the text and voice signals with obvious position deviations are forced to align.
  • the following provides a speech synthesis device based on the attention mechanism related to the embodiment of the application.
  • the speech synthesis device based on the attention mechanism can be a method of rapid acquisition, processing, analysis and Extract valuable data, based on interactive data, to bring various convenient service equipment for third parties.
  • FIG. 5 is a schematic structural diagram of a speech synthesis device based on an attention mechanism provided by an embodiment of the present application. It may include a first determining unit 501, a second determining unit 502, and a third determining unit 503, and may also include a fourth determining unit 504.
  • the first determining unit 501 is configured to determine a text encoding matrix and an audio encoding matrix according to the target text;
  • the second determining unit 502 is configured to determine the audio decoding matrix through the function L(A) according to the text encoding matrix and the audio encoding matrix, where the function L(A) is the attention mechanism loss function determined according to Ant and W nt , Wherein, Ant is used to transform the text encoding matrix.
  • W nt changes linearly;
  • the alignment strength of the target text is determined by the nth character in the target text
  • the position and the time point t of the pronunciation of the nth text in the target text are determined, n is greater than 0 and less than or equal to the number of characters in the target text, and t is greater than 0 and less than or equal to the time point of the total pronunciation of the target text;
  • the third determining unit 503 is configured to determine the Mel cepstrum coefficient according to the audio decoding matrix, and determine the target audio according to the Mel cepstrum coefficient.
  • the first determining unit 501 is specifically configured to: obtain a target text, the target text includes a text of N characters; obtain an offset audio, the offset audio includes the duration of the target text Is the audio of T; the text encoding matrix is determined according to the target text; the audio encoding matrix is determined according to the offset audio.
  • is a linearly adjustable alignment intensity threshold, which is used to represent the preset alignment intensity between the position of the nth character and the time point t when the nth character is pronounced, and the value of ⁇ The value range is ⁇ (0,1).
  • the device further includes a fourth determining unit 504, configured to determine the audio decoding matrix according to the sample code matrix and the audio coding matrix by using the function L(A) Text and sample speech are trained on the function L(A) model, and the alignment intensity threshold ⁇ of the function L(A) is determined.
  • a fourth determining unit 504 configured to determine the audio decoding matrix according to the sample code matrix and the audio coding matrix by using the function L(A) Text and sample speech are trained on the function L(A) model, and the alignment intensity threshold ⁇ of the function L(A) is determined.
  • the fourth determining unit 504 is specifically configured to: set ⁇ at preset intervals to automatically change from 0 to 1; for the ⁇ after each change, perform actions based on sample text and sample voice.
  • the function L(A) is trained; it is determined that when the alignment strength of the sample text and the sample speech reaches a first threshold, the shortest time ⁇ for the speech synthesis is the alignment strength threshold ⁇ .
  • is the alignment strength threshold
  • N is the total number of characters in the N characters of the target text
  • n is the number of characters in the nth word in the N characters
  • T is the The time point when the Nth character of the target text is pronounced
  • t is the time point when the nth character is pronounced.
  • the third determining unit 503 is specifically configured to determine the Mel cepstrum coefficients according to the audio decoding matrix, and perform short-time Fourier transform on the Mel cepstrum coefficients, and then according to the short-term Time Fourier spectrum to determine the target audio frequency.
  • FIG. 6 is a simplified physical device structure diagram of an attention mechanism-based speech synthesis device provided by an embodiment of the present application, which is easy to understand and easy to illustrate.
  • the device 60 in FIG. 6 may include the following: Or multiple components: storage component 601, processing component 602, and communication component 603.
  • the storage component 601 may include one or more storage units, and each unit may include one or more memories.
  • the storage component can be used to store programs and various data, and can complete the programs or data at high speed and automatically during the operation of the device 60 Access.
  • a physical device with two stable states can be used to store information, and the two stable states are represented as "0" and "1" respectively.
  • the storage component can be used to store target text, target audio, and other related data.
  • the processing component 602 may also be called a processor, a processing unit, a processing board, a processing module, a processing device, and so on.
  • the processing component may be a central processing unit (CPU), a network processor (NP) or a combination of CPU and NP.
  • CPU central processing unit
  • NP network processor
  • the processing component 603 is used to call the data of the storage component 601 to execute the method described in FIGS. 3A to 4 above. Related descriptions are not repeated here.
  • the communication component 603 may also be called a transceiver, or a transceiver, etc., which may include a unit for wireless, wired, or other communication methods.
  • the device for implementing the receiving function in part 603 can be regarded as the receiving unit, and the device for implementing the sending function as the sending unit, that is, the part 603 can receive target text or send target audio.
  • the units described as separate parts may or may not be physically separated, and the parts displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed in multiple locations.
  • Network unit Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional components in the various embodiments of the present application may be integrated into one component, or each component may exist alone physically, or two or more components may be integrated into one component.
  • the above-mentioned integrated components can be implemented in the form of hardware or software functional units.
  • the integrated component is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • an embodiment of the present application provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement FIGS. 3A and 4 Corresponding description of the method embodiment shown.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium and includes several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, rather than corresponding to the embodiments of the present application.
  • the implementation process constitutes any limitation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un dispositif de synthèse vocale fondée sur un mécanisme d'attention. Le procédé consiste : à déterminer, en fonction d'un texte cible, une matrice de texte codée et une matrice audio codée (S301) ; à déterminer une matrice audio décodée en fonction de la matrice de texte codée et de la matrice audio codée au moyen d'une fonction L(A) (302), la fonction L(A) constituant une fonction de perte d'un mécanisme d'attention déterminé en fonction de A nt et Wnt ; et à déterminer un coefficient cepstral de fréquence Mel en fonction de la matrice audio décodée, et à déterminer un élément audio cible en fonction du coefficient cepstral de fréquence Mel (S303). Le procédé peut provoquer le changement linéaire de W nt lorsqu'une force d'alignement d'un texte cible est inférieure à un seuil de force d'alignement. L'invention peut réduire considérablement la difficulté d'algorithme dans des mécanismes d'attention classiques, et ne nécessite pas de grandes quantités de données d'échantillon et de temps d'apprentissage pour effectuer un alignement forcé d'un texte et d'un signal audio évidemment décalé.
PCT/CN2019/117785 2019-02-27 2019-11-13 Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention WO2020173134A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910149065.5A CN109767752B (zh) 2019-02-27 2019-02-27 一种基于注意力机制的语音合成方法及装置
CN201910149065.5 2019-02-27

Publications (1)

Publication Number Publication Date
WO2020173134A1 true WO2020173134A1 (fr) 2020-09-03

Family

ID=66457333

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117785 WO2020173134A1 (fr) 2019-02-27 2019-11-13 Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention

Country Status (2)

Country Link
CN (1) CN109767752B (fr)
WO (1) WO2020173134A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257407A (zh) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 音频中的文本对齐方法、装置、电子设备及可读存储介质
CN113112987A (zh) * 2021-04-14 2021-07-13 北京地平线信息技术有限公司 语音合成方法、语音合成模型的训练方法及装置
CN113539232A (zh) * 2021-07-10 2021-10-22 东南大学 一种基于慕课语音数据集的语音合成方法
CN115410550A (zh) * 2022-06-02 2022-11-29 柯登峰 一种细粒度韵律可控的情感语音合成方法、系统及存储介质

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767752B (zh) * 2019-02-27 2023-05-26 平安科技(深圳)有限公司 一种基于注意力机制的语音合成方法及装置
CN110264991B (zh) * 2019-05-20 2023-12-22 平安科技(深圳)有限公司 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
CN112133279B (zh) * 2019-06-06 2024-06-21 Tcl科技集团股份有限公司 车载信息播报方法、装置及终端设备
US11183201B2 (en) * 2019-06-10 2021-11-23 John Alexander Angland System and method for transferring a voice from one body of recordings to other recordings
CN110264987A (zh) * 2019-06-18 2019-09-20 王子豪 基于深度学习的和弦进行生成方法
CN111508466A (zh) * 2019-09-12 2020-08-07 马上消费金融股份有限公司 一种文本处理方法、装置、设备及计算机可读存储介质
CN110808027B (zh) * 2019-11-05 2020-12-08 腾讯科技(深圳)有限公司 语音合成方法、装置以及新闻播报方法、系统
CN111133506A (zh) * 2019-12-23 2020-05-08 深圳市优必选科技股份有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN111259188B (zh) * 2020-01-19 2023-07-25 成都潜在人工智能科技有限公司 一种基于seq2seq网络的歌词对齐方法及系统
CN113314096A (zh) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 语音合成方法、装置、设备和存储介质
CN111524503B (zh) * 2020-04-15 2023-01-17 上海明略人工智能(集团)有限公司 音频数据的处理方法、装置、音频识别设备和存储介质
CN111862934B (zh) * 2020-07-24 2022-09-27 思必驰科技股份有限公司 语音合成模型的改进方法和语音合成方法及装置
US11798527B2 (en) 2020-08-19 2023-10-24 Zhejiang Tonghu Ashun Intelligent Technology Co., Ltd. Systems and methods for synthesizing speech
CN112466272B (zh) * 2020-10-23 2023-01-17 浙江同花顺智能科技有限公司 一种语音合成模型的评价方法、装置、设备及存储介质
CN112837673B (zh) * 2020-12-31 2024-05-10 平安科技(深圳)有限公司 基于人工智能的语音合成方法、装置、计算机设备和介质
CN112908294B (zh) * 2021-01-14 2024-04-05 杭州倒映有声科技有限公司 一种语音合成方法以及语音合成系统
CN113345413B (zh) * 2021-06-01 2023-12-29 平安科技(深圳)有限公司 基于音频特征提取的语音合成方法、装置、设备及介质
CN113299268A (zh) * 2021-07-28 2021-08-24 成都启英泰伦科技有限公司 一种基于流生成模型的语音合成方法
CN113707127B (zh) * 2021-08-30 2023-12-15 中国科学院声学研究所 一种基于线性自注意力的语音合成方法及系统
CN115691476B (zh) * 2022-06-06 2023-07-04 腾讯科技(深圳)有限公司 语音识别模型的训练方法、语音识别方法、装置及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101651788A (zh) * 2008-12-26 2010-02-17 中国科学院声学研究所 一种在线语音文本对齐系统及方法
US20180330713A1 (en) * 2017-05-14 2018-11-15 International Business Machines Corporation Text-to-Speech Synthesis with Dynamically-Created Virtual Voices
CN109036371A (zh) * 2018-07-19 2018-12-18 北京光年无限科技有限公司 用于语音合成的音频数据生成方法及系统
CN109767752A (zh) * 2019-02-27 2019-05-17 平安科技(深圳)有限公司 一种基于注意力机制的语音合成方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4291755B2 (ja) * 2004-08-13 2009-07-08 京セラ株式会社 携帯端末装置及び音声信号の出力方法
JP2008225254A (ja) * 2007-03-14 2008-09-25 Canon Inc 音声合成装置及び方法並びにプログラム
JP6716397B2 (ja) * 2016-08-31 2020-07-01 株式会社東芝 音声処理装置、音声処理方法およびプログラム
CN107943405A (zh) * 2016-10-13 2018-04-20 广州市动景计算机科技有限公司 语音播报装置、方法、浏览器及用户终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101651788A (zh) * 2008-12-26 2010-02-17 中国科学院声学研究所 一种在线语音文本对齐系统及方法
US20180330713A1 (en) * 2017-05-14 2018-11-15 International Business Machines Corporation Text-to-Speech Synthesis with Dynamically-Created Virtual Voices
CN109036371A (zh) * 2018-07-19 2018-12-18 北京光年无限科技有限公司 用于语音合成的音频数据生成方法及系统
CN109767752A (zh) * 2019-02-27 2019-05-17 平安科技(深圳)有限公司 一种基于注意力机制的语音合成方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257407A (zh) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 音频中的文本对齐方法、装置、电子设备及可读存储介质
CN112257407B (zh) * 2020-10-20 2024-05-14 网易(杭州)网络有限公司 音频中的文本对齐方法、装置、电子设备及可读存储介质
CN113112987A (zh) * 2021-04-14 2021-07-13 北京地平线信息技术有限公司 语音合成方法、语音合成模型的训练方法及装置
CN113112987B (zh) * 2021-04-14 2024-05-03 北京地平线信息技术有限公司 语音合成方法、语音合成模型的训练方法及装置
CN113539232A (zh) * 2021-07-10 2021-10-22 东南大学 一种基于慕课语音数据集的语音合成方法
CN113539232B (zh) * 2021-07-10 2024-05-14 东南大学 一种基于慕课语音数据集的语音合成方法
CN115410550A (zh) * 2022-06-02 2022-11-29 柯登峰 一种细粒度韵律可控的情感语音合成方法、系统及存储介质
CN115410550B (zh) * 2022-06-02 2024-03-26 北京听见科技有限公司 一种细粒度韵律可控的情感语音合成方法、系统及存储介质

Also Published As

Publication number Publication date
CN109767752A (zh) 2019-05-17
CN109767752B (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
WO2020173134A1 (fr) Procédé et dispositif de synthèse vocale fondée sur un mécanisme d'attention
CN110111775B (zh) 一种流式语音识别方法、装置、设备及存储介质
WO2020215666A1 (fr) Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage
CN104732977B (zh) 一种在线口语发音质量评价方法和系统
CN110246488B (zh) 半优化CycleGAN模型的语音转换方法及装置
WO2016150257A1 (fr) Programme de synthèse vocale
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
CN110600013B (zh) 非平行语料声音转换数据增强模型训练方法及装置
WO2020098269A1 (fr) Procédé de synthèse de la parole et dispositif de synthèse de la parole
US20220383876A1 (en) Method of converting speech, electronic device, and readable storage medium
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
Chaudhary et al. Feature extraction methods for speaker recognition: A review
CN112185363B (zh) 音频处理方法及装置
CN114023300A (zh) 一种基于扩散概率模型的中文语音合成方法
CN111192659A (zh) 用于抑郁检测的预训练方法和抑郁检测方法及装置
WO2023142454A1 (fr) Procédés de traduction vocale et d'entraînement de modèle, appareil, dispositif électronique et support de stockage
CN112017690B (zh) 一种音频处理方法、装置、设备和介质
Mian Qaisar Isolated speech recognition and its transformation in visual signs
CN114255740A (zh) 语音识别方法、装置、计算机设备和存储介质
Priyadarshani et al. Dynamic time warping based speech recognition for isolated Sinhala words
CN113963679A (zh) 一种语音风格迁移方法、装置、电子设备及存储介质
Anees Speech coding techniques and challenges: A comprehensive literature survey
US20230368777A1 (en) Method And Apparatus For Processing Audio, Electronic Device And Storage Medium
Shankarappa et al. A faster approach for direct speech to speech translation
CN116434736A (zh) 语音识别方法、交互方法、系统和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19917374

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19917374

Country of ref document: EP

Kind code of ref document: A1