WO2021227707A1 - 音频合成方法、装置、计算机可读介质及电子设备 - Google Patents

音频合成方法、装置、计算机可读介质及电子设备 Download PDF

Info

Publication number
WO2021227707A1
WO2021227707A1 PCT/CN2021/085862 CN2021085862W WO2021227707A1 WO 2021227707 A1 WO2021227707 A1 WO 2021227707A1 CN 2021085862 W CN2021085862 W CN 2021085862W WO 2021227707 A1 WO2021227707 A1 WO 2021227707A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
mixed
text
text information
feature
Prior art date
Application number
PCT/CN2021/085862
Other languages
English (en)
French (fr)
Inventor
林诗伦
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021227707A1 publication Critical patent/WO2021227707A1/zh
Priority to US17/703,136 priority Critical patent/US12106746B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language

Definitions

  • This application relates to the field of artificial intelligence technology, specifically to audio synthesis technology.
  • Speech synthesis technology is also called text-to-speech (TTS). Its function is to convert text information generated by the computer itself or externally input into a fluent voice that the user can understand and play it. come out.
  • the purpose of this application is to provide an audio synthesis method, an audio synthesis device, a computer-readable medium, and an electronic device, which can solve the technical problem of timbre differences due to the presence of different language types in the synthesized audio to a certain extent.
  • an audio synthesis method which is executed by an electronic device, and the method includes:
  • the mixed-language text information includes text characters corresponding to at least two language types
  • an audio information synthesis device which includes:
  • An information acquisition module for acquiring mixed-language text information, where the mixed-language text information includes text characters corresponding to at least two language types;
  • An information encoding module configured to perform text encoding processing on the mixed-language text information based on the at least two language types to obtain the intermediate semantic encoding features of the mixed-language text information;
  • An information decoding module configured to obtain a target timbre feature corresponding to the target timbre subject, and decode the intermediate semantic coding feature based on the target timbre feature to obtain an acoustic feature;
  • the acoustic coding module is configured to perform acoustic coding processing on the acoustic features to obtain audio corresponding to the mixed language text information.
  • an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute The executable instructions are used to execute the audio synthesis method in the above technical solution.
  • a computer program product including instructions, which when run on a computer, cause the computer to execute the audio synthesis method in the above technical solution.
  • the mixed language text information is encoded by encoders corresponding to multiple language types, and then the encoded features are decoded by a decoder that combines the timbre characteristics of the target timbre body to achieve conversion Generate audio information corresponding to a single tone color and multiple language types. It solves the problem of timbre jumping due to language differences in the existing mixed language audio synthesis technology, and can stably output mixed language audio that is natural, smooth and uniform in tone.
  • the embodiments of this application can be deployed in the cloud to provide general audio synthesis services for various devices, and can also customize exclusive tones according to different application requirements.
  • the monolingual audio database of different target timbre subjects can be used, the mixing and synthesis of audio of multiple language types can be realized, which greatly reduces the cost of training data collection.
  • the embodiment of the present application can be compatible with the recorded monolingual audio database, so that the usable tone colors are more abundant.
  • Figure 1 shows an exemplary system architecture diagram of the technical solution of the present application in an application scenario
  • Figure 2 shows an exemplary system architecture and a customized audio synthesis service process of the technical solution of the present application in another application scenario
  • Figure 3 shows a flow chart of the steps of the audio synthesis method provided by an embodiment of the present application
  • FIG. 4 shows a flowchart of the method steps for performing encoding processing through a multi-channel encoder in an embodiment of the present application
  • FIG. 5 shows a flowchart of a method for encoding processing based on an attention mechanism (Attention) in an embodiment of the present application
  • Fig. 6 shows a schematic diagram of the principle of audio information synthesis for Chinese-English mixed text based on an embodiment of the present application
  • Figure 7 shows a block diagram of the audio synthesis device provided by an embodiment of the present application.
  • Fig. 8 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
  • This application has a wide range of application scenarios. It is possible to configure a mixed audio synthesis solution of multiple languages and types as a cloud service. As a basic technology, it can empower users who use the cloud service. It can also be used for personalities in vertical fields. The scene. For example, it can be applied to scenarios such as reading APP smart reading, smart customer service, news broadcast, smart device interaction, etc., to realize intelligent audio synthesis in various scenarios.
  • Fig. 1 shows an exemplary system architecture diagram of the technical solution of the present application in an application scenario.
  • the system architecture 100 may include a client 110, a network 120 and a server 130.
  • the client 110 can be carried on various terminal devices such as smart phones, smart robots, smart speakers, tablet computers, notebook computers, and desktop computers.
  • the server 130 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud services.
  • the network 120 may be a communication medium of various connection types that can provide a communication link between the client 110 and the server 130, for example, a wired communication link, a wireless communication link, and so on.
  • the technical solutions provided in the embodiments of the present application may be applied to the client 110, may also be applied to the server 130, or may be implemented by the client 110 and the server 130 in collaboration, which is not specifically limited in this application.
  • various smart devices such as smart robots and smart phones can access the mixed-language audio synthesis service provided by the cloud server through the wireless network, such as the Chinese-English mixed speech synthesis service.
  • the client 110 sends the Chinese-English mixed text to be synthesized to the server 130 via the network 120.
  • the server 130 performs speech synthesis, the corresponding synthesized audio can be sent to the client 110 in the form of streaming or whole sentence return.
  • a complete speech synthesis process may include, for example:
  • the client 110 uploads the Chinese-English mixed text to be synthesized to the server 130, and the server 130 performs corresponding regularization processing after receiving the Chinese-English mixed text;
  • the server 130 inputs the normalized text information into the Chinese-English mixed speech synthesis system, and quickly synthesizes the audio corresponding to the text information through the Chinese-English mixed speech synthesis system, and completes post-processing operations such as audio compression;
  • the server 130 returns the audio to the client 110 by streaming or whole sentence return, and the client 110 can play the audio smoothly and naturally after receiving the audio.
  • the speech synthesis service provided by the server 130 has a small delay, and the client 110 can basically obtain the returned result immediately. Users can hear what they need in a short time, liberate their eyes, and interact naturally and conveniently.
  • Fig. 2 shows an exemplary system architecture and a customized audio synthesis service process of the technical solution of the present application in another application scenario.
  • the system architecture and process can be applied to vertical fields such as novel reading, news broadcasting, etc., which require customized voice synthesis services.
  • the process of implementing customized audio synthesis services under this system architecture may include:
  • the front-end demander 210 submits a list of timbre requirements of the speech synthesis service required by its product, such as the gender of the speaker, the timbre type, and other requirements.
  • the back-end server 220 collects the corresponding timbre according to the required timbre, builds an audio database, and trains the corresponding audio synthesis model 230.
  • the back-end server 220 uses the audio synthesis model 230 to synthesize the sample. After the sample is delivered to the front-end demander 210 for verification and confirmation, the customized audio synthesis model 230 can be deployed online;
  • the application of the front-end demander 210 (such as reading apps, news clients, etc.) sends the text that needs to synthesize audio to the audio synthesis model 230 deployed on the back-end server 220; users of the front-end demander 210 can listen in the application
  • the specific audio synthesis process is the same as the online synthesis service used in the system architecture shown in Figure 1.
  • the back-end server 220 only needs to collect the speaker audio database of one language type (such as Chinese) that meets the demand, and combine it with the other original speaker.
  • the audio database of multiple language types (such as English) is used for customized training of the language-mixable audio synthesis model 230, and finally the language-mixed audio synthesis is performed to meet the requirements of the front-end demand side 210. This greatly reduces the customized audio synthesis service the cost of.
  • Fig. 3 shows a flow chart of the steps of the audio synthesis method provided by an embodiment of the present application.
  • the main body of the audio synthesis method is an electronic device.
  • the electronic device can be various terminal devices such as smart phones and smart speakers that carry clients, or various server devices such as physical servers and cloud servers that serve as servers.
  • the audio synthesis method mainly includes steps S310 to S340:
  • Step S310 Obtain mixed-language text information, where the mixed-language text information includes text characters corresponding to at least two language types.
  • the mixed language text information is composed of any number of text characters, and each text character can correspond to at least two different language types.
  • the mixed language text information may be a text composed of a mixture of Chinese characters and English characters.
  • the mixed-language text information input by the user through the input device can be obtained by real-time reception, or the mixed-language text information can be extracted sentence by sentence or paragraph by paragraph from a file containing text information by collecting item by item.
  • this step can also perform voice recognition on the voice information input by the user that includes two or more different language types, and based on the voice recognition results, obtain mixed language text information including at least two language types; for example, this Steps can use a pre-trained speech recognition model to perform speech recognition processing on the received speech information including at least two language types to obtain the corresponding mixed language text information, and then perform audio synthesis on the mixed language text information through subsequent steps , To achieve the effect of overall tone color conversion, and realize the uniform tone change processing for one or more speakers.
  • Step S320 Perform text encoding processing on the mixed language text information based on at least two language types to obtain intermediate semantic encoding features of the mixed language text information.
  • a pre-trained encoder can be used to perform text encoding processing on the mixed language text information to obtain intermediate semantic encoding features related to the natural semantics of the mixed language text information.
  • the number and type of encoders can correspond to the language types included in the mixed-language text information.
  • the mixed-language text information contains both Chinese characters and English characters, so this step can use Chinese and English two-way encoders Perform text encoding processing on the mixed-language text information to obtain intermediate semantic encoding features.
  • the intermediate semantic encoding features can be decoded by a decoder corresponding to the encoder in the subsequent steps, and finally form a form that can be understood by users , Natural language in the form of audio.
  • the encoder can be based on Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) or Recurrent Gate Unit (Gate Recurrent Unit, GRU) and other types of neural network training models.
  • CNN is a feed-forward neural network whose neurons can respond to units in the receptive field; CNN is usually composed of multiple convolutional layers and a fully connected layer at the top, which reduces the amount of parameters of the model by sharing parameters to make it Image and speech recognition are widely used.
  • RNN is a type of recursive neural network that takes sequence data as input, recursively in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain.
  • LSTM is a recurrent neural network.
  • Input gates, forget gates and output gates are placed in this unit; after the information enters LSTM, it is judged whether it is useful according to the rules. Only the information that conforms to the algorithm authentication will be left, and the non-conforming information will be forgotten through the forget gate; LSTM is suitable for processing and predicting important events with relatively long intervals and delays in the time series.
  • GRU is a kind of recurrent neural network.
  • GRU is also proposed to solve the problems of long-term memory and gradients in back propagation; compared with LSTM, GRU has one less "gating" inside, and the parameters are more than LSTM In most cases, it can achieve the same effect as LSTM and effectively reduce the calculation time.
  • Step S330 Obtain the target timbre feature corresponding to the target timbre subject, and decode the intermediate semantic coding feature based on the target timbre feature to obtain the acoustic feature.
  • the target timbre subject is a subject object used to determine the characteristics of the synthesized audio timbre, and the subject object may be a speaker corresponding to an audio database storing voice samples.
  • the target timbre subject can be a real physical object, for example, it can be a real character such as an anchor or a voice actor with obvious timbre characteristics; in addition, the target timbre subject can also be a virtual object synthesized by computer simulation, for example, It is a virtual character such as Hatsune Miku and Luo Tianyi generated by the speech synthesis software VOCALOID.
  • the timbre characteristics required by the user such as male voice, emotional voice, etc.
  • the target timbre subject that meets these timbre characteristics can be selected.
  • the target timbre characteristics that can reflect and identify its timbre characteristics can be obtained by means of feature extraction or mapping.
  • a pre-trained decoder can be used to decode the intermediate semantic coding feature obtained in step S320 to obtain the corresponding acoustic feature.
  • Acoustic features may be, for example, spectrograms or other forms of features with timbre characteristics and sound content.
  • the frequency spectrum refers to the representation of the time domain signal in the frequency domain. It can be obtained by Fourier transform of the sound signal. The result is two images with amplitude and phase as the vertical axis and frequency as the horizontal axis. Speech synthesis In technical applications, phase information is often omitted, and only the corresponding amplitude information at different frequencies is retained.
  • Step S340 Perform acoustic encoding processing on the acoustic features to obtain audio corresponding to the mixed language text information.
  • the acoustic feature can be input to a vocoder, and the acoustic feature can be converted by the vocoder to form audio that can be output and played through audio output devices such as speakers.
  • the vocoder is derived from the abbreviation of Voice Encoder, and is also called a speech signal analysis and synthesis system. The function of the vocoder is to convert acoustic features into sound.
  • the mixed language text information is encoded by the encoders corresponding to multiple language types, and then the encoded features are decoded by the decoder that combines the timbre characteristics of the target timbre body.
  • the conversion generates audio information corresponding to a single timbre and multiple language types. It solves the problem of timbre jumping due to language differences in the existing mixed language audio synthesis technology, and can stably output mixed language audio that is natural, smooth and uniform in tone.
  • the embodiments of this application can be deployed in the cloud to provide general audio synthesis services for various devices, and can also customize exclusive tones according to different application requirements.
  • the monolingual audio database of different target timbre subjects can be used, the mixed synthesis of audio of multiple language types can be realized, which greatly reduces the cost of training data collection.
  • the embodiment of the present application can be compatible with the recorded monolingual audio database, so that the usable tone colors are more abundant.
  • Fig. 4 shows a flow chart of the steps of a method for encoding processing by a multi-channel encoder in an embodiment of the present application.
  • step S320 on the basis of the above embodiment, encoding the mixed language text information to obtain the intermediate semantic encoding feature of the mixed language text information may include the following steps S410 ⁇ Step S430:
  • Step S410 Perform text encoding processing on the mixed-language text information through the respective monolingual text encoders corresponding to each language type to obtain at least two monolingual encoding features of the mixed-language text information.
  • the mixed-language text information can be mapped and transformed in advance to form vector features that can be recognized by the encoder.
  • the mapping transformation method may be, for example, by performing mapping transformation processing on the mixed language text information through the character embedding matrix corresponding to each language type, respectively, to obtain at least two embedded character features of the mixed language text information.
  • the number and types of character embedding matrices can be one-to-one corresponding to the language types.
  • this step can map the mixed language text information through the character embedding matrix corresponding to the Chinese characters
  • the transformation process obtains the embedded character features corresponding to the Chinese characters, and at the same time, the mixed language text information can be mapped and transformed through the character embedding matrix corresponding to the English characters to obtain the embedded character features corresponding to the English characters.
  • the mixed language text information can be linearly mapped first, and then the activation function or other methods can be used to perform nonlinear transformation on the linear mapping result to obtain the corresponding embedded character features.
  • the mixed-language text information contains both Chinese characters and English characters.
  • the embedded character features corresponding to the Chinese characters can be encoded by a monolingual text encoder corresponding to the Chinese language to obtain Corresponding to the monolingual encoding feature of the Chinese language, and after obtaining the embedded character feature corresponding to the English character, the embedded character feature can be encoded by a monolingual text encoder corresponding to the English language to obtain the corresponding English language The monolingual coding feature of.
  • the monolingual text encoder used in the embodiments of the present application may be an encoder with a residual network structure.
  • the characteristic of the residual network is that it is easy to optimize, and the accuracy can be improved by adding a considerable depth.
  • the embedded character features can be separately encoded with the corresponding monolingual text encoders for each language type to obtain at least two residual encoding features of the mixed language text information; then the embedded character features can be compared with Each residual coding feature is fused to obtain at least two monolingual coding features of the mixed-language text information.
  • the residual coding feature is the difference between the input data and the output data of the encoder.
  • the single language coding feature can be obtained by fusing the residual coding feature with the input embedded character feature.
  • the fusion method here can be the residual coding feature and the embedding Character features are added directly.
  • the coding method of the residual network structure is more sensitive to the data changes of the coded output data. During the training process, the data changes of the coded output data have a greater effect on the adjustment of the network weight, so it can obtain a better training effect.
  • Step S420 Perform fusion processing on at least two monolingual coding features to obtain mixed-language coding features of the mixed-language text information.
  • the mixed-language coding feature of the mixed-language text information can be obtained according to the single-language coding features outputted by each single-language text encoder by means of fusion processing. For example, for two monolingual coding features, vector calculations can be performed on them, for example, the mixed-language coding feature can be obtained by direct addition. In addition, the two monolingual coding features can also be spliced, and then the features obtained by the splicing process can be mapped through a fully connected layer or other network structure to obtain a mixed-language coding feature. The embodiment of the application does not make any special limitation on this.
  • each residual coding feature and embedded character feature may be separately fused to obtain the monolingual coding feature, Then, the monolingual coding feature is fused to obtain the mixed language coding feature of the mixed language text information.
  • each embedded character feature in monolingual encoders corresponding to different language types based on the residual network structure, that is, to obtain residual coding features directly.
  • the difference encoding feature is used as the monolingual encoding feature output by each monolingual text encoder, and then each monolingual encoding feature and the embedded character feature are fused to obtain the mixed language encoding feature of the mixed language text information.
  • This processing method is compared In the previous implementation, one fusion process can be reduced, so that the calculation efficiency can be improved and the calculation cost can be saved.
  • Step S430 Determine the intermediate semantic coding feature of the mixed language text information according to the mixed language coding feature.
  • the mixed-language coding feature can be directly determined as the intermediate semantic coding feature of the mixed-language text information, or the intermediate semantic coding feature can be obtained by transforming the mixed-language coding feature through a preset function.
  • the identification information of the language type may be embedded in the mixed language text information to obtain the intermediate semantic coding feature of the mixed language text information.
  • the mixed language text information can be mapped and transformed based on the language embedding matrix of at least two language types to obtain the embedded language characteristics of the mixed language text information; then, the mixed language encoding features and embedded language characteristics The fusion process is performed to obtain the intermediate semantic coding characteristics of the mixed language text information.
  • the mapping and transformation processing of the mixed language text information through the language embedding matrix can be based on the matrix parameters preset in the language embedding matrix to linearly map the mixed language text information, and then perform the linear mapping result through the activation function or other methods Non-linear transformation, so as to obtain the corresponding embedded language characteristics.
  • the mixed language text information is a character sequence with a certain number of characters
  • the embedded language feature obtained after mapping and transforming it can be a feature vector with the same sequence length as the character sequence, and each of the feature vectors The elements respectively correspond to the language type corresponding to each character in the character sequence.
  • the fusion processing of the mixed language coding feature and the embedded language feature can be a vector calculation of the two, such as obtaining the intermediate semantic coding feature of the mixed language text information by a direct addition method.
  • the mixed language coding feature and the embedded language feature can also be spliced, and then the splicing processing result can be mapped through the fully connected layer or other network structure to obtain the intermediate semantic coding feature of the mixed language text information.
  • steps S410 to S430 it is possible to use the corresponding monolingual text encoders of each language type to independently encode the mixed language text information through independent symbol sets of different languages, and to obtain the inclusion after fusion processing.
  • Intermediate semantic coding features of language type information By performing steps S410 to S430, it is possible to use the corresponding monolingual text encoders of each language type to independently encode the mixed language text information through independent symbol sets of different languages, and to obtain the inclusion after fusion processing.
  • FIG. 5 shows a flowchart of the method steps for encoding processing based on the attention mechanism (Attention) in an embodiment of the present application.
  • step S320 Based on at least two language types, encoding the mixed language text information to obtain the intermediate semantic encoding feature of the mixed language text information may include the following step S510 ⁇ Step S530:
  • Step S510 Perform text encoding processing on each text character in the mixed language text information based on at least two language types to obtain character encoding features corresponding to each text character.
  • the mixed-language text information is a character sequence composed of multiple text characters.
  • each text character can be encoded in turn to obtain each Character encoding characteristics corresponding to text characters.
  • Step S520 Obtain the attention distribution weight corresponding to each text character.
  • each text character in the mixed-language text information has character semantic differences, there are other factors that affect the semantic encoding and decoding, so this step can determine each text character according to the influencing factors of different dimensions The respective attention distribution weights.
  • Step S530 Perform weighted mapping on the character encoding feature corresponding to each text character according to the respective attention distribution weights corresponding to each text character to obtain the intermediate semantic encoding feature of the mixed language text information.
  • the size of the attention distribution weight determines the semantic importance of each text character in the encoding and decoding process. Therefore, the weighted mapping of the character encoding features of each text character according to the attention distribution weight can improve the obtained intermediate semantic encoding features. Semantic expression ability.
  • an attention dimension may be the sequence position information of each text character in the mixed language text information.
  • the embodiment of the present application may first obtain the sequence position information of each text character in the mixed language text information, and then determine the corresponding positional attention distribution weight of each text character according to the sequence position information.
  • the embodiment of the present application can also obtain the language type information of each text character, and then determine the language attention distribution weight corresponding to each text character according to the language type information, and then distribute the weight and language attention according to the position attention Assign weights to determine the weights of multiple attention assignments corresponding to each text character.
  • the embodiment of the present application can also obtain the timbre identification information of the target timbre subject corresponding to each text character, and then determine the timbre attention distribution weight corresponding to each text character according to the timbre identification information, and then according to the position attention Assign weights, language attention distribution weights, and timbre attention distribution weights to determine the multiple attention distribution weights corresponding to each text character.
  • the coding effect based on the attention mechanism can be achieved, especially through the multiple attention mechanism, a variety of different influencing factors can be introduced into the coding process of mixed-language text information, and the semantics of the coding result can be improved expression ability.
  • step S330 the target timbre characteristic corresponding to the target timbre subject is acquired, and the intermediate semantic coding characteristic is decoded based on the target timbre characteristic to obtain the acoustic characteristic.
  • audio databases corresponding to different timbre subjects can be pre-configured, and corresponding timbre identification information can be assigned to different timbre subjects by way of numbering or the like.
  • the timbre identification information of the target timbre body can be acquired first, and then the timbre identification information is mapped and transformed through the timbre embedding matrix to obtain the target timbre characteristic of the target timbre body.
  • the target timbre feature and the intermediate semantic coding feature can be jointly input into the decoder, and the decoder performs the decoding process to obtain the acoustic feature with the timbre characteristic of the target timbre subject.
  • a multiple attention mechanism similar to the encoder in the above embodiment can also be used.
  • the RNN network structure based on the attention mechanism can be used as the encoder-decoding
  • the transformer model realizes the encoding and decoding of mixed-language text information.
  • the transformer can also be used as the encoder-decoder model for encoding and decoding.
  • the Transformer model is based on the network structure of the full attention mechanism, which can improve the model Parallel ability.
  • step S340 After performing the acoustic encoding processing on the acoustic features to obtain the audio corresponding to the mixed language text information, the embodiment of the present application may also obtain the timbre conversion model trained by using the timbre data sample of the target timbre body, and then through the timbre conversion The model performs timbre conversion processing on the audio to obtain the audio corresponding to the main body of the target timbre.
  • the audio timbre of the mixed language can be made more uniform without increasing the cost of data collection.
  • Fig. 6 shows a schematic diagram of the principle of audio synthesis of Chinese-English mixed text based on an embodiment of the present application.
  • the overall system for audio synthesis can mainly include four parts: a multi-channel residual encoder 610, a language embedding generator 620, a multiple attention mechanism module 630, and a speaker embedding generator 640. It also includes decoding 650 and vocoder 660.
  • the Multipath-Res-Encoder 610 can perform residual encoding on the input mixed language text through the Chinese and English two-way encoder, and add the encoding result to the input mixed language text to obtain the text encoding representation (Encode Representation), while enhancing the distinguishability of text encoding representation, it reduces the separation between Chinese and English species.
  • the language embedding generator 620 may map and non-linearly transform the category of each character in the input mixed language text through language embedding (Language Embedding) to obtain language embedding. In this way, each input character has a corresponding language embedded to mark it, and combined with the text encoding characterization, the distinguishability of the output result of the encoder can be further enhanced.
  • the multi-attention mechanism module 630 (Multi-Attention) not only pays attention to the text encoding representation, but also pays attention to language embedding.
  • the attention mechanism serves as a bridge connecting the multi-channel residual encoder 610 and the decoder 650, and accurately determines which position in the encoding at each decoding moment plays a decisive role in the final synthesis quality.
  • the multiple attention mechanism not only pays attention to the text encoding representation, but also has a clear understanding of the current content that needs to be decoded. At the same time, it also pays attention to language embedding, and has a clear judgment on which language the current decoded content belongs to. The combination of the two can make the decoding more stable and smooth.
  • the speaker embedding generator 640 (Speaker Embedding) obtains speaker embedding information through mapping and non-linear transformation of the speaker serial numbers to which different audio databases belong, and participates in each decoding moment. Since the function of the decoder 650 is to convert the text encoding representation into an acoustic feature, it plays a key role in the timbre of the final synthesized audio. Introducing the speaker embedding into each decoding moment can effectively control the audio characteristic attributes output by the decoder 650, and then control the final synthesized audio timbre to the timbre of the corresponding speaker.
  • a mixed Chinese and English audio corresponding to the mixed language text can be obtained.
  • the system includes the benefits of end-to-end learning, and through the refined design of the model encoding and decoding ends, it ensures that the synthesized Chinese-English mixed audio is naturally smooth and has consistent timbre.
  • Fig. 7 shows a block diagram of the audio synthesis device provided by an embodiment of the present application.
  • the audio synthesis device 700 may include:
  • the information acquisition module 710 is configured to acquire mixed-language text information, where the mixed-language text information includes types corresponding to at least two languages;
  • the information encoding module 720 is configured to perform text encoding processing on the mixed-language text information based on at least two language types to obtain intermediate semantic encoding features of the mixed-language text information;
  • the information decoding module 730 is configured to obtain the target timbre feature corresponding to the target timbre subject, and decode the intermediate semantic coding feature based on the target timbre feature to obtain the acoustic feature;
  • the acoustic coding module 740 is configured to perform acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed language text information.
  • the information encoding module 720 includes:
  • the monolingual coding unit is used to separately encode the mixed-language text information through the corresponding monolingual text encoder of each language type to obtain at least two monolingual coding features of the mixed-language text information;
  • the coding feature fusion unit is used to perform fusion processing on at least two monolingual coding features to obtain the mixed-language coding feature of the mixed-language text information;
  • the coding feature determining unit is used to determine the intermediate semantic coding feature of the mixed language text information according to the coding feature of the mixed language.
  • the monolingual coding unit includes:
  • the character embedding subunit is used for mapping and transforming the mixed-language text information through the character embedding matrix corresponding to each language type to obtain at least two embedded character features of the mixed-language text information;
  • the embedded coding subunit is used to perform text coding processing on embedded character features through respective monolingual text encoders corresponding to each language type to obtain at least two monolingual coding features of mixed language text information.
  • the embedded coding subunit is specifically used for:
  • the embedded character feature is fused with each residual coding feature to obtain at least two monolingual coding features of the mixed-language text information.
  • the monolingual coding feature is the residual coding feature obtained by performing residual coding on the embedded character feature;
  • the coding feature fusion unit includes:
  • the coding feature fusion subunit is used to perform fusion processing on at least two monolingual coding features and embedded character features to obtain the mixed language coding feature of the mixed language text information.
  • the coding feature determination unit includes:
  • the language embedding subunit is used for mapping and transforming the mixed language text information through a language embedding matrix based on at least two language types to obtain the embedded language characteristics of the mixed language text information;
  • the language fusion subunit is used to merge the mixed language coding features and embedded language features to obtain the intermediate semantic coding features of the mixed language text information.
  • the information encoding module 720 includes:
  • the character encoding unit is used to perform text encoding processing on each text character in the mixed-language text information based on at least two language type pairs to obtain the character encoding feature corresponding to each text character;
  • the weight obtaining unit is used to obtain the attention distribution weight corresponding to each text character
  • the feature weighting unit is used for weighting and mapping the respective character encoding features of each text character according to the corresponding attention distribution weight of each text character to obtain the intermediate semantic encoding feature of the mixed language text information.
  • the weight obtaining unit includes:
  • the sequence position obtaining subunit is used to obtain the sequence position information of each text character in the mixed language text information
  • the first weight determination sub-unit is used to determine the corresponding positional attention distribution weight of each text character according to the sequence position information.
  • the weight obtaining unit further includes:
  • the language type acquisition subunit is used to acquire the language type information of each text character
  • the language weight determination subunit is used to determine the language attention distribution weight corresponding to each text character according to the language type information
  • the second weight determination subunit is used to determine the multiple attention distribution weights corresponding to each text character according to the location attention distribution weight and the language attention distribution weight.
  • the second weight determination subunit is specifically used for:
  • timbre identification information determine the timbre attention distribution weight corresponding to each text character
  • the positional attention distribution weight the language attention distribution weight and the timbre attention distribution weight, the multiple attention distribution weights corresponding to each text character are determined.
  • the information decoding module 730 includes:
  • the timbre identification acquiring unit is used to acquire the timbre identification information of the target timbre subject
  • the timbre identification embedding unit is used for mapping and transforming the timbre identification information through the timbre embedding matrix to obtain the target timbre characteristic of the target timbre subject.
  • the audio synthesis device 700 further includes:
  • the model acquisition module is used to acquire the timbre conversion model obtained by training using the timbre data sample of the target timbre body;
  • the tone color conversion module is used to perform tone color conversion processing on audio information through the tone color conversion model to obtain audio corresponding to the target tone color characteristics.
  • Fig. 8 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
  • the computer system 800 includes a central processing unit (Central Processing Unit, CPU) 801, which can be loaded into a random storage unit according to a program stored in a read-only memory (Read-Only Memory, ROM) 802 or from a storage part 808. Access to the program in the memory (Random Access Memory, RAM) 803 to execute various appropriate actions and processing. In RAM 803, various programs and data required for system operation are also stored.
  • the CPU 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
  • An input/output (Input/Output, I/O) interface 805 is also connected to the bus 804.
  • the following components are connected to the I/O interface 805: the input part 806 including keyboard, mouse, etc.; including the output part such as cathode ray tube (Cathode Ray Tube, CRT), liquid crystal display (LCD), and speakers 807 A storage part 808 including a hard disk, etc.; and a communication part 809 including a network interface card such as a LAN (Local Area Network) card and a modem.
  • the communication section 809 performs communication processing via a network such as the Internet.
  • the driver 810 is also connected to the I/O interface 805 as needed.
  • a removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 810 as needed, so that the computer program read from it is installed into the storage section 808 as needed.
  • the processes described in the flowcharts of the various methods may be implemented as computer software programs.
  • the embodiments of the present application include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication part 809, and/or installed from the removable medium 811.
  • CPU central processing unit
  • the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本申请属于人工智能技术领域,并涉及机器学习技术。具体而言,本申请涉及一种音频合成方法、音频合成装置、计算机可读介质以及电子设备。该方法包括:获取包括至少两个语种类型的文本字符的混合语种文本信息;基于至少两个语种类型,对混合语种文本信息进行文本编码处理,得到混合语种文本信息的中间语义编码特征;获取对应于目标音色主体的目标音色特征,并基于目标音色特征对中间语义编码特征进行解码处理,得到声学特征;对声学特征进行声学编码处理,得到与混合语种文本信息对应的音频。该方法解决了现有的混合语种音频合成技术中存在的因语种差异而出现的音色跳变问题,可稳定地输出自然顺畅且音色统一的混合语种音频。

Description

音频合成方法、装置、计算机可读介质及电子设备
本申请要求于2020年05月13日提交中国专利局、申请号为202010402599.7、申请名称为“音频信息合成方法、装置、计算机可读介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及音频合成技术。
背景技术
随着人工智能技术和智能硬件设备(如智能手机、智能音箱等)的快速发展,语音交互技术作为一种自然的交互方式得到越来越多的应用。作为语音交互技术中重要的一环,语音合成技术也取得了长足的进步。语音合成技术也被称为文字转语音技术(Text to Speech,TTS),其作用是将计算机自己产生的或外部输入的文字信息转换为用户可以听得懂的、流利的语音,并将其播放出来。
在语音合成技术的应用中,往往会遇到多种语言类型的语音混杂的情况,例如中文句子中夹杂着英文单词或英文短语。在这种情形下,切换两种语言类型的语音时一般会出现较大的音色差异,导致合成语音出现明显的音色跳变,影响合成语音的播放效果。因此,如何克服因混合多种语言类型的语音而导致的音色差异,是目前亟待解决的问题。
发明内容
本申请的目的在于提供一种音频合成方法、音频合成装置、计算机可读介质以及电子设备,能够在一定程度上解决因合成音频中存在不同语言类型的语音而导致出现音色差异的技术问题。
本申请的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。
根据本申请实施例的一个方面,提供一种音频合成方法,由电子设备执行,该方法包括:
获取混合语种文本信息,所述混合语种文本信息包括对应于至少两个语种类型的文本字符;
基于所述至少两个语种类型,对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的中间语义编码特征;
获取对应于目标音色主体的目标音色特征,并基于所述目标音色特征对所述中间语义编码特征进行解码处理,得到声学特征;
对所述声学特征进行声学编码处理,得到与所述混合语种文本信息对应的音频。
根据本申请实施例的一个方面,提供一种音频信息合成装置,该装置包 括:
信息获取模块,用于获取混合语种文本信息,所述混合语种文本信息包括对应于至少两个语种类型的文本字符;
信息编码模块,用于基于所述至少两个语种类型,对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的中间语义编码特征;
信息解码模块,用于获取对应于目标音色主体的目标音色特征,并基于所述目标音色特征对所述中间语义编码特征进行解码处理,得到声学特征;
声学编码模块,用于对所述声学特征进行声学编码处理,得到与所述混合语种文本信息对应的音频。
根据本申请实施例的一个方面,提供一种电子设备,该电子设备包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器被配置为经由执行所述可执行指令来执行如以上技术方案中的音频合成方法。
根据本申请实施例的一个方面,提供一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行如以上技术方案中的音频合成方法。
在本申请实施例提供的技术方案中,通过多个语种类型对应的编码器对混合语种文本信息进行编码处理,再通过结合目标音色主体的音色特征的解码器对编码特征进行解码处理,实现转换生成对应于单一音色和多个语种类型的音频信息。解决了现有的混合语种音频合成技术中存在的因语种差异而出现的音色跳变问题,可稳定地输出自然顺畅且音色统一的混合语种音频。本申请实施例既可部署于云端为各种设备提供通用的音频合成服务,也可根据不同应用需求定制专属音色。由于可以使用不同目标音色主体的单语种音频数据库,实现多个语种类型的音频的混合合成,大大降低了训练数据采集的成本。同时,本申请实施例能够兼容已录制的单语种音频数据库,使得可用音色更为丰富。
附图说明
图1示出了本申请技术方案在一种应用场景中的示例性系统架构示意图;
图2示出了本申请技术方案在另一种应用场景中的示例性系统架构以及定制化音频合成服务流程;
图3示出了本申请实施例提供的音频合成方法的步骤流程图;
图4示出了本申请实施例中通过多路编码器进行编码处理的方法步骤流程图;
图5示出了本申请实施例中基于注意力机制(Attention)进行编码处理的方法步骤流程图;
图6示出了基于本申请实施例实现对中英混合文本进行音频信息合成的 原理示意图;
图7示出了本申请实施例提供的音频合成装置的组成框图;
图8示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
具体实施方式
本申请的应用场景广泛,可以将多个语种类型混合的音频合成方案配置为云服务,作为一种基础技术赋能于使用该云服务的用户,也可将该方案用于垂直领域下的个性化场景。例如,可以应用于阅读类APP智能朗读、智能客服、新闻播报、智能设备交互等场景,实现在各种场景下的智能化音频合成。
图1示出了本申请技术方案在一种应用场景中的示例性系统架构示意图。
如图1所示,系统架构100可以包括客户端110、网络120和服务端130。客户端110可以承载于智能手机、智能机器人、智能音箱、平板电脑、笔记本电脑、台式电脑等各种终端设备。服务端130可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务的云服务器。网络120可以是能够在客户端110和服务端130之间提供通信链路的各种连接类型的通信介质,例如可以是有线通信链路、无线通信链路等等。
根据实现需要,本申请实施例提供的技术方案可以应用于客户端110,也可以应用于服务端130,或者可以由客户端110和服务端130协同实施,本申请对此不做特殊限定。
举例而言,智能机器人、智能手机等各种智能设备均可通过无线网络接入云服务器提供的混合语种音频合成服务,如中英混合语音合成服务。客户端110将需要合成的中英混合文本通过网络120发送至服务端130,服务端130进行语音合成后,可通过流式或整句返回的形式,向客户端110发送对应的合成音频。一次完整的语音合成流程例如可以包括:
客户端110将需要合成的中英混合文本上传至服务端130,服务端130接收到该中英混合文本后进行相应的规整化处理;
服务端130将规整化后的文本信息输入到中英混合语音合成系统中,通过该中英混合语音合成系统快速合成出该文本信息对应的音频,并完成音频压缩等后处理操作;
服务端130通过流式或整句返回的方式将音频返回到客户端110,客户端110在接收到该音频后可对其进行流畅自然的语音播放。
在以上语音合成过程中,服务端130提供的语音合成服务延迟很小,客户端110基本可以立即获得返回结果。用户可以在短时间内听到所需内容,解放双眼,交互自然便捷。
图2示出了本申请技术方案在另一种应用场景中的示例性系统架构以及定制化音频合成服务流程。该系统架构及流程可以应用于小说朗读、新闻播报等需要定制专属音色语音合成服务的垂直领域中。
在该系统架构下实现定制化音频合成服务的流程可以包括:
前端需求方210提交其产品所需的语音合成服务的音色需求清单,如发音人性别,音色类型等各种需求。
后台服务方220收到音色需求清单后,根据所需音色情况采集对应的,并构建音频数据库,训练对应的音频合成模型230。
后台服务方220利用音频合成模型230合成样例,在将样例交付给前端需求方210检验确认后,可以将定制的音频合成模型230部署上线;
前端需求方210的应用程序(如阅读类APP,新闻客户端等)将需要合成音频的文本发送至后台服务方220上部署的音频合成模型230;前端需求方210的用户可以在应用程序中听到用对应的定制音色朗读的文本内容,具体的音频合成流程与图1所示系统架构中使用的在线合成服务相同。
在该应用场景中,前端需求方210在提供需求后,后台服务方220仅需采集符合需求的一种语言类型(如中文)的发音人音频数据库,并结合原有的其他发音人的另一种语言类型(如英文)的音频数据库,进行可语种混合的音频合成模型230的定制化训练,最终以满足前端需求方210要求的音色进行语种混合音频合成,如此,大大降低了定制音频合成服务的成本。
下面结合具体实施方式对本申请提供的技术方案做出详细说明。
图3示出了本申请实施例提供的音频合成方法的步骤流程图。该音频合成方法的执行主体为电子设备,该电子设备可以是承载有客户端的智能手机、智能音箱等各种终端设备,也可以是作为服务端的物理服务器、云服务器等各种服务器设备。如图3所示,该音频合成方法主要包括步骤S310~步骤S340:
步骤S310.获取混合语种文本信息,该混合语种文本信息包括对应于至少两个语种类型的文本字符。
混合语种文本信息由任意数量的文本字符组成,其中各个文本字符可以对应于至少两个不同的语种类型。例如,混合语种文本信息可以是由中文字符和英文字符混合组成的文本。
本步骤可以通过实时接收的方式,获取用户通过输入设备输入的混合语种文本信息,也可以通过逐项采集的方式,从包括文本信息的文件中逐句或者逐段地提取混合语种文本信息。
除此之外,本步骤还可以对用户输入的包含两种或者两种以上不同语种类型的语音信息进行语音识别,基于语音识别结果得到包括至少两个语种类型的混合语种文本信息;例如,本步骤可以通过预先训练的语音识别模型,对接收到的包括有至少两种语种类型的语音信息进行语音识别处理,得到相 应的混合语种文本信息,再经过后续步骤对该混合语种文本信息进行音频合成,达到整体上音色转换的效果,实现对一个或者多个说话人进行音色统一的变声处理。
步骤S320.基于至少两个语种类型,对混合语种文本信息进行文本编码处理,得到混合语种文本信息的中间语义编码特征。
本步骤可以利用预先训练的编码器(encoder)对混合语种文本信息进行文本编码处理,得到与该混合语种文本信息的自然语义相关的中间语义编码特征。其中,编码器的数量和类型可以与混合语种文本信息中包括的语种类型一一对应,例如混合语种文本信息中同时包含了中文字符和英文字符,那么本步骤可以利用中文和英文两路编码器对混合语种文本信息进行文本编码处理,得到中间语义编码特征,该中间语义编码特征在后续步骤中可以再通过与编码器相对应的解码器(decoder)进行解码处理,最终形成可供用户理解的、具有音频形式的自然语言。
编码器可以是基于卷积神经网络(Convolutional Neural Network,CNN)、循环神经网络(Recurrent Neural Network,RNN)、长短时记忆网络(Long Short-Term Memory,LSTM)或者循环门单元(Gate Recurrent Unit,GRU)等各种类型的神经网络训练得到的模型。CNN是一种前馈神经网络,其神经元可对感受野内的单元进行响应;CNN通常由多个卷积层和顶端的全连接层组成,其通过共享参数降低模型的参数量,使之在图像和语音识别方面得到广泛应用。RNN是一类以序列(sequence)数据为输入,在序列的演进方向进行递归(recursion)且所有节点(循环单元)按链式连接的递归神经网络。LSTM是一种循环神经网络,它在算法中加入了一个判断信息有用与否的单元,该单元中放置了输入门、遗忘门和输出门;信息进入LSTM后,根据规则来判断其是否有用,符合算法认证的信息才会留下,不符的信息则通过遗忘门被遗忘;LSTM适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。GRU是循环神经网络的一种,和LSTM一样,GRU也是为了解决长期记忆和反向传播中的梯度等问题而提出的;与LSTM相比,GRU内部少了一个“门控”,参数比LSTM少,在多数情况下能够达到与LSTM相当的效果并有效降低计算耗时。
步骤S330.获取对应于目标音色主体的目标音色特征,并基于目标音色特征对中间语义编码特征进行解码处理,得到声学特征。
目标音色主体是用于确定合成音频音色特点的主体对象,该主体对象可以是存储有声音样本的音频数据库对应的说话人。在一些实施例中,目标音色主体可以是真实的实体对象,例如可以是具有明显音色特点的主播、配音演员等真实人物;另外,目标音色主体也可以是由计算机模拟合成的虚拟对象,例如可以是利用语音合成软件VOCALOID生成的初音未来、洛天依等虚拟人物。
本步骤可以预先获取用户需求的音色特点,如男声、情感类发声等等,然后选取符合这些音色特点的目标音色主体。针对一个确定的目标音色主体,可以通过特征提取或者映射等方式,获取能够体现和标识其音色特点的目标音色特征。然后可以基于该目标音色特征,利用预先训练的解码器对步骤S320得到的中间语义编码特征进行解码处理,以得到相应的声学特征。
声学特征例如可以是以频谱(Spectrograms)或者其他形式呈现的具有音色特点和声音内容的特征。频谱是指时域信号在频域下的表示方式,可以通过对声音信号进行傅里叶变换得到,所得的结果是分别以幅度及相位为纵轴、频率为横轴的两种图像,语音合成技术应用中多会省略相位的信息,而只保留不同频率下对应的幅度信息。
步骤S340.对声学特征进行声学编码处理,得到与混合语种文本信息对应的音频。
本步骤可以将声学特征输入至声码器(Vocoder),通过声码器对该声学特征进行转换处理,形成可以通过扬声器等音频输出设备输出播放的音频。声码器源自人声编码器(Voice Encoder)的缩写,也被称作语音信号分析合成系统,声码器的作用是将声学特征转换为声音。
在本申请实施例提供的音频合成方法中,通过多个语种类型对应的编码器对混合语种文本信息进行编码处理,再通过结合目标音色主体的音色特征的解码器对编码特征进行解码处理,实现转换生成对应于单一音色和多个语种类型的音频信息。解决了现有的混合语种音频合成技术中存在的因语种差异而出现的音色跳变问题,可稳定地输出自然顺畅且音色统一的混合语种音频。本申请实施例既可部署于云端为各种设备提供通用的音频合成服务,也可根据不同应用需求定制专属音色。由于可以使用不同目标音色主体的单语种音频数据库,实现对于多个语种类型的音频的混合合成,大大降低了训练数据采集的成本。同时,本申请实施例能够兼容已录制的单语种音频数据库,使得可用音色更为丰富。
下面结合图4至图5对以上实施例中部分步骤的实现方式做出详细说明。
图4示出了本申请实施例中通过多路编码器进行编码处理的方法步骤流程图。如图4所示,在以上实施例的基础上,步骤S320.基于至少两个语种类型,对混合语种文本信息进行编码处理,得到混合语种文本信息的中间语义编码特征,可以包括以下步骤S410~步骤S430:
步骤S410.通过各个语种类型各自对应的单语种文本编码器,分别对混合语种文本信息进行文本编码处理,得到混合语种文本信息的至少两个单语种编码特征。
本步骤可以预先对混合语种文本信息进行映射变换,形成编码器可识别的向量特征。映射变换方式例如可以是,通过各个语种类型各自对应的字符 嵌入矩阵,分别对混合语种文本信息进行映射变换处理,得到混合语种文本信息的至少两个嵌入字符特征。字符嵌入矩阵的数量和类型可以与语种类型一一对应,例如,混合语种文本信息中同时包含中文字符和英文字符,那么本步骤可以通过对应于中文字符的字符嵌入矩阵对混合语种文本信息进行映射变换处理,得到对应于中文字符的嵌入字符特征,同时可以通过对应于英文字符的字符嵌入矩阵对混合语种文本信息进行映射变换处理,得到对应于英文字符的嵌入字符特征。通过字符嵌入矩阵可以先对混合语种文本信息进行线性映射,然后再利用激活函数或者其他方式对线性映射结果进行非线性变换,得到相应的嵌入字符特征。
在混合语种文本信息中包括几个语种类型,那么本步骤便可以使用几个相应的单语种文本编码器。通过各个语种类型各自对应的单语种文本编码器,分别对嵌入字符特征进行编码处理,可以得到混合语种文本信息的至少两个单语种编码特征。例如,混合语种文本信息中同时包含中文字符和英文字符,在得到对应于中文字符的嵌入字符特征后,可以通过对应于中文语种的单语种文本编码器,对该嵌入字符特征进行编码处理,得到对应于中文语种的单语种编码特征,同时在得到对应于英文字符的嵌入字符特征后,可以通过对应于英文语种的单语种文本编码器,对该嵌入字符特征进行编码处理,得到对应于英文语种的单语种编码特征。
本申请实施例中使用的单语种文本编码器可以是具有残差网络结构的编码器,残差网络的特点是容易优化,并且能够通过增加相当的深度来提高准确率。在此基础上,可以通过各个语种类型各自对应的单语种文本编码器,分别对嵌入字符特征进行残差编码,得到混合语种文本信息的至少两个残差编码特征;然后将嵌入字符特征分别与各个残差编码特征进行融合处理,得到混合语种文本信息的至少两个单语种编码特征。
残差编码特征是编码器输入数据和输出数据的差异部分,将残差编码特征与输入的嵌入字符特征进行融合即可得到单语种编码特征,这里的融合方式可以是将残差编码特征与嵌入字符特征直接相加。残差网络结构的编码方式对编码输出数据的数据变化具有更强的敏感性,在训练过程中,编码输出数据的数据变化对网络权重的调整作用更大,因此能够获得更好的训练效果。
步骤S420.对至少两个单语种编码特征进行融合处理,得到混合语种文本信息的混合语种编码特征。
可以通过融合处理的方式,根据由各个单语种文本编码器输出得到的单语种编码特征,得到混合语种文本信息的混合语种编码特征。例如,对于两个单语种编码特征,可以对其进行向量计算,如通过直接相加的方式得到混合语种编码特征。另外,也可以对两个单语种编码特征进行拼接处理,再通过全连接层或者其他网络结构,对拼接处理得到的特征进行映射处理得到混 合语种编码特征。本申请实施例对此不做特殊限定。
在本申请的一些实施例中,可以基于残差网络结构在对应于不同语种类型的单语种文本编码器中,分别对各个残差编码特征与嵌入字符特征进行融合处理,得到单语种编码特征,然后再对单语种编码特征进行融合处理,得到混合语种文本信息的混合语种编码特征。
在本申请的另外一些实施例中,也可以基于残差网络结构在对应于不同语种类型的单语种编码器中,仅对各个嵌入字符特征进行残差编码得到残差编码特征,即直接以残差编码特征作为各个单语种文本编码器输出的单语种编码特征,然后再对各个单语种编码特征与嵌入字符特征进行融合处理,得到混合语种文本信息的混合语种编码特征,这种处理方式相比于上一种实施方式可以减少一次融合处理,从而可以提交计算效率、节约计算成本。
步骤S430.根据混合语种编码特征,确定混合语种文本信息的中间语义编码特征。
在本申请的一些实施例中,可以直接将混合语种编码特征确定为混合语种文本信息的中间语义编码特征,也可以通过预设函数对混合语种编码特征进行变换处理后得到中间语义编码特征。
在本申请的另一些实施例中,可以在混合语种文本信息中嵌入语种类型的标识信息,得到混合语种文本信息的中间语义编码特征。
举例而言,本步骤可以通过基于至少两个语种类型的语种嵌入矩阵,对混合语种文本信息进行映射变换处理,得到混合语种文本信息的嵌入语种特征;然后,对混合语种编码特征和嵌入语种特征进行融合处理,得到混合语种文本信息的中间语义编码特征。
通过语种嵌入矩阵对混合语种文本信息进行的映射变换处理,可以是按照语种嵌入矩阵中预设的矩阵参数,对混合语种文本信息进行线性映射,然后再通过激活函数或者其他方式对线性映射结果进行非线性变换,从而得到相应的嵌入语种特征。例如,混合语种文本信息是一个具有一定字符数量的字符序列,那么对其进行映射变换后得到的嵌入语种特征可以是一个与该字符序列具有相同序列长度的特征向量,该特征向量中的每个元素分别对应表示字符序列中每个字符对应的语种类型。
对混合语种编码特征和嵌入语种特征进行的融合处理,可以是对二者进行向量计算,如通过直接相加的方式得到混合语种文本信息的中间语义编码特征。另外,也可以对混合语种编码特征和嵌入语种特征进行拼接处理,然后再通过全连接层或者其他网络结构,对拼接处理结果进行映射处理,得到混合语种文本信息的中间语义编码特征。
通过执行步骤S410~步骤S430,可以实现利用各个语种类型各自对应的单语种文本编码器,通过不同语种的相互独立的符号集,对混合语种文本信息进行的独立编码,并经过融合处理后得到包含语种类型信息的中间语义编 码特征。
图5示出了本申请实施例中基于注意力机制(Attention)进行编码处理的方法步骤流程图。如图5所示,在以上各实施例的基础上,步骤S320.基于至少两个语种类型,对混合语种文本信息进行编码处理,得到混合语种文本信息的中间语义编码特征,可以包括以下步骤S510~步骤S530:
步骤S510.基于至少两个语种类型,对混合语种文本信息中的各个文本字符进行文本编码处理,得到对应于各个文本字符的字符编码特征。
混合语种文本信息是由多个文本字符组成的字符序列,在利用以上各实施例提供的编码方法对混合语种文本信息进行文本编码处理时,可以对其中的各个文本字符依次进行编码处理,得到各个文本字符各自对应的字符编码特征。
步骤S520.获取各个文本字符各自对应的注意力分配权重。
由于混合语种文本信息中的各个文本字符除了存在字符语义差别之外,还存在其他多个方面的对语义编解码会产生影响的因素,因此本步骤可以根据不同维度的影响因素,确定各个文本字符各自对应的注意力分配权重。
步骤S530.根据各个文本字符各自对应的注意力分配权重,对各个文本字符各自对应的字符编码特征进行加权映射,得到混合语种文本信息的中间语义编码特征。
注意力分配权重的大小决定了在编解码过程中每个文本字符的语义重要程度,因此,根据注意力分配权重对各个文本字符的字符编码特征进行加权映射,可以提高得到的中间语义编码特征的语义表达能力。
在本申请的一些实施例中,一个注意力维度可以是各个文本字符在混合语种文本信息中的序列位置信息。例如,本申请实施例可以先获取各个文本字符在混合语种文本信息中的序列位置信息,然后根据序列位置信息,确定各个文本字符各自对应的位置注意力分配权重。
在此基础上,本申请实施例还可以获取各个文本字符的语种类型信息,然后根据语种类型信息,确定各个文本字符各自对应的语种注意力分配权重,进而根据位置注意力分配权重和语种注意力分配权重,确定各个文本字符各自对应的多重注意力分配权重。
在此基础上,本申请实施例还可以获取各个文本字符各自对应的目标音色主体的音色标识信息,然后根据音色标识信息,确定各个文本字符各自对应的音色注意力分配权重,进而根据位置注意力分配权重、语种注意力分配权重以及音色注意力分配权重,确定各个文本字符各自对应的多重注意力分配权重。
通过执行步骤S510~步骤S530,可以实现基于注意力机制的编码效果,尤其是通过多重注意力机制,可以将多种不同的影响因素引入到混合语种文本信息的编码过程中,提高编码结果的语义表达能力。
在步骤S330中,获取对应于目标音色主体的目标音色特征,并基于目标音色特征对中间语义编码特征进行解码处理,得到声学特征。
本步骤可以预先配置对应于不同音色主体的音频数据库,并且可以通过编号等方式为不同的音色主体分配对应的音色标识信息。本步骤可以先获取目标音色主体的音色标识信息,然后通过音色嵌入矩阵对音色标识信息进行映射变换处理,得到目标音色主体的目标音色特征。然后可以将目标音色特征和中间语义编码特征共同输入至解码器中,由解码器进行解码处理后,得到具有目标音色主体的音色特点的声学特征。
在通过解码器进行解码处理时,也可以使用与以上实施例中的编码器相似的多重注意力机制,例如在步骤S320和步骤S330中可以使用基于注意力机制的RNN网络结构作为编码器-解码器模型,实现对混合语种文本信息的编解码处理,另外也可以使用变形器(Transformer)作为编码器-解码器模型进行编解码处理,Transformer模型是基于全注意机制的网络结构,可以提高模型的并行能力。
在步骤S340.对声学特征进行声学编码处理,得到与混合语种文本信息相对应的音频之后,本申请实施例还可以获取利用目标音色主体的音色数据样本训练得到音色转换模型,然后,通过音色转换模型,对音频进行音色转换处理,得到对应于目标音色主体的音频。
通过训练音色转换模型,并利用音色转换模型对输出的音频进行音色转换,可以在不增加数据采集成本的前提下,使得混合语种的音频音色更为统一。
图6示出了基于本申请实施例实现对中英混合文本进行音频合成的原理示意图。如图6所示,实现音频合成的整体系统主要可以包括多路残差编码器610、语种嵌入生成器620、多重注意力机制模块630和说话人嵌入生成器640四个部分,另外还包括解码器650和声码器660等部分。
多路残差编码器610(Multipath-Res-Encoder)可以通过中英两路编码器对输入的混合语种文本进行残差编码,并将编码结果与输入的混合语种文本进行相加,得到文本编码表征(Encode Representation),在增强文本编码表示的可区分性的同时降低了中英语种边界处的割裂。
语种嵌入生成器620可以通过语种嵌入(Language Embedding)对输入的混合语种文本中的每个字符的所属类别进行映射及非线性变换,得到语种嵌入。这样一来,输入的每个字符都有对应的语种嵌入对其进行标明,与文本编码表征结合,可以进一步增强编码器输出结果的可区分性。
多重注意力机制模块630(Multi-Attention)除了关注文本编码表征之外,还要关注语种嵌入。注意力机制作为连接多路残差编码器610和解码器650的桥梁,准确地判断每个解码时刻编码中的哪个位置对最终合成质量起决定性作用。多重注意力机制既关注文本编码表征,对当前需要解码的内容有明 确的认知。与此同时,还关注语种嵌入,对当前解码内容属于哪个语种有明确的判别。二者结合,能够使得解码更为稳定顺畅。
说话人嵌入生成器640(Speaker Embedding)将不同音频数据库所属的说话人序号,通过映射和非线性变换得到说话人嵌入信息,并参与到每一个解码时刻中。由于解码器650的作用是将文本编码表征转换为声学特征,对最终合成音频的音色起关键性作用。将说话人嵌入引入到每个解码时刻中,能有效地控制解码器650输出的音频特征属性,进而控制最终合成音频的音色为对应说话人的音色。
解码器650输出的声学特征经过声码器660进行声音编码后,即可得到与混合语种文本相对应的中文和英文混合的音频。该系统包含了端到端学习带来的好处,并通过对模型编码端和解码端的精细化设计,保证了合成的中英混合音频自然顺畅且音色一致。
应当注意,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的音频合成方法。图7示出了本申请实施例提供的音频合成装置的组成框图。如图7所示,音频合成装置700可以包括:
信息获取模块710,用于获取混合语种文本信息,该混合语种文本信息包括对应于至少两种语种类型的;
信息编码模块720,用于基于至少两个语种类型,对混合语种文本信息进行文本编码处理,得到混合语种文本信息的中间语义编码特征;
信息解码模块730,用于获取对应于目标音色主体的目标音色特征,并基于目标音色特征对中间语义编码特征进行解码处理,得到声学特征;
声学编码模块740,用于对声学特征进行声学编码处理,得到与混合语种文本信息对应的音频信息。
在本申请的一些实施例中,基于以上各实施例,信息编码模块720包括:
单语种编码单元,用于通过各个语种类型各自对应的单语种文本编码器,分别对混合语种文本信息进行文本编码处理,得到混合语种文本信息的至少两个单语种编码特征;
编码特征融合单元,用于对至少两个单语种编码特征进行融合处理,得到混合语种文本信息的混合语种编码特征;
编码特征确定单元,用于根据混合语种编码特征,确定混合语种文本信息的中间语义编码特征。
在本申请的一些实施例中,基于以上各实施例,单语种编码单元包括:
字符嵌入子单元,用于通过各个语种类型各自对应的字符嵌入矩阵,分别对混合语种文本信息进行映射变换处理,得到混合语种文本信息的至少两个嵌入字符特征;
嵌入编码子单元,用于通过各个语种类型各自对应的单语种文本编码器,分别对嵌入字符特征进行文本编码处理,得到混合语种文本信息的至少两个单语种编码特征。
在本申请的一些实施例中,基于以上各实施例,嵌入编码子单元具体用于:
通过各个语种类型各自对应的单语种文本编码器,分别对嵌入字符特征进行残差编码,得到混合语种文本信息的至少两个残差编码特征;
将嵌入字符特征分别与各个残差编码特征进行融合处理,得到混合语种文本信息的至少两个单语种编码特征。
在本申请的一些实施例中,基于以上各实施例,单语种编码特征是对嵌入字符特征进行残差编码得到的残差编码特征;编码特征融合单元包括:
编码特征融合子单元,用于对至少两个单语种编码特征以及嵌入字符特征进行融合处理,得到混合语种文本信息的混合语种编码特征。
在本申请的一些实施例中,基于以上各实施例,编码特征确定单元包括:
语种嵌入子单元,用于通过基于至少两个语种类型的语种嵌入矩阵,对混合语种文本信息进行映射变换处理,得到混合语种文本信息的嵌入语种特征;
语种融合子单元,用于对混合语种编码特征和嵌入语种特征进行融合处理,得到混合语种文本信息的中间语义编码特征。
在本申请的一些实施例中,基于以上各实施例,信息编码模块720包括:
字符编码单元,用于基于至少两个语种类型对,混合语种文本信息中的各个文本字符进行文本编码处理,得到各个文本字符各自对应的字符编码特征;
权重获取单元,用于获取各个文本字符各自对应的注意力分配权重;
特征加权单元,用于根据各个文本字符各自对应的注意力分配权重,对各个文本字符各自的字符编码特征进行加权映射,得到混合语种文本信息的中间语义编码特征。
在本申请的一些实施例中,基于以上各实施例,权重获取单元包括:
序列位置获取子单元,用于获取各个文本字符在混合语种文本信息中的序列位置信息;
第一权重确定子单元,用于根据序列位置信息,确定各个文本字符各自 对应的位置注意力分配权重。
在本申请的一些实施例中,基于以上各实施例,权重获取单元还包括:
语种类型获取子单元,用于获取各个文本字符的语种类型信息;
语种权重确定子单元,用于根据语种类型信息,确定各个文本字符各自对应的语种注意力分配权重;
第二权重确定子单元,用于根据位置注意力分配权重和语种注意力分配权重,确定各个文本字符各自对应的多重注意力分配权重。
在本申请的一些实施例中,基于以上各实施例,第二权重确定子单元具体用于:
获取各个文本字符各自对应的目标音色主体的音色标识信息;
根据音色标识信息,确定各个文本字符各自对应的音色注意力分配权重;
根据位置注意力分配权重、语种注意力分配权重以及音色注意力分配权重,确定各个文本字符各自对应的多重注意力分配权重。
在本申请的一些实施例中,基于以上各实施例,信息解码模块730包括:
音色标识获取单元,用于获取目标音色主体的音色标识信息;
音色标识嵌入单元,用于通过音色嵌入矩阵对音色标识信息进行映射变换处理,得到目标音色主体的目标音色特征。
在本申请的一些实施例中,基于以上各实施例,音频合成装置700还包括:
模型获取模块,用于获取利用目标音色主体的音色数据样本训练得到音色转换模型;
音色转换模块,用于通过音色转换模型,对音频信息进行音色转换处理,得到对应于目标音色特征的音频。
本申请各实施例中提供的音频合成装置的具体细节已经在对应的方法实施例中进行了详细的描述,此处不再赘述。
图8示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
需要说明的是,图8示出的电子设备的计算机系统800仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图8所示,计算机系统800包括中央处理单元(Central Processing Unit,CPU)801,其可以根据存储在只读存储器(Read-Only Memory,ROM)802中的程序或者从存储部分808加载到随机访问存储器(Random Access Memory,RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有系统操作所需的各种程序和数据。CPU 801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(Input/Output,I/O)接口805 也连接至总线804。
以下部件连接至I/O接口805:包括键盘、鼠标等的输入部分806;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分807;包括硬盘等的存储部分808;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至I/O接口805。可拆卸介质811,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器810上,以便于从其上读出的计算机程序根据需要被安装入存储部分808。
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分809从网络上被下载和安装,和/或从可拆卸介质811被安装。在该计算机程序被中央处理单元(CPU)801执行时,执行本申请的系统中限定的各种功能。
需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。

Claims (16)

  1. 一种音频合成方法,由电子设备执行,包括:
    获取混合语种文本信息,所述混合语种文本信息包括对应于至少两个语种类型的文本字符;
    基于所述至少两个语种类型,对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的中间语义编码特征;
    获取对应于目标音色主体的目标音色特征,并基于所述目标音色特征对所述中间语义编码特征进行解码处理,得到声学特征;
    对所述声学特征进行声学编码处理,得到与所述混合语种文本信息对应的音频。
  2. 根据权利要求1所述的音频合成方法,所述基于所述至少两个语种类型,对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的中间语义编码特征,包括:
    通过各个所述语种类型各自对应的单语种文本编码器,分别对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的至少两个单语种编码特征;
    对所述至少两个单语种编码特征进行融合处理,得到所述混合语种文本信息的混合语种编码特征;
    根据所述混合语种编码特征,确定所述中间语义编码特征。
  3. 根据权利要求2所述的音频合成方法,所述通过各个所述语种类型各自对应的单语种文本编码器,分别对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的至少两个单语种编码特征,包括:
    通过各个所述语种类型各自对应的字符嵌入矩阵,分别对所述混合语种文本信息进行映射变换处理,得到所述混合语种文本信息的至少两个嵌入字符特征;
    通过各个所述语种类型各自对应的单语种文本编码器,分别对所述嵌入字符特征进行文本编码处理,得到所述混合语种文本信息的至少两个单语种编码特征。
  4. 根据权利要求3所述的音频合成方法,所述通过各个所述语种类型各自对应的单语种文本编码器,分别对所述嵌入字符特征进行文本编码处理,得到所述混合语种文本信息的至少两个单语种编码特征,包括:
    通过各个所述语种类型各自对应的单语种文本编码器,分别对所述嵌入字符特征进行残差编码,得到所述混合语种文本信息的至少两个残差编码特征;
    将所述嵌入字符特征分别与各个所述残差编码特征进行融合处理,得到所述混合语种文本信息的至少两个单语种编码特征。
  5. 根据权利要求3所述的音频合成方法,所述单语种编码特征是对所 述嵌入字符特征进行残差编码得到的残差编码特征;所述对所述至少两个单语种编码特征进行融合处理,得到所述混合语种文本信息的混合语种编码特征,包括:
    对所述至少两个单语种编码特征以及所述嵌入字符特征进行融合处理,得到所述混合语种文本信息的混合语种编码特征。
  6. 根据权利要求2所述的音频合成方法,所述根据所述混合语种编码特征,确定所述中间语义编码特征,包括:
    通过基于所述至少两个语种类型的语种嵌入矩阵,对所述混合语种文本信息进行映射变换处理,得到所述混合语种文本信息的嵌入语种特征;
    对所述混合语种编码特征和所述嵌入语种特征进行融合处理,得到所述中间语义编码特征。
  7. 根据权利要求1所述的音频合成方法,所述基于所述至少两个语种类型,对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的中间语义编码特征,包括:
    基于所述至少两个语种类型,对所述混合语种文本信息中的各个文本字符进行文本编码处理,得到各个所述文本字符各自对应的字符编码特征;
    获取各个所述文本字符各自对应的注意力分配权重;
    根据各个所述文本字符各自对应的注意力分配权重,对各个所述文本字符各自对应的字符编码特征进行加权映射,得到所述中间语义编码特征。
  8. 根据权利要求7所述的音频合成方法,所述获取各个所述文本字符各自对应的注意力分配权重,包括:
    获取各个所述文本字符在所述混合语种文本信息中的序列位置信息;
    根据所述序列位置信息,确定各个所述文本字符各自对应的位置注意力分配权重。
  9. 根据权利要求8所述的音频合成方法,所述获取各个所述文本字符各自对应的注意力分配权重,还包括:
    获取各个所述文本字符的语种类型信息;
    根据所述语种类型信息,确定各个所述文本字符各自对应的语种注意力分配权重;
    根据所述位置注意力分配权重和所述语种注意力分配权重,确定各个所述文本字符各自对应的多重注意力分配权重。
  10. 根据权利要求9所述的音频合成方法,所述根据所述位置注意力分配权重和所述语种注意力分配权重,确定各个所述文本字符各自对应的多重注意力分配权重,包括:
    获取各个所述文本字符各自对应的目标音色主体的音色标识信息;
    根据所述音色标识信息,确定各个所述文本字符各自对应的音色注意力分配权重;
    根据所述位置注意力分配权重、所述语种注意力分配权重以及所述音色注意力分配权重,确定各个所述文本字符各自对应的多重注意力分配权重。
  11. 根据权利要求1所述的音频合成方法,所述获取对应于目标音色主体的目标音色特征,包括:
    获取所述目标音色主体的音色标识信息;
    通过音色嵌入矩阵对所述音色标识信息进行映射变换处理,得到所述目标音色特征。
  12. 根据权利要求1所述的音频合成方法,在对所述声学特征进行声学编码处理,得到与所述混合语种文本信息对应的音频之后,所述方法还包括:
    获取利用所述目标音色主体的音色数据样本训练得到音色转换模型;
    通过所述音色转换模型,对所述音频进行音色转换处理,得到对应于所述目标音色主体的音频。
  13. 一种音频合成装置,包括:
    信息获取模块,用于获取混合语种文本信息,所述混合语种文本信息包括对应于至少两个语种类型的文本字符;
    信息编码模块,用于基于所述至少两个语种类型,对所述混合语种文本信息进行文本编码处理,得到所述混合语种文本信息的中间语义编码特征;
    信息解码模块,用于获取对应于目标音色主体的目标音色特征,并基于所述目标音色特征对所述中间语义编码特征进行解码处理,得到声学特征;
    声学编码模块,用于对所述声学特征进行声学编码处理,得到与所述混合语种文本信息对应的音频信息。
  14. 一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至12中任意一项所述的音频合成方法。
  15. 一种电子设备,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至12中任意一项所述的音频合成方法。
  16. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行权利要求1至12任意一项所述的音频合成方法。
PCT/CN2021/085862 2020-05-13 2021-04-08 音频合成方法、装置、计算机可读介质及电子设备 WO2021227707A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/703,136 US12106746B2 (en) 2020-05-13 2022-03-24 Audio synthesis method and apparatus, computer readable medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010402599.7 2020-05-13
CN202010402599.7A CN112767910B (zh) 2020-05-13 2020-05-13 音频信息合成方法、装置、计算机可读介质及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/703,136 Continuation US12106746B2 (en) 2020-05-13 2022-03-24 Audio synthesis method and apparatus, computer readable medium, and electronic device

Publications (1)

Publication Number Publication Date
WO2021227707A1 true WO2021227707A1 (zh) 2021-11-18

Family

ID=75693026

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/085862 WO2021227707A1 (zh) 2020-05-13 2021-04-08 音频合成方法、装置、计算机可读介质及电子设备

Country Status (3)

Country Link
US (1) US12106746B2 (zh)
CN (1) CN112767910B (zh)
WO (1) WO2021227707A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220101829A1 (en) * 2020-09-29 2022-03-31 Harman International Industries, Incorporated Neural network speech recognition system
CN113192498A (zh) * 2021-05-26 2021-07-30 北京捷通华声科技股份有限公司 音频数据处理方法、装置、处理器及非易失性存储介质
CN113421547B (zh) * 2021-06-03 2023-03-17 华为技术有限公司 一种语音处理方法及相关设备
CN113838450B (zh) * 2021-08-11 2022-11-25 北京百度网讯科技有限公司 音频合成及相应的模型训练方法、装置、设备及存储介质
CN116072098B (zh) * 2023-02-07 2023-11-14 北京百度网讯科技有限公司 音频信号生成方法、模型训练方法、装置、设备和介质
CN117174074A (zh) * 2023-11-01 2023-12-05 浙江同花顺智能科技有限公司 一种语音合成方法、装置、设备及存储介质
CN117594051B (zh) * 2024-01-17 2024-04-05 清华大学 用于语音转换的可控说话者音频表示的方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09244679A (ja) * 1996-03-12 1997-09-19 Sony Corp 音声合成方法および音声合成装置
CN107481713A (zh) * 2017-07-17 2017-12-15 清华大学 一种混合语言语音合成方法及装置
CN109697974A (zh) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 使用卷积序列学习的神经文本转语音的系统和方法
CN109767755A (zh) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 一种语音合成方法和系统
CN111128114A (zh) * 2019-11-11 2020-05-08 北京大牛儿科技发展有限公司 一种语音合成的方法及装置
CN111145720A (zh) * 2020-02-04 2020-05-12 清华珠三角研究院 一种将文本转换成语音的方法、系统、装置和存储介质
CN111247581A (zh) * 2019-12-23 2020-06-05 深圳市优必选科技股份有限公司 一种多语言文本合成语音方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845125B (zh) * 2016-05-18 2019-05-03 百度在线网络技术(北京)有限公司 语音合成方法和语音合成装置
WO2019139431A1 (ko) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 다중 언어 텍스트-음성 합성 모델을 이용한 음성 번역 방법 및 시스템
JP7142333B2 (ja) * 2018-01-11 2022-09-27 ネオサピエンス株式会社 多言語テキスト音声合成方法
CN110718208A (zh) * 2019-10-15 2020-01-21 四川长虹电器股份有限公司 基于多任务声学模型的语音合成方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09244679A (ja) * 1996-03-12 1997-09-19 Sony Corp 音声合成方法および音声合成装置
CN107481713A (zh) * 2017-07-17 2017-12-15 清华大学 一种混合语言语音合成方法及装置
CN109697974A (zh) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 使用卷积序列学习的神经文本转语音的系统和方法
CN109767755A (zh) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 一种语音合成方法和系统
CN111128114A (zh) * 2019-11-11 2020-05-08 北京大牛儿科技发展有限公司 一种语音合成的方法及装置
CN111247581A (zh) * 2019-12-23 2020-06-05 深圳市优必选科技股份有限公司 一种多语言文本合成语音方法、装置、设备及存储介质
CN111145720A (zh) * 2020-02-04 2020-05-12 清华珠三角研究院 一种将文本转换成语音的方法、系统、装置和存储介质

Also Published As

Publication number Publication date
CN112767910B (zh) 2024-06-18
US20220215827A1 (en) 2022-07-07
CN112767910A (zh) 2021-05-07
US12106746B2 (en) 2024-10-01

Similar Documents

Publication Publication Date Title
WO2021227707A1 (zh) 音频合成方法、装置、计算机可读介质及电子设备
WO2022188734A1 (zh) 一种语音合成方法、装置以及可读存储介质
CN111899720A (zh) 用于生成音频的方法、装置、设备和介质
CN111091800A (zh) 歌曲生成方法和装置
CN111161695B (zh) 歌曲生成方法和装置
WO2021212954A1 (zh) 极低资源下的特定发音人情感语音合成方法及装置
WO2022252904A1 (zh) 基于人工智能的音频处理方法、装置、设备、存储介质及计算机程序产品
WO2021169825A1 (zh) 语音合成方法、装置、设备和存储介质
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
CN113761841B (zh) 将文本数据转换为声学特征的方法
CN116129863A (zh) 语音合成模型的训练方法、语音合成方法及相关装置
CN116798405B (zh) 语音合成方法、装置、存储介质和电子设备
WO2024088262A1 (zh) 语音识别模型的数据处理系统及方法、语音识别方法
CN114242033A (zh) 语音合成方法、装置、设备、存储介质及程序产品
CN117219052A (zh) 韵律预测方法、装置、设备、存储介质和程序产品
CN112035699A (zh) 音乐合成方法、装置、设备和计算机可读介质
CN116825090B (zh) 语音合成模型的训练方法、装置及语音合成方法、装置
CN117150338A (zh) 任务处理、自动问答以及多媒体数据识别模型训练方法
CN112185340A (zh) 语音合成方法、语音合成装置、存储介质与电子设备
CN116129862A (zh) 语音合成方法、装置、电子设备及存储介质
JP2022153600A (ja) 音声合成方法、装置、電子機器及び記憶媒体
CN111554300B (zh) 音频数据处理方法、装置、存储介质及设备
CN114333758A (zh) 语音合成方法、装置、计算机设备、存储介质和产品
CN114299915A (zh) 语音合成方法及相关设备
CN118711575A (zh) 一种语音识别方法、装置、电子设备及计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21804858

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21804858

Country of ref document: EP

Kind code of ref document: A1