CN117153140A

CN117153140A - Audio synthesis method, device, equipment and storage medium

Info

Publication number: CN117153140A
Application number: CN202210561989.8A
Authority: CN
Inventors: 杨丽兵
Original assignee: TCL Technology Group Co Ltd
Current assignee: TCL Technology Group Co Ltd
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2023-12-01

Abstract

The embodiment of the application discloses an audio synthesis method, an audio synthesis device, audio synthesis equipment and a storage medium, which comprise the following steps: obtaining a standard phonetic symbol text of a preset text, wherein the standard phonetic symbol text is used for representing standard pronunciation of the preset text; acquiring a target accent feature vector; acquiring a target identity; generating target sound characteristic parameters according to the standard phonetic symbol text, the target accent characteristic vector and the identity through an audio synthesis model; and converting the target sound characteristic parameters into target audio corresponding to the preset text, wherein the target audio carries target accents corresponding to the target accent characteristic vectors and the target tone corresponding to the target identity. The synthesized audio is generated by training the audio synthesis model, so that the synthesized audio can present different accents and timbres in the same language, and the problem that the existing part of people cannot adapt to or understand the Mandarin prompt voice or guide voice is solved.

Description

Audio synthesis method, device, equipment and storage medium

Technical Field

The present application relates to the field of computers, and in particular, to an audio synthesis method, apparatus, device, and storage medium.

Background

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

However, the current speech synthesis technology cannot obtain audio frequencies of different accents or different timbres in the same language through text synthesis, so that problems are caused in the practical application level. For example, typically the operation prompts or guides of a banking ATM are voice-announced by mandarin, and for non-mandarin use groups of people (especially elderly people who live in a dialect language environment for a long period of time) may be less adapted to or even unable to understand the specific meaning of the prompts or guides.

Disclosure of Invention

The embodiment of the application provides an audio synthesis method, an audio synthesis device, audio synthesis equipment and an audio synthesis storage medium, which can solve the problem that the existing part of people cannot adapt to or understand the prompt voice of mandarin or guide voice by being capable of presenting synthesized audio of different voices and timbres under the same language.

The embodiment of the application provides an audio synthesis method, which comprises the following steps:

obtaining a standard phonetic symbol text of a preset text, wherein the standard phonetic symbol text is used for representing standard pronunciation of the preset text;

acquiring a target accent feature vector, wherein the target accent feature vector is used for representing a target accent;

acquiring a target identity, wherein the target identity is used for representing a target tone;

generating target sound characteristic parameters according to the standard phonetic symbol text, the target accent characteristic vector and the identity through an audio synthesis model;

and converting the target sound characteristic parameters into target audio corresponding to the preset text, wherein the target audio carries target accents corresponding to the target accent characteristic vectors and the target tone corresponding to the target identity.

In the above embodiment, the obtaining the target accent feature vector includes:

Receiving user audio of a target user;

extracting user sound characteristic parameters corresponding to the user audio;

and generating the target accent feature vector according to the user sound feature parameters through a first accent feature extraction network in the audio synthesis model.

Optionally, the acquiring the target accent feature vector includes:

acquiring an accent phonetic symbol text, wherein the accent phonetic symbol text is used for representing the pronunciation of the target accent of the preset text;

and generating the target accent feature vector according to the accent phonetic symbol text through a second accent feature extraction network in the audio synthesis model.

Optionally, the audio synthesis model comprises an encoder, an embedded network, and a decoder;

generating, by the audio synthesis model, a target sound feature parameter according to the standard phonetic symbol text, the target accent feature vector, and the identity, including:

generating, by the encoder, an output vector from the standard phonetic symbol text;

generating an identity vector according to the target identity through the embedded network;

calculating the summation of the target accent feature vector, the output vector and the identity identification vector, and obtaining a summation result;

And generating the target sound characteristic parameters according to the addition result through the decoder.

Optionally, before the standard phonetic symbol text of the preset text is acquired, the method further includes:

acquiring a training accent phonetic symbol text set, wherein the training accent phonetic symbol text set comprises a first number of training accent phonetic symbol texts, the first number of training accent phonetic symbol texts are jointly generated by at least two training users, and each training accent phonetic symbol text is used for representing the accent pronunciation of the training user to which the training accent phonetic symbol text belongs;

acquiring a training standard phonetic symbol text set, wherein the training standard phonetic symbol text set comprises a first number of training standard phonetic symbol texts, the training standard phonetic symbol texts are in one-to-one correspondence with the training accent phonetic symbol texts, and each training standard phonetic symbol text is used for representing standard pronunciation corresponding to the accent pronunciation;

acquiring an identity corresponding to each training user in the at least two training users;

acquiring training audio corresponding to the first number of training accent phonetic texts, and acquiring training sound characteristic parameters based on the training audio;

and training the initial audio synthesis model according to the training accent phonetic symbol text set, the training standard phonetic symbol text set, at least two identification marks and the training sound characteristic parameters to obtain an audio synthesis model.

In the above embodiment, the initial audio synthesis model includes an initial first accent feature extraction network, an initial second accent feature extraction network, an initial embedding network, an initial encoder, and an initial decoder;

training the initial audio synthesis model according to the training accent phonetic symbol text set, the training standard phonetic symbol text set, at least two identification marks and the training sound characteristic parameters to obtain an audio synthesis model, wherein the training comprises the following steps:

and acquiring the training sound characteristic parameters corresponding to the first training accent phonetic symbol text.

Generating a training first oral feature vector according to the training sound feature parameters through the initial first oral feature extraction network;

generating a training second accent feature vector according to a first training accent phonetic symbol text through the initial second accent feature extraction network, wherein the first training accent phonetic symbol text is any one of the first number of training accent phonetic symbol texts;

generating a training identity vector according to the identity corresponding to the first training accent phonetic symbol text through the initial embedded network;

generating a training output vector according to the training standard phonetic symbol text corresponding to the first training accent phonetic symbol text by the initial encoder;

Acquiring training output sound characteristic parameters based on the training second accent characteristic vector, the training identity vector and the training output vector;

calculating a first loss according to the training sound characteristic parameters and the training output sound characteristic parameters;

calculating a second loss according to the training first accent feature vector and the training second accent feature vector;

and adjusting parameters of the initial audio synthesis model according to the sum of the first loss and the second loss, and determining the initial audio synthesis model containing the adjusted parameters as an audio synthesis model when the sum of the first loss and the second loss reaches a preset condition.

In the foregoing embodiment, the obtaining the training output sound feature parameter based on the training second accent feature vector, the training identity vector, and the training output vector includes:

calculating a training vector sum of the training second accent feature vector, the training identity vector and the training output vector;

and generating training output sound characteristic parameters according to the training vector sum through the initial decoder.

The embodiment of the application also provides an audio synthesis device, which comprises:

The standard phonetic symbol text module is used for acquiring standard phonetic symbol text of a preset text, wherein the standard phonetic symbol text is used for representing standard pronunciation of the preset text;

the target accent feature vector module is used for acquiring a target accent feature vector, and the target accent feature vector is used for representing a target accent;

the target identity module is used for acquiring a target identity, and the target identity is used for representing a target tone;

the synthesis module is used for generating target sound characteristic parameters according to the standard phonetic symbol text, the target accent characteristic vector and the target identity through an audio synthesis model;

the conversion module is used for converting the target sound characteristic parameters into target audio corresponding to the preset text, wherein the target audio carries target accents corresponding to the target accent characteristic vectors and the target timbres corresponding to the target identity marks.

The embodiment of the application also provides equipment, which comprises a processor and a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform steps in the audio synthesis method.

The embodiment of the application also provides a storage medium, which is characterized in that the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to execute the steps in the audio synthesis method.

According to the embodiment of the application, the synthesized audio can be generated according to the requirements of the user, and the operation prompt or the guide language is broadcasted to the user through the synthesized audio, so that the user can conveniently understand the prompt or the guide language. The embodiment of the application can adopt different pronunciation marking modes aiming at different languages, is not only applicable to Chinese, but also has universal applicability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic diagram of an application scenario of an audio synthesis method according to an embodiment of the present application;

fig. 1b is a schematic flow chart of an audio synthesis method according to an embodiment of the present application;

Fig. 2 is a schematic flow chart of steps 141 to 144 in an audio synthesis method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of steps T1 to T5 in an audio synthesis method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio synthesis device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The terms "first," "second," and the like in this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. Meanwhile, the term "includes" and any form of modification thereof are intended to cover non-exclusive inclusion.

The embodiment of the application provides an audio synthesis method, an audio synthesis device, audio synthesis equipment and a storage medium.

The audio synthesis device may be integrated in a device, which may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the audio synthesis apparatus may also be integrated in a plurality of devices, for example, the audio synthesis apparatus may be integrated in a plurality of servers, and the audio synthesis method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

Referring to fig. 1a, fig. 1a shows an application scenario of an audio synthesis method provided by a specific implementation manner of an embodiment of the present application, and the following steps are performed, assuming that, during the process of using an ATM, a user performs, in order to obtain an operation prompt voice or a guide voice for presenting a tone of a celebrity-a:

and obtaining a standard phonetic symbol text of the preset text, wherein the standard phonetic symbol text is used for representing standard pronunciation of the preset text. In this embodiment, the terminal stores a preset text in advance, where the preset text is a response word set by the user in the use process, and the preset text may also be an operation prompt word or a guide word of the user in the use process, for example, "you good", "thank you", "please input a password", and so on. And simultaneously, based on the standard pronunciation corresponding to the preset text, carrying out pronunciation marking on the preset text to obtain the standard phonetic symbol text.

And obtaining a target accent feature vector, wherein the target accent feature vector is used for representing the target accent. In this embodiment, the target accent vector is used to represent accent characteristics of the target accent, specifically, the terminal performs feature extraction on user speech input to the terminal through the loaded software, so as to obtain the target accent vector used to represent accent characteristics of the user.

And obtaining a target identity, wherein the target identity is used for representing the target tone. In some embodiments of the present application, the target identity may be used to characterize the timbre of young men, young women, elderly men, elderly women, etc., and the target identity may also be used to characterize the timbre of celebrity a, celebrity b, or celebrity c. For example, in one embodiment of the present application, the user selected target identity may be the identity of the celebrity first, i.e., the target identity is used to characterize the accent of the celebrity first. Specifically, in fig. 1a, the target id is speaker_id.

And generating target sound characteristic parameters according to the standard phonetic symbol text, the target accent characteristic vector and the identity through an audio synthesis model. In this embodiment, the standard phonetic symbol text, the target accent feature vector and the identity are input into the audio synthesis model carried in the terminal system, and the target sound feature parameter is obtained through model calculation.

And converting the target sound characteristic parameters into target audio corresponding to the preset text, wherein the target audio carries target accents corresponding to the target accent characteristic vectors and the target tone corresponding to the target identity, namely the target audio is generated by reading the preset text by the target accents and the target tone. In this embodiment, the vocoder is used to convert the target sound characteristic parameter into the target audio, where the accent and timbre presented by the target audio are consistent with those presented by the target accent, that is, the user obtains a plurality of segments of audio fused with the accent of the user and the timbre of the celebrity and speaking the preset text content through the software interface carried by the terminal, so that the user feels intimate, and can understand the content of the preset text more by means of the timbre of the celebrity.

The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.

In this embodiment, an audio synthesis method based on a speech synthesis technology related to artificial intelligence is provided, as shown in fig. 1b, and the specific flow of the audio synthesis method includes steps 110 to 150:

110. And obtaining a standard phonetic symbol text of the preset text, wherein the standard phonetic symbol text is used for representing standard pronunciation of the preset text.

In some embodiments of the present application, the preset text is a response word set by the user in the use process, and the preset text may also be an operation prompt word or a guidance word of the user in the use process.

In some embodiments of the present application, the standard phonetic symbol text is obtained by manually or by software carried by a system to perform pronunciation marking on the preset text, where the pronunciation marking mode is formulated according to the language type of the preset text, and the language type of the preset text is not limited and may be chinese, english, french, german, etc., and if the preset text is chinese, the pronunciation marking mode is formulated based on mandarin pinyin; if the preset text is English, the pronunciation marking mode is formulated based on standard English phonetic symbols; for preset text in other kinds of languages, corresponding phonetic symbols can be adopted for preparation.

Specifically, taking the preset text as a Chinese for example, the following manner is followed for pronunciation annotation of the preset text:

the method comprises the steps of using Mandarin pinyin for pronunciation annotation, and enabling multi-tone characters and pernicious sounds to be accurately annotated so as to obtain a pinyin sequence marked according to Mandarin, namely standard phonetic symbol texts, wherein the standard phonetic symbol texts comprise initials, finals and tones, but do not comprise marks for other accents;

And labeling the level of the following characteristics of the initial consonant, the final or the whole of each pinyin:

the tone is light and heavy: labeling on initials or finals, classifying into 5 grades, wherein 1 represents "ultralight", 2 represents "light", 3 represents "normal", 4 represents "heavy", and 5 represents "overweight";

the pronunciation length is as follows: labeling on initials or finals, classifying into 5 grades, wherein 1 represents "ultra-short", 2 represents "short", 3 represents "normal", 4 represents "long", and 5 represents "ultra-long";

length of pause: the mark is marked on the whole pinyin and indicates the pause length behind the word, and is divided into 5 stages, wherein 1 indicates 'ultra-short', 2 indicates 'short', 3 indicates 'normal', 4 indicates 'long', and 5 indicates 'ultra-long';

tone turning: the marks are marked on the whole of the pinyin, represent the transition from the standard mandarin tone to the transition with accent, and are represented by two numbers, such as (1, 2) represents the transition from 1 sound to 2 sound, and the mark is (0, 0) without the transition;

the nasal sound is light and heavy: the phonetic symbols are classified into 5 levels on the whole, wherein 1 represents "ultralight", 2 represents "light", 3 represents "normal", 4 represents "heavy", and 5 represents "overweight";

child voice conversion: labeling on vowels, wherein 1 represents the child, and 0 represents the normal;

Flat tongue and stick-up tongue: labeling on the initial consonant, wherein 1 represents flat tongue with a raised tongue tone, 2 represents flat tongue with a raised tongue tone, and 0 represents normal;

and combining three vectors according to the grade labels:

the initial accent vector and the three-dimensional vector are (the tone is light and heavy, the pronunciation is long and short, and the tongue is flat) and are (4, 0);

vowel accent vector, three-dimensional vector, is (tone light and heavy, pronunciation length, child voice), such as (2, 4, 1);

the pinyin is the whole accent vector, and the four-dimensional vector is (pronunciation length, pause length, tone turning, and vibrissa weight), for example (1,4,2,4).

The final pinyin sequence includes: initials, initials accent vectors, finals plus tones, finals accent vectors and pinyin whole accent vectors.

For example:

if the text is "Shanxi words", the corresponding pinyin sequence is:

sh(3,5,0)an3(4,4,0)(3,3,2,5)

b(2,2,0)ei3(2,2,0)(3,0,0,4)

h(3,3,0)ua4(2,2,0)(3,0,0,4)；

if the text is "Jiangzhe words", the corresponding pinyin sequence is:

j(2,2,0)ian1(2,3,0)(3,0,0,3)

z(2,2,1)he4(2,3,0)(3,4,3,3)

h(2,3,0)ua4(2,4,0)(3,0,0,3)。

120. and obtaining a target accent feature vector, wherein the target accent feature vector is used for representing the target accent.

In some embodiments of the present application, the target accent vector is used to characterize accent characteristics of the target accent, and the target accent vector is obtained by performing feature extraction processing on the target accent through a terminal.

In some embodiments of the present application, the step 120 may include two specific implementations as follows:

(one), in this embodiment, the step 120 includes a process shown in steps 121a to 123 a:

121a, receiving user audio of the target user.

In some embodiments of the present application, the user audio of the target user may be any audio recorded and input by the user himself who is using the method, or any audio of another person may be recorded in advance for the user himself, and any audio including a voice obtained through network interception may be also recorded.

The user audio may be stored in the form of an audio file, which in this embodiment is in wav format with a sampling frequency of 24kHz.

The Wav format is used as an audio file format developed under a Windows platform, accords with RIFF (Resource Interchange File Format) standard, is supported by all audio product software, and can reach higher tone quality requirements.

The sampling frequency refers to the number of sound samples taken per second. Sound is in fact an energy wave and therefore also features frequency, which corresponds to the time axis, and amplitude, which corresponds to the level axis. The wave is infinitely smooth, the string can be seen as consisting of numerous points, which must be sampled during digital encoding, since the memory space is relatively limited. The sampling process is to extract the frequency value of a certain point, obviously, the more points are extracted in one second, the more frequency information is obtained, the higher the sampling frequency is, the better the sound quality is, the more the sound is restored, but at the same time, the more resources are occupied. Too high a frequency cannot be resolved because of the limited resolution of the human ear. For example, a sampling frequency of 22050Hz is common, 44100Hz is already CD sound quality, but a sampling frequency exceeding 48000Hz or 96000Hz has no meaning to the human ear.

122a, extracting the user voice characteristic parameters corresponding to the user audio.

In some embodiments of the present application, the user sound characteristic parameter is a mel spectrum corresponding to the user audio.

Mel-spectrum (mel-spectrum), which is a spectrum under mel scale, is obtained by multiplying the spectrum by several mel-filter points.

In some embodiments of the present application, the user sound characteristic parameter may be mel-cepstrum (MFCC, mel Frequency Cepstral Coefficents) or linear-prediction cepstrum coefficient (LPCC, linear predictive cepstral coefficient) corresponding to the user audio.

In some embodiments of the present application, the mel spectrum corresponding to the user audio may be obtained by:

pre-emphasis, framing and windowing of audio signals corresponding to user audio;

performing short-time Fourier transform (STFT) on the audio signal of each frame to obtain a short-time amplitude spectrum (STAS);

the short-term magnitude spectrum is passed through a Mel Filter bank (Mel Filter Banks) to obtain a Mel spectrum.

123a, generating the target accent feature vector according to the user sound feature parameters through a first accent feature extraction network in the audio synthesis model.

In some embodiments of the present application, the structure of the first accent feature extraction network sequentially includes a multi-layer 2D convolution network, a layer RNN network, and a layer full connection network, and the generated target accent feature vector is a 256-dimensional feature vector.

In the video processing, each frame of image is identified by using the CNN, and the information of the time dimension is not considered.

Specifically, assuming that the original image size is 14×14×3 (3 channels), the size of the feature map is 10×10×32, which is obtained by convolving the original image with 32 convolution checks of size 5*3 (3 is depth, which is the same as the number of channels).

The RNN network (Recurrent Neural Network, RNN) is a type of recurrent neural network (recursive neural network) that takes sequence data as input, performs recursion (recovery) in the evolution direction of the sequence, and all nodes (circulation units) are chained.

A fully connected network (Fully Connected Neural Network) characterized by a link connection between any two nodes, the nature of the fully connected network being a single switch connecting all inputs and outputs, the advantage of the fully connected network being: the network throughput is high, the reliability is high, and the delay is low.

(II), in this embodiment, the step 120 includes the process shown in steps 121b to 122 b:

121b, acquiring an accent phonetic symbol text, wherein the accent phonetic symbol text is used for representing the pronunciation of the target accent of the preset text.

In some embodiments of the present application, the accent phonetic symbol text is obtained by manually or by performing pronunciation marking on the preset text by software carried by the terminal, and the accent phonetic symbol text corresponding to a plurality of different accents of the preset text can be obtained by a pronunciation marking mode, and then the plurality of accent phonetic symbol texts are displayed on a display screen in a virtual button mode for a user to click.

For example, accent phonetic text for a variety of different accents includes: accent phonetic symbol text with Sichuan accent, accent phonetic symbol text with Guangdong accent and accent phonetic symbol text with Shandong accent, then correspondingly, three virtual buttons can be displayed on the display screen: sichuan accent, guangdong accent, and Shandong accent. The user may select the accent phonetic text desired by the user by clicking one of the three virtual buttons.

122b, generating the target accent feature vector according to the accent phonetic symbol text through a second accent feature extraction network in the audio synthesis model.

In some embodiments of the present application, the second accent feature extraction network comprises a three-layer LSTM network, and the obtained target accent feature vector is a 256-dimensional feature vector, where the target accent feature vector is used to characterize the accent features desired by the user.

The LSTM network, long-term memory network (LSTM), is a special RNN network. In the training of the original RNN, the problems of gradient explosion or gradient disappearance easily occur along with the lengthening of the training time and the increase of the network layer number, so that longer sequence data cannot be processed, and information of long-distance data cannot be acquired. In contrast to RNN networks, a persistent unit state is maintained between the loop structures of LSTM networks for deciding which information to forget or to continue. Therefore, information in a long time series can be efficiently transferred and expressed using the LSTM network and useful information before a long time is not ignored (forgotten). Meanwhile, the LSTM network can solve the problem that gradient vanishes or explodes frequently occurring in the RNN network.

130. And obtaining a target identity, wherein the target identity is used for representing the target tone.

In some embodiments of the present application, the target identity may be used to represent the tone of young males, young females, elderly males, or elderly males according to the application scenario and the actual requirements of the user, and the target identity may also be used to represent the tone of celebrity a, celebrity b, or celebrity c, and may also be used to represent the tone of different dialect regions in the same language.

For example, the target identity may include: young men, young women, old men and old women, then correspondingly, four virtual buttons can be displayed on the display screen: young men, young women, elderly men and elderly women, the user may select a desired timbre by clicking any one of the four virtual buttons described above.

140. And generating target sound characteristic parameters according to the standard phonetic symbol text, the target accent characteristic vector and the target identity through an audio synthesis model.

In some embodiments of the application, the audio synthesis model includes an encoder, an embedded network, and a decoder.

Specifically, an encoder-decoder network (encoder-decoder network) includes the encoder and the decoder. The encoding-decoding network is a special neural network for feature extraction and data dimension reduction. The simplest auto-encoder consists of an input layer, an hidden layer, and an output layer. The mapping of the hidden layer is the encoder, and the mapping of the output layer is the decoder.

An embedding network (embedding network) is used for differentially embedding the target identity into a vector space by using the DNN network, so as to obtain a vector for representing the tone characteristics corresponding to the target identity.

Embedding, chinese translation is "Embedding", and better understood is "vector mapping", simply by representing entities, which may be people or objects, with vectors representing different entities being generally different.

In this embodiment, in combination with the field of speech synthesis technology (TTS), the core of the embedded network is a Speaker Embedding network.

Speaker Embedding networks are typically used in end-to-end TTS networks for multi-person models, and in this embodiment, the characteristics of Speaker Embedding networks enable them to clearly distinguish the timbres corresponding to different target identities.

In the embodiment of the present application, the step 140 includes the processes shown in steps 141 to 144:

141. and generating an output vector according to the standard phonetic symbol text by the encoder.

In some embodiments of the present application, the standard phonetic symbol text corresponding to the preset text is input to the encoder, and the output vector is obtained through feature encoding processing of the encoder.

142. And generating an identity vector according to the target identity through the embedded network.

In some embodiments of the present application, the target id, that is, speaker_id, is input to the embedded network, that is, speaker Embedding network, and feature encoding is performed through Speaker Embedding network to obtain the id vector, where the id vector is used to characterize the target timbre corresponding to the target id.

143. And calculating the summation of the target accent feature vector, the output vector and the identity identification vector, and obtaining a summation result.

144. And generating the target sound characteristic parameters according to the addition result through the decoder.

In some embodiments of the present application, the target sound characteristic parameter is a mel spectrum corresponding to the preset text.

150. And converting the target sound characteristic parameters into target audio corresponding to the preset text, wherein the target audio carries target accents corresponding to the target accent characteristic vectors and the target tone corresponding to the target identity.

In some embodiments of the present application, the vocoder is used to convert the target sound characteristic parameter into a target audio, where the accent and timbre presented by the target audio are consistent with those presented by the target accent, i.e. the user obtains at least one section of audio fused with the target accent and the target timbre through a software interface carried by the terminal, and speaks the preset text content. Specifically, the target tone may be from a mel spectrum corresponding to audio input by the user, or may be from the accent text. Based on the pronunciation annotation mode, the accent phonetic symbol text is used for representing the preset text content which is spoken by the target accent.

In some embodiments of the present application, the converting the target sound characteristic parameter into target audio using a vocoder, i.e., reconstructing the target audio through mel spectrum, includes the steps of:

the mel spectrum is converted into an amplitude spectrum;

reconstructing waveforms by a griffin-lim vocoder algorithm;

de-emphasis to obtain the target audio, namely the synthesized audio.

In some embodiments of the present application, vocoders used include, but are not limited to, griff_lim, wavenet, and melgan.

The embodiment of the application related to the process from step 110 to step 150 can obtain the synthesized audio fused with the tone expected by the user, the accent expected by the user and the preset text, so that the user can understand the content of the preset text, the problem that the existing part of people cannot adapt to or understand the mandarin broadcasting is solved, and the user can perform the next action according to the content of the preset text.

In some embodiments of the present application, as shown in fig. 3, before the standard phonetic symbol text of the preset text is obtained, the method further includes a process shown in steps T1 to T9:

t1, acquiring a training accent phonetic symbol text set, wherein the training accent phonetic symbol text set comprises a first number of training accent phonetic symbol texts, the first number of training accent phonetic symbol texts are jointly generated by at least two training users, and each training accent phonetic symbol text is used for representing the accent pronunciation of the training user to which the training accent phonetic symbol text belongs.

In some embodiments of the present application, training users from different regions in the same language may be provided as 50.

Specifically, the training users come from different provincial regions of China, in order to enrich the training accent phonetic symbol text set, different accent characteristics under the Chinese context are included, and the birth place of the training users can cover all provincial administrative units of China.

Meanwhile, based on different training texts, the training texts are read by the training users through mandarin with own accents, and the training texts can contain local speaking idioms and can contain no dialect exclusive words or words which cannot be spoken by ordinary speech.

The recitation process is natural, coherent and clear, keeps own style and does not need to be deliberate. The aim of the recitation process is to use Mandarin, and simultaneously keep the accent of each person speaking, such as light and heavy tone, pronunciation length, pause length, tone turning habit, flat tongue, peruse, etc.

In some embodiments of the present application, the training texts are 100 different chinese texts, and each training user speaks one chinese text, so as to be written as one training accent phonetic symbol text, that is, a total of 5000 training accent phonetic symbol texts are generated by 50 training users, and the first number is 5000.

T2, acquiring a training standard phonetic symbol text set, wherein the training standard phonetic symbol text set comprises a first number of training standard phonetic symbol texts, the training standard phonetic symbol texts are in one-to-one correspondence with the training accent phonetic symbol texts, and each training standard phonetic symbol text is used for representing standard pronunciation corresponding to the accent pronunciation.

In the above embodiment of the present application, as can be seen from the foregoing, the number of the training standard phonetic symbols is 5000, and the training standard phonetic symbols are obtained by manual or terminal-mounted software according to mandarin pinyin marks.

And T3, acquiring the identity identifier corresponding to each training user in the at least two training users.

Continuing with the above example, the number of the identities corresponds to 50 to the training user one by one.

And T4, acquiring training audio corresponding to the first number of training accent phonetic texts, and acquiring training sound characteristic parameters based on the training audio.

In some embodiments of the present application, the training user recites the training text in Mandarin of own accent, and records the reciting process as audio. Specifically, each training user recites a sentence of Chinese text, i.e., records a piece of audio.

Continuing with the above example, the number of training tones is 5000. Based on 5000 sections of training audio, the training sound characteristic parameters are obtained through the related steps, and specifically, the training sound characteristic parameters are training mel frequency spectrums.

And T5, training the initial audio synthesis model according to the training accent phonetic symbol text set, the training standard phonetic symbol text set, at least two identification marks and the training sound characteristic parameters to obtain an audio synthesis model.

In some embodiments of the application, the initial audio synthesis model includes an initial first accent feature extraction network, an initial second accent feature extraction network, an initial embedding network, an initial encoder, and an initial decoder.

Optionally, step T5 includes a process as described in steps T51 to T59:

and T51, acquiring the training sound characteristic parameters corresponding to the first training accent phonetic symbol text.

In some embodiments of the present application, the first training accent phonetic symbol text corresponds to a segment of the training audio, and the training sound characteristic parameter corresponding to the segment of the training audio is obtained through the foregoing related steps.

And T52, generating a training first oral feature vector according to the training sound feature parameters through the initial first oral feature extraction network.

In some embodiments of the present application, the training sound feature parameter is input to the initial first oral feature extraction network, and the training first oral feature vector is obtained through feature extraction processing of the initial first oral feature extraction network.

Continuing with the above example, the training sound characteristic parameter is a mel spectrum corresponding to 5000 segments of the training audio.

And T53, generating a training second accent feature vector according to a first training accent phonetic symbol text through the initial second accent feature extraction network, wherein the first training accent phonetic symbol text is any one of the first number of training accent phonetic symbol texts.

In some embodiments of the present application, any one of the first training accent phonetic symbol text, i.e. 5000 training accent phonetic symbol texts, is input into the initial second accent feature extraction network, and the training second accent feature vector is obtained through feature extraction processing of the initial second accent feature extraction network.

And T54, generating a training identity identification vector according to the identity identification corresponding to the first training accent phonetic symbol text through the initial embedded network.

In some embodiments of the present application, the identity corresponding to the first training accent phonetic symbol text is input to the initial embedded network, and feature encoding processing is performed through the embedded network, so as to obtain the training identity vector.

And T55, generating a training output vector according to the training standard phonetic symbol text corresponding to the first training accent phonetic symbol text by the initial encoder.

In some embodiments of the present application, the training standard phonetic symbol text corresponding to the first training accent phonetic symbol text is input to the initial encoder, and feature encoding processing is performed by the encoder, so as to obtain the training output vector.

And T56, acquiring training output sound characteristic parameters based on the training second accent characteristic vector, the training identity vector and the training output vector.

Optionally, step T56 includes a process as described in steps T561 to T562.

And T561, calculating the training second accent feature vector, and adding the training identity vector and the training vector of the training output vector.

And T562, generating training output sound characteristic parameters according to the training vector sum through the initial decoder.

In some embodiments of the present application, the training vector sum is calculated, so that in the training process, the accent feature corresponding to the training second accent feature vector and the tone feature corresponding to the training identity vector are superimposed on the training standard tone mark text, and then the training output sound feature parameter is obtained through the decoding process of the initial decoder.

And T57, calculating a first loss according to the training sound characteristic parameters and the training output sound characteristic parameters.

In some embodiments of the application, the first loss is a first MSE loss calculated from the training sound characteristic parameter and the training output sound characteristic parameter.

MSE, the mean square error (Mean Square Error), is the most commonly used error in regression loss functions, calculated by summing the squares of the distances between the predicted value f (x) and the target value y, as follows:

/>

in some embodiments of the application, n is the number of times the predicted value f (x) and the target value y are obtained during training.

Specifically, as can be seen from the above calculation formula, the mean square error value is the smallest when the difference between the predicted value f (x) and the target value y is 0. Similarly, after the training is performed for a preset number of times, when the MSE loss value corresponds to the lowest end of the loss function curve, the MSE loss is not reduced any more, and at this time.

And T58, calculating a second loss according to the training first accent feature vector and the training second accent feature vector.

In some embodiments of the application, the second loss is a second MSE loss calculated from the training first accent feature vector and the training second accent feature vector.

In some embodiments of the application, the calculation of the second MSE loss is also based on the above calculation formula, i.e. the sum of squares of the distances between the training first accent feature vector and the training second accent feature vector.

And T59, adjusting parameters of the initial audio synthesis model according to the sum of the first loss and the second loss, and determining the initial audio synthesis model containing the adjusted parameters as an audio synthesis model when the sum of the first loss and the second loss reaches a preset condition.

In some embodiments of the present application, the audio synthesis model may be more finely synthesized to obtain the target audio during the application process through adjustment of the parameters.

According to the embodiment of the application, the synthesized audio is generated through the trained audio synthesis model, so that the synthesized audio can present different accents and timbres in the same language, and meanwhile, the different accents can be finely controlled.

In order to better implement the above method, the embodiment of the present application further provides an audio synthesis apparatus, where the audio synthesis apparatus may be specifically integrated in a device, and the device may be a terminal, a server, or other devices. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, as shown, the audio synthesis device may include:

a standard phonetic symbol text module 401, configured to obtain a standard phonetic symbol text of a preset text, where the standard phonetic symbol text is used to represent a standard pronunciation of the preset text;

a target accent feature vector module 402, configured to obtain a target accent feature vector, where the target accent feature vector is used to characterize a target accent;

a target identity module 403, configured to obtain a target identity, where the target identity is used to characterize a target tone;

the synthesizing module 404 is configured to generate, according to the standard phonetic symbol text, the target accent feature vector, and the target identity, a target sound feature parameter by using an audio synthesis model;

The conversion module 405 is configured to convert the target sound feature parameter into target audio corresponding to the preset text, where the target audio carries a target accent corresponding to the target accent feature vector and the target timbre corresponding to the target identity.

In the implementation, each module may be implemented as an independent entity, or may be combined arbitrarily, and implemented as the same entity or several entities, and the implementation of each module may be referred to the foregoing method embodiment, which is not described herein again.

The embodiment of the application provides an audio synthesis device, which is based on the method embodiment, so that the problem that the collection of audio materials takes time and labor in the implementation process of the method embodiment is solved, meanwhile, the problem of errors caused by manual operation in the method embodiment is solved through the arrangement of the functional modules, the functional modules are systematically operated, the accuracy and the refinement of the implementation process are ensured, the time of manual implementation is saved, and the implementation efficiency of the technical scheme protected by the application is improved.

The embodiment of the application also provides equipment which can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, or the like.

For example, as shown in the figure, there is shown a schematic structural diagram of an apparatus according to an embodiment of the present application, specifically:

the device may include one or more processing cores 'processors 501, one or more storage media's memory 502, a power supply 503, an input module 504, and a communication module 505, among other components. It will be appreciated by those skilled in the art that the device structure shown in fig. 4 is not limiting of the device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the processor 501 is a control center of the device, and uses various interfaces and lines to connect the various parts of the overall device, perform various functions of the device and process data by running or executing software programs and/or modules stored in the memory 502, and invoking data stored in the memory 502. In some embodiments, processor 501 may include one or more processing cores; in some embodiments, the processor 501 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.

The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by executing the software programs and modules stored in the memory 502. The memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the device, etc. In addition, memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 502 may also include a memory controller to provide access to the memory 502 by the processor 501.

The device also includes a power supply 503 for powering the various components, and in some embodiments, the power supply 503 may be logically connected to the processor 501 via a power management system, such that functions such as charge, discharge, and power consumption management are performed by the power management system. The power supply 503 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The device may also include an input module 504, which input module 504 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The device may also include a communication module 505, and in some embodiments the communication module 505 may include a wireless module through which the device may wirelessly transmit over short distances, thereby providing wireless broadband internet access to the user. For example, the communication module 505 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and the like.

Although not shown, the apparatus may further include a display unit or the like, which is not described herein. Specifically, in this embodiment, the processor 501 in the device loads executable files corresponding to the processes of one or more application programs into the memory 502 according to the following instructions, and the processor 501 executes the application programs stored in the memory 502, so as to implement various functions as follows:

generating target sound characteristic parameters according to the standard phonetic symbol text, the target accent characteristic vector and the target identity through an audio synthesis model;

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions or by controlling associated hardware, which may be stored in a storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the audio synthesis methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The instructions stored in the storage medium can execute steps in any audio synthesis method provided by the embodiment of the present application, so that the beneficial effects that any audio synthesis method provided by the embodiment of the present application can achieve can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.

The foregoing has described in detail the methods, apparatuses, devices and storage medium for audio synthesis according to the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A method of audio synthesis, comprising:

2. The method of audio synthesis according to claim 1, wherein the obtaining the target accent feature vector comprises:

receiving user audio of a target user;

3. The method of audio synthesis according to claim 1, wherein the obtaining the target accent feature vector comprises:

4. The audio synthesis method according to claim 1, wherein the audio synthesis model comprises an encoder, an embedded network, and a decoder;

5. The audio synthesis method according to claim 1, wherein before the standard phonetic symbol text of the preset text is obtained, the method further comprises:

6. The method of audio synthesis according to claim 5, wherein the initial audio synthesis model comprises an initial first accent feature extraction network, an initial second accent feature extraction network, an initial embedding network, an initial encoder, and an initial decoder;

7. The method of audio synthesis according to claim 6, wherein the obtaining training output sound feature parameters based on the training second accent feature vector, the training identity vector, and the training output vector comprises:

8. An audio synthesis device, comprising:

9. An apparatus comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of a method of audio synthesis as claimed in any one of claims 1 to 7.

10. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of a method of audio synthesis as claimed in any one of claims 1 to 7.