CN116825085A

CN116825085A - Speech synthesis method, device, computer equipment and medium based on artificial intelligence

Info

Publication number: CN116825085A
Application number: CN202310721752.6A
Authority: CN
Inventors: 张旭龙; 王健宗; 程宁; 唐浩彬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-29

Abstract

The invention is applicable to the technical field of medical treatment, and particularly relates to a voice synthesis method, a device, computer equipment and a medium. According to the invention, the phoneme characteristics of the target text are extracted through the phoneme encoder, the style encoder is used for extracting the style characteristics of the reference audio, the text predictor is used for predicting the predicted text of the style characteristics, when the similarity between the predicted text and the real text is smaller than the similarity threshold, the corresponding phoneme attribute is given to each phoneme in the phoneme characteristics by combining the style characteristics and the preset phoneme attribute adaptation parameters, the target audio of the corresponding target text is obtained by decoding based on the audio generator, the style characteristics are consistent with the style of the reference audio through the synthesis of the target text, and the target audio with the target text as the voice content, so that the accuracy of a voice synthesis system is improved, and a great amount of redundant and complicated repetitive labor is effectively assisted to accurately process by medical staff in the medical technical field, so that the working efficiency and the working quality of the medical staff are greatly improved.

Description

Speech synthesis method, device, computer equipment and medium based on artificial intelligence

Technical Field

The invention is suitable for the technical field of medical treatment, and particularly relates to a speech synthesis method, device, computer equipment and medium based on artificial intelligence.

Background

The speech synthesis system is a technology for converting text into speech, and plays an important role in a plurality of scenes in life such as a speech dialogue system, an intelligent speech assistant, a telephone information inquiry system, a car navigation system, a voice electronic book, a real-time information broadcasting system, information acquisition and communication of vision or speech impaired people, and disease information propaganda by giving a robot the ability to speak freely. For example, in the medical technical field, due to the complicated operation of the medical system, the medical staff needs to engage in a large amount of repeated basic work, and in recent years, the application of the intelligent voice technology in the medical field is more and more widespread, and along with the continuous development of the internet big data and voice synthesis technology, the voice synthesis technology is used as an assistant in the doctor's work, and can help the doctor to handle a large amount of tedious and tedious repeated labor due to the characteristics of rapidness, accuracy and low error rate, thereby greatly reducing the burden of the medical staff.

The existing voice synthesis system can accurately read the content of an article or accurately answer the consultation of a user, but because the text content of the synthesized text and the text content of the reference audio are consistent when the existing voice synthesis system is trained, the text content of the synthesized text and the text content of the reference audio are often far away in the actual use process, so that the synthesized voice is quite sufficient in mechanical sense and lacks of emotion, the voice effect of suppressing the frustration is difficult to realize, and the user experience is greatly reduced.

Therefore, in the medical technical field, how to improve the accuracy of the speech synthesis result is a problem to be solved.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a speech synthesis method, apparatus, computer device and medium based on artificial intelligence, so as to solve the problem of low accuracy of the existing speech synthesis result.

In a first aspect, an embodiment of the present invention provides an artificial intelligence-based speech synthesis method, where the speech synthesis method includes:

obtaining a target text and a reference audio, extracting phoneme features of the target text by using a trained phoneme encoder to obtain phoneme features, and extracting style features of the reference audio by using a trained style encoder to obtain style features;

performing text prediction on the style characteristics by using a trained text predictor to obtain a predicted text;

obtaining a real text corresponding to the reference audio, calculating the similarity between the predicted text and the real text, and detecting whether the similarity is smaller than a preset similarity threshold;

if the similarity is detected to be smaller than the similarity threshold, according to the style characteristics, a corresponding phoneme attribute is assigned to each phoneme in the phoneme characteristics by combining with a preset phoneme attribute adaptation parameter;

And decoding the phoneme attributes corresponding to all phonemes by using the trained audio generator to obtain target audio corresponding to the target text.

In a second aspect, an embodiment of the present invention provides an artificial intelligence-based speech synthesis apparatus, the speech synthesis apparatus including:

the feature extraction module is used for acquiring a target text and a reference audio, extracting phoneme features of the target text by using a trained phoneme encoder to obtain phoneme features, and extracting style features of the reference audio by using a trained style encoder to obtain style features;

the text prediction module is used for predicting the text of the style characteristics by using a trained text predictor to obtain a predicted text;

the similarity calculation module is used for obtaining the real text corresponding to the reference audio, calculating the similarity between the predicted text and the real text, and detecting whether the similarity is smaller than a preset similarity threshold value or not;

the attribute determining module is used for assigning corresponding phoneme attributes to each phoneme in the phoneme features according to the style features and combining preset phoneme attribute adaptation parameters if the similarity is detected to be smaller than the similarity threshold;

And the audio synthesis module is used for decoding the phoneme attributes corresponding to all phonemes by using the trained audio generator to obtain target audio corresponding to the target text.

In a third aspect, an embodiment of the present invention provides a computer device, the computer device comprising a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the speech synthesis method according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the speech synthesis method according to the first aspect.

Compared with the prior art, the embodiment of the invention has the beneficial effects that: obtaining a target text and a reference audio, extracting phoneme characteristics from the target text by using a trained phoneme encoder to obtain phoneme characteristics, extracting style characteristics from the reference audio by using a trained style encoder to obtain style characteristics, carrying out text prediction on the style characteristics by using a trained text predictor to obtain a predicted text, obtaining a real text corresponding to the reference audio, calculating the similarity between the predicted text and the real text, detecting whether the similarity is smaller than a preset similarity threshold, if the similarity is detected to be smaller than the similarity threshold, according to the style characteristics, combining with preset phoneme attribute adaptation parameters, assigning corresponding phoneme attributes to each phoneme in the phoneme characteristics, decoding the phoneme attributes corresponding to all phonemes by using a trained audio generator to obtain the target audio corresponding to the target text, synthesizing the target audio and the style of the reference audio by using the trained style encoder, and not extracting text information of the reference audio, thereby improving the synthesis of the target audio of a speech synthesis system in non-parallel data based on the style characteristics and the target text, and improving the accuracy of the speech synthesis system in the medical care field, and improving the accuracy and the accuracy of the medical care field, and the medical care field.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of an artificial intelligence-based speech synthesis method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of an artificial intelligence based speech synthesis method according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech synthesis apparatus based on artificial intelligence according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present invention.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

The speech synthesis method based on artificial intelligence provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. The clients include, but are not limited to, palm top computers, desktop computers, notebook computers, ultra-mobile personal computer (UMPC), netbooks, cloud computing devices, personal digital assistants (personal digital assistant, PDA), and other computing devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

Referring to fig. 2, a flow chart of an artificial intelligence-based speech synthesis method according to an embodiment of the invention is shown, and the speech synthesis method may be applied to the client in fig. 1, and the speech synthesis method may include the following steps:

step S201, obtaining a target text and a reference audio, extracting phoneme features of the target text by using a trained phoneme encoder to obtain phoneme features, and extracting style features of the reference audio by using a trained style encoder to obtain style features.

In the speech synthesis task, the reference audio is used for providing style information of the synthesized audio, the style information can include information such as timbre, emotion and rhythm of a speaker, and the target text is used for providing content information of the synthesized audio, so that the style information of the reference audio and the content information of the target text are fused to generate the audio which is consistent with the style of the reference audio and takes the target text as speech content. The speech synthesis task plays an important role in a plurality of scenes such as a speech dialogue system, an intelligent speech assistant, a telephone information inquiry system, a vehicle navigation system, a sound electronic book, a real-time information broadcasting system, information acquisition and communication of vision or speech impaired people and the like.

For example, in the medical technical field, the target text may be a popular science text related to a disease, the reference audio may be speaking audio of a healthcare worker, and then content information of the synthesized audio may be provided through the target text, and style information of the synthesized audio is provided through the reference audio, so that the style information of the reference audio and the content information of the target text are fused, a target synthesized voice consistent with the style of the reference audio and using the target text as voice content is generated, the medical worker is efficiently assisted to perform popular science work of the disease, the burden of the medical worker is reduced, and the work efficiency and the work quality of the medical worker are improved.

Existing speech synthesis systems typically perform model training based on sample reference audio and sample training text, wherein the sample training text and the sample real text corresponding to the sample reference audio remain identical, i.e. the sample training text and the sample reference audio belong to parallel data. In the actual use process, the voice synthesis system usually synthesizes target audio with consistent style and inconsistent content based on the fixed reference audio and different target texts, so that the efficiency of voice synthesis is improved, and the cost of voice synthesis is reduced. Therefore, the real texts corresponding to the target text and the reference audio in the actual use process are inconsistent, namely the target text and the reference audio belong to non-parallel data, so that the accuracy of the synthesized voice based on the existing voice synthesis system is low, and the user requirement is difficult to meet.

Therefore, the embodiment obtains the reference audio and the target text inconsistent with the real text corresponding to the reference audio based on the actual application scene, and extracts the style information of the reference audio and the text content information of the target text to synthesize the target audio consistent with the style of the reference audio and taking the target text as the voice content.

Specifically, a trained phoneme encoder and a trained style encoder are obtained, and the trained phoneme encoder is used for extracting phoneme characteristics of a target text to obtain phoneme characteristics, wherein the phoneme characteristics are used for representing text content information of the target text; and extracting style characteristics of the reference audio by using the trained style encoder to obtain style characteristics, wherein the style characteristics are used for representing information such as timbre, emotion, rhythm and the like of a speaker in the reference audio.

In order to solve the problem that the data between the target text and the real text corresponding to the reference audio are not parallel to influence on the speech synthesis result, the style encoder used in the embodiment is only used for extracting the style characteristics of the reference audio, and the text information of the reference audio is not contained in the style characteristics through training, so that the speech content of the target audio is more from the text characteristics of the target text and is not influenced by the text information of the reference audio, and the speech synthesis accuracy of the speech synthesis system in the non-parallel data is improved.

Optionally, the trained phoneme encoder comprises a trained phoneme embedding layer and a trained phoneme feature extraction layer;

extracting phoneme features of the target text by using the trained phoneme encoder, wherein obtaining the phoneme features comprises:

performing phoneme embedding on the target text by using the trained phoneme embedding layer to obtain a phoneme sequence;

and extracting the characteristics of the phoneme sequence by using the trained phoneme characteristic extraction layer to obtain phoneme characteristics.

The phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, and one action forms one phoneme, so that the phonemes are the synthesis basis of the target voice and have important roles in voice synthesis.

In this embodiment, the trained factor encoder is configured to perform phoneme feature extraction on the target text to obtain phoneme features, and in order to improve accuracy of extracting phoneme features, the trained factor encoder in this embodiment includes a trained phoneme embedding layer and a trained phoneme feature extracting layer.

Specifically, firstly, a target text is subjected to phoneme embedding through a trained phoneme embedding layer to obtain a phoneme sequence, and all phonemes in the target text can be represented by the phoneme sequence in sequence; and then, carrying out feature extraction on the phoneme sequence by using the trained phoneme feature extraction layer to obtain phoneme features, wherein the phoneme features can be used for representing text information of the target text based on phonemes of the target text. Therefore, the feature extraction accuracy of the target text can be improved by converting the target text into a corresponding phoneme sequence and then performing feature extraction on the phoneme sequence, thereby improving the accuracy of the extracted phoneme features.

According to the embodiment, the trained phoneme embedding layer is used for converting the target text into the corresponding phoneme sequence, and then the trained phoneme feature extraction layer is used for extracting the features of the phoneme sequence, so that the feature extraction accuracy of the target text is improved, and the accuracy of extracted phoneme features is improved.

Optionally, the training process of the style encoder includes:

acquiring a trained text predictor, sample reference audio and sample real texts corresponding to each sample reference audio;

carrying out style characteristic extraction on the sample reference audio by using a style encoder to obtain sample style characteristics;

performing text prediction on the sample style characteristics by using a trained text predictor to obtain a sample predicted text;

calculating model loss according to the sample predicted text and the sample real text, training the style encoder in a gradient inversion mode according to the model loss until the model loss meets preset training conditions, and obtaining the trained style encoder.

The trained style encoder used in the embodiment is only used for extracting style characteristics of the reference audio, and does not extract text information of the reference audio.

And in the training process of the style encoder, acquiring a trained text predictor, sample reference audios and sample real texts corresponding to each sample reference audio, extracting style characteristics of the sample reference audios by using the style encoder to obtain sample style characteristics, and further carrying out text prediction on the extracted sample style characteristics by using the trained text predictor to obtain sample predicted texts contained in the sample style characteristics.

And calculating model loss according to the sample real text and the sample predicted text, wherein the model loss is used for representing the similarity between the sample predicted text and the sample real text which are contained in the sample style characteristics, and calculating according to the similarity to obtain corresponding model loss, and correspondingly, the smaller the similarity between the sample predicted text and the sample real text is, the larger the corresponding model loss is.

Therefore, in this embodiment, in order to make the trained style encoder only use for extracting the style features of the reference audio, but not extracting the text information of the reference audio, the style encoder is trained by adopting a gradient inversion mode until the model loss is smaller than the preset model loss threshold value, so as to obtain the trained style encoder.

In the training process of the style encoder, text prediction is carried out on the extracted sample style characteristics through the trained text predictor, so that sample predicted text contained in the sample style characteristics is obtained, then model loss is calculated according to sample real text and sample predicted text, the style encoder is trained in a gradient inversion mode until the model loss is smaller than a preset model loss threshold value, the trained style encoder is obtained, the aim that the trained style encoder is only used for extracting the style characteristics of the reference audio and not extracting text information of the reference audio is achieved, and the style characteristic extraction accuracy of the style encoder in non-parallel data is improved.

Optionally, training the style encoder in a gradient inversion manner includes:

obtaining model parameters of a style encoder, calculating according to model loss and the model parameters to obtain training gradients, and determining the optimization direction of the model parameters according to the training gradients;

inverting the optimized direction to obtain an optimized inversion direction, and iteratively adjusting the model parameters according to the optimized inversion direction.

In the training process of the style encoder, the model parameters of the style encoder need to be iteratively adjusted so as to improve the accuracy of style feature extraction of the style encoder.

The gradient descent method is a first-order optimization algorithm, when a local minimum value of a function is found by using the gradient descent method, the function needs to be deflected on the current position to obtain a corresponding gradient, and then iterative search is carried out according to the opposite direction of the gradient and a preset step length.

Therefore, in this embodiment, after model loss is calculated according to the sample real text and the sample predicted text, model parameters of the style encoder are obtained, and then a training gradient is calculated according to a functional relationship between the model loss and the model parameters, where the training gradient is in a vector form, and then an opposite direction of the training gradient can be determined as an optimization direction of the model parameters.

In order to enable the trained style encoder to be only used for extracting style characteristics of the reference audio, but not extracting a text information target of the reference audio, the embodiment inverts the optimization direction to obtain the optimization inversion direction, iteratively adjusts model parameters according to the optimization inversion direction until the minimum value of model loss is obtained, and further determines optimal model parameters to obtain the trained style encoder.

According to the method, the optimization direction of the model parameters is determined based on the gradient descent method, the optimization direction is reversed, the optimization reversal direction is obtained, the model parameters are subjected to iterative adjustment according to the optimization reversal direction, and the trained style encoder is obtained, so that the sample style characteristics output by the trained style encoder do not contain text information of sample reference audio, and the style characteristic extraction accuracy of the style encoder is improved.

The method comprises the steps of obtaining a target text and a reference audio, extracting phoneme characteristics of the target text by using a trained phoneme encoder to obtain phoneme characteristics, extracting style characteristics of the reference audio by using a trained style encoder to obtain style characteristics, obtaining the target text which is inconsistent with a real text corresponding to the reference audio, extracting style characteristics of text content information which does not contain the reference audio, representing information such as tone, emotion and rhythm of a speaker in the reference audio, extracting phoneme characteristics of the target text, representing the text content information of the target text, synthesizing the target text according to style characteristics and phoneme characteristics, and taking the target text as a target audio of voice content, thereby improving the voice synthesis accuracy of a voice synthesis system in non-parallel data.

And S202, carrying out text prediction on the style characteristics by using a trained text predictor to obtain a predicted text.

The trained style encoder in this embodiment aims to improve the accuracy of style feature extraction of the style encoder in non-parallel data by only extracting style features of the reference audio, but not extracting text information of the reference audio. Therefore, after the style characteristics are extracted by the trained style encoder, the text content information contained in the style characteristics is also required to be predicted and extracted, so as to judge whether the style characteristics extracted by the trained style encoder meet the target requirements.

Specifically, the embodiment obtains the trained text predictor, and the trained text predictor can accurately predict the text contained in the input content after model training. Therefore, the trained text predictor is used for carrying out text prediction on the style characteristics, so as to obtain a predicted text, and the predicted text is used as a judgment basis for judging whether the style characteristics meet the speech synthesis target.

And the step of predicting the text of the style characteristics by using the trained text predictor to obtain the predicted text, wherein the text contained in the style characteristics is predicted by using the trained text predictor to obtain the predicted text which is used as a judgment basis for judging whether the style characteristics meet a speech synthesis target or not, and the predicted text is used as a precondition of speech synthesis to improve the speech synthesis accuracy of a speech synthesis system in non-parallel data.

Step S203, obtaining a real text corresponding to the reference audio, calculating the similarity between the predicted text and the real text, and detecting whether the similarity is smaller than a preset similarity threshold.

After text prediction is performed on style characteristics by using a trained text predictor, in order to determine whether the style encoder satisfies the goal of extracting only style characteristics of the reference audio but not text information of the reference audio according to the predicted text, the embodiment acquires a real text corresponding to the reference audio and calculates the similarity between the predicted text and the real text. Correspondingly, the smaller the similarity, the smaller the extraction degree of text information representing the reference audio by the style encoder.

Therefore, a preset similarity threshold is obtained according to the actual application condition, and whether the style characteristics obtained by extraction of the trained style encoder meet the requirements of speech synthesis is judged by detecting whether the similarity is smaller than the preset similarity threshold.

Optionally, calculating the similarity between the predicted text and the real text includes:

converting the predicted text into a predicted text vector according to a word vector technology, and converting the real text into a real text vector according to the word vector technology;

And calculating the similarity between the predicted text vector and the real text vector, and determining the similarity between the predicted text and the real text.

In order to facilitate similarity calculation between the predicted text and the real text, the embodiment converts the predicted text and the real text into a predicted text vector and a real text vector respectively through word vector technology, calculates the similarity between the predicted text vector and the real text vector, and determines the similarity between the predicted text vector and the real text vector as the similarity between the predicted text and the real text.

According to the embodiment, the predicted text and the real text are respectively converted into the predicted text vector and the real text vector through the word vector technology, so that the similarity between the predicted text vector and the real text vector is determined to be the similarity between the predicted text and the real text, and the calculation efficiency of the similarity between the predicted text and the real text is improved.

The method comprises the steps of obtaining the real text corresponding to the reference audio, calculating the similarity between the predicted text and the real text, detecting whether the similarity is smaller than a preset similarity threshold, and characterizing the extraction degree of text information of the reference audio by a style encoder through calculating the similarity between the predicted text and the real text, so that whether the style characteristics meet a voice synthesis target is judged through detecting whether the similarity is smaller than the preset similarity threshold, and the voice synthesis accuracy of a voice synthesis system in non-parallel data is improved as a precondition of voice synthesis.

Step S204, if the similarity is smaller than the similarity threshold, according to the style characteristics, a corresponding phoneme attribute is assigned to each phoneme in the phoneme characteristics in combination with a preset phoneme attribute adaptation parameter.

Wherein, since the similarity between the predicted text and the real text is smaller, the extraction degree of the text information representing the reference audio by the style encoder is smaller. When the similarity is detected to be greater than or equal to the similarity threshold, determining that the style characteristics do not meet the voice synthesis target, and if the style characteristics are used for voice synthesis, the accuracy of text content in the voice synthesis result cannot be ensured, so that the accuracy of the voice synthesis result is reduced.

Therefore, in this embodiment, when the similarity is detected to be smaller than the similarity threshold, after determining that the style characteristics meet the speech synthesis target, according to the style characteristics, a corresponding phoneme attribute is given to each phoneme in the phoneme characteristics in combination with a preset phoneme attribute adaptation parameter, so as to obtain all phoneme attribute information corresponding to each phoneme, and the information is used as a speech synthesis basis to improve the accuracy of the speech synthesis result.

Wherein from an acoustic perspective analysis, the style of the audio is related to the pitch, energy and duration of the phonemes, and therefore, in this embodiment, all phoneme attributes corresponding to each phoneme are set to include pitch, energy and duration.

Specifically, if the similarity is detected to be smaller than the similarity threshold, according to style characteristics, a corresponding phoneme attribute is assigned to each phoneme in the phoneme characteristics in combination with a preset phoneme attribute adaptation parameter, so that style characteristics and phoneme characteristics are combined on the basis of the preset phoneme attribute adaptation parameter, all phoneme attributes corresponding to each phoneme are determined, all phoneme attributes corresponding to all phonemes simultaneously comprise style information of reference audio and text information of a target text, and the target speech can be synthesized as a basis of speech synthesis.

In an embodiment, the preset phoneme attribute adapting parameter is a model parameter in a preset trained phoneme attribute adapter, and the style feature and the phoneme feature are fused to obtain a fusion feature, and the fusion feature fuses the style information in the reference audio and the text information in the target text. And then inputting the fusion characteristics into a trained phoneme attribute adapter for decoding, and outputting the phoneme attribute corresponding to each phoneme in the phoneme characteristics as a speech synthesis basis.

Optionally, according to style characteristics, combining preset phoneme attribute adaptation parameters, and endowing each phoneme in the phoneme characteristics with a corresponding phoneme attribute;

Acquiring preset phoneme attribute adaptation parameters, wherein the preset phoneme attribute adaptation parameters comprise preset pitch adaptation parameters, preset energy adaptation parameters and preset duration adaptation parameters;

according to style characteristics, assigning a corresponding pitch to each phoneme in the phoneme characteristics in combination with preset pitch adaptation parameters, assigning a corresponding energy to each phoneme in the phoneme characteristics in combination with preset energy adaptation parameters, and assigning a corresponding duration to each phoneme in the phoneme characteristics in combination with preset duration adaptation parameters;

the pitch, energy and duration corresponding to each phoneme in the phoneme characteristic are obtained.

In this embodiment, since all the phoneme attributes corresponding to each phoneme include pitch, energy and duration, correspondingly, the preset phoneme attribute adaptation parameters include a preset pitch adaptation parameter, a preset energy adaptation parameter and a preset duration adaptation parameter, so as to obtain the pitch, energy and duration of each phoneme respectively.

Specifically, according to style characteristics, a corresponding pitch is assigned to each phoneme in the phoneme characteristics in combination with preset pitch adaptation parameters, a corresponding energy is assigned to each phoneme in the phoneme characteristics in combination with preset energy adaptation parameters, and a corresponding duration is assigned to each phoneme in the phoneme characteristics in combination with preset duration adaptation parameters, so that a corresponding pitch, energy and duration of each phoneme in the phoneme characteristics are obtained.

In an embodiment, the preset phoneme attribute adaptation parameters are model parameters in a preset trained phoneme attribute adapter, and the trained phoneme attribute adapter includes a trained phoneme pitch adapter, a trained phoneme energy adapter, and a trained phoneme duration adapter.

The style characteristics and the phoneme characteristics are fused to obtain fusion characteristics, then the fusion characteristics are input into a trained phoneme pitch adapter for decoding, and the pitch corresponding to each phoneme in the phoneme characteristics is output; the fusion feature is input to a trained phoneme energy adapter for decoding, energy corresponding to each phoneme in the phoneme feature is output, the fusion feature is input to a trained phoneme duration adapter for decoding, duration corresponding to each phoneme in the phoneme feature is output, and accordingly pitch, energy and duration corresponding to each phoneme in the phoneme feature are obtained.

According to the embodiment, the preset phoneme attribute adaptation parameters are correspondingly set, the preset pitch adaptation parameters, the preset energy adaptation parameters and the preset duration adaptation parameters are included, the pitch, the energy and the duration of each phoneme in the phoneme features are respectively obtained to represent the style information of each phoneme according to the combination of the style features, the phoneme features and the preset phoneme attribute adaptation parameters, the richness of the style information is improved, and the accuracy of the speech synthesis result is further improved.

If the similarity is detected to be smaller than the similarity threshold, according to the style characteristics, the step of assigning a corresponding phoneme attribute to each phoneme in the phoneme characteristics is combined with a preset phoneme attribute adaptation parameter, and only under the condition that the similarity is detected to be smaller than the similarity threshold, namely that the style characteristics meet the speech synthesis target, the style characteristics, the phoneme characteristics and the preset phoneme attribute adaptation parameter are combined to respectively obtain the phoneme attribute corresponding to each phoneme in the phoneme characteristics, and then all the phoneme attributes corresponding to all the phonemes simultaneously comprise the style information of the reference audio and the text information of the target text, so that the accuracy of the speech synthesis result is improved.

And step S205, decoding the phoneme attributes corresponding to all phonemes by using the trained audio generator to obtain target audio corresponding to the target text.

And decoding the phoneme attributes corresponding to all phonemes by using a trained audio generator to obtain target audio corresponding to the target text, wherein the target audio is the audio which is consistent with the style of the reference audio and takes the target text as the voice content.

Optionally, the trained audio generator comprises a trained decoder and a trained vocoder;

decoding the phoneme attributes corresponding to all phonemes by using the trained audio generator, wherein the obtaining the target audio corresponding to the target text comprises the following steps:

decoding the phoneme attributes corresponding to all phonemes by using a trained decoder to obtain a target Mel frequency spectrum;

and decoding the target Mel frequency spectrum by using the trained vocoder to obtain target audio corresponding to the target text.

Wherein, the difference between the human ear and the low-frequency signal is more sensitive than the difference between the human ear and the high-frequency signal, so that the distance perception of the human ear on two pairs of frequencies with equal distance on the frequency domain is not necessarily equal. Therefore, in the task of text and speech conversion, the scale of the frequency domain is usually adjusted based on the mel spectrum, so that the distance perception of the human ear on two pairs of frequencies with equal distances on the new scale of the frequency domain is consistent, and the perception capability of the human ear on speech is improved.

Therefore, the mel spectrum is widely applied to the acoustic field, and in this embodiment, the trained audio generator is set to include a trained decoder and a trained vocoder, where the trained decoder is used to decode the phoneme attributes corresponding to all phonemes to obtain a target mel spectrum, and the trained vocoder is used to decode the target mel spectrum to obtain the target audio corresponding to the target text.

The embodiment sets the trained audio generator to comprise a trained decoder and a trained vocoder, decodes the phoneme attributes corresponding to all phonemes by using the trained decoder to obtain a target mel frequency spectrum, decodes the target mel frequency spectrum by using the trained vocoder to obtain target audio corresponding to a target text, and improves the accuracy of the target audio.

The step of decoding the phoneme attributes corresponding to all phonemes by using the trained audio generator to obtain the target audio corresponding to the target text, and decoding the phoneme attributes simultaneously containing the style information of the reference audio and the text information of the target text by using the trained audio generator to obtain the target audio consistent with the style of the reference audio and taking the target text as the voice content, thereby improving the accuracy of the target audio.

According to the embodiment of the invention, the target text and the reference audio are obtained, the trained phoneme encoder is used for extracting the phoneme characteristics of the target text to obtain the phoneme characteristics, the trained style encoder is used for extracting the style characteristics of the reference audio to obtain the style characteristics, the trained text predictor is used for carrying out text prediction on the style characteristics to obtain the predicted text, the real text corresponding to the reference audio is obtained, the similarity between the predicted text and the real text is calculated, whether the similarity is smaller than a preset similarity threshold value is detected, if the similarity is detected to be smaller than the similarity threshold value, the corresponding phoneme attribute is assigned to each phoneme in the phoneme characteristics according to the style characteristics in combination with the preset phoneme attribute adaptation parameter, the trained audio generator is used for decoding the corresponding phoneme attributes of all phonemes to obtain the target audio corresponding to the target text, the style characteristics of the reference audio are extracted only through the trained encoder, text information of the reference audio is not extracted, the synthesis of the target text is consistent with the style of the reference audio based on the style characteristics, the target text is used as the target audio of speech content, the speech synthesis system is improved in the non-parallel data, the redundancy of the speech synthesis system is high-accuracy and the medical staff is improved, the medical staff is greatly, the medical staff is assisted in the medical staff is high-efficiency, and the medical staff is greatly complicated in the work quality is greatly improved.

Corresponding to the artificial intelligence based speech synthesis method of the above embodiment, fig. 3 shows a block diagram of the artificial intelligence based speech synthesis apparatus according to the second embodiment of the present invention, and for convenience of explanation, only the parts related to the embodiment of the present invention are shown.

Referring to fig. 3, the voice synthesizing apparatus includes:

the feature extraction module 31 is configured to obtain a target text and a reference audio, perform phoneme feature extraction on the target text using a trained phoneme encoder to obtain phoneme features, and perform style feature extraction on the reference audio using a trained style encoder to obtain style features;

a text prediction module 32, configured to perform text prediction on style characteristics using a trained text predictor, so as to obtain a predicted text;

the similarity calculation module 33 is configured to obtain a real text corresponding to the reference audio, calculate a similarity between the predicted text and the real text, and detect whether the similarity is smaller than a preset similarity threshold;

the attribute determining module 34 is configured to assign a corresponding phoneme attribute to each phoneme in the phoneme feature according to the style feature in combination with a preset phoneme attribute adaptation parameter if the similarity is detected to be smaller than the similarity threshold;

The audio synthesis module 35 is configured to decode the phoneme attributes corresponding to all phonemes by using the trained audio generator, so as to obtain the target audio corresponding to the target text.

Optionally, the trained phoneme encoder comprises a trained phoneme embedding layer and a trained phoneme feature extraction layer, and the feature extraction module 31 comprises:

the phoneme embedding sub-module is used for embedding the phonemes into the target text by using the trained phoneme embedding layer to obtain a phoneme sequence;

and the phoneme feature extraction submodule is used for carrying out feature extraction on the phoneme sequence by using the trained phoneme feature extraction layer to obtain phoneme features.

Optionally, the similarity calculation module 33 includes:

the text vector conversion sub-module is used for converting the predicted text into a predicted text vector according to a word vector technology and converting the real text into a real text vector according to the word vector technology;

and the similarity calculation sub-module is used for calculating the similarity between the predicted text vector and the real text vector and determining the similarity between the predicted text and the real text.

Optionally, the attribute determining module 34 includes:

the parameter acquisition sub-module is used for acquiring preset phoneme attribute adaptation parameters, wherein the preset phoneme attribute adaptation parameters comprise preset pitch adaptation parameters, preset energy adaptation parameters and preset duration adaptation parameters;

The attribute determination submodule is used for assigning corresponding pitch to each phoneme in the phoneme features according to style features and combining preset pitch adaptation parameters, assigning corresponding energy to each phoneme in the phoneme features and assigning corresponding duration to each phoneme in the phoneme features by combining preset energy adaptation parameters and combining preset duration adaptation parameters;

and the attribute acquisition sub-module is used for obtaining the pitch, the energy and the duration corresponding to each phoneme in the phoneme characteristics.

Optionally, the trained audio generator includes a trained decoder and a trained vocoder, and the audio synthesis module 35 includes:

the attribute decoding submodule is used for decoding the phoneme attributes corresponding to all phonemes by using the trained decoder to obtain a target Mel frequency spectrum;

and the audio synthesis submodule is used for decoding the target Mel frequency spectrum by using the trained vocoder to obtain target audio corresponding to the target text.

Optionally, the feature extraction module 31 includes:

the sample acquisition sub-module is used for acquiring the trained text predictor, the sample reference audio and the sample real text corresponding to each sample reference audio;

The feature extraction submodule is used for extracting style features of the sample reference audio by using the style encoder to obtain sample style features;

the text prediction sub-module is used for carrying out text prediction on the sample style characteristics by using a trained text predictor to obtain a sample predicted text;

the model training sub-module is used for calculating model loss according to the sample prediction text and the sample real text, training the style encoder in a gradient inversion mode according to the model loss until the model loss meets preset training conditions, and obtaining the trained style encoder.

Optionally, the model training submodule includes:

the optimization direction determining unit is used for obtaining model parameters of the style encoder, calculating training gradients according to model loss and the model parameters, and determining the optimization direction of the model parameters according to the training gradients;

the model parameter adjusting unit is used for reversing the optimization direction to obtain an optimization reversing direction, and carrying out iterative adjustment on the model parameters according to the optimization reversing direction.

It should be noted that, because the content of information interaction and execution process between the modules and the embodiment of the method of the present invention are based on the same concept, specific functions and technical effects thereof may be referred to in the method embodiment section, and details thereof are not repeated herein.

Fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. As shown in fig. 4, the computer device of this embodiment includes: at least one processor (only one shown in fig. 4), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor executing the computer program to perform the steps of any of the various speech synthesis method embodiments described above.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 4 is merely an example of a computer device and is not intended to limit the computer device, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.

The processor may be a CPU, but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above-described embodiment, and may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The present invention may also be implemented as a computer program product for implementing all or part of the steps of the method embodiments described above, when the computer program product is run on a computer device, causing the computer device to execute the steps of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A speech synthesis method based on artificial intelligence, the speech synthesis method comprising:

2. The method of speech synthesis according to claim 1, wherein the trained phoneme encoder comprises a trained phoneme embedding layer and a trained phoneme feature extraction layer;

the step of extracting the phoneme features of the target text by using the trained phoneme encoder, wherein the step of obtaining the phoneme features comprises the following steps:

and performing feature extraction on the phoneme sequence by using the trained phoneme feature extraction layer to obtain phoneme features.

3. The method of speech synthesis according to claim 1, wherein the calculating of the similarity between the predicted text and the real text comprises:

converting the predicted text into a predicted text vector according to a word vector technology, and converting the real text into a real text vector according to a word vector technology;

4. The method according to claim 1, wherein assigning a corresponding phoneme attribute to each phoneme in the phoneme feature according to the style feature in combination with a preset phoneme attribute adaptation parameter comprises:

according to the style characteristics, assigning a corresponding pitch to each phoneme in the phoneme characteristics in combination with the preset pitch adaptation parameters, assigning a corresponding energy to each phoneme in the phoneme characteristics in combination with the preset energy adaptation parameters, and assigning a corresponding duration to each phoneme in the phoneme characteristics in combination with the preset duration adaptation parameters;

And obtaining the pitch, the energy and the duration of each phoneme in the phoneme characteristics.

5. The method of speech synthesis according to claim 4, wherein the trained audio generator comprises a trained decoder and a trained vocoder;

decoding phoneme attributes corresponding to all phonemes by using the trained audio generator, and obtaining target audio corresponding to the target text comprises the following steps:

decoding the phoneme attributes corresponding to all phonemes by using the trained decoder to obtain a target Mel frequency spectrum;

6. The method of claim 1, wherein the training process of the style encoder comprises:

carrying out style characteristic extraction on the sample reference audio by using the style encoder to obtain sample style characteristics;

performing text prediction on the sample style characteristics by using the trained text predictor to obtain a sample predicted text;

7. The method of speech synthesis according to claim 6, wherein training the style encoder in a gradient inversion manner comprises:

obtaining model parameters of the style encoder, calculating according to the model loss and the model parameters to obtain training gradients, and determining the optimization direction of the model parameters according to the training gradients;

inverting the optimization direction to obtain an optimization inversion direction, and iteratively adjusting the model parameters according to the optimization inversion direction.

8. A speech synthesis apparatus based on artificial intelligence, the speech synthesis apparatus comprising:

9. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the speech synthesis method according to any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method according to any one of claims 1 to 7.