CN115910028A - Speech synthesis method and model generation method - Google Patents

Speech synthesis method and model generation method Download PDF

Info

Publication number
CN115910028A
CN115910028A CN202211160928.7A CN202211160928A CN115910028A CN 115910028 A CN115910028 A CN 115910028A CN 202211160928 A CN202211160928 A CN 202211160928A CN 115910028 A CN115910028 A CN 115910028A
Authority
CN
China
Prior art keywords
voice
speech
text
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211160928.7A
Other languages
Chinese (zh)
Inventor
祖新星
何挺
赵中州
周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211160928.7A priority Critical patent/CN115910028A/en
Publication of CN115910028A publication Critical patent/CN115910028A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The embodiment of the invention provides a voice synthesis method, a model generation method, a voice broadcasting device, computing equipment and a computer storage medium. The voice synthesis method comprises the following steps: acquiring a text to be subjected to voice synthesis; performing voice feature prediction on the text to obtain voice feature information of a target voice style; adjusting basic voice feature information of a voice synthesis model by using the voice feature information to obtain target voice feature information; and inputting the text into the speech synthesis model so as to perform speech synthesis on the text by using the target speech characteristic information to obtain the target speech of the target speech style. The technical scheme provided by the embodiment of the invention converts the voice style output by the voice synthesis model from the basic voice style corresponding to the basic voice characteristic information into the target voice style corresponding to the target voice characteristic information.

Description

Speech synthesis method and model generation method
Technical Field
The embodiment of the invention relates to the technical field of voice synthesis, in particular to a voice synthesis method, a training method of a voice characteristic prediction model, a voice broadcasting method, computing equipment and a computer storage medium.
Background
With the continuous expansion of audio content markets, voice broadcasting is widely applied in various scenes.
In the related art, a voice broadcast is usually implemented by using a voice synthesis model, specifically, the voice synthesis model takes a text as an input, and performs voice synthesis on the text based on pre-configured basic voice feature information, such as a speed, a pitch, a volume, and the like, to obtain a voice.
However, the inventor finds that the speech synthesis model in the related art can only synthesize speech with a single speech style and is not accurate enough in the process of implementing the concept of the invention.
Disclosure of Invention
The embodiment of the invention provides a voice synthesis method, a model generation method, a voice broadcasting device, computing equipment and a computer storage medium.
In a first aspect, an embodiment of the present invention provides a speech synthesis method, including:
acquiring a text to be subjected to voice synthesis;
performing voice feature prediction on the text to obtain voice feature information of a target voice style;
adjusting basic voice characteristic information of a voice synthesis model by using the voice characteristic information to obtain target voice characteristic information;
and inputting the text into the voice synthesis model to perform voice synthesis on the text by using the target voice characteristic information to obtain the target voice of the target voice style.
In a second aspect, an embodiment of the present invention provides a model generation method, including:
acquiring a voice sample set, wherein the voice sample set comprises a plurality of voice samples belonging to the same target voice style;
respectively carrying out character recognition on the plurality of voice samples, and determining character samples respectively corresponding to the plurality of voice samples;
respectively extracting the characteristics of the voice samples, and determining the voice sample characteristic information respectively corresponding to the voice samples;
and training a voice characteristic prediction model by utilizing text samples and voice sample characteristic information which respectively correspond to the plurality of voice samples, wherein the voice characteristic prediction model is used for predicting voice characteristic information of a text to be subjected to voice synthesis, the voice characteristic information is used for adjusting basic voice characteristic information of the voice synthesis model to obtain target voice characteristic information, and the target voice characteristic information is used for performing voice synthesis on an input text to obtain target voice matched with the target voice style.
In a third aspect, an embodiment of the present invention provides a voice broadcast method, including:
acquiring a text to be broadcasted;
carrying out voice feature prediction on the text to be broadcasted to obtain voice feature information of a target voice style;
inputting the text to be played and the voice characteristic information into a voice synthesis model, adjusting basic voice characteristic information of the voice synthesis model based on the voice characteristic information to obtain target voice characteristic information, and performing voice synthesis on the content to be played by using the target voice characteristic information to obtain broadcast voice of the target voice style;
and playing the broadcast voice.
In a fourth aspect, an embodiment of the present invention provides a speech synthesis apparatus, including:
the first text acquisition module is used for acquiring a text to be subjected to voice synthesis;
the first characteristic prediction module is used for carrying out voice characteristic prediction on the text to obtain voice characteristic information of a target voice style;
the first characteristic information adjusting module is used for adjusting basic voice characteristic information of a voice synthesis model by utilizing the voice characteristic information to obtain target voice characteristic information;
and the first voice synthesis module is used for inputting the text into the voice synthesis model so as to perform voice synthesis on the text by using the target voice characteristic information to obtain the target voice of the target voice style.
In a fifth aspect, an embodiment of the present invention provides a model generation apparatus, including:
the system comprises a sample acquisition module, a processing module and a processing module, wherein the sample acquisition module is used for acquiring a voice sample set by a sample, and the voice sample set comprises a plurality of voice samples belonging to the same target voice style;
the recognition module is used for respectively carrying out character recognition on the plurality of voice samples and determining character samples respectively corresponding to the plurality of voice samples;
the feature extraction module is used for respectively extracting features of the voice samples and determining voice sample feature information respectively corresponding to the voice samples;
the training module is used for training a voice feature prediction model by utilizing text samples and voice sample feature information which correspond to the voice samples respectively, the voice feature prediction model is used for predicting voice feature information of a text to be subjected to voice synthesis, the voice feature information is used for adjusting basic voice feature information of the voice synthesis model to obtain target voice feature information, and the target voice feature information is used for performing voice synthesis on an input text to obtain target voice matched with the target voice style.
In a sixth aspect, an embodiment of the present invention provides a voice broadcast apparatus, including:
the second text acquisition module is used for acquiring a text to be broadcasted;
the second characteristic prediction module is used for carrying out voice characteristic prediction on the text to be broadcasted to obtain voice characteristic information of a target voice style;
the second characteristic information adjusting module is used for inputting the text to be played and the voice characteristic information into a voice synthesis model, adjusting basic voice characteristic information of the voice synthesis model based on the voice characteristic information to obtain target voice characteristic information, and performing voice synthesis on the content to be played by using the target voice characteristic information to obtain broadcast voice of the target voice style;
and the voice playing module plays the broadcast voice.
In a seventh aspect, an embodiment of the present invention provides a computing device, including a processing component and a storage component;
the storage component stores one or more computer instructions; the one or more computer instructions are used for being called and executed by the processing component to implement the voice synthesis method provided by the embodiment of the present invention, or implement the model generation method provided by the embodiment of the present invention, or implement the voice broadcast method provided by the embodiment of the present invention.
In an eighth aspect, an embodiment of the present invention provides a computer storage medium, which stores a computer program, where when the computer program is executed by a computer, the computer program implements a voice synthesis method provided in an embodiment of the present invention, or implements a model generation method provided in an embodiment of the present invention, or implements a voice broadcast method as provided in an embodiment of the present invention.
Before inputting a text to be subjected to speech synthesis into a speech synthesis model for speech synthesis, predicting speech characteristic information of a target speech style aiming at the text to obtain the speech characteristic information of the target speech style, further adjusting basic speech characteristic information of the speech synthesis model by using the obtained speech characteristic information of the target style to generate the target speech characteristic information, inputting the text into the speech synthesis model to enable the speech synthesis model to carry out speech synthesis on the text by using the target speech characteristic information to generate target speech of the target style, and converting the speech style output by the speech synthesis model from the basic speech style corresponding to the basic speech characteristic information into the target speech style corresponding to the target speech characteristic information.
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating a system architecture to which embodiments of the present invention may be applied;
FIG. 2 is a flow diagram schematically illustrating a method of speech synthesis according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a speech synthesis method provided by an embodiment of the invention;
FIG. 4 schematically illustrates a flow chart of a method of model generation provided by an embodiment of the invention;
FIG. 5 schematically illustrates a waveform diagram in one embodiment of the invention;
fig. 6 schematically shows a flowchart of a voice broadcast method according to an embodiment of the present invention;
fig. 7 is a schematic system diagram illustrating a voice broadcast method according to an embodiment of the present invention;
fig. 8 schematically shows a block diagram of a speech synthesis apparatus provided in an embodiment of the present invention;
FIG. 9 schematically illustrates a block diagram of a model generation apparatus provided in one embodiment of the present invention;
fig. 10 is a block diagram schematically illustrating a voice broadcasting device provided in an embodiment of the present invention;
FIG. 11 schematically illustrates a block diagram of a computing device provided in one embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
Text To Speech (TTS) typically implements conversion of Text To Speech. In the related art, the speech synthesis technology generally includes firstly inviting the voiceconnection to record a high-quality speech library in a professional recording studio, acquiring a text corresponding to the speech in the speech library, training through a neural network model, inputting the text, and outputting the speech, thereby obtaining a trained speech synthesis model.
TTS speakers usually refer to the tone and pronunciation style of the speech synthesized by the speech synthesis model, and the speech synthesis model trained by using the speech library recorded with different sound quality as the training sample has different basic speech feature information, such as speech speed, tone, volume, etc., and the corresponding tone and pronunciation style are also different.
The inventor finds that in the process of realizing the concept of the invention, a TTS speaker in the related art has a certain specific pronunciation style, and the specific pronunciation style is often only suitable for a specific scene, for example, in basic voice feature information of most existing voice synthesis models, the speed of speech is slow, the tone is gentle, the volume is low, correspondingly, in the process of voice broadcasting by the TTS speaker, the pronunciation is gentle, the voice is lack of fluctuation and fluctuation, the pronunciation style is often only suitable for a scene with short broadcasting duration, and if the scene is broadcasted for a long time, the lack of frustration and pronunciation easily causes auditory fatigue of a user.
In order to enable the pronunciation of a TTS speaker to have rhythm, when sound is recorded in a voice database, all recorded texts are recorded by adopting various emotions respectively, and then the voice database with various emotion styles is used for training a voice synthesis model. After the method is adopted to train the voice synthesis model, the expressive force of the voice of a TTS speaker is improved, the voice is no longer lack of fluctuation and fluctuation, and the voice has certain rhythm. Moreover, although the pronunciation expressive force of a TTS speaker is improved by adopting the optimization method, the TTS speaker can still only perform voice broadcast in a single voice style.
With the continuous expansion of the audio content market, voice broadcast is widely applied in various scenes, and correspondingly, TTS speakers need to adopt different voice styles to perform voice broadcast in different broadcast scenes, which cannot be realized by directly adopting a voice synthesis model.
In the related art, in order to enrich the pronunciation style of TTS speakers, the following method is generally adopted:
1. when the voice broadcasting with the specific voice style is required, the sound is recorded according to the characteristic voice style to generate a voice base with the specific style, and the voice base with the specific style is used for training a voice synthesis model with the specific style. The pronunciation effect of the method is the best, but the newly added voice style needs to retrain the voice synthesis model, the retraining of the voice synthesis model needs to retrain the acquisition of the training sample, the cost is higher, the time for training the voice synthesis model is longer, and the requirement of the newly added voice style cannot be met in time.
2. When a voice broadcast with a specific voice style is required, firstly, the prosody required by the text is manually adjusted by using a Speech Synthesis Markup Language (SSML), and then the adjusted text is input into a voice Synthesis model.
In order to solve the technical problem that the voice style of a TTS speaker is difficult to be newly added in the related art, the embodiment of the invention provides a voice synthesis method, before inputting a text to be subjected to voice synthesis into a voice synthesis model for voice synthesis, firstly predicting voice characteristic information aiming at the text to obtain the voice characteristic information of a target voice style, further, adjusting basic voice characteristic information of the voice synthesis model by using the obtained voice characteristic information of the target style to generate target voice characteristic information, and then inputting the text into the voice synthesis model to enable the voice synthesis model to carry out voice synthesis on the text by using the target voice characteristic information to generate target voice of the target style, so that the voice style output by the voice synthesis model is converted from the basic voice style corresponding to the basic voice characteristic information into the target voice style corresponding to the target voice characteristic information.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 shows a system architecture diagram to which the technical solution of the embodiment of the present invention can be applied, and the system architecture may include a user side 101 and a service side 102. ( Or: may include a server and a plurality of clients; or may comprise a server, a first user terminal, a second user terminal, etc )
The user terminal 101 and the server terminal 102 establish a connection through a network. The network provides a medium for communication links between the user side 101 and the service side 102. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user terminal 101 may interact with the service terminal 102 via a network to receive or send messages, etc.
The user side 101 may be a browser, an APP (Application), or a web Application such as an H5 (HyperText Markup Language5, 5 th edition) Application, or a light Application (also referred to as an applet, a light Application), or a cloud Application, and the user side 101 may be deployed in an electronic device and needs to run depending on the device or some APPs in the device. The electronic device may have a display screen and support information browsing, for example, and may be a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, and the like, and for understanding, the user side is mainly represented by a device image in fig. 1. Various other types of applications may also be typically deployed in an electronic device, such as human-machine conversation-type applications, model training-type applications, text processing-type applications, web browser applications, shopping-type applications, search-type applications, instant messaging tools, mailbox clients, social platform software, and so forth.
The server 102 may include a server providing various services, such as a server for background training that provides support for a model used on the user side 101, and a server for processing interaction information sent by the user side.
It should be noted that the server 102 may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system, or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
It should be noted that the voice synthesis method, the model generation method, and the voice broadcast method provided in the embodiment of the present invention may be executed by the server 102, and the corresponding voice synthesis apparatus, the model generation apparatus, and the voice broadcast apparatus are correspondingly disposed in the server 102. However, in other embodiments of the present invention, the user terminal 101 may also have similar functions as the service terminal 102, so as to execute the voice synthesis method, the model generation method, and the voice broadcast method provided in the embodiments of the present invention. In other embodiments, the voice synthesis method, the model generation method, and the voice broadcast method provided in the embodiments of the present invention can also be executed by the user terminal 101 and the server terminal 102 together,
it should be understood that the number of clients and servers in fig. 1 is merely illustrative. There may be any number of clients and servers, as desired for implementation.
The details of implementation of the technical solution of the embodiment of the present invention are set forth below.
Fig. 2 schematically shows a flowchart of a speech synthesis method provided by an embodiment of the present invention, where the speech synthesis method may include the following steps:
and 201, acquiring a text to be subjected to voice synthesis.
And 202, performing voice characteristic prediction on the text to obtain voice characteristic information of a target voice style.
And 203, adjusting basic voice characteristic information of the voice synthesis model by using the voice characteristic information to obtain target voice characteristic information.
And 204, inputting the text into a speech synthesis model to perform speech synthesis on the text by using the target speech characteristic information to obtain target speech with the target speech style.
According to an embodiment of the present invention, the text to be subjected to speech synthesis may include any text in any language. For example, the language may include Chinese, english, russian, german, and the like. The text may include news text, entertainment text, event commentary text, and the like.
According to the embodiment of the invention, the text to be subjected to voice synthesis can be acquired based on a mode input by a user, and the input mode of the text can comprise voice input, mode input for performing character recognition on entity text, peripheral equipment input and the like.
According to the embodiment of the invention, different texts usually need different voice styles to be broadcasted, and the same text can also need to be broadcasted by adopting different voice styles in different language environments, so that the voice characteristics of the text can be predicted based on the text content and the expected target voice style to obtain the voice characteristic information of the target voice style.
According to the embodiment of the invention, the basic speech feature information may include feature parameters which are generated after a speech synthesis model is trained by using samples and influence the pronunciation style of a TTS speaker. The pronunciation style of the corresponding TTS speaker can be changed by changing the characteristic parameter value of the basic voice characteristic information.
According to the embodiment of the present invention, the implementation manner of adjusting the basic voice feature information by using the voice feature information may include, for example: and replacing the basic voice feature information with the voice feature information so as to take the voice feature information as target voice feature information, or adjusting the value of the basic voice feature by using the voice feature information on the basis of the basic voice feature information to obtain the target voice feature information.
According to the embodiment of the invention, after the voice characteristic information is input into the voice synthesis model and the target voice characteristic information is obtained by adjusting the basic voice characteristic information of the voice synthesis model, the text can be input into the voice synthesis model, so that the voice synthesis model carries out voice synthesis on the text based on the target voice characteristic information obtained by adjustment, and the target voice with the target style is generated. But not limited to this, the speech feature information and the text may also be input into the speech synthesis model together, and after input, the basic speech information may be first adjusted by using the speech feature information to generate the target speech feature information, and then the text may be speech-synthesized by using the target speech feature information to output the target speech.
According to an embodiment of the present invention, the target speech feature information may indicate a pronunciation mode of each unit to be pronounced in the text, and based on this, in an embodiment of the present invention, the speech synthesis of the text by the speech synthesis model based on the target speech feature information may specifically be implemented as:
searching a target pronunciation unit corresponding to each unit to be pronounced in the text from a plurality of pronunciation units stored in a voice library;
for any unit to be pronounced, determining a target pronunciation mode matched with the target voice characteristic information from a plurality of pronunciation modes of a target pronunciation unit corresponding to the unit to be pronounced based on the target voice characteristic information corresponding to the unit to be pronounced;
and splicing the target pronunciation units matched with the target pronunciation modes to obtain the target voice.
According to the embodiment of the invention, a plurality of pronunciation units can be stored in the voice library, the pronunciations of the pronunciation units can be obtained by segmenting natural voice, and the pronunciation units are labeled after segmentation (including pronunciation marks, unvoiced and voiced segmentation and the like). Therefore, after the target voice characteristic information corresponding to each pronunciation unit in the text is determined, the voice library can be searched, and the target voice is generated.
In another embodiment of the present invention, the speech synthesis of the text by the speech synthesis model based on the target speech feature information may specifically be implemented as follows:
generating a Mel frequency spectrum corresponding to the text according to the target voice characteristic information;
converting the Mel frequency spectrum into waveform;
the waveform is taken as the target voice.
According to the embodiment of the invention, the speech feature prediction is performed on the text, and the speech feature information of the target speech style can be specifically realized as follows:
performing voice feature prediction on the text by using a voice feature prediction model to obtain voice feature information of a target voice style; the voice feature prediction model is obtained based on a plurality of voice samples belonging to a target voice style.
According to an embodiment of the present invention, the speech feature prediction model may include a neural network model trained by using a plurality of speech samples, and the plurality of speech samples may belong to a target speech style.
According to the embodiment of the invention, before the training of the voice feature prediction model, the voice styles can be classified according to application scenes, for example, the voice styles can be divided into a live tape-and-play voice style, a sports event commentary voice style, a news broadcast voice style, a human-computer interaction voice style and the like.
According to the embodiment of the invention, a plurality of voice samples belonging to the target voice style can be obtained by capturing the recording of the corresponding voice style. Voice samples, such as the news broadcast voice genre, may be obtained in the form of a grab of audio or video of a news program. When the captured video is, the audio in the video can be extracted, and the extracted audio is used as a voice sample.
According to the embodiment of the invention, the training sample of the voice characteristic prediction model is obtained by capturing the recording belonging to the target voice style from the Internet, so that the recording of the voice library and the marking of the sample data can be carried out without manpower cost, and the training cost of the voice characteristic prediction model can be reduced. In addition, the voice characteristic information predicted by the voice characteristic prediction model can be provided for a plurality of voice synthesis models for use, so that the pronunciation style of a TTS speaker can be changed without changing the tone of the TTS speaker.
According to the embodiment of the invention, when the voice style of a TTS speaker needs to be changed, a desired voice style can be determined firstly, and then audio corresponding to the desired voice style is captured from the Internet, so that a plurality of voice samples are obtained. After the voice sample is obtained, the voice sample can be used for training to generate a voice characteristic prediction model. The voice characteristic prediction model is obtained by training voice samples belonging to the same style, so that the voice characteristic prediction model has stronger characteristic extraction capability aiming at the voice style, and voice characteristic information matched with the style can be obtained by prediction from a text.
According to the embodiment of the invention, before the text is subjected to voice synthesis by using the voice synthesis model, the pronunciation style of the target voice synthesized by the voice synthesis model is changed in a mode of predicting the voice characteristic information of the text, a recording studio and a professional voice are not required to record a voice library, and the voice synthesis model is not required to be retrained, so that the online period of the newly added voice style is shortened.
According to the embodiment of the invention, the voice feature prediction is performed on the text by using the voice feature prediction model, and the acquisition of the voice feature information of the target voice style can be specifically realized as follows:
segmenting the text according to a preset segmentation granularity to obtain a plurality of segmentation units;
and inputting the plurality of segmentation units into the voice feature prediction model, and outputting voice feature information respectively corresponding to the plurality of segmentation units.
According to an embodiment of the present invention, the preset segmentation granularity may include one or more of a word granularity, a phoneme granularity, and a sentence granularity. The flexibility of prosody control of the target voice can be improved through the combination of different segmentation granularities, so that the pronunciation of the target voice is more natural.
According to the embodiment of the present invention, when the text is cut according to the granularity, the finer the cut granularity is, for example, the text is cut according to the phoneme granularity, the larger the adjustable space of the prosody of the target speech is, but the difficulty of the speech feature prediction is increased correspondingly. Therefore, the preset segmentation granularity can be selected by a person skilled in the art according to the actual application requirement, and the preset segmentation granularity is not particularly limited in the embodiment of the present invention.
The basic voice feature information of the voice synthesis model is adjusted by utilizing the voice feature information, and the target voice feature information can be specifically obtained by:
and adjusting the basic voice characteristic information of the voice synthesis model by utilizing the voice characteristic information respectively corresponding to the plurality of segmentation units to obtain the target voice characteristic information.
According to the embodiment of the present invention, the basic speech feature information may be a general speech feature, that is, for any input text, the same basic speech feature information is used for speech synthesis, which is one of the reasons that the speech synthesis model generates speech based on the basic speech feature information, and the speech synthesis model lacks fluctuation and pronunciation mechanism.
According to the embodiment of the invention, the voice characteristics respectively corresponding to each segmentation unit can be output by respectively inputting the plurality of segmentation units into the voice characteristic prediction model, and the voice characteristics corresponding to each segmentation unit are different. Based on the method, the basic voice characteristics can be adjusted according to each segmentation unit based on the corresponding voice characteristics, sub-target voice characteristic information corresponding to each segmentation unit is generated, and then the sub-target voice characteristic information can be used as target voice characteristic information together.
According to an embodiment of the present invention, the voice feature information includes at least one voice feature value.
According to the embodiment of the present invention, the basic speech feature information of the speech synthesis model is adjusted by using the speech feature information corresponding to each of the plurality of segmentation units, and the obtaining of the target speech feature information can be specifically realized as follows:
performing anti-regularization on at least one voice characteristic value corresponding to each segmentation unit, and determining a plurality of target voice characteristic values;
carrying out average calculation on the plurality of target voice characteristic values to determine an average voice characteristic value;
determining a voice characteristic proportion value corresponding to each segmentation unit based on the average voice characteristic value and the target voice characteristic value corresponding to each segmentation unit;
and inputting the voice characteristic proportion value corresponding to each segmentation unit into the voice synthesis model, and adjusting basic voice characteristic information of the voice synthesis model by using the voice characteristic proportion value corresponding to each segmentation unit to obtain target voice characteristic information.
According to the embodiment of the present invention, based on the average speech feature value and the target speech feature value corresponding to each segmentation unit, determining the speech feature proportion value corresponding to each segmentation unit may specifically be implemented as follows:
and respectively dividing the target voice characteristic value corresponding to each segmentation unit with the average voice characteristic value to generate a voice characteristic proportion value corresponding to each segmentation unit.
According to the embodiment of the present invention, determining the speech feature proportion value corresponding to each segmentation unit can be expressed by the following formulas (1) and (2).
Figure BDA0003859918920000091
Figure BDA0003859918920000101
Wherein, V avg Representing the mean speech feature value, V scale Representing a speech feature scale value, N representing the number of segmentation units included in the text, V Anti-regularities Representing the target speech feature value.
According to the embodiment of the invention, for the speech, there can be a plurality of dimensions of features that can influence the speech style, such as the tone, the speech speed, the volume, and the like, and each speech feature value can represent a dimension that can influence the speech style.
According to the embodiment of the present invention, the calculation of the speech feature proportion value may be performed on each of the speech feature values included in the speech feature information in turn. For example, the voice feature information includes a pitch feature value and a volume feature value, so that the pitch feature value of each segmentation unit in the text may be firstly de-regularized to generate a target pitch feature value; and then, counting tone characteristic values respectively corresponding to each segmentation unit included in the text, summing the tone characteristic values and dividing the sum by the number of the segmentation units to obtain an average tone characteristic value, and finally determining a tone proportion value respectively corresponding to each segmentation unit by using the average tone characteristic value and the target tone characteristic value. Similarly to the pitch feature value, the volume ratio value corresponding to each of the sliced cells may be determined in the same procedure.
According to the embodiment of the present invention, the pitch ratio and the volume ratio may be operated in parallel or in series.
According to the embodiment of the invention, the calculation mode of the de-regularization can be the inverse operation of the regularization, and the calculation mode of the regularization depends on the calculation mode of the regularization adopted in the preprocessing process of the training samples when the speech feature reservation model is trained.
According to the embodiment of the invention, after the regularized calculation mode is determined, the calculation mode can be derived, so that when the de-regularization needs to be performed, the derived regularized calculation mode is obtained, and the de-regularized calculation mode is determined. For example. The calculation of the regularization may be represented by the following equation (3),
Figure BDA0003859918920000102
thus, the calculation manner of the denormalization can be expressed by the following formula (4).
V Anti-regularities =V·V STD +V Mean ; (4)
Wherein V may represent the sample value of a sample, V Mean Can represent the mean, V, of a plurality of samples STD The standard deviation of a plurality of samples may be represented.
According to an embodiment of the present invention, the voice feature information includes at least one voice feature value.
According to the embodiment of the present invention, the basic speech feature information of the speech synthesis model is adjusted by using the speech feature information corresponding to each of the plurality of segmentation units, and the obtaining of the target speech feature information may specifically be implemented as follows:
and inputting at least one voice characteristic value corresponding to each of the plurality of segmentation units into the voice synthesis model, and adjusting basic voice characteristic information of the voice synthesis model by using the plurality of voice characteristic values corresponding to each of the plurality of segmentation units to obtain target voice characteristic information.
According to an embodiment of the present invention, the basic speech feature information includes at least one basic speech feature value corresponding to the speech feature information.
According to the embodiment of the invention, after at least one voice characteristic value corresponding to each of the plurality of segmentation units is input into the voice synthesis model, the corresponding basic voice characteristic value can be adjusted by using the voice characteristic value.
For example, the voice feature information includes a pitch feature value and a volume feature value, and correspondingly, the basic voice feature information may include a basic pitch feature value and a basic volume feature value, so that, after the pitch feature value is input into the voice synthesis model, the basic pitch feature value may be adjusted by using the pitch feature value to generate a target pitch feature value; after the volume characteristic value is input into the speech synthesis model, the basic volume characteristic value can be adjusted by using the volume characteristic value to generate a target volume characteristic value.
In an embodiment of the present invention, after the speech feature value corresponding to each segmentation unit of the text is extracted by using the speech feature prediction model, the speech feature value may be directly input into the speech synthesis model. In another embodiment of the present invention, the speech feature value may be first preprocessed to convert the speech feature value into a corresponding speech feature ratio value, and then the speech feature ratio value is input into the speech synthesis model.
According to the embodiment of the invention, the voice characteristic value is converted into the voice characteristic proportion value and then input into the voice synthesis model, so that the voice synthesis model can be adjusted in proportion on the basis of the basic voice characteristic value, and the generated target voice is more natural and harmonious.
According to the embodiment of the present invention, the basic speech feature information of the speech synthesis model is adjusted by using the speech feature ratio value corresponding to each segmentation unit, and the target speech feature information is obtained by:
multiplying the voice characteristic proportion value corresponding to each segmentation unit with basic voice characteristic information respectively to generate first voice characteristic information corresponding to each segmentation unit respectively;
and adding each piece of first voice characteristic information and the basic voice characteristic information respectively to determine target voice characteristic information.
According to the embodiment of the present invention, the step of multiplying the speech feature ratio value corresponding to each segmentation unit by the basic speech feature information to generate the first speech feature information corresponding to each segmentation unit may specifically be implemented as follows:
multiplying the voice characteristic proportion value with the corresponding basic voice characteristic value to generate a first voice characteristic value corresponding to each segmentation unit;
and taking the plurality of first voice characteristic values as first voice characteristic information.
According to an embodiment of the present invention, for the dimension of the speech rate, for example, the basic speech rate feature value is 100, and the speech rate feature ratio value is 0.3, so that the first speech rate feature value obtained by multiplying the basic speech rate feature value and the speech rate feature ratio value is 30, and the first speech rate feature value may represent the amplitude that needs to be adjusted based on the basic speech rate feature value. After the first speech rate feature value is obtained, the first speech rate feature value may be added to the basic speech rate feature value to determine the target speech rate feature value as 130.
Fig. 3 schematically shows a schematic diagram of a speech synthesis method provided by an embodiment of the invention.
As shown in fig. 3, after the text to be subjected to speech synthesis is obtained, the text may be segmented according to a preset segmentation granularity, so as to generate a plurality of segmentation units. For example, the text may be "very worth buying", and by segmenting the text, three segmentation units of "very", "worth", "buying" may be generated.
After performing the granularity segmentation, the three segmentation units may be respectively input to the speech feature prediction model, and speech feature information respectively corresponding to the three segmentation units is output, where the speech feature information may include three speech feature values, which are respectively a pitch feature value, a speech rate feature value, and a volume feature value, and may be represented as T = [ T ] 1 (t 1 、t 2 、t 3 ),T 2 (t 1 、t 2 、t 3 ),T 3 (t 1 、t 2 、t 3 )]Where T represents text, T 1 、T 2 、T 3 Each representing a slicing unit, t 1 、t 2 、t 3 Respectively representing a speech feature value of the segmentation unit.
Further, prosodic amplitude adjustment may be performed, specifically, all three speech feature values corresponding to the three segmentation units may be converted into speech feature ratio values, and T = [ T ] is generated 1 (s 1 、s 2 、s 3 ),T 2 (s 1 、s 2 、s 3 ),T 3 (s 1 、s 2 、s 3 )]Wherein s is 1 、s 2 、s 3 Respectively representing a speech feature scale value.
Furthermore, the speech feature ratio values respectively corresponding to the three segmentation units may be input into the speech synthesis model, so as to adjust the corresponding basic speech feature values by using the speech feature ratio values, and generate the target speech feature ratio value.
Finally, the three segmentation units can be respectively input into the speech synthesis model, and the target speech is output.
Fig. 4 schematically shows a flowchart of a model generation method provided in an embodiment of the present invention, where the model generation method may include the following steps:
a speech sample set is obtained 401, the speech sample set comprising a plurality of speech samples belonging to the same target speech style.
And 402, respectively carrying out character recognition on the plurality of voice samples, and determining character samples respectively corresponding to the plurality of voice samples.
And 403, respectively extracting the characteristics of the plurality of voice samples, and determining voice sample characteristic information respectively corresponding to the plurality of voice samples.
And 404, training a voice feature prediction model by using text samples and voice sample feature information respectively corresponding to the plurality of voice samples, wherein the voice feature prediction model is used for predicting voice feature information of a text to be subjected to voice synthesis, the voice feature information is used for adjusting basic voice feature information of the voice synthesis model to obtain target voice feature information, and the target voice feature information is used for performing voice synthesis on an input text to obtain target voice matched with a target voice style.
According to an embodiment of the present invention, the speech feature prediction model may include a neural network model trained by using a plurality of speech samples, and the plurality of speech samples may belong to a target speech style.
According to the embodiment of the invention, before the training of the voice feature prediction model, the voice styles can be classified according to application scenes, for example, the voice styles can be divided into a live tape-and-play voice style, a sports event commentary voice style, a news broadcast voice style, a human-computer interaction voice style and the like.
According to embodiments of the present invention, the target speech style may correspond to a text type. For example, when the text is a presentation case of a live-broadcast cargo-carrying commodity, the target voice style can be the live-broadcast cargo-carrying voice style; when the text is a news broadcast television, the target voice style can be the news broadcast voice style.
According to the embodiment of the invention, a plurality of voice samples belonging to the target voice style can be obtained by capturing the recording of the corresponding voice style. Voice samples, such as the news broadcast voice genre, may be obtained in the form of a grab of audio or video of a news program. When the captured video is, the audio in the video can be extracted, and the extracted audio is used as a voice sample.
According to the embodiment of the invention, the training sample of the voice characteristic prediction model is obtained by capturing the recording belonging to the target voice style from the Internet, so that the recording of the voice library and the marking of the sample data can be carried out without manpower cost, and the training cost of the voice characteristic prediction model can be reduced. In addition, the voice characteristic information predicted by the voice characteristic prediction model can be provided for a plurality of voice synthesis models for use, so that the pronunciation style of a TTS speaker can be changed without changing the tone of the TTS speaker.
According to the embodiment of the invention, when the voice style of a TTS speaker needs to be changed, a desired voice style can be determined firstly, and then audio corresponding to the desired voice style is captured from the Internet, so that a plurality of voice samples are obtained. After the voice sample is obtained, the voice sample can be used for training to generate a voice characteristic prediction model. The voice characteristic prediction model is obtained by training voice samples belonging to the same style, so that the voice characteristic prediction model has stronger characteristic extraction capability aiming at the voice style, and voice characteristic information matched with the style can be obtained by prediction from a text.
According to the embodiment of the invention, before the text is subjected to voice synthesis by using the voice synthesis model, the pronunciation style of the target voice synthesized by the voice synthesis model is changed in a mode of predicting the voice characteristic information of the text, a recording studio and a professional voice are not required to record a voice library, and the voice synthesis model is not required to be retrained, so that the online period of the newly added voice style is shortened.
According to the embodiment of the present invention, after performing character recognition on a plurality of voice samples in a voice sample set, and determining text samples corresponding to the plurality of voice samples, the model generation method further includes:
and for any text sample, segmenting the text sample according to a preset segmentation granularity to obtain a plurality of segmentation units.
According to an embodiment of the present invention, the preset segmentation granularity may include one or more of word granularity, phoneme granularity, and sentence granularity.
According to the embodiment of the present invention, when the text is cut according to the granularity, the finer the cutting granularity is, for example, the text is cut according to the factor granularity, the larger the adjustable space of the prosody of the target speech is, but the difficulty of the speech feature prediction is correspondingly increased. Therefore, the preset segmentation granularity can be selected by a person skilled in the art according to the actual application requirement, and the preset segmentation granularity is not particularly limited in the embodiment of the present invention.
Respectively extracting the characteristics of a plurality of voice samples in the voice sample set, and determining the voice sample characteristic information respectively corresponding to the plurality of voice samples specifically can be realized as follows:
and respectively extracting the characteristics of a plurality of segmentation units contained in the text sample aiming at any text sample, and determining the voice sample characteristic information respectively corresponding to the plurality of segmentation units.
The training of the speech feature prediction model by using the text samples and the speech feature information respectively corresponding to the plurality of speech samples can be specifically realized as follows:
and taking a plurality of segmentation units of the text sample corresponding to any voice sample as a model input, taking voice sample characteristic information corresponding to the voice sample as a model label, and training a voice characteristic prediction model.
According to the embodiment of the present invention, for any text sample, feature extraction is respectively performed on a plurality of segmentation units included in the text sample, and it is determined that the voice sample feature information respectively corresponding to the plurality of segmentation units specifically can be implemented as follows:
aiming at any segmentation unit in any text sample, respectively carrying out time axis alignment on the segmentation unit in a voice sample corresponding to the text sample so as to determine a time label of the segmentation unit;
acquiring a oscillogram of a voice sample;
determining waveforms respectively corresponding to the segmentation units from the oscillogram of the voice sample based on the time labels of the segmentation units;
and determining the voice sample characteristic information corresponding to the segmentation unit according to the waveform.
According to an embodiment of the present invention, a segmentation unit is time-axially aligned on a speech sample, i.e. a start time and an end time of the pronunciation of the segmentation unit in the speech sample are determined, such that the start time and the end time together serve as a time tag corresponding to the segmentation unit.
Fig. 5 schematically shows a waveform diagram in an embodiment of the invention.
Fig. 5 may be a waveform of a speech sample corresponding to a text sample that is "very commercially available". In the waveform diagram, the abscissa represents time, and the ordinate represents amplitude.
The text sample of "buy-worthwhile" can be divided into three division units of "buy", and "extraordinary" according to a preset division granularity, and the division units are aligned by time axes, so that the starting time of the "buy-worthwhile" division unit is determined as a point a, the ending time of the "buy-worthwhile" division unit is determined as a point c, the ending time of the "buy-worthwhile" division unit is determined as a point d, and the starting time of the "buy-worthwhile" division unit is determined as a point e, and the ending time of the "buy-worthwhile" division unit is determined as a point f.
Based on this, it can be determined that in the waveform diagram, the waveform between the points a and b corresponds to a "very" cut unit, the waveform between the points c and d corresponds to a "worth" cut unit, and the waveform between the points e and f corresponds to a "buy" cut unit.
According to the embodiment of the present invention, determining the voice sample feature information corresponding to the segmentation unit according to the waveform specifically may be implemented as follows:
determining the speech rate characteristic value of the segmentation unit according to the starting time and the ending time of the waveform;
extracting fundamental frequency of the waveform to determine a tone characteristic value of a segmentation unit;
determining a volume characteristic value of the segmentation unit according to the amplitude of the waveform;
and generating voice sample characteristic information according to the speech speed characteristic value, the tone characteristic value and the volume characteristic value.
According to the embodiment of the present invention, the time required for the segmentation unit to be pronounced in the voice sample can be determined by subtracting the start time from the end time, and thus the time can be used as the speech rate feature value.
According to the embodiment of the present invention, the fundamental frequency extraction of the waveform may refer to a time domain method or a frequency domain method in the prior art, which is not described herein again.
According to the embodiment of the present invention, training a speech feature prediction model using text samples and speech sample feature information corresponding to a plurality of speech samples respectively may specifically be implemented as follows:
inputting a text sample corresponding to a voice sample into a voice characteristic prediction model aiming at any voice sample, and outputting a predicted voice characteristic value;
calculating the loss result of the predicted voice characteristic value and the voice characteristic value corresponding to the voice sample;
and adjusting the model parameters of the voice characteristic prediction model according to the loss result until the loss result meets the preset condition.
According to an embodiment of the invention, the calculation of the loss result may be implemented using a loss function, which may include, for example, a cross-entropy loss function, a square loss function, or the like.
According to an embodiment of the present invention, the preset condition may include, for example, that the loss function is smaller than a preset threshold, or that the number of iterations is larger than a preset turn threshold.
Fig. 6 schematically shows a flowchart of a voice broadcasting method according to an embodiment of the present invention, where the voice broadcasting method may include the following steps:
601, acquiring a text to be broadcasted;
602, performing voice feature prediction on a text to be broadcasted to obtain voice feature information of a target voice style;
603, inputting the text to be played and the voice characteristic information into a voice synthesis model, adjusting basic voice characteristic information of the voice synthesis model based on the voice characteristic information to obtain target voice characteristic information, and performing voice synthesis on the content to be played by using the target voice characteristic information to obtain broadcast voice of a target voice style;
and 604, playing the broadcast voice.
According to the embodiment of the invention, in different application scenes, the acquisition modes of the text to be broadcasted can be different. For example, in a live delivery scene, a user can write a commodity introduction case, then receive the commodity introduction case uploaded by the user, and use the received commodity introduction case as a text to be broadcasted. For example, in a voice interaction scenario, the terminal device may receive a question or an instruction issued by a user through voice, and in response to receiving the question or the instruction input by the user, the terminal device may generate a response document, based on which the response document may be used as a text to be broadcasted.
Fig. 7 schematically shows a system diagram to which the voice broadcasting method provided by the embodiment of the present invention can be applied.
As shown in fig. 7, the system may include a terminal device 701 and a server 702.
The terminal device 701 may receive the text to be broadcasted 703 input by the user. After obtaining the text 703 to be broadcasted, the terminal device 701 may perform certain preprocessing on the text 703 to be broadcasted, where the preprocessing may include, for example, identifying the text 703 to be broadcasted, so as to determine a voice style corresponding to the text to be broadcasted. The identification of the text to be broadcasted 703 can be realized by performing keyword identification on the text to be broadcasted 703, or by performing semantic identification on the text to be broadcasted 703.
After the terminal device 701 identifies the text to be broadcasted 703, the text to be broadcasted 703 and the identification result may be sent to the server 702.
After receiving the text to be broadcasted 703 and the recognition result, the server 702 may determine, based on the recognition result, a target speech feature prediction model 705 corresponding to the recognition result from a plurality of speech feature prediction models 704 trained in advance, and input the text to be broadcasted 703 into the target speech feature prediction model 705 for speech feature prediction, where the plurality of speech feature prediction models 704 may respectively correspond to a speech style. For example, the plurality of voice feature prediction models 704 may include a voice feature prediction model of a news broadcast style, a voice feature prediction model of a sports event commentary style, and a voice feature prediction model of a live broadcast style, and the recognition result of the text to be broadcast 703 represents that the text to be broadcast 703 is a news broadcast table book, so that the voice feature prediction model of the news broadcast style may be determined as the target voice feature prediction model 705.
After the target speech feature prediction model outputs the speech feature information, the server 702 may input the speech feature information into the speech synthesis model 706, so as to adjust the basic speech feature information by using the speech feature information to generate the target speech feature information, and further, may input the text to be broadcasted into the speech synthesis model 706 to output the target speech 707.
After generating the target voice 707, the server may send the target voice 707 to the terminal device 701, so that the terminal device 701 broadcasts the target voice 707.
It should be noted that the speech feature prediction model and the speech synthesis model may also be deployed in the terminal device, so that after receiving the text to be broadcasted, the terminal device may locally predict speech feature information by using the speech feature prediction model, and synthesize the target speech by using the speech synthesis model.
According to an embodiment of the present invention, specific implementation details of the voice broadcasting method shown in fig. 7 are the same as or similar to those of the voice synthesizing method shown in fig. 2, and specifically refer to the related description of the voice synthesizing method shown in fig. 2, which is not described herein again.
Fig. 8 schematically shows a block diagram of a speech synthesis apparatus according to an embodiment of the present invention, and as shown in fig. 8, the speech synthesis apparatus 800 includes a first text obtaining module 801, a first feature prediction module 802, a first feature information adjusting module 803, and a first speech synthesis module 804.
A first text acquisition module 801, configured to acquire a text to be subjected to speech synthesis;
a first feature prediction module 802, configured to perform speech feature prediction on the text to obtain speech feature information of a target speech style;
a first feature information adjusting module 803, configured to adjust basic speech feature information of a speech synthesis model by using the speech feature information, to obtain target speech feature information;
a first speech synthesis module 804, configured to input the text into the speech synthesis model, so as to perform speech synthesis on the text by using the target speech feature information, and obtain the target speech in the target speech style. The speech synthesis apparatus shown in fig. 8 may perform the speech synthesis method described in the embodiment shown in fig. 2, and details of the implementation principle and the technical effect are not repeated. The specific manner in which each module and unit of the speech synthesis apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
According to an embodiment of the present invention, the first feature prediction module 802 is specifically configured to:
performing voice feature prediction on the text by using a voice feature prediction model to obtain voice feature information of a target voice style; wherein the speech feature prediction model is obtained based on a plurality of speech samples belonging to the target speech style.
According to an embodiment of the present invention, the first feature prediction module 802 is specifically configured to:
segmenting the text according to a preset segmentation granularity to obtain a plurality of segmentation units;
inputting the plurality of segmentation units into the voice feature prediction model, and outputting voice feature information respectively corresponding to the plurality of segmentation units;
the adjusting the basic voice feature information of the voice synthesis model by using the voice feature information to obtain the target voice feature information comprises:
and adjusting the basic voice characteristic information of the voice synthesis model by utilizing the voice characteristic information respectively corresponding to the plurality of segmentation units to obtain the target voice characteristic information.
According to an embodiment of the present invention, the voice feature information includes at least one voice feature value.
According to an embodiment of the present invention, the first characteristic information adjusting module 803 is specifically configured to:
performing anti-regularization on at least one voice characteristic value corresponding to each segmentation unit, and determining a plurality of target voice characteristic values;
carrying out average calculation on the target voice characteristic values to determine an average voice characteristic value;
determining a voice characteristic proportion value corresponding to each segmentation unit based on the average voice characteristic value and a target voice characteristic value corresponding to each segmentation unit;
and inputting the voice characteristic proportion value corresponding to each segmentation unit into the voice synthesis model, and adjusting basic voice characteristic information of the voice synthesis model by using the voice characteristic proportion value corresponding to each segmentation unit to obtain target voice characteristic information.
According to an embodiment of the present invention, the voice feature information includes at least one voice feature value.
According to an embodiment of the present invention, the first characteristic information adjusting module 803 is specifically configured to:
and inputting at least one voice characteristic value corresponding to each of the plurality of segmentation units into the voice synthesis model, and adjusting basic voice characteristic information of the voice synthesis model by using the plurality of voice characteristic values corresponding to each of the plurality of segmentation units to obtain target voice characteristic information.
According to an embodiment of the present invention, the first characteristic information adjusting module 803 is specifically configured to:
multiplying the voice characteristic proportion value corresponding to each segmentation unit with the basic voice characteristic information respectively to generate first voice characteristic information corresponding to each segmentation unit respectively;
and adding each piece of first voice characteristic information and the basic voice characteristic information respectively to determine the target voice characteristic information.
According to an embodiment of the present invention, the first characteristic information adjusting module 803 is specifically configured to:
and respectively dividing the target voice characteristic value corresponding to each segmentation unit with the average voice characteristic value to generate the voice characteristic proportion value corresponding to each segmentation unit.
Fig. 9 schematically shows a block diagram of a model generation apparatus provided in an embodiment of the present invention, and as shown in fig. 9, the model generation apparatus 900 includes a sample acquisition module 901, a recognition module 902, a feature extraction module 903, and a training module 904.
A sample obtaining module 901, configured to obtain a speech sample set by using a sample, where the speech sample set includes a plurality of speech samples belonging to the same target speech style;
a recognition module 902, configured to perform text recognition on the multiple voice samples, and determine text samples corresponding to the multiple voice samples respectively;
a feature extraction module 903, configured to perform feature extraction on the multiple voice samples respectively, and determine voice sample feature information corresponding to the multiple voice samples respectively;
a training module 904, configured to train a speech feature prediction model using the text samples and the speech sample feature information corresponding to the multiple speech samples, where the speech feature prediction model is used to predict speech feature information of a text to be subjected to speech synthesis, the speech feature information is used to adjust basic speech feature information of the speech synthesis model to obtain target speech feature information, and the target speech feature information is used to perform speech synthesis on an input text to obtain target speech matching with the target speech style.
According to an embodiment of the present invention, the identifying module 902 is specifically configured to:
for any text sample, segmenting the text sample according to a preset segmentation granularity to obtain a plurality of segmentation units;
according to an embodiment of the present invention, the feature extraction module 903 is specifically configured to:
respectively extracting the characteristics of a plurality of segmentation units contained in any text sample, and determining the voice sample characteristic information corresponding to the segmentation units;
according to an embodiment of the present invention, the training module 904 is specifically configured to:
and taking a plurality of segmentation units of the text sample corresponding to any voice sample as model input, taking voice sample characteristic information corresponding to the voice sample as a model label, and training the voice characteristic prediction model.
According to an embodiment of the present invention, the feature extraction module 903 is specifically configured to:
aiming at any segmentation unit in any text sample, respectively carrying out time axis alignment on the segmentation unit in a voice sample corresponding to the text sample so as to determine a time label of the segmentation unit;
acquiring a oscillogram of the voice sample;
determining waveforms respectively corresponding to the segmentation units from the oscillogram of the voice sample based on the time labels of the segmentation units;
and determining the voice sample characteristic information corresponding to the segmentation unit according to the waveform.
According to an embodiment of the present invention, the feature extraction module 903 is specifically configured to:
determining the speech rate characteristic value of the segmentation unit according to the starting time and the ending time of the waveform;
performing fundamental frequency extraction on the waveform to determine a pitch characteristic value of the segmentation unit;
determining a volume characteristic value of the segmentation unit according to the amplitude of the waveform;
and generating the voice sample characteristic information according to the speech speed characteristic value, the tone characteristic value and the volume characteristic value.
According to an embodiment of the present invention, the training module 904 is specifically configured to:
inputting a text sample corresponding to any voice sample into the voice feature prediction model and outputting a predicted voice feature value aiming at the voice sample;
calculating the loss result of the predicted voice characteristic value and the voice characteristic value corresponding to the voice sample;
and adjusting the model parameters of the voice characteristic prediction model according to the loss result until the loss result meets the preset condition.
The model generating apparatus shown in fig. 9 may execute the model generating method shown in the embodiment shown in fig. 4, and the implementation principle and the technical effect are not repeated. The specific manner in which each module and unit of the model generation apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
Fig. 10 schematically illustrates a block diagram of a voice broadcast apparatus according to an embodiment of the present invention, and as shown in fig. 10, the voice broadcast apparatus 1000 includes a second text obtaining module 1001, a second feature predicting module 1002, a second feature information adjusting module 1003, and a voice playing module 1004.
A second text acquisition module 1001, configured to acquire a text to be broadcasted;
the second feature prediction module 1002 is configured to perform voice feature prediction on the text to be broadcasted to obtain voice feature information of a target voice style;
a second feature information adjusting module 1003, configured to input the text to be played and the voice feature information into a voice synthesis model, adjust basic voice feature information of the voice synthesis model based on the voice feature information to obtain target voice feature information, and perform voice synthesis on the content to be played by using the target voice feature information to obtain broadcast voice of the target voice style;
and the voice playing module 1004 plays the broadcast voice.
The voice broadcast device shown in fig. 10 may execute the voice broadcast method shown in the embodiment shown in fig. 7, and the implementation principle and the technical effect are not described again. The specific manner in which each module and unit of the voice broadcasting device in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
In one possible design, the speech synthesis apparatus, the model generation apparatus, and the voice broadcasting apparatus provided in the embodiment of the present invention may be implemented as a computing device, as shown in fig. 11, the computing device may include a storage component 1101 and a processing component 1102;
the storage component 1101 stores one or more computer instructions, where the one or more computer instructions are used by the processing component 1102 to be invoked and executed to implement the voice synthesis method, the model generation method, and the voice broadcast method provided by the embodiment of the present invention.
Of course, a computing device may also necessarily include other components, such as input/output interfaces, communication components, and so forth. The input/output interface provides an interface between the processing components and peripheral interface modules, which may be output devices, input devices, etc. The communications component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.
The computing device may be a physical device or an elastic computing host provided by a cloud computing platform, and the computing device may be a cloud server, and the processing component, the storage component, and the like may be a basic server resource rented or purchased from the cloud computing platform.
When the computing device is a physical device, the computing device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device.
In practical application, the computing device may specifically deploy a node in the message queue system, and implement the node as a producer, a consumer, a transit server, a naming server, or the like in the message queue system.
The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program can realize the voice synthesis method, the model generation method and the voice broadcast method provided by the embodiment of the invention when being executed by a computer.
The embodiment of the invention also provides a computer program product, which comprises a computer program, and the computer program can realize the voice synthesis method, the model generation method and the voice broadcast method provided by the embodiment of the invention when being executed by a computer.
The processing components in the respective embodiments above may include one or more processors executing computer instructions to perform all or part of the steps of the methods described above. Of course, the processing elements may also be implemented as one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components configured to perform the above-described methods.
The storage component is configured to store various types of data to support operations in the device. The storage component may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (15)

1. A method of speech synthesis, comprising:
acquiring a text to be subjected to voice synthesis;
performing voice feature prediction on the text to obtain voice feature information of a target voice style;
adjusting basic voice feature information of a voice synthesis model by using the voice feature information to obtain target voice feature information;
and inputting the text into the voice synthesis model to perform voice synthesis on the text by using the target voice characteristic information to obtain the target voice of the target voice style.
2. The method of claim 1, wherein the performing speech feature prediction on the text to obtain speech feature information of a target speech style comprises:
performing voice feature prediction on the text by using a voice feature prediction model to obtain voice feature information of a target voice style; wherein the speech feature prediction model is obtained based on a plurality of speech samples belonging to the target speech style.
3. The method of claim 2, wherein the performing speech feature prediction on the text by using a speech feature prediction model to obtain speech feature information of a target speech style comprises:
segmenting the text according to a preset segmentation granularity to obtain a plurality of segmentation units;
inputting the plurality of segmentation units into the voice feature prediction model, and outputting voice feature information respectively corresponding to the plurality of segmentation units;
the adjusting the basic voice feature information of the voice synthesis model by using the voice feature information to obtain the target voice feature information comprises:
and adjusting the basic voice characteristic information of the voice synthesis model by utilizing the voice characteristic information respectively corresponding to the plurality of segmentation units to obtain the target voice characteristic information.
4. The method according to claim 3, wherein the speech feature information includes at least one speech feature value, and the adjusting the basic speech feature information of the speech synthesis model using the speech feature information corresponding to each of the plurality of segmentation units to obtain the target speech feature information includes:
performing anti-regularization on at least one voice characteristic value corresponding to each segmentation unit, and determining a plurality of target voice characteristic values;
carrying out average calculation on the target voice characteristic values to determine an average voice characteristic value;
determining a voice characteristic proportion value corresponding to each segmentation unit based on the average voice characteristic value and the target voice characteristic value corresponding to each segmentation unit;
and inputting the voice characteristic proportion value corresponding to each segmentation unit into the voice synthesis model, and adjusting basic voice characteristic information of the voice synthesis model by using the voice characteristic proportion value corresponding to each segmentation unit to obtain target voice characteristic information.
5. The method according to claim 3, wherein the speech feature information includes at least one speech feature value, and the adjusting the basic speech feature information of the speech synthesis model using the speech feature information corresponding to each of the plurality of segmentation units to obtain the target speech feature information includes:
and inputting at least one voice characteristic value corresponding to each of the plurality of segmentation units into the voice synthesis model, and adjusting basic voice characteristic information of the voice synthesis model by using the plurality of voice characteristic values corresponding to each of the plurality of segmentation units to obtain target voice characteristic information.
6. The method according to claim 4, wherein the adjusting the basic speech feature information of the speech synthesis model by using the speech feature ratio value corresponding to each segmentation unit to obtain the target speech feature information comprises:
multiplying the voice characteristic proportion value corresponding to each segmentation unit with the basic voice characteristic information respectively to generate first voice characteristic information corresponding to each segmentation unit respectively;
and adding each piece of first voice characteristic information and the basic voice characteristic information respectively to determine the target voice characteristic information.
7. The method of claim 4, wherein determining a speech feature proportion value corresponding to each segmentation unit based on the average speech feature value and the target speech feature value corresponding to each segmentation unit comprises:
and respectively dividing the target voice characteristic value corresponding to each segmentation unit with the average voice characteristic value to generate the voice characteristic proportion value corresponding to each segmentation unit.
8. A method of model generation, comprising:
acquiring a voice sample set, wherein the voice sample set comprises a plurality of voice samples belonging to the same target voice style;
respectively carrying out character recognition on the plurality of voice samples, and determining character samples respectively corresponding to the plurality of voice samples;
respectively extracting the characteristics of the voice samples, and determining the voice sample characteristic information respectively corresponding to the voice samples;
and training a voice characteristic prediction model by utilizing text samples and voice sample characteristic information which respectively correspond to the plurality of voice samples, wherein the voice characteristic prediction model is used for predicting voice characteristic information of a text to be subjected to voice synthesis, the voice characteristic information is used for adjusting basic voice characteristic information of the voice synthesis model to obtain target voice characteristic information, and the target voice characteristic information is used for performing voice synthesis on an input text to obtain target voice matched with the target voice style.
9. The method of claim 8, wherein after performing text recognition on each of the plurality of speech samples in the speech sample set and determining a text sample corresponding to each of the plurality of speech samples, the method further comprises:
for any text sample, segmenting the text sample according to a preset segmentation granularity to obtain a plurality of segmentation units;
the respectively extracting the features of the plurality of voice samples in the voice sample set, and determining the voice sample feature information respectively corresponding to the plurality of voice samples includes:
respectively extracting the characteristics of a plurality of segmentation units contained in any text sample, and determining the voice sample characteristic information corresponding to the segmentation units;
the training of the speech feature prediction model using the text samples and the speech feature information corresponding to the plurality of speech samples respectively includes:
and taking a plurality of segmentation units of the text sample corresponding to any voice sample as model input, taking voice sample characteristic information corresponding to the voice sample as a model label, and training the voice characteristic prediction model.
10. The method according to claim 9, wherein for any text sample, the performing feature extraction on a plurality of segmentation units included in the text sample respectively, and determining the feature information of the speech sample corresponding to each of the plurality of segmentation units comprises:
aiming at any segmentation unit in any text sample, respectively carrying out time axis alignment on the segmentation unit in a voice sample corresponding to the text sample so as to determine a time label of the segmentation unit;
acquiring a oscillogram of the voice sample;
determining waveforms respectively corresponding to the segmentation units from the oscillogram of the voice sample based on the time labels of the segmentation units;
and determining the voice sample characteristic information corresponding to the segmentation unit according to the waveform.
11. The method of claim 10, wherein the determining the feature information of the speech sample corresponding to the segmentation unit according to the waveform comprises:
determining the speech rate characteristic value of the segmentation unit according to the starting time and the ending time of the waveform;
performing fundamental frequency extraction on the waveform to determine a pitch characteristic value of the segmentation unit;
determining a volume characteristic value of the segmentation unit according to the amplitude of the waveform;
and generating the voice sample characteristic information according to the speech speed characteristic value, the tone characteristic value and the volume characteristic value.
12. The method of claim 8, wherein training a speech feature prediction model using the text samples and the speech sample feature information corresponding to the plurality of speech samples comprises:
inputting a text sample corresponding to any voice sample into the voice feature prediction model and outputting a predicted voice feature value;
calculating a loss result of the predicted voice characteristic value and the voice characteristic value corresponding to the voice sample;
and adjusting the model parameters of the voice characteristic prediction model according to the loss result until the loss result meets the preset condition.
13. A voice broadcast method is characterized by comprising the following steps:
acquiring a text to be broadcasted;
carrying out voice feature prediction on the text to be broadcasted to obtain voice feature information of a target voice style;
inputting the text to be played and the voice characteristic information into a voice synthesis model, adjusting basic voice characteristic information of the voice synthesis model based on the voice characteristic information to obtain target voice characteristic information, and performing voice synthesis on the content to be played by using the target voice characteristic information to obtain broadcast voice of the target voice style;
and playing the broadcast voice.
14. A computing device comprising a processing component and a storage component;
the storage component stores one or more computer instructions; the one or more computer instructions are for execution by the processing component to invoke, implement a speech synthesis method according to any one of claims 1 to 7, or implement a model generation method according to any one of claims 8 to 12, or implement a voice announcement method according to claim 13.
15. A computer storage medium characterized by storing a computer program that, when executed by a computer, implements the speech synthesis method according to any one of claims 1 to 7, or implements the model generation method according to any one of claims 8 to 12, or implements the voice broadcast method according to claim 13.
CN202211160928.7A 2022-09-22 2022-09-22 Speech synthesis method and model generation method Pending CN115910028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211160928.7A CN115910028A (en) 2022-09-22 2022-09-22 Speech synthesis method and model generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211160928.7A CN115910028A (en) 2022-09-22 2022-09-22 Speech synthesis method and model generation method

Publications (1)

Publication Number Publication Date
CN115910028A true CN115910028A (en) 2023-04-04

Family

ID=86481374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211160928.7A Pending CN115910028A (en) 2022-09-22 2022-09-22 Speech synthesis method and model generation method

Country Status (1)

Country Link
CN (1) CN115910028A (en)

Similar Documents

Publication Publication Date Title
US10614795B2 (en) Acoustic model generation method and device, and speech synthesis method
CN108305643B (en) Method and device for determining emotion information
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
US20210335364A1 (en) Computer program, server, terminal, and speech signal processing method
CN112420014A (en) Virtual face construction method and device, computer equipment and computer readable medium
CN110019962B (en) Method and device for generating video file information
CN110880198A (en) Animation generation method and device
CN112116903A (en) Method and device for generating speech synthesis model, storage medium and electronic equipment
CN113658577A (en) Speech synthesis model training method, audio generation method, device and medium
CN112185363A (en) Audio processing method and device
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN110930975B (en) Method and device for outputting information
CN115171644A (en) Speech synthesis method, apparatus, electronic device and storage medium
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN113347491A (en) Video editing method and device, electronic equipment and computer storage medium
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
CN114125506A (en) Voice auditing method and device
CN111383627A (en) Voice data processing method, device, equipment and medium
CN115910028A (en) Speech synthesis method and model generation method
CN114333758A (en) Speech synthesis method, apparatus, computer device, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination