CN117219052A - Prosody prediction method, apparatus, device, storage medium, and program product - Google Patents
Prosody prediction method, apparatus, device, storage medium, and program product Download PDFInfo
- Publication number
- CN117219052A CN117219052A CN202310121183.1A CN202310121183A CN117219052A CN 117219052 A CN117219052 A CN 117219052A CN 202310121183 A CN202310121183 A CN 202310121183A CN 117219052 A CN117219052 A CN 117219052A
- Authority
- CN
- China
- Prior art keywords
- prosody
- text
- prediction
- target
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 94
- 238000003860 storage Methods 0.000 title claims abstract description 16
- 238000005070 sampling Methods 0.000 claims abstract description 75
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 33
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 33
- 238000012549 training Methods 0.000 claims description 37
- 238000012545 processing Methods 0.000 claims description 28
- 230000015654 memory Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 18
- 238000000605 extraction Methods 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 25
- 238000013473 artificial intelligence Methods 0.000 abstract description 9
- 230000002194 synthesizing effect Effects 0.000 abstract description 8
- 238000012512 characterization method Methods 0.000 description 22
- 230000008569 process Effects 0.000 description 21
- 238000009792 diffusion process Methods 0.000 description 18
- 238000010586 diagram Methods 0.000 description 9
- 230000033764 rhythmic process Effects 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000001364 causal effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000008451 emotion Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009429 distress Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Machine Translation (AREA)
Abstract
The application provides a prosody prediction method, a prosody prediction device, a prosody prediction storage medium and a prosody prediction program product; relates to artificial intelligence technology; the method comprises the following steps: extracting characteristics of a target text to obtain text characteristics of the target text; sampling from a first target distribution to obtain an initial prosody characteristic for prosody prediction, and sampling from a second target distribution to obtain noise for prosody prediction; performing prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text; the predicted prosodic features are used for carrying out voice synthesis by combining the text features to obtain synthesized voice of the target text, wherein the synthesized voice has the predicted prosodic features; according to the method and the device, the diversity of the predicted prosodic features of the text can be enriched, so that the synthesized voice obtained by synthesizing the voice for the text based on the predicted prosodic features is diversified.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a prosody prediction method, apparatus, device, storage medium, and program product.
Background
Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and deciding. Artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, such as natural language processing technology, machine learning/deep learning and other directions, and with the development of technology, the artificial intelligence technology will be applied in more fields and has an increasingly important value.
Prosody prediction is also an important application direction of artificial intelligence. In the related art, when prosody prediction is performed, prosody prediction is performed on a text through a deterministic prosody prediction model, so that predicted prosody characteristics of the text are obtained. However, since deterministic prosody prediction only predicts one type of predicted prosody feature for one text, the predicted prosody feature of the same text lacks variation, resulting in too much singulation of the synthesized speech resulting from speech synthesis for that text based on the predicted prosody feature.
Disclosure of Invention
The embodiment of the application provides a prosody prediction method, a prosody prediction device, electronic equipment, a computer-readable storage medium and a computer program product, which can enrich the diversity of predicted prosody characteristics of texts, so that synthesized voices obtained by synthesizing voices for the texts based on the predicted prosody characteristics are diversified.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a prosody prediction method, which comprises the following steps:
extracting characteristics of a target text to obtain text characteristics of the target text;
sampling from a first target distribution to obtain an initial prosody characteristic for prosody prediction, and sampling from a second target distribution to obtain noise for prosody prediction;
performing prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text;
the predicted prosodic features are used for combining the text features to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
The embodiment of the application also provides a prosody prediction device, which comprises:
the feature extraction module is used for extracting features of the target text to obtain text features of the target text;
the sampling module is used for sampling initial prosody characteristics for prosody prediction from the first target distribution and sampling noise for prosody prediction from the second target distribution;
the prosody prediction module is used for performing prosody prediction on the target text based on the text characteristics, the initial prosody characteristics and the noise to obtain predicted prosody characteristics of the target text;
The predicted prosodic features are used for combining the text features to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
In the above scheme, the feature extraction module is further configured to perform word segmentation processing on the target text to obtain a plurality of word segments included in the target text; for each word segment, obtaining the phoneme information of the word segment, and carrying out coding processing on the phoneme information so as to convert the phoneme information into phoneme characteristics; and combining the phoneme characteristics of each word segment to obtain the text characteristics of the target text.
In the above scheme, the sampling module is further configured to generate first random data conforming to a first data distribution type before sampling from a first target distribution to obtain an initial prosody characteristic for prosody prediction and sampling from a second target distribution to obtain noise for prosody prediction, and construct the first target distribution based on the first random data; generating second random data conforming to a second data distribution type, and constructing the second target distribution based on the second random data.
In the above scheme, the prosody prediction module is further configured to obtain prosody control information for prosody prediction before prosody prediction is performed on the target text based on the text feature, the initial prosody feature, and the noise to obtain a predicted prosody feature of the target text; the prosody prediction module is further used for combining the text characteristics and the prosody control information to obtain combined characteristics; and performing prosody prediction on the target text based on the combined characteristic, the initial prosody characteristic and the noise to obtain predicted prosody characteristics of the target text.
In the above aspect, the prosody prediction includes M-wheel prosody prediction; the sampling module is further configured to sample, for each wheel prosody prediction in the M wheel prosody predictions, noise for the wheel prosody prediction from the second target distribution; the prosody prediction module is further configured to perform prosody prediction on the target text based on the text feature, the initial prosody feature, and noise for the 1 st round of prosody prediction, for 1 st round of prosody prediction in the M round of prosody prediction, to obtain an intermediate predicted prosody feature of the 1 st round of prosody prediction; aiming at the M-th wheel prosody prediction in the M-th wheel prosody prediction, performing prosody prediction on the target text based on the text features, the intermediate prediction prosody features of the (M-1) -th wheel prosody prediction and the noise for the M-th wheel prosody prediction to obtain the intermediate prediction prosody features of the M-th wheel prosody prediction; traversing the M to obtain intermediate predicted prosody characteristics of the Mth wheel prosody prediction, and taking the intermediate predicted prosody characteristics of the Mth wheel prosody prediction as predicted prosody characteristics of the target text; wherein, M and M are integers greater than 0, M is greater than or equal to M.
In the above scheme, the prosody prediction module is further configured to predict the initial prosody feature to be noise removed based on the text feature and the initial prosody feature, to obtain noise to be removed; and subtracting the initial prosodic features from the noise to be removed, and determining the predicted prosodic features of the target text based on the subtraction result obtained by subtracting and the noise.
In the above scheme, the first target distribution accords with a first data distribution type, and the sampling module is further configured to randomly sample, from the first target distribution, first sampling data that accords with the first data distribution type, and use the first sampling data as the initial prosody feature for prosody prediction; the second target distribution accords with a second data distribution type, and the sampling module is further used for randomly sampling second sampling data which accords with the second data distribution type from the second target distribution, and taking the second sampling data as the noise for prosody prediction.
In the above aspect, the prosody prediction module is further configured to obtain a prosody prediction model for prosody prediction; and calling the prosody prediction model to perform prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text.
In the above scheme, the prosody prediction module is further configured to obtain an initial prosody prediction model, and obtain a sample text for training the initial prosody prediction model, and a target prosody feature of the sample text; extracting features of the sample text to obtain sample text features of the sample text, and sampling from third target distribution to obtain sample noise; based on the sample text features, the target prosodic features and the sample noise, predicting noise to be removed through the initial prosodic prediction model to obtain predicted noise to be removed, wherein the predicted noise to be removed is used for determining a prosodic prediction result of the initial prosodic prediction model for the sample text; and updating model parameters of the initial prosody prediction model based on the prediction to be noise removed to obtain the prosody prediction model for prosody prediction.
In the above aspect, the prosody prediction module is further configured to determine a gradient of the initial prosody prediction model based on the prediction to be noise-removed; based on the gradient, model parameters of the initial prosody prediction model are updated.
In the above scheme, the prosody prediction module is further configured to perform speech synthesis on the target text based on the text feature and the predicted prosody feature, so as to obtain a target synthesized speech of the target text.
The embodiment of the application also provides electronic equipment, which comprises:
a memory for storing computer executable instructions;
and the processor is used for realizing the prosody prediction method provided by the embodiment of the application when executing the computer executable instructions stored in the memory.
The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions or a computer program, and when the computer executable instructions or the computer program are executed by a processor, the prosody prediction method provided by the embodiment of the application is realized.
The embodiment of the application also provides a computer program product, which comprises computer executable instructions or a computer program, and the computer executable instructions or the computer program realize the prosody prediction method provided by the embodiment of the application when being executed by a processor.
The embodiment of the application has the following beneficial effects:
according to the embodiment of the application, firstly, the characteristics of the target text are extracted to obtain the text characteristics of the target text, then the initial prosodic characteristics used for prosodic prediction are obtained by sampling from the first target distribution, and the noise used for prosodic prediction is obtained by sampling from the second target distribution, so that the target text is prosodic predicted based on the text characteristics, the initial prosodic characteristics and the noise to obtain the predicted prosodic characteristics of the target text; thus, the predicted prosodic features and the text features can be combined to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
Here, since the initial prosodic features and noise sampled at the time of prosodic prediction processing have randomness, the sampled initial prosodic features and noise are different from one prosodic prediction processing to another, and thus the predicted prosodic features are also different. Thus, according to the prosody prediction method provided by the embodiment of the application, different predicted prosody characteristics can be predicted for the same text, and the diversity of the predicted prosody characteristics of the text is enriched, so that the synthesized voice obtained by synthesizing the voice for the text based on the predicted prosody characteristics is diversified.
Drawings
Fig. 1 is a schematic diagram of a prosody prediction system 100 according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device 500 for implementing a prosody prediction method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a prosody prediction method according to an embodiment of the present application;
fig. 4 is a flowchart of a prosody prediction method according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a prosody prediction method according to an embodiment of the present application;
FIG. 6 is a flowchart of a prosody prediction method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an application of a prosody prediction model provided by an embodiment of the present application;
FIG. 8 is a training schematic of a prosody prediction model provided by an embodiment of the present application;
fig. 9 is a schematic diagram of an application of a prosody prediction model according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) And a client, an application program for providing various services, such as a client supporting prosody prediction, running in the terminal.
2) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, one or more of the operations performed may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.
3) Phonemes: the minimum speech unit divided according to the natural attribute of the speech exists as a basic modeling unit in the speech synthesis method.
4) Prosodic features include, in addition to timbre features in which vowels and consonants are arranged in time series, features such as the level (pitch), intensity (intensity), length (duration), and interrelation thereof of sounds. The prosodic features of speech are called the pause in speech. The prosodic features are specifically: tone at syllable level, accent at syllable combination level, long and short tones at phoneme level, intonation at sentence level.
5) Speech synthesis, also known as Text To Speech (TTS), is a technique that produces artificial Speech by mechanical, electronic means. TTS technology is a technology that converts text information generated by a computer itself or input externally into intelligible and fluent spoken chinese language output.
6) Encoding is the process of converting information from one form or format to another. Decoding is the inverse of encoding.
The embodiment of the application provides a prosody prediction method, a prosody prediction device, electronic equipment, a computer-readable storage medium and a computer program product, which can enrich the diversity of predicted prosody characteristics of texts, so that synthesized voices obtained by synthesizing voices for the texts based on the predicted prosody characteristics are diversified. Next, the respective descriptions will be given.
It should be noted that when the embodiments of the present application are applied to specific products or technologies, user permissions or agreements need to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.
The following describes a prosody prediction system provided by an embodiment of the present application. Referring to fig. 1, fig. 1 is a schematic architecture diagram of a prosody prediction system 100 according to an embodiment of the present application, in order to support an exemplary application, a terminal (a terminal 400-1 is shown in an exemplary manner) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of both, and a wireless or wired link is used to implement data transmission.
A terminal (e.g., 400-1) for transmitting a prosody prediction request for the target text to the server 200 in response to the prosody prediction instruction for the target text;
a server 200 for receiving a prosody prediction request for a target text transmitted by a terminal; responding to the prosody prediction request, acquiring a target text, and extracting features of the target text to obtain text features of the target text; sampling from a first target distribution to obtain an initial prosody characteristic for prosody prediction, and sampling from a second target distribution to obtain noise for prosody prediction; performing prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text;
in some embodiments, when the target text needs to be subjected to speech synthesis, the user may trigger a speech synthesis instruction for the target text at the terminal, where the terminal (for example, 400-1) is further configured to send a speech synthesis request for the target text to the server 200; the server 200 is further configured to receive a speech synthesis request for the target text; responding to the voice synthesis request, and performing voice synthesis based on the text characteristics and predicted prosody characteristics of the target text to obtain synthesized voice of the target text, wherein the synthesized voice has the predicted prosody characteristics; returning the synthesized voice of the target text to the terminal; a terminal (e.g., 400-1) for receiving the synthesized voice of the target text returned from the server 200; and playing the synthesized voice of the target text.
In some embodiments, the prosody prediction method provided by the embodiments of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal alone, may be implemented by a server alone, or may be implemented by a terminal and a server in cooperation. The prosody prediction method provided by the embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, games, audios and videos, and the like.
In some embodiments, the electronic device implementing the prosody prediction method provided by the embodiments of the present application may be various types of terminals or servers. The server (e.g., server 200) may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers. The terminal (e.g., terminal 400-1) may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart voice interaction device (e.g., a smart speaker), a smart home appliance (e.g., a smart television), a smart watch, a vehicle-mounted terminal, a wearable device, a Virtual Reality (VR) device, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited by the embodiment of the present application.
In some embodiments, the prosody prediction method provided by the embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology). Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources. As an example, a server (e.g., server 200) may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), and basic cloud computing services such as big data and artificial intelligence platforms.
In some embodiments, multiple servers may be organized into a blockchain, and the servers may be nodes on the blockchain, where there may be information connections between each node in the blockchain, and where information may be transferred between the nodes via the information connections. The data related to the prosody prediction method provided by the embodiment of the present application (for example, the target text, the predicted prosody feature of the target text, the synthesized voice of the target text, etc.) may be stored on the blockchain.
In some embodiments, the terminal or the server may implement the prosody prediction method provided by the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; a Native (APP) Application, i.e. a program that needs to be installed in an operating system to run; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.
The electronic device for implementing the prosody prediction method provided by the embodiment of the application is described below. Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for implementing a prosody prediction method according to an embodiment of the present application. The electronic device 500 provided in the embodiment of the present application may be a terminal or a server. The electronic device 500 provided in the embodiment of the application includes: at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in electronic device 500 are coupled together by bus system 540. It is appreciated that the bus system 540 is used to enable connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 540 in fig. 2 for clarity of illustration.
The processor 510 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (Digital Signal Processor, DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may be removable, non-removable, or a combination thereof. Memory 550 may include one or more storage devices physically located away from processor 510. Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (Random Access Memory, RAM). The memory 550 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
network communication module 552 is used to reach other electronic devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (Universal Serial Bus, USB), etc.;
a presentation module 553 for enabling presentation of information (e.g., a user interface for operating a peripheral device and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
the input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.
In some embodiments, the prosody prediction device provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a prosody prediction device 555 stored in a memory 550, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the feature extraction module 5551, the sampling module 5552, and the prosody prediction module 5553 are logical, and thus may be arbitrarily combined or further split according to the implemented functions, the functions of each module will be described below.
The following describes a prosody prediction method provided by the embodiment of the present application. In some embodiments, the prosody prediction method provided by the embodiments of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal alone, may be implemented by a server alone, or may be implemented by a terminal and a server in cooperation. With reference to fig. 3, fig. 3 is a schematic flow chart of a prosody prediction method provided by an embodiment of the present application, where the prosody prediction method provided by the embodiment of the present application includes:
step 101: and the terminal performs feature extraction on the target text to obtain text features of the target text.
In step 101, the terminal may be provided with a client, such as a client supporting prosody prediction. The terminal runs the client, and a user can trigger a prosody prediction instruction aiming at the target text through the client; and the terminal responds to the prosody prediction instruction to acquire a target text to be subjected to prosody prediction. The target text may be set by the user according to the needs, may be generated according to a context scene (for example, the intelligent voice assistant generates according to the voice of the user), and may also be an existing text (such as an audio reading material, etc.), for example, the target text may be: "hello" and "pleased when you are in distress", etc. After the terminal acquires the target text, firstly extracting the characteristics of the target text to obtain the text characteristics of the target text.
In some embodiments, referring to fig. 4, fig. 4 is a flowchart of a prosody prediction method provided in an embodiment of the present application, where step 101 shown in fig. 3 may be implemented by steps 1011-1013 shown in fig. 4: step 1011, performing word segmentation processing on the target text to obtain a plurality of segmented words included in the target text; step 1012, for each word segment, obtaining the phoneme information of the word segment, and performing coding processing on the phoneme information to convert the phoneme information into phoneme features; in step 1013, the phoneme features of each word are combined to obtain the text features of the target text.
Here, when feature extraction is performed on the target text, phoneme-level features of the target text may be extracted. In actual implementation, this may be achieved by a pre-built text encoder. The text encoder may be constructed based on at least one layer of a Transformer network. The process for extracting the characteristics of the target text comprises the following steps: in step 1011, word segmentation may be performed on the target text to obtain a plurality of words included in the target text, specifically, regularization may be performed on the target text to obtain a standard text, and then word segmentation may be performed on the standard text to obtain a plurality of words included in the target text. In step 1012, 1) for each word segment, phoneme information of the word segment, specifically, phoneme information of each word in the word segment is obtained, where the phoneme information is information representing pronunciation of the corresponding word segment, for example, the phoneme information may be pinyin, phonetic symbol, and any pronunciation symbol suitable for representing pronunciation of the phrase. 2) For each word segment, the phoneme information of the word segment is subjected to coding processing, so that the phoneme information of the word segment is converted into phoneme features. In step 1013, the phoneme features of each word included in the target text are combined to obtain the text features of the target text. Specifically, the phoneme features of each word may be subjected to a concatenation process to obtain a concatenation phoneme feature, and the concatenation phoneme feature is used as a text feature of the target text. The stitching process may include: the phoneme features of each word are subjected to addition processing, multiplication processing, and the like.
It should be noted that the prosodic features are prosodic features of phonemes of the target text, each phoneme having a corresponding prosodic feature. The prosodic features may be represented by dimensions of fundamental frequency, energy, duration, etc. Text features in embodiments of the present application may be understood as text tokens and prosodic features may be understood as prosodic tokens.
Step 102: an initial prosody characteristic for prosody prediction is sampled from a first target distribution and noise for prosody prediction is sampled from a second target distribution.
In step 102, the terminal may first acquire a first target distribution and a second target distribution. Then randomly sampling from the first target distribution to obtain initial prosody characteristics for prosody prediction; and randomly sampling from the second target distribution to obtain noise for prosody prediction. In practical applications, the first target distribution may conform to a first data distribution type, i.e. the first target distribution comprises a plurality of random data conforming to the first data distribution type, and the second target distribution may conform to a second data distribution type, i.e. the second target distribution comprises a plurality of random data conforming to the second data distribution type. The first data distribution type and the second data distribution type may be the same or different. The data distribution type (first data distribution type or second data distribution type) may be a normal distribution, a standard normal distribution, or the like.
In some embodiments, the terminal may construct the first target distribution by: generating first random data conforming to the first data distribution type, and constructing first target distribution based on the first random data; accordingly, the terminal may construct the second target distribution by: second random data conforming to the second data distribution type is generated, and a second target distribution is constructed based on the second random data. Here, when generating random data, it may be generated by a random data generation algorithm.
In some embodiments, when the target distribution (the first target distribution or the second target distribution) conforms to a certain target data distribution type (the first data distribution type or the second data distribution type), when random sampling is performed, sampling data conforming to the target data distribution type may be sampled from the target distribution, and the sampling data may be used as a sampling result. That is, the first target distribution accords with the first data distribution type, and the terminal can sample from the first target distribution to obtain the initial prosody characteristic by the following manner: randomly sampling from a first target distribution to obtain first sampling data conforming to a first data distribution type, and taking the first sampling data as an initial prosody characteristic for prosody prediction; accordingly, the second target distribution accords with the second data distribution type, and the terminal can sample noise from the second target distribution in the following manner: and randomly sampling second sampling data conforming to a second data distribution type from the second target distribution, and taking the second sampling data as noise for prosody prediction.
Here, since the initial prosodic features and noise sampled at the time of prosodic prediction processing have randomness, the sampled initial prosodic features and noise are different from one turn of prosodic prediction processing to another, so that the predicted prosodic features obtained by prediction are also different. In this way, by executing the prosody prediction method provided by the embodiment of the application for the same text for multiple times, different predicted prosody features can be obtained for the same text, and the diversity of the predicted prosody features of the text is enriched, so that the synthesized voice obtained by synthesizing the voice for the text based on the predicted prosody features is diversified.
Step 103: and performing prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text.
The predicted prosodic features are used for combining text features to perform speech synthesis to obtain synthesized speech of the target text, wherein the synthesized speech has the predicted prosodic features.
In step 103, after obtaining the text feature of the target text, the initial prosodic feature for prosodic prediction, and the noise, prosodic prediction is performed on the target text based on the text feature, the initial prosodic feature, and the noise, to obtain a predicted prosodic feature of the target text.
In some embodiments, before prosody prediction is performed on the target text, the terminal may further acquire prosody control information (or prosody control conditions) for prosody prediction, where the prosody control information may be preset or may be generated according to a context scene of the target text, content of the target text, or the like. The prosodic control information may include information on timbre, pitch, intensity, emotion (e.g., happy, sad), etc. Based on the above, the terminal may perform prosody prediction on the target text based on the text feature, the initial prosody feature, and the noise by: combining the text characteristics and prosody control information to obtain combined characteristics; and performing prosody prediction on the target text based on the combined characteristics, the initial prosody characteristics and the noise to obtain predicted prosody characteristics of the target text. Here, when the text feature and the prosody control information are combined, the text feature and the prosody control information may be subjected to a splicing process, and the result of the splicing process may be used as the combined feature. The concatenation process may include an addition process, a multiplication process, or the like of the text feature and prosody control information.
In some embodiments, the prosody prediction process is a process of denoising the initial prosody features, so when prosody prediction is performed, it is the actual prediction that the noise to be removed from the initial prosody features is to be removed. Based on this, referring to fig. 5, fig. 5 is a flowchart of a prosody prediction method according to an embodiment of the present application, where step 103 shown in fig. 3 may be further implemented by steps 1031 to 1032 shown in fig. 5: step 1031, based on the text feature and the initial prosody feature, predicting the noise to be removed from the initial prosody feature to obtain the noise to be removed; step 1032, subtracting the noise to be removed from the initial prosodic features, and determining the predicted prosodic features of the target text based on the subtraction result and the noise.
In some embodiments, the prosody prediction comprises an M-round prosody prediction; accordingly, referring to fig. 6, fig. 6 is a flowchart of a prosody prediction method according to an embodiment of the present application, where step 102 shown in fig. 3 may be implemented by step 1021 shown in fig. 6: the method comprises the steps of sampling initial prosody features for prosody prediction from a first target distribution, and sampling noise for prosody prediction from a second target distribution for each round of prosody prediction in M rounds of prosody prediction. Based on this, step 103 shown in fig. 3 may also be implemented by steps 1033-1035 shown in fig. 6: step 1033, aiming at the 1 st round of prosody prediction in the M round of prosody prediction, performing prosody prediction on the target text based on the text characteristics, the initial prosody characteristics and the noise for the 1 st round of prosody prediction to obtain intermediate predicted prosody characteristics of the 1 st round of prosody prediction; step 1034, aiming at the M-th wheel rhythm prediction in the M-th wheel rhythm prediction, performing rhythm prediction on the target text based on the text characteristics, the intermediate prediction rhythm characteristics of the (M-1) -th wheel rhythm prediction and the noise for the M-th wheel rhythm prediction to obtain the intermediate prediction rhythm characteristics of the M-th wheel rhythm prediction; step 1035, traversing M to obtain intermediate predicted prosody characteristics of the Mth round of prosody prediction, and taking the intermediate predicted prosody characteristics of the Mth round of prosody prediction as predicted prosody characteristics of the target text; wherein M and M are integers greater than 0, M being greater than or equal to M.
In practical applications, the prosody prediction includes M-round prosody prediction. In the embodiment of the application, for each wheel prosody prediction in the M wheel prosody predictions, the noise for the wheel prosody prediction is obtained by sampling from the second target distribution, that is, each wheel prosody prediction has the noise for the wheel prosody prediction. Based on this, the processing procedure of each round of prosody prediction in the M-round prosody prediction is described:
for the 1 st round of prosody prediction among the M rounds of prosody prediction: and performing prosody prediction on the target text based on the text characteristics, the initial prosody characteristics and the noise for the 1 st round of prosody prediction to obtain intermediate predicted prosody characteristics of the 1 st round of prosody prediction. Specifically, firstly, the text features, the initial prosody features and the noise for the 1 st round of prosody prediction are spliced to obtain spliced features, and then prosody prediction is carried out on the target text based on the spliced features to obtain the intermediate predicted prosody features of the 1 st round of prosody prediction.
Aiming at M-th wheel prosody prediction in M-wheel prosody prediction: and performing prosody prediction on the target text based on the text characteristics, the intermediate predicted prosody characteristics of the (m-1) th wheel prosody prediction and the noise for the m-th wheel prosody prediction to obtain the intermediate predicted prosody characteristics of the m-th wheel prosody prediction. Similarly, the text feature, the intermediate predicted prosody feature of the (m-1) th wheel prosody prediction, and the noise for the m-th wheel prosody prediction may be spliced to obtain a spliced feature, and then the target text may be prosody predicted based on the spliced feature to obtain the intermediate predicted prosody feature of the 1 st wheel prosody prediction. And traversing M to obtain the intermediate predicted prosody characteristic of the Mth wheel prosody prediction, and taking the intermediate predicted prosody characteristic of the Mth wheel prosody prediction as the predicted prosody characteristic of the target text. Here, the intermediate prediction characteristic of the next round can be obtained through calculation through the intermediate prediction prosody characteristic of the previous round, so that accuracy of prosody prediction is improved through a mode of M-round cascading round-by-round prosody prediction, and expressive force of predicted prosody characteristics and final synthesized voice is improved.
In practical application, for the 1 st round of prosody prediction in M round of prosody prediction: based on the text characteristics and the initial prosodic characteristics, predicting the noise to be removed from the initial prosodic characteristics to obtain the 1 st round of noise to be removed; subtracting the initial prosody characteristic from the 1 st round of noise to be removed, and determining the intermediate predicted prosody characteristic of the 1 st round of prosody prediction based on the subtraction result and the noise obtained by the subtraction. Aiming at M-th wheel prosody prediction in M-wheel prosody prediction: based on the text characteristics and the intermediate predicted prosody characteristics of the (m-1) th wheel prosody prediction, predicting the intermediate predicted prosody characteristics of the (m-1) th wheel prosody prediction to remove noise to obtain the (m-1) th wheel to remove noise; subtracting the intermediate predicted prosody characteristic of the (m-1) th wheel prosody prediction from the noise to be removed of the (m-1) th wheel, and determining the intermediate predicted prosody characteristic of the (m-1) th wheel prosody prediction based on the subtraction result and the noise obtained by the subtraction. Traversing M to obtain the intermediate predicted prosody characteristic of the Mth wheel prosody prediction, and taking the intermediate predicted prosody characteristic of the Mth wheel prosody prediction as the predicted prosody characteristic of the target text. Therefore, the gradual denoising process of the initial prosody features is realized through the prosody prediction of the M-wheel cascade connection, so that the obtained predicted prosody features are more accurate, and the expressive force of the predicted prosody features and the final synthesized voice is improved.
In some embodiments, the terminal may perform prosody prediction on the target text based on the text feature, the initial prosody feature, and the noise by: acquiring a prosody prediction model for prosody prediction; and based on the text features, the initial prosody features and the noise, calling a prosody prediction model to perform prosody prediction on the target text, and obtaining predicted prosody features of the target text. Here, the prosody prediction model may be constructed based on a neural network, such as a deep neural network, a convolutional neural network. In the embodiment of the application, the prosody prediction model is built by taking the diffusion probability model as a core, at this time, the number M of rounds is equal to the total diffusion time step number T (T is a positive integer) of the diffusion probability model, each M value corresponds to one time step T, and the time steps T are attributed to a time step sequence { T, T-1, …,1}, and as M increases, the time steps T are less until 1. Based on this, in performing prosody prediction per round, prosody prediction is also required in conjunction with the time step t. Specifically, the prosody prediction model may predict a predicted prosody characterization by the following equation (one):
wherein c is an input condition obtained by combining text representation and prosody control conditions (namely prosody control information); x is x t-1 A predicted prosody characterization (i.e., an intermediate predicted prosody feature of the (m-1) th wheel prosody prediction) obtained for the last time step prediction; z is the noise sampled from the second target distribution (e.g., a standard normal distribution) at each time step t; fixed parameters∈ θ (x t C, t) is the noise to be removed. And (3) repeatedly calculating for T times according to the formula (I) to obtain the final predicted prosody characterization.
In addition, as the prosody prediction process of the prosody prediction model comprises M wheel prosody predictions, N (N is a positive integer smaller than M) wheel prosody predictions can be sampled from the M wheel prosody predictions for training, so that the training efficiency of the model is improved, and the waste of calculation resources is reduced.
In actual implementation, the prosody prediction model is trained in advance, and a training process of the prosody prediction model is described next.
In some embodiments, the terminal may train to derive the prosody prediction model by: acquiring an initial prosody prediction model, and acquiring a sample text for training the initial prosody prediction model and target prosody characteristics of the sample text; extracting features of the sample text to obtain sample text features of the sample text; based on the sample text characteristics and the target prosody characteristics, predicting noise to be removed through an initial prosody prediction model to obtain predicted noise to be removed, wherein the predicted noise to be removed is used for determining a prosody prediction result of the initial prosody prediction model for the sample text; and updating model parameters of the initial prosody prediction model based on the prediction to remove noise to obtain a prosody prediction model for prosody prediction.
In some embodiments, the terminal may update the model parameters of the initial prosody prediction model based on predicting that noise is to be removed by: determining a gradient of an initial prosody prediction model based on predicting noise to be removed; based on the gradient, model parameters of the initial prosody prediction model are updated.
In the model training stage, the training target of the prosody prediction model is a variation lower bound of log likelihood of maximum prosody characterization distribution, in order to achieve the effect of training the prosody prediction model by the training target, training data (comprising sample text features and target prosody features) are processed through the prosody prediction model in the actual training process, specifically, noise to be removed is predicted through the prosody prediction model based on the training data, the noise to be removed is predicted, the gradient of the initial prosody prediction model is calculated based on the noise to be removed, and accordingly model parameters of the initial prosody prediction model are updated based on the gradient until convergence is achieved, and training is completed.
In practical application, when the initial prosody prediction model is constructed based on the diffusion probability model, the gradient of the initial prosody prediction model can be determined based on the prediction of the noise to be removed by adopting the following formula (two):
Wherein c is an input condition obtained by combining the sample text representation and prosody control conditions (namely prosody control information); t is a positive integer set between 1 and TUniformly sampling to obtain a current time step; the E is noise sampled from a standard normal distribution;noise to be removed for prediction of current time step, x 0 Characterizing the target rhythm; while a fixed parameter alpha t :=1-β t ,/>β t And (5) the diffusion noise to be added for the current time step t.
In practical application, when the speech synthesis is required, the speech synthesis can be performed on the target text based on the text features and the predicted prosody features, so as to obtain the target synthesized speech of the target text. Here, the processing of the speech synthesis may be performed by pre-training a completed speech synthesis model. Specifically, the target synthesized speech may be obtained by performing decoding processing on the text features and the predicted prosodic features through an acoustic decoder of the speech synthesis model.
According to the embodiment of the application, firstly, the characteristics of the target text are extracted to obtain the text characteristics of the target text, then the initial prosodic characteristics used for prosodic prediction are obtained by sampling from the first target distribution, and the noise used for prosodic prediction is obtained by sampling from the second target distribution, so that the target text is prosodic predicted based on the text characteristics, the initial prosodic characteristics and the noise to obtain the predicted prosodic characteristics of the target text; thus, the predicted prosodic features and the text features can be combined to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
Here, since the initial prosodic features and noise sampled at the time of prosodic prediction processing have randomness, the sampled initial prosodic features and noise are different from one prosodic prediction processing to another, and thus the predicted prosodic features are also different. Thus, according to the prosody prediction method provided by the embodiment of the application, different predicted prosody characteristics can be predicted for the same text, and the diversity of the predicted prosody characteristics of the text is enriched, so that the synthesized voice obtained by synthesizing the voice for the text based on the predicted prosody characteristics is diversified.
An exemplary application of the embodiment of the present application in an actual application scenario is described below by taking a prosody prediction model based on a diffusion probability model construction as an example.
First, a prosody prediction method provided in the related art is explained. In the related art, the prosody characterization of the real human voice is preset to follow a simple unimodal laplace or gaussian distribution, so that the training of the prosody prediction model is performed by optimizing the error of the prediction result. The prosody prediction model adopts a traditional deterministic prediction model, namely, under the condition that prosody control conditions (namely, the prosody control information) are the same, the output predicted prosody representation is always unchanged for a single input text. Accordingly, the related art has the following disadvantages: 1) The prosody representation distribution of human voice is excessively simplified and preset, so that the predicted prosody representation is excessively smooth and finally the expressive force of the synthesized voice is reduced due to the deviation from the real situation; 2) The training mode of the prediction result error is directly optimized, so that the prosody prediction model is overfitted to a sample text in training data, and the generalization performance of the prosody prediction model is poor; 3) The deterministic model predictive result causes a lack of variation in the predicted prosody representation of a single input text, resulting in over-singulation of the synthesized speech resulting from speech synthesis for that input text based on the predicted prosody representation.
Based on this, the embodiment of the application provides a prosody prediction method to at least solve the above-mentioned problems. The embodiment of the application particularly provides a prosody prediction method based on a diffusion probability model, which is oriented to a speech synthesis task. Specifically, the target text and the prosody control condition are used as inputs of a prosody prediction model, and the corresponding predicted prosody representation is obtained and output by using the prosody prediction model. The predicted prosody characterization can be used to control prosody changes of the synthesized speech generated by the speech synthesis task, thereby improving synthesized speech expressive power and diversity. Based on the above, the prosody prediction method provided by the embodiment of the application solves the following problems: 1) The diffusion probability model does not preset the original distribution of the prosody characterization, model training is carried out by maximizing the variation lower bound of the log likelihood of the prosody characterization distribution, prosody prediction is carried out in a gradual denoising mode, more complex prosody characterization distribution can be modeled, the overcomplete phenomenon is avoided, and the expressive force of predicting the prosody characterization and finally synthesizing the voice is improved; 2) When the diffusion probability model is used as a generation model, the initial prosody features and the noise are randomly sampled from the normal distribution in the process of predicting the prosody features, so that the initial prosody features and the noise sampled in each turn have randomness, and the predicted prosody features obtained in different turns are also different, namely, different predicted prosody features can be obtained by executing the prosody prediction method provided by the embodiment of the application in multiple turns for the same text, the diversity of the predicted prosody features of the text is enriched, and the diversity of the synthesized speech is improved.
In practical application, the prosody prediction method provided by the embodiment of the application can be applied to content generation scenes depending on voice synthesis, such as intelligent voice assistants, automatic generation of electronic audio books and the like. Specifically, referring to fig. 7, fig. 7 is an application schematic diagram of a prosody prediction model provided in an embodiment of the present application. Here, the prosody prediction model provided by the embodiment of the present application may be combined with a speech synthesis framework including a text encoder and an acoustic decoder to form a speech synthesis service flow as shown in fig. 7, including:
(1) The service caller gives input text and prosody control conditions (such as tone, emotion and other information covered by training data); (2) The text encoder obtains text characterization according to the input text; (3) The prosody prediction model predicts prosody according to the text representation and prosody control conditions to obtain predicted prosody representation; (4) The acoustic decoder decodes the synthesized speech of the input text according to the text representation and the predicted prosody representation, the content of the synthesized speech is the input text, and the prosody of the synthesized speech is regulated by the predicted prosody representation.
The prosody prediction method provided by the embodiment of the application mainly relates to a text encoder and a prosody prediction model. Wherein 1) the text encoder can employ The multi-layer feedforward transducer network, pre-trained with the speech synthesis framework, is responsible for encoding the input text into a phoneme-level representation (i.e., text representation), which can keep the parameters of the text encoder frozen during model training. 2) The prosody prediction model adopts a non-causal Wavenet as a diffusion probability model of the denoising device, is responsible for modeling prosody characterization distribution under given text characterization and prosody control conditions, and provides a function of predicting according to the modeled prosody characterization distribution, so that predicted prosody characterization is obtained. Specifically, the main parameters of the diffusion probability model include: a) Preset fixed parameters: a1 The total time step number T of the diffusion process (namely the number M of the prosody prediction is a positive integer); a2 Length T diffusion noise sequence { beta }, to be added during diffusion 1 ,…,β T -a }; b) Trainable parameters: neural network parameters θ in a non-causal WaveNet based denoising.
Next, a training process of the prosody prediction model provided by the embodiment of the present application will be described. Referring to fig. 8, fig. 8 is a training schematic diagram of a prosody prediction model provided by an embodiment of the present application. Here, in the model training phase, the training target of the prosody prediction model is a lower bound of variation of the log likelihood that maximizes the prosody representation distribution, and in order to achieve the effect of training the prosody prediction model with the training target, in the actual training process, training data (including sample text, prosody control conditions, target prosody representation x of sample text 0 ) The method comprises the steps of processing, specifically, predicting noise to be removed in the current time step t based on training data through a denoising device in a diffusion probability model, outputting the predicted noise to be removed, and then calculating gradients based on a gradient calculation formula (II), so that trainable parameters theta of a prosody prediction model are updated through the gradients until convergence, and training is completed.
Wherein c is a prosody control condition group after the sample text in the training data set is encoded into the sample text by a text encoderCombining the obtained input conditions; t is the current time step obtained by uniformly sampling on a positive integer set between 1 and T; e is noise sampled from a third target distribution (e.g., a standard normal distribution);noise to be removed for the prediction of the current time step of the denoiser prediction in the diffusion probability model, where x 0 The target prosody representation of each phoneme which is preprocessed in the training dataset consists of prosody representations of three dimensions of fundamental frequency, energy and duration; while a fixed parameter alpha t :=1-β t ,/>
Next, an application process of the prosody prediction model provided by the embodiment of the present application will be described. Referring to fig. 9, fig. 9 is a schematic diagram illustrating an application of a prosody prediction model according to an embodiment of the present application. Here, in the model reasoning application phase, the prosody prediction model is derived from an initial prosody representation x sampled from a standard normal distribution T Starting from, for each time step T in the sequence of diffusion time steps { T, T-1, …,1}, the prosody representation is updated in turn according to the following formula (one):
wherein c is an input condition obtained by combining text characterization based on an input text (i.e., the target text) with prosody control conditions; x is x t-1 A predicted prosody representation obtained for the previous time step prediction; z is the noise sampled from the second target distribution (e.g., a standard normal distribution) at each time step t; e-shaped article θ (x t C, t) is noise to be removed; fixed parametersAnd (3) repeatedly calculating for T times according to the formula (I) to obtain the final predicted prosody characterization.
In practice, 1) with respect to the multi-layer feedforward transducer model in a text encoder, neural networks of other structures may be used instead; 2) Regarding the non-causal Wavenet network of the diffusion probability model in the prosody prediction model, neural networks of other structures can be used instead; 3) The sampling method of the current time step t in the model training process can be replaced by weighted sampling, wherein the weight can be dynamically updated; 4) Regarding the representation method of the current time step T in the model training process, the small number representation form of the current time step T can be obtained by dividing the current time step T by T after upsampling a positive integer set between 1 and T to replace the small number representation form; 5) With respect to the composition of prosodic tokens, other types of acoustic features or combinations of implicit tokens extracted by neural networks may be substituted.
By applying the embodiment of the application, 1) the embodiment of the application compared with the related technology improves the fitting effect of the predicted prosody characterization distribution on the real distribution on the whole, and the distribution Jensen-Shannon Divergence (belonging interval is (0, 1), the smaller the value is, the better the fitting effect is) in three prosody characterization dimensions of fundamental frequency, energy and duration, and the fitting effect is respectively reduced from 0.199, 0.056 and 0.119 to 0.085, 0.055 and 0.056. 2) Compared with the related technology, the embodiment of the application improves the predicted prosody representation and the expressive force of the synthesized speech controlled by the predicted prosody representation. 3) Compared with the related technology, the embodiment of the application enriches the diversity of the predicted prosody characterization of the single text, realizes different predicted prosody characterizations obtained through prosody prediction processing of different rounds, improves the problem that the single text can only output the same predicted prosody characterization under the same condition of the related technology, and further improves the diversity of the synthesized voice.
Continuing with the description below of an exemplary structure of the prosody prediction device 555 implemented as a software module provided by an embodiment of the present application, in some embodiments, as shown in fig. 2, the software module stored in the prosody prediction device 555 of the memory 550 may include: the feature extraction module 5551 is configured to perform feature extraction on a target text, so as to obtain text features of the target text; a sampling module 5552, configured to sample an initial prosody feature for prosody prediction from the first target distribution, and sample noise for prosody prediction from the second target distribution; the prosody prediction module 5553 is configured to perform prosody prediction on the target text based on the text feature, the initial prosody feature, and the noise, so as to obtain a predicted prosody feature of the target text; the predicted prosodic features are used for combining the text features to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
In some embodiments, the feature extraction module 5551 is further configured to perform word segmentation processing on the target text, so as to obtain a plurality of word segments included in the target text; for each word segment, obtaining the phoneme information of the word segment, and carrying out coding processing on the phoneme information so as to convert the phoneme information into phoneme characteristics; and combining the phoneme characteristics of each word segment to obtain the text characteristics of the target text.
In some embodiments, the sampling module 5552 is further configured to generate first random data that conforms to a first data distribution type before sampling initial prosody features for prosody prediction from a first target distribution and sampling noise for prosody prediction from a second target distribution, and construct the first target distribution based on the first random data; generating second random data conforming to a second data distribution type, and constructing the second target distribution based on the second random data.
In some embodiments, the prosody prediction module 5553 is further configured to obtain prosody control information for prosody prediction before prosody prediction is performed on the target text based on the text feature, the initial prosody feature, and the noise, to obtain a predicted prosody feature of the target text; the prosody prediction module 5553 is further configured to combine the text feature and the prosody control information to obtain a combined feature; and performing prosody prediction on the target text based on the combined characteristic, the initial prosody characteristic and the noise to obtain predicted prosody characteristics of the target text.
In some embodiments, the prosody prediction comprises an M-round prosody prediction; the sampling module 5552 is further configured to sample, for each of the M wheel prosody predictions, noise for the wheel prosody prediction from the second target distribution; the prosody prediction module 5553 is further configured to perform prosody prediction on the target text based on the text feature, the initial prosody feature, and noise for the 1 st round of prosody prediction, for 1 st round of prosody prediction in the M round of prosody prediction, to obtain an intermediate predicted prosody feature of the 1 st round of prosody prediction; aiming at the M-th wheel prosody prediction in the M-th wheel prosody prediction, performing prosody prediction on the target text based on the text features, the intermediate prediction prosody features of the (M-1) -th wheel prosody prediction and the noise for the M-th wheel prosody prediction to obtain the intermediate prediction prosody features of the M-th wheel prosody prediction; traversing the M to obtain intermediate predicted prosody characteristics of the Mth wheel prosody prediction, and taking the intermediate predicted prosody characteristics of the Mth wheel prosody prediction as predicted prosody characteristics of the target text; wherein, M and M are integers greater than 0, M is greater than or equal to M.
In some embodiments, the prosody prediction module 5553 is further configured to predict the initial prosody feature to be noise removed based on the text feature and the initial prosody feature, to obtain noise to be removed; and subtracting the initial prosodic features from the noise to be removed, and determining the predicted prosodic features of the target text based on the subtraction result obtained by subtracting and the noise.
In some embodiments, the first target distribution conforms to a first data distribution type, and the sampling module 5552 is further configured to randomly sample, from the first target distribution, first sampled data conforming to the first data distribution type, and use the first sampled data as the initial prosodic feature for prosody prediction; the second target distribution conforms to a second data distribution type, and the sampling module 5552 is further configured to randomly sample second sampling data conforming to the second data distribution type from the second target distribution, and use the second sampling data as the noise for prosody prediction.
In some embodiments, the prosody prediction module 5553 is further configured to obtain a prosody prediction model for prosody prediction; and calling the prosody prediction model to perform prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text.
In some embodiments, the prosody prediction module 5553 is further configured to obtain an initial prosody prediction model, and obtain a sample text for training the initial prosody prediction model, and a target prosody characteristic of the sample text; extracting features of the sample text to obtain sample text features of the sample text, and sampling from third target distribution to obtain sample noise; based on the sample text features, the target prosodic features and the sample noise, predicting noise to be removed through the initial prosodic prediction model to obtain predicted noise to be removed, wherein the predicted noise to be removed is used for determining a prosodic prediction result of the initial prosodic prediction model for the sample text; and updating model parameters of the initial prosody prediction model based on the prediction to be noise removed to obtain the prosody prediction model for prosody prediction.
In some embodiments, the prosody prediction module 5553 is further configured to determine a gradient of the initial prosody prediction model based on the prediction to remove noise; based on the gradient, model parameters of the initial prosody prediction model are updated.
In some embodiments, the prosody prediction module 5553 is further configured to perform speech synthesis on the target text based on the text feature and the predicted prosody feature, so as to obtain a target synthesized speech of the target text.
According to the embodiment of the application, firstly, the characteristics of the target text are extracted to obtain the text characteristics of the target text, then the initial prosodic characteristics used for prosodic prediction are obtained by sampling from the first target distribution, and the noise used for prosodic prediction is obtained by sampling from the second target distribution, so that the target text is prosodic predicted based on the text characteristics, the initial prosodic characteristics and the noise to obtain the predicted prosodic characteristics of the target text; thus, the predicted prosodic features and the text features can be combined to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
Here, since the initial prosodic features and noise sampled at the time of prosodic prediction processing have randomness, the sampled initial prosodic features and noise are different from one prosodic prediction processing to another, and thus the predicted prosodic features are also different. Thus, according to the prosody prediction method provided by the embodiment of the application, different predicted prosody characteristics can be predicted for the same text, and the diversity of the predicted prosody characteristics of the text is enriched, so that the synthesized voice obtained by synthesizing the voice for the text based on the predicted prosody characteristics is diversified.
Embodiments of the present application also provide a computer program product comprising computer-executable instructions or a computer program stored in a computer-readable storage medium. The processor of the electronic device reads the computer-executable instructions or the computer program from the computer-readable storage medium, and the processor executes the computer-executable instructions or the computer program, so that the electronic device executes the prosody prediction method provided by the embodiment of the present application.
The embodiment of the application also provides a computer readable storage medium, in which computer executable instructions or a computer program are stored, which when executed by a processor, cause the processor to perform the prosody prediction method provided by the embodiment of the application.
In some embodiments, the computer readable storage medium may be RAM, ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (Hyper Text Markup Language, HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.
Claims (15)
1. A prosody prediction method, the method comprising:
extracting characteristics of a target text to obtain text characteristics of the target text;
Sampling from a first target distribution to obtain an initial prosody characteristic for prosody prediction, and sampling from a second target distribution to obtain noise for prosody prediction;
performing prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text;
the predicted prosodic features are used for combining the text features to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
2. The method of claim 1, wherein the feature extraction of the target text to obtain text features of the target text comprises:
performing word segmentation processing on the target text to obtain a plurality of word segments included in the target text;
for each word segment, obtaining the phoneme information of the word segment, and carrying out coding processing on the phoneme information so as to convert the phoneme information into phoneme characteristics;
and combining the phoneme characteristics of each word segment to obtain the text characteristics of the target text.
3. The method of claim 1, wherein prior to sampling initial prosodic features for prosody prediction from the first target distribution and sampling noise for prosody prediction from the second target distribution, the method further comprises:
Generating first random data conforming to a first data distribution type, and constructing the first target distribution based on the first random data;
generating second random data conforming to a second data distribution type, and constructing the second target distribution based on the second random data.
4. The method of claim 1, wherein the prosody predicting the target text based on the text feature, the initial prosody feature, and the noise, prior to deriving the predicted prosody feature of the target text, the method further comprises:
acquiring prosody control information for prosody prediction;
performing prosody prediction on the target text based on the text feature, the initial prosody feature, and the noise to obtain a predicted prosody feature of the target text, including:
combining the text features and the prosody control information to obtain combined features;
and performing prosody prediction on the target text based on the combined characteristic, the initial prosody characteristic and the noise to obtain predicted prosody characteristics of the target text.
5. The method of claim 1, wherein the prosody prediction comprises an M-round prosody prediction; the sampling from the second target distribution results in noise for prosody prediction:
Sampling noise for the wheel prosody prediction from the second target distribution for each of the M wheel prosody predictions;
performing prosody prediction on the target text based on the text feature, the initial prosody feature, and the noise to obtain a predicted prosody feature of the target text, including:
aiming at the 1 st round of prosody prediction in the M round of prosody prediction, performing prosody prediction on the target text based on the text features, the initial prosody features and noise for the 1 st round of prosody prediction to obtain intermediate predicted prosody features of the 1 st round of prosody prediction;
aiming at the M-th wheel prosody prediction in the M-th wheel prosody prediction, performing prosody prediction on the target text based on the text features, the intermediate prediction prosody features of the (M-1) -th wheel prosody prediction and the noise for the M-th wheel prosody prediction to obtain the intermediate prediction prosody features of the M-th wheel prosody prediction;
traversing the M to obtain intermediate predicted prosody characteristics of the Mth wheel prosody prediction, and taking the intermediate predicted prosody characteristics of the Mth wheel prosody prediction as predicted prosody characteristics of the target text;
Wherein, M and M are integers greater than 0, M is greater than or equal to M.
6. The method of claim 1, wherein prosody predicting the target text based on the text feature, the initial prosody feature, and the noise results in a predicted prosody feature of the target text, comprising:
based on the text features and the initial prosodic features, predicting the noise to be removed from the initial prosodic features to obtain the noise to be removed;
and subtracting the initial prosodic features from the noise to be removed, and determining the predicted prosodic features of the target text based on the subtraction result obtained by subtracting and the noise.
7. The method of claim 1, wherein,
the first target distribution accords with a first data distribution type, and the initial prosody characteristic for prosody prediction is obtained by sampling from the first target distribution, and the method comprises the following steps:
randomly sampling from the first target distribution to obtain first sampling data conforming to the first data distribution type, and taking the first sampling data as the initial prosody characteristic for prosody prediction;
The second target distribution accords with a second data distribution type, and noise for prosody prediction is obtained by sampling from the second target distribution, and the method comprises the following steps:
and randomly sampling second sampling data conforming to the second data distribution type from the second target distribution, and taking the second sampling data as the noise for prosody prediction.
8. The method of claim 1, wherein prosody predicting the target text based on the text feature, the initial prosody feature, and the noise results in a predicted prosody feature of the target text, comprising:
acquiring a prosody prediction model for prosody prediction;
and calling the prosody prediction model to perform prosody prediction on the target text based on the text features, the initial prosody features and the noise to obtain predicted prosody features of the target text.
9. The method of claim 8, wherein the obtaining a prosody prediction model for prosody prediction comprises:
acquiring an initial prosody prediction model, and acquiring a sample text for training the initial prosody prediction model and target prosody features of the sample text;
Extracting features of the sample text to obtain sample text features of the sample text, and sampling from third target distribution to obtain sample noise;
based on the sample text features, the target prosodic features and the sample noise, predicting noise to be removed through the initial prosodic prediction model to obtain predicted noise to be removed, wherein the predicted noise to be removed is used for determining a prosodic prediction result of the initial prosodic prediction model for the sample text;
and updating model parameters of the initial prosody prediction model based on the prediction to be noise removed to obtain the prosody prediction model for prosody prediction.
10. The method of claim 9, wherein the updating model parameters of the initial prosody prediction model based on the prediction to remove noise comprises:
determining a gradient of the initial prosody prediction model based on the prediction to remove noise;
based on the gradient, model parameters of the initial prosody prediction model are updated.
11. The method of claim 1, wherein the method further comprises:
and performing voice synthesis on the target text based on the text characteristics and the predicted prosody characteristics to obtain target synthesized voice of the target text.
12. A prosody prediction device, the device comprising:
the feature extraction module is used for extracting features of the target text to obtain text features of the target text;
the sampling module is used for sampling initial prosody characteristics for prosody prediction from the first target distribution and sampling noise for prosody prediction from the second target distribution;
the prosody prediction module is used for performing prosody prediction on the target text based on the text characteristics, the initial prosody characteristics and the noise to obtain predicted prosody characteristics of the target text;
the predicted prosodic features are used for combining the text features to perform speech synthesis to obtain the synthesized speech of the target text, which has the predicted prosodic features.
13. An electronic device, the electronic device comprising:
a memory for storing computer executable instructions;
a processor for implementing the prosody prediction method of any one of claims 1 to 11 when executing computer-executable instructions stored in the memory.
14. A computer-readable storage medium storing computer-executable instructions or a computer program, which when executed by a processor, implements the prosody prediction method of any one of claims 1 to 11.
15. A computer program product comprising computer-executable instructions or a computer program, which, when executed by a processor, implements the prosody prediction method of any of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310121183.1A CN117219052A (en) | 2023-01-31 | 2023-01-31 | Prosody prediction method, apparatus, device, storage medium, and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310121183.1A CN117219052A (en) | 2023-01-31 | 2023-01-31 | Prosody prediction method, apparatus, device, storage medium, and program product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117219052A true CN117219052A (en) | 2023-12-12 |
Family
ID=89043162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310121183.1A Pending CN117219052A (en) | 2023-01-31 | 2023-01-31 | Prosody prediction method, apparatus, device, storage medium, and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117219052A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117995209A (en) * | 2024-03-28 | 2024-05-07 | 荣耀终端有限公司 | Voice conversion method and related equipment |
CN118588085A (en) * | 2024-08-05 | 2024-09-03 | 南京硅基智能科技有限公司 | Voice interaction method, voice interaction system and storage medium |
CN118588057A (en) * | 2024-08-05 | 2024-09-03 | 南京硅基智能科技有限公司 | Speech synthesis method, speech synthesis apparatus, and readable storage medium |
-
2023
- 2023-01-31 CN CN202310121183.1A patent/CN117219052A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117995209A (en) * | 2024-03-28 | 2024-05-07 | 荣耀终端有限公司 | Voice conversion method and related equipment |
CN118588085A (en) * | 2024-08-05 | 2024-09-03 | 南京硅基智能科技有限公司 | Voice interaction method, voice interaction system and storage medium |
CN118588057A (en) * | 2024-08-05 | 2024-09-03 | 南京硅基智能科技有限公司 | Speech synthesis method, speech synthesis apparatus, and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112687259B (en) | Speech synthesis method, device and readable storage medium | |
CN111968618B (en) | Speech synthesis method and device | |
JP7570760B2 (en) | Speech recognition method, speech recognition device, computer device, and computer program | |
CN117219052A (en) | Prosody prediction method, apparatus, device, storage medium, and program product | |
CA3155320A1 (en) | Generating audio using neural networks | |
CN116364055B (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN105185372A (en) | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device | |
WO2022252904A1 (en) | Artificial intelligence-based audio processing method and apparatus, device, storage medium, and computer program product | |
CN113450765B (en) | Speech synthesis method, device, equipment and storage medium | |
CN114207706A (en) | Generating acoustic sequences via neural networks using combined prosodic information | |
US20230035504A1 (en) | Audio processing method and apparatus, vocoder, electronic device, computer-readable storage medium, and computer program product | |
WO2022135100A1 (en) | Artificial intelligence-based audio signal generation method, apparatus, device, storage medium, and computer program product | |
CN112767910A (en) | Audio information synthesis method and device, computer readable medium and electronic equipment | |
CN113870827B (en) | Training method, device, equipment and medium for speech synthesis model | |
CN113838448A (en) | Voice synthesis method, device, equipment and computer readable storage medium | |
CN114387946A (en) | Training method of speech synthesis model and speech synthesis method | |
CN116798405B (en) | Speech synthesis method, device, storage medium and electronic equipment | |
CN114242033A (en) | Speech synthesis method, apparatus, device, storage medium and program product | |
CN116711003A (en) | Customization of recurrent neural network transcriber for speech recognition | |
CN113555000A (en) | Acoustic feature conversion and model training method, device, equipment and medium | |
CN113823259B (en) | Method and device for converting text data into phoneme sequence | |
CN113870838A (en) | Voice synthesis method, device, equipment and medium | |
CN116825090B (en) | Training method and device for speech synthesis model and speech synthesis method and device | |
CN116978364A (en) | Audio data processing method, device, equipment and medium | |
CN116959402A (en) | Training method, device, equipment, medium and program product of speech synthesis model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |