CN112382267A - Method, apparatus, device and storage medium for converting accents - Google Patents
Method, apparatus, device and storage medium for converting accents Download PDFInfo
- Publication number
- CN112382267A CN112382267A CN202011270379.XA CN202011270379A CN112382267A CN 112382267 A CN112382267 A CN 112382267A CN 202011270379 A CN202011270379 A CN 202011270379A CN 112382267 A CN112382267 A CN 112382267A
- Authority
- CN
- China
- Prior art keywords
- target
- accent
- text
- features
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000006243 chemical reaction Methods 0.000 claims abstract description 42
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 23
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims description 46
- 238000003062 neural network model Methods 0.000 claims description 28
- 238000001228 spectrum Methods 0.000 claims description 26
- 230000015654 memory Effects 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000002194 synthesizing effect Effects 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Abstract
The application discloses a method, a device, equipment and a storage medium for converting accents, and relates to the fields of speech synthesis, natural language processing, computer technology, artificial intelligence and deep learning. The specific implementation scheme is as follows: acquiring a target speaking text; determining a first acoustic feature indicative of a target accent based on the target spoken text; determining linguistic features in the first acoustic features; and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency. The implementation mode can realize that any text input by the user can be accurately and quickly converted into the audio frequency with the target accent of the user by carrying out voice synthesis, recognition and conversion on the acquired speaking text data of the user.
Description
Technical Field
The present application relates to the field of computer technology, and in particular, to the fields of speech synthesis, natural language processing, computer technology, artificial intelligence, and deep learning, and more particularly, to a method, an apparatus, a device, and a storage medium for converting accents.
Background
In recent years, due to rapid development of online education and online learning, accent conversion techniques aiming at converting an accent of a target user into a target accent have been widely studied and paid attention. Similarly, the accent conversion technology has great application prospect in entertainment. The existing accent conversion technology is used for performing accent conversion, so that the speed is low, and the result of the accent conversion is often inaccurate.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, and storage medium for converting accents.
According to an aspect of the present disclosure, there is provided a method for converting accents, including: acquiring a target speaking text; determining a first acoustic feature indicative of a target accent based on the target spoken text; determining linguistic features in the first acoustic features; and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.
According to another aspect of the present disclosure, there is provided an apparatus for converting an accent, including: an acquisition unit configured to acquire a target utterance text; a first acoustic feature determination unit configured to determine a first acoustic feature indicating a target accent based on the target spoken text; a linguistic feature determination unit configured to determine a linguistic feature in the first acoustic features; and the conversion unit is configured to determine the audio with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text and output the audio.
According to still another aspect of the present disclosure, there is provided an electronic device for converting an accent, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for converting accents as described above.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method for converting an accent as described above.
According to the technology of the application, the problem that the accent conversion cannot be accurately and quickly carried out is solved, any text input by a user can be accurately and quickly converted into the audio with the target accent of the user by carrying out voice synthesis, recognition and conversion on the obtained speaking text data of the user.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for converting an accent, according to the present application;
FIG. 3 is a schematic diagram of an application scenario of a method for converting accents according to the present application;
FIG. 4 is a flow diagram of another embodiment of a method for converting an accent in accordance with the present application;
FIG. 5 is a schematic diagram of an embodiment of an apparatus for converting accents according to the present application;
fig. 6 is a block diagram of an electronic device for implementing a method for converting accents according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for converting an accent or apparatus for converting an accent may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a speech synthesis application, may be installed on the terminal devices 101, 102, 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, car computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as a background server that processes the targeted spoken text collected by the terminal devices 101, 102, 103. The background server can acquire a target speaking text and determine a first acoustic characteristic used for indicating a target accent based on the target speaking text; determining linguistic features in the first acoustic features; and determining the audio with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as a plurality of software or software modules, or as a single software or software module. And is not particularly limited herein.
It should be noted that the method for converting accents provided in the embodiments of the present application is generally performed by the server 105. Accordingly, the means for converting accents is typically located in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for converting accents in accordance with the present application is shown. The method for converting accents of the embodiment comprises the following steps:
In this embodiment, an executing subject (for example, the server 105 in fig. 1) of the method for converting an accent may obtain, through a wired connection or a wireless connection, a target utterance text acquired by a terminal device through recording or scanning. Specifically, the target speech text may be a speech text converted from a voice uttered by any one of persons having a certain identity and who want to convert an accent. The spoken text may be a word. For example, if the pragma is explained as "i love home", the target utterance text may be "i love home" expressed in the form of characters.
Based on the target spoken text, a first acoustic feature indicative of a target accent is determined, step 202.
After the execution subject acquires the target utterance text, a first acoustic feature indicating a target accent may be determined based on the target utterance text. In particular, the first acoustic feature may be a speech feature parameter for characterizing the target accent, which may be a mel-frequency spectrum, for example. In this embodiment, the execution subject may perform pre-emphasis, framing, and windowing on the audio according to the target spoken text and the audio corresponding to the target spoken text, and then may perform short-time Fourier transform (STFT), short-time Fourier transform, or short-term Fourier transform on a signal of each frame of audio to obtain a short-time amplitude spectrum; the short-time amplitude spectrum passes through a Mel filter bank to obtain a Mel frequency spectrum. The present application does not specifically limit the manner in which the mel spectrum (i.e., the first acoustic feature) is obtained.
The executive subject, upon determining the first acoustic feature, may determine a linguistic feature in the first acoustic feature. In particular, the linguistic features may include prosodic features, syntax, structure of speech pieces, structure of information, and the like. The prosodic feature may be a super-sound feature or a super-sound segment feature, which is a sound system structure of the language. Prosodic features can be divided into three main aspects: intonation, time domain distribution and stress are realized through the characteristics of the ultrasonic segment. The super-range features include pitch, intensity, and temporal characteristics, loaded by a phoneme or group of phonemes. Prosody is a typical feature of human natural language and has many features common across languages, such as: pitch downtilt, rereading, pauses, etc. are common among different languages. Prosodic features are one of the important forms of language and emotional expression. The executing subject may compare the existing known linguistic features with the features in the first acoustic features, and determine the feature in the first acoustic features, which has a similarity greater than a threshold value with the known linguistic features, as the linguistic feature in the first acoustic features. Specifically, the executing subject may convert each of the first acoustic features into a first acoustic feature vector and convert existing known linguistic features into a linguistic feature vector, respectively. And performing cosine similarity calculation on each first acoustic feature vector and each linguistic feature vector, wherein the cosine similarity is closer to 1, which shows that the more similar the first acoustic feature vector participating in the calculation of the cosine similarity is to the corresponding linguistic feature vector, the first acoustic feature vector can be determined as the linguistic feature vector, and correspondingly, the first acoustic feature corresponding to the first acoustic feature vector is determined as the linguistic feature corresponding to the linguistic feature vector. Similarly, traversing each first acoustic feature and each linguistic feature until finding the first acoustic feature corresponding to the first acoustic feature vector, wherein the absolute value of the difference between the cosine similarity of the vector corresponding to each first acoustic feature and each linguistic feature and 1 is less than the threshold value, and determining the first acoustic feature as the linguistic feature, thereby obtaining all the linguistic features in the first acoustic feature.
And step 204, determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.
After determining the linguistic feature, the execution subject may determine, based on the linguistic feature and the target spoken text, an audio corresponding to the target spoken text and having the target accent, and output the audio. Specifically, the executive body may input the linguistic features and the target spoken text into a pre-trained speech model to generate audio corresponding to the target spoken text with a target accent, and output the audio. The pre-trained speech model is used for representing the corresponding relation between linguistic features and target speaking texts and audio with target accents, and can be a pre-trained neural network model. When a speech model is trained, an initial neural network model can be obtained firstly; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and audio which is labeled and corresponds to the linguistic features and the target speaking texts corresponding to the linguistic features and the target speaking texts with target accents; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking audio corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features and audio with target accents as expected output, and training the initial neural network model; and determining the trained initial neural network model as the voice model. For example, the target spoken text may be a text "i love home" of mandarin chinese, the audio corresponding to the target spoken text and having the target accent may be an audio of "i love home" of northeast accent, the audio may be in the form of MP3 or MP4, and the form of the output audio is not particularly limited in the present application.
With continued reference to fig. 3, a schematic illustration of one application scenario of the method for converting accents according to the present application is shown. In the application scenario of fig. 3, a server 303 obtains targeted spoken text 301 over a network 302. The server 303 determines 304 a first acoustic feature indicative of a target accent based on the target spoken text 301. The server 303 determines linguistic features 305 in the first acoustic features 304. The server 303 determines the audio 306 having the target accent corresponding to the target spoken text 301 based on the linguistic features 305 and the target spoken text 301, and outputs the audio 306. This application is actually the conversion of speech text data to audio with a particular accent. The text content is unchanged.
According to the embodiment, any one text input by the user can be accurately and quickly converted into the audio with the target accent of the user by performing voice synthesis, recognition and conversion on the acquired speaking text data of the user.
With continued reference to FIG. 4, a flow 400 of another embodiment of a method for converting accents in accordance with the present application is shown. As shown in fig. 4, the method for converting accents of the present embodiment may include the following steps:
Based on the target spoken text, a first acoustic feature indicative of a target accent is determined, step 402.
The principle of step 401 to step 402 is similar to that of step 201 to step 202, and is not described herein again.
Specifically, step 402 may be implemented by step 4021:
In this embodiment, the pre-trained speech synthesis model is used to characterize the correspondence between the spoken text and the first acoustic feature. After the execution subject obtains the target speaking text, the first acoustic feature corresponding to the target speaking text and used for indicating the target accent can be determined according to the target speaking text and the pre-trained speech synthesis model. In particular, the first acoustic feature may be a mel-frequency spectrum indicative of a target accent, such as a northeast accent. The first acoustic feature corresponding to the target speaking text obtained through the speech synthesis model and used for indicating the target accent cannot be used for outputting the target accent at this time, further feature recognition and feature conversion are needed to obtain features carrying more target speaking text information, and then further speech synthesis is carried out according to the obtained features carrying more target speaking text information. In this embodiment, specifically, the execution subject may input the target utterance text into the pre-trained speech synthesis model, and perform extraction of the first acoustic feature on the target utterance text by using the pre-trained speech synthesis model to obtain the first acoustic feature corresponding to the target utterance text and used for indicating the target accent.
According to the embodiment, the audio Mel frequency spectrum with the target accent can be obtained according to the target speaking text and the pre-trained speech synthesis model, so that the conversion of the target accent can be accurately carried out on the speaking text of any user based on the obtained audio Mel frequency spectrum with the target accent, and the audio with the target accent can be obtained.
In step 403, linguistic features in the first acoustic features are determined.
The principle of step 403 is similar to that of step 203, and is not described in detail here.
Specifically, step 403 may be implemented by step 4031:
The pre-trained recognition model is used for representing the corresponding relation between the first acoustic feature and the linguistic feature. After obtaining the first acoustic feature, the executive body may extract a linguistic feature in the first acoustic feature by using a pre-trained recognition model. Specifically, the linguistic features in the first acoustic feature may include prosodic features such as intonation, time-domain distribution, accent, pitch, accent, pause, and the like. Linguistic features are one of the important forms of language and emotional expression.
In some optional implementations of this embodiment, the executing subject may further determine, according to the first acoustic feature and a pre-trained recognition model, a class identifier corresponding to the first acoustic feature, where the pre-trained recognition model is used to characterize a correspondence between the first acoustic feature and the class identifier. The obtained class identifier may be an identifier for characterizing a class of each phoneme in the first acoustic feature, for example, each phoneme in the first acoustic feature may be a intonation phoneme, a time domain distribution phoneme, an accent phoneme, a pitch phoneme, an accent phoneme, a pause phoneme, and may be represented by identifiers 1, 2, 3, 4, 5, 6, and 7, respectively. Then, the execution subject may determine a second acoustic feature for generating the audio with the target accent according to each phoneme in the first acoustic feature corresponding to the obtained identifier and a preset identifier, and a correspondence between the phoneme and the second acoustic feature. The second acoustic feature may be a mel-frequency spectrum corresponding to each phoneme required for generating the target accent. The execution subject may determine audio having a target accent corresponding to the target spoken text based on the second acoustic feature and output the audio. The realization mode can enrich the Mel frequency spectrum required by the audio used for generating the target accent and improve the accuracy of the audio used for generating the target accent.
In the embodiment, the linguistic features in the first acoustic features are extracted by utilizing the pre-trained recognition model, so that the relevant features of the corresponding language and emotion expression in the target speaking text can be extracted, the features of the audio used for generating the target accent are perfected, and the accuracy of generating the audio with the target accent is improved.
And step 404, determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.
The principle of step 404 is similar to that of step 204, and is not described here again.
Specifically, step 404 can be implemented by steps 4041 to 4042:
In this embodiment, the conversion model is used to characterize the linguistic features and the correspondence between the target spoken text and the second acoustic feature used to generate the audio with the target accent. In particular, the second acoustic feature may comprise a mel-frequency spectrum corresponding to the above-mentioned linguistic feature. Specifically, the second acoustic feature may be a mel-frequency spectrum corresponding to each phoneme required for generating the target accent. The execution subject can input the linguistic features and the target speaking text into the pre-trained conversion model, and output second acoustic features corresponding to the target speaking text and used for generating audio with the target accent.
The execution subject, having obtained the second acoustic feature, may determine audio having the target accent based on the second acoustic feature. Specifically, the execution body may input the obtained second acoustic feature to the vocoder, and the audio with the target accent is obtained through conversion by the vocoder. The vocoder encodes and encrypts the second acoustic feature at its transmitting end to obtain a match with the channel, which is transmitted to the receiving end of the vocoder via the information channel, and analyzes the received feature in the frequency domain to identify unvoiced and voiced sounds, determine the fundamental frequency of the voiced sounds, and further select the unvoiced-voiced decision, the fundamental frequency of the voiced sounds and the spectral envelope as feature parameters for transmission. Of course, the analysis may also be performed in the time domain, and some second acoustic features are periodically extracted to perform linear prediction, so as to generate audio with a target accent corresponding to the second acoustic features. Specifically, the vocoders may include a channel vocoder, a formant vocoder, a pattern vocoder, a linear prediction vocoder, a correlation vocoder, and an orthogonal function vocoder, and the type of the vocoder is not particularly limited in the present application.
According to the embodiment, the second acoustic feature corresponding to the target speaking text is determined according to the linguistic feature, the target speaking text and the pre-trained conversion model, and the audio with the target accent is determined based on the second acoustic feature, so that the audio of the target accent can be obtained accurately for any one speaking text of any user.
In some optional implementations of the present embodiment, the method for converting accents further comprises the following model training steps not shown in fig. 4: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.
In this embodiment, the execution subject may obtain the initial neural network model through a wired connection manner or a wireless connection manner. The initial Neural Network model may include various Artificial Neural Networks (ANN) including hidden layers. In this embodiment, the execution main body may also obtain a pre-stored initial model from a local place, or may also obtain the initial model from a communication-connected electronic device, which is not limited herein.
In this embodiment, the execution subject may acquire the training sample set in various ways. Specifically, the training samples in the training sample set may include the linguistic features, the target spoken text corresponding to the linguistic features, and the labeled second acoustic features corresponding to the linguistic features and the target spoken text corresponding to the linguistic features. The second acoustic feature corresponding to the target spoken text corresponding to the linguistic feature and the linguistic feature labeled in the training sample may be obtained from a local or communicatively connected electronic device in a wired or wireless connection manner, may also be labeled manually in real time, or may be obtained by first performing automatic labeling and then manually performing supplementary modification and correcting a labeling error, which is not specifically limited in this application. The linguistic features in the training samples may be extracted in real-time or may be obtained from a local or communicatively coupled electronic device via a wired or wireless connection. The target spoken text corresponding to the linguistic features in the training sample may be collected by different users in real time, or may be obtained from a local or communicatively connected electronic device in a wired or wireless connection manner, which is not specifically limited in this application.
In this embodiment, the initial neural network model is trained by obtaining a training sample set, so that a conversion model capable of accurately generating an audio frequency with a target accent according to linguistic features and a spoken text can be obtained, thereby more accurately converting characters on the obtained target spoken text into an audio frequency with the target accent, and improving the quality of the generated audio frequency with the target accent.
In some optional implementations of the embodiment, the audio with the target accent comprises at least one of: speech audio with a target accent, singing audio with a target accent.
In some optional implementations of the embodiment, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and determining audio having a target accent based on the second acoustic feature, comprising: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.
Specifically, the execution body may automatically input a mel spectrum corresponding to the target accent into a preset neural network vocoder after obtaining the mel spectrum, to synthesize audio having the target accent based on the neural network vocoder and the mel spectrum.
The present implementation may make the synthesized audio with the target accent more accurate by a vocoder according to the mel spectrum corresponding to the target accent and a preset neural network.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for converting accents, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable in various electronic devices.
As shown in fig. 5, the apparatus 500 for converting accents of the present embodiment includes: an acquisition unit 501, a first acoustic feature determination unit 502, a linguistic feature determination unit 503, and a conversion unit 504.
An obtaining unit 501 configured to obtain a target utterance text.
A first acoustic feature determination unit 502 configured to determine a first acoustic feature indicating a target accent based on the target spoken text.
A linguistic feature determination unit 503 configured to determine linguistic features in the first acoustic features.
And the conversion unit 504 is configured to determine audio with a target accent corresponding to the target speaking text based on the linguistic features and the target speaking text, and output the audio.
In some optional implementations of the present embodiment, the first acoustic feature determination unit 502 is further configured to: and determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.
In some optional implementations of this embodiment, the linguistic feature determination unit 503 is further configured to: and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.
In some optional implementations of this embodiment, the conversion unit 504 is further configured to: determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent; based on the second acoustic feature, audio having a target accent is determined.
In some optional implementations of this embodiment, the apparatus further comprises a training unit, not shown in fig. 5, configured to: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.
In some optional implementations of the embodiment, the audio with the target accent comprises at least one of: speech audio with a target accent, singing audio with a target accent.
In some optional implementations of the embodiment, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and the conversion unit 504 is further configured to: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.
It should be understood that the units 501 to 504 described in the apparatus 500 for converting accents correspond to the respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the method for converting accents apply equally to the apparatus 500 and the units comprised therein, and are not described in detail here.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 6 is a block diagram of an electronic device for converting accents according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses 605 and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses 605 may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.
The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for converting accents provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for converting accents provided herein.
The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and units, such as program instructions/units corresponding to the method for converting accents in the embodiment of the present application (for example, the acquisition unit 501, the first acoustic feature determination unit 502, the linguistic feature determination unit 503, and the conversion unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 602, that is, implements the method for converting accents in the above method embodiments.
The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic apparatus for converting accents, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to an electronic device for converting accents. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method for converting accents may further comprise: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603, and the output device 604 may be connected by a bus 605 or other means, and are exemplified by the bus 605 in fig. 6.
The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic apparatus for converting accents, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, any text input by the user can be accurately and quickly converted into the audio frequency with the target accent of the user by carrying out voice synthesis, recognition and conversion on the acquired speaking text data of the user.
In accordance with one or more embodiments of the present disclosure, there is provided a method for converting accents, comprising: acquiring a target speaking text; determining a first acoustic feature indicative of a target accent based on the target spoken text; determining linguistic features in the first acoustic features; and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.
In accordance with one or more embodiments of the present disclosure, determining a first acoustic feature for indicating a target accent based on a target spoken text includes: and determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.
According to one or more embodiments of the present disclosure, determining linguistic features in the first acoustic feature comprises: and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.
According to one or more embodiments of the present disclosure, determining audio corresponding to a target spoken text and having a target accent based on linguistic features and the target spoken text includes: determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent; based on the second acoustic feature, audio having a target accent is determined.
According to one or more embodiments of the present disclosure, the method for converting accents further comprises: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.
In accordance with one or more embodiments of the present disclosure, audio having a target accent includes at least one of: speech audio with a target accent, singing audio with a target accent.
In accordance with one or more embodiments of the present disclosure, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and determining audio having a target accent based on the second acoustic feature, comprising: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.
In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for converting accents, comprising: an acquisition unit configured to acquire a target utterance text; a first acoustic feature determination unit configured to determine a first acoustic feature indicating a target accent based on the target spoken text; a linguistic feature determination unit configured to determine a linguistic feature in the first acoustic features; and the conversion unit is configured to determine the audio with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text and output the audio.
According to one or more embodiments of the present disclosure, the first acoustic feature determination unit is further configured to: and determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.
According to one or more embodiments of the present disclosure, the linguistic feature determination unit is further configured to: and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.
According to one or more embodiments of the present disclosure, the conversion unit is further configured to: determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent; based on the second acoustic feature, audio having a target accent is determined.
According to one or more embodiments of the present disclosure, the apparatus for converting accents further comprises a training unit, not shown in fig. 5, configured to: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.
In accordance with one or more embodiments of the present disclosure, audio having a target accent includes at least one of: speech audio with a target accent, singing audio with a target accent.
In accordance with one or more embodiments of the present disclosure, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and the conversion unit is further configured to: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.
It should be understood that the above embodiments are merely exemplary embodiments, but are not limited thereto, and include other methods known in the art that may be implemented for converting accents. Steps may be reordered, added, or deleted using the various forms of flow shown above. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A method for converting accents, comprising:
acquiring a target speaking text;
determining a first acoustic feature indicative of a target accent based on the target spoken text;
determining linguistic features in the first acoustic features;
and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.
2. The method of claim 1, wherein determining, based on the targeted spoken text, a first acoustic feature indicative of a targeted accent comprises:
and determining a first acoustic feature which is corresponding to the target speaking text and is used for indicating a target accent according to the target speaking text and a pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.
3. The method of claim 1, wherein the determining linguistic features in the first acoustic feature comprises:
and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.
4. The method of claim 1, wherein the determining audio corresponding to the target spoken text with a target accent based on the linguistic features and the target spoken text comprises:
determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent;
based on the second acoustic feature, audio having a target accent is determined.
5. The method of claim 4, wherein the method further comprises:
acquiring an initial neural network model;
acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features;
taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as the input of the initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model;
and determining the trained initial neural network model as the conversion model.
6. The method of any of claims 1-5, wherein the audio with the target accent comprises at least one of: speech audio with a target accent, singing audio with a target accent.
7. The method of claim 4, wherein the second acoustic feature comprises a Mel spectrum corresponding to the target accent; and
the determining audio having a target accent based on the second acoustic feature comprises:
and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.
8. An apparatus for converting accents, comprising:
an acquisition unit configured to acquire a target utterance text;
a first acoustic feature determination unit configured to determine a first acoustic feature indicating a target accent based on the target spoken text;
a linguistic feature determination unit configured to determine linguistic features in the first acoustic features;
and the conversion unit is configured to determine the audio with the target accent corresponding to the target speaking text based on the linguistic features and the target speaking text, and output the audio.
9. An electronic device for converting accents, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011270379.XA CN112382267A (en) | 2020-11-13 | 2020-11-13 | Method, apparatus, device and storage medium for converting accents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011270379.XA CN112382267A (en) | 2020-11-13 | 2020-11-13 | Method, apparatus, device and storage medium for converting accents |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112382267A true CN112382267A (en) | 2021-02-19 |
Family
ID=74582321
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011270379.XA Pending CN112382267A (en) | 2020-11-13 | 2020-11-13 | Method, apparatus, device and storage medium for converting accents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112382267A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223542A (en) * | 2021-04-26 | 2021-08-06 | 北京搜狗科技发展有限公司 | Audio conversion method and device, storage medium and electronic equipment |
CN113539239A (en) * | 2021-07-12 | 2021-10-22 | 网易(杭州)网络有限公司 | Voice conversion method, device, storage medium and electronic equipment |
CN113808572A (en) * | 2021-08-18 | 2021-12-17 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN113838452A (en) * | 2021-08-17 | 2021-12-24 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, device and computer storage medium |
WO2022236111A1 (en) * | 2021-05-06 | 2022-11-10 | Sanas.ai Inc. | Real-time accent conversion model |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160005391A1 (en) * | 2014-07-03 | 2016-01-07 | Google Inc. | Devices and Methods for Use of Phase Information in Speech Processing Systems |
EP3151239A1 (en) * | 2015-09-29 | 2017-04-05 | Yandex Europe AG | Method and system for text-to-speech synthesis |
CN109285535A (en) * | 2018-10-11 | 2019-01-29 | 四川长虹电器股份有限公司 | Phoneme synthesizing method based on Front-end Design |
US20190088253A1 (en) * | 2017-09-20 | 2019-03-21 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for converting english speech information into text |
CN110197655A (en) * | 2019-06-28 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | Method and apparatus for synthesizing voice |
US20200082805A1 (en) * | 2017-05-16 | 2020-03-12 | Beijing Didi Infinity Technology And Development Co., Ltd. | System and method for speech synthesis |
CN111243571A (en) * | 2020-01-14 | 2020-06-05 | 北京字节跳动网络技术有限公司 | Text processing method, device and equipment and computer readable storage medium |
CN111292720A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111462727A (en) * | 2020-03-31 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device and computer readable medium for generating speech |
CN111477210A (en) * | 2020-04-02 | 2020-07-31 | 北京字节跳动网络技术有限公司 | Speech synthesis method and device |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111785261A (en) * | 2020-05-18 | 2020-10-16 | 南京邮电大学 | Cross-language voice conversion method and system based on disentanglement and explanatory representation |
CN111899719A (en) * | 2020-07-30 | 2020-11-06 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
-
2020
- 2020-11-13 CN CN202011270379.XA patent/CN112382267A/en active Pending
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160005391A1 (en) * | 2014-07-03 | 2016-01-07 | Google Inc. | Devices and Methods for Use of Phase Information in Speech Processing Systems |
EP3151239A1 (en) * | 2015-09-29 | 2017-04-05 | Yandex Europe AG | Method and system for text-to-speech synthesis |
US20200082805A1 (en) * | 2017-05-16 | 2020-03-12 | Beijing Didi Infinity Technology And Development Co., Ltd. | System and method for speech synthesis |
US20190088253A1 (en) * | 2017-09-20 | 2019-03-21 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for converting english speech information into text |
CN109285535A (en) * | 2018-10-11 | 2019-01-29 | 四川长虹电器股份有限公司 | Phoneme synthesizing method based on Front-end Design |
CN110197655A (en) * | 2019-06-28 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | Method and apparatus for synthesizing voice |
CN111243571A (en) * | 2020-01-14 | 2020-06-05 | 北京字节跳动网络技术有限公司 | Text processing method, device and equipment and computer readable storage medium |
CN111292720A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111462727A (en) * | 2020-03-31 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device and computer readable medium for generating speech |
CN111477210A (en) * | 2020-04-02 | 2020-07-31 | 北京字节跳动网络技术有限公司 | Speech synthesis method and device |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111785261A (en) * | 2020-05-18 | 2020-10-16 | 南京邮电大学 | Cross-language voice conversion method and system based on disentanglement and explanatory representation |
CN111899719A (en) * | 2020-07-30 | 2020-11-06 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223542A (en) * | 2021-04-26 | 2021-08-06 | 北京搜狗科技发展有限公司 | Audio conversion method and device, storage medium and electronic equipment |
CN113223542B (en) * | 2021-04-26 | 2024-04-12 | 北京搜狗科技发展有限公司 | Audio conversion method and device, storage medium and electronic equipment |
WO2022236111A1 (en) * | 2021-05-06 | 2022-11-10 | Sanas.ai Inc. | Real-time accent conversion model |
US11948550B2 (en) | 2021-05-06 | 2024-04-02 | Sanas.ai Inc. | Real-time accent conversion model |
CN113539239A (en) * | 2021-07-12 | 2021-10-22 | 网易(杭州)网络有限公司 | Voice conversion method, device, storage medium and electronic equipment |
CN113838452A (en) * | 2021-08-17 | 2021-12-24 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, device and computer storage medium |
CN113838452B (en) * | 2021-08-17 | 2022-08-23 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, device and computer storage medium |
CN113808572A (en) * | 2021-08-18 | 2021-12-17 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102581346B1 (en) | Multilingual speech synthesis and cross-language speech replication | |
CN112382267A (en) | Method, apparatus, device and storage medium for converting accents | |
US11881210B2 (en) | Speech synthesis prosody using a BERT model | |
US11605371B2 (en) | Method and system for parametric speech synthesis | |
CN112382270A (en) | Speech synthesis method, apparatus, device and storage medium | |
CN111899719A (en) | Method, apparatus, device and medium for generating audio | |
CN114203147A (en) | System and method for text-to-speech cross-speaker style delivery and for training data generation | |
KR102594081B1 (en) | Predicting parametric vocoder parameters from prosodic features | |
Astrinaki et al. | Reactive and continuous control of HMM-based speech synthesis | |
KR20230056741A (en) | Synthetic Data Augmentation Using Voice Transformation and Speech Recognition Models | |
US11475874B2 (en) | Generating diverse and natural text-to-speech samples | |
CN112365879A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN110600013A (en) | Training method and device for non-parallel corpus voice conversion data enhancement model | |
JP2022133392A (en) | Speech synthesis method and device, electronic apparatus, and storage medium | |
CN113963679A (en) | Voice style migration method and device, electronic equipment and storage medium | |
EP4205105A1 (en) | System and method for cross-speaker style transfer in text-to-speech and training data generation | |
CN112382269A (en) | Audio synthesis method, device, equipment and storage medium | |
CN112382274A (en) | Audio synthesis method, device, equipment and storage medium | |
Jayakumari et al. | An improved text to speech technique for tamil language using hidden Markov model | |
Ajayi et al. | Systematic review on speech recognition tools and techniques needed for speech application development | |
Louw | Neural speech synthesis for resource-scarce languages | |
US11335321B2 (en) | Building a text-to-speech system from a small amount of speech data | |
Astrinaki et al. | sHTS: A streaming architecture for statistical parametric speech synthesis | |
Docasal et al. | Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics | |
Mishra et al. | Emotion Detection From Speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |