CN111326138A - Voice generation method and device - Google Patents

Voice generation method and device Download PDF

Info

Publication number
CN111326138A
CN111326138A CN202010113619.9A CN202010113619A CN111326138A CN 111326138 A CN111326138 A CN 111326138A CN 202010113619 A CN202010113619 A CN 202010113619A CN 111326138 A CN111326138 A CN 111326138A
Authority
CN
China
Prior art keywords
acoustic
text information
target speaker
speech
acoustic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010113619.9A
Other languages
Chinese (zh)
Inventor
车浩
周芯永
王晓瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Reach Best Technology Co Ltd
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Reach Best Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Reach Best Technology Co Ltd filed Critical Reach Best Technology Co Ltd
Priority to CN202010113619.9A priority Critical patent/CN111326138A/en
Publication of CN111326138A publication Critical patent/CN111326138A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a speech generating method and apparatus. The voice generation method comprises the following steps: acquiring first text information; inputting the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature; the first hidden acoustic feature is a feature of the voice corresponding to the first text information; inputting the first hidden acoustic feature into a preset self-adaptive acoustic model corresponding to the target speaker to obtain a first acoustic feature corresponding to the target speaker; and generating voice corresponding to the target speaker and the first text information according to the first acoustic characteristic. Therefore, the aim of quickly and effectively generating the voice corresponding to the target speaker and the target language is fulfilled without acquiring bilingual information of the target speaker and acquiring specific voice of the target speaker.

Description

Voice generation method and device
Technical Field
The present disclosure relates to the field of speech signal processing technologies, and in particular, to a speech generation method and apparatus.
Background
In recent years, the artificial intelligence industry has developed rapidly, and various intelligent products gradually enter the daily life of consumers. Speech plays an important role in the process of human-computer interaction as the most natural carrier of human information transfer. With the popularization of deep learning, the speech signal processing technology is greatly improved, for example, the accuracy of speech recognition is greatly improved, and the speech generated by speech synthesis approaches to human voice. The speech cloning technology is an important research field of speech signal processing, and aims to learn the tone of a target speaker from the speech data of a small number of target speakers and synthesize the speech with the tone of the target speaker, and belongs to a special speech synthesis technology. Cross-language phonetic cloning refers to learning the tone of the target speaker from the phonetic data of the non-target language, synthesizing the voice with the tone of the target speaker, and the synthesized content can be any phonetic information of the target language. The conventional speech cloning method generally needs to acquire bilingual information of a target speaker, learn according to the bilingual information and generate a speech of the target speaker corresponding to a target language, or generate the speech of the target speaker corresponding to the target language by using phoneme sharing modeling between different languages.
However, the time cost and the production cost for obtaining bilingual information of a target speaker are high, and the use of phoneme sharing modeling between different languages to generate speech also requires obtaining bilingual information of multiple speakers and the effect of generating speech is not ideal, so how to quickly and effectively generate speech corresponding to the target speaker and the target language under limited data resources is a problem to be solved.
Disclosure of Invention
The present disclosure provides a speech generating method and apparatus to at least solve the problem of rapidly and efficiently generating speech corresponding to a target speaker and a target language in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a speech generation method, including:
acquiring first text information;
inputting the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature; the first hidden acoustic feature is a feature of a voice corresponding to the first text information;
inputting the first hidden acoustic feature into a preset self-adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker;
and generating voice corresponding to the target speaker and the first text information according to the first acoustic characteristic.
In one embodiment, the inputting the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature previously includes:
acquiring second text information and a second voice sample corresponding to the second text information;
extracting initial hidden acoustic features corresponding to the second voice sample;
and inputting the second text information and the initial hidden acoustic features into a preset initial prosody analysis model, and training the initial prosody analysis model to obtain the prosody analysis model.
In one embodiment, the inputting the first implicit acoustic feature into a preset adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker previously includes:
acquiring a first voice sample corresponding to a target speaker;
and correcting a preset basic acoustic model by applying the first voice sample to obtain the self-adaptive acoustic model corresponding to the target speaker.
In one embodiment, the language of the first speech sample comprises a first target language or a second target language, and the language of the speech corresponding to the target speaker and the first text information comprises the first target language and/or the second target language; the first target language and the second target language are languages of different languages.
In one embodiment, the applying the first speech sample to correct a preset basic acoustic model to obtain the adaptive acoustic model corresponding to the target speaker includes:
obtaining a plurality of speaker voice samples;
extracting a second implicit acoustic feature corresponding to each speaker voice sample, and extracting a second acoustic feature corresponding to each speaker voice sample;
and inputting the second implicit acoustic feature and the second acoustic feature into a preset initial basic acoustic model, and training the initial basic acoustic model to obtain the basic acoustic model.
In one embodiment, the applying the first speech sample to correct the base acoustic model to obtain the adaptive acoustic model corresponding to the target speaker includes:
extracting a third implicit acoustic feature corresponding to the first voice sample, and extracting a third acoustic feature corresponding to the first voice sample;
and inputting the third implicit acoustic features and the third acoustic features into a basic acoustic model, and training the basic acoustic model to obtain the self-adaptive acoustic model corresponding to the target speaker.
In one embodiment, the generating speech corresponding to the target speaker and the first text information according to the first acoustic feature comprises:
and inputting the first acoustic characteristic into a preset vocoder to generate voice corresponding to the target speaker and the first text information.
According to a second aspect of the embodiments of the present disclosure, there is provided a speech generating apparatus including:
a text information acquisition unit configured to perform acquisition of first text information;
the hidden acoustic feature acquisition unit is configured to input the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature; the first hidden acoustic feature is a feature of a voice corresponding to the first text information;
an acoustic feature acquisition unit configured to perform inputting the first hidden acoustic feature into a preset adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker;
a speech generating unit configured to perform generating speech corresponding to the target speaker and the first text information according to the first acoustic feature.
In one embodiment, the implicit acoustic feature obtaining unit is further configured to perform:
acquiring second text information and a second voice sample corresponding to the second text information; extracting initial hidden acoustic features corresponding to the second voice sample;
and inputting the second text information and the initial hidden acoustic features into a preset initial prosody analysis model, and training the initial basic acoustic model to obtain the prosody analysis model.
In one embodiment, the acoustic feature obtaining unit is further configured to perform:
acquiring a first voice sample corresponding to a target speaker;
and correcting a preset basic acoustic model by applying the first voice sample to obtain the self-adaptive acoustic model corresponding to the target speaker.
In one embodiment, the acoustic feature obtaining unit is further configured to perform:
the language of the first voice sample comprises a first target language or a second target language, and the language of the voice corresponding to the target speaker and the first text information comprises the first target language and/or the second target language; the first target language and the second target language are languages of different languages.
In one embodiment, the acoustic feature obtaining unit is further configured to perform:
obtaining a plurality of speaker voice samples;
extracting a second implicit acoustic feature corresponding to each speaker voice sample, and extracting a second acoustic feature corresponding to each speaker voice sample;
and inputting the second implicit acoustic feature and the second acoustic feature into a preset initial basic acoustic model, and training the initial basic acoustic model to obtain the basic acoustic model.
In one embodiment, the acoustic feature obtaining unit is further configured to perform:
extracting a third implicit acoustic feature corresponding to the first voice sample, and extracting a third acoustic feature corresponding to the first voice sample;
and inputting the third implicit acoustic features and the third acoustic features into a basic acoustic model, and training the basic acoustic model to obtain the self-adaptive acoustic model corresponding to the target speaker.
In one embodiment, the speech generating unit is further configured to perform:
and inputting the first acoustic characteristic into a preset vocoder to generate voice corresponding to the target speaker and the first text information.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the speech generation method of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium including: the instructions in the storage medium, when executed by a processor of the electronic device, enable the electronic device to perform the speech generation method of the first aspect described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the first text information is input into a prosody analysis model to obtain a first acoustic characteristic, wherein the first acoustic characteristic is the characteristic of the voice corresponding to the first text information, the first acoustic characteristic is input into a self-adaptive acoustic model corresponding to a target speaker to obtain the first acoustic characteristic corresponding to the target speaker, and the voice corresponding to the target speaker and the first text information is generated according to the first acoustic characteristic corresponding to the target speaker, so that bilingual information of the target speaker does not need to be obtained, specific voice of the target speaker does not need to be obtained, and the voice corresponding to the target speaker and the target language can be quickly and effectively generated.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow diagram illustrating a method of speech generation according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating an implementable method prior to step S200 according to an exemplary embodiment.
Fig. 3 is a flowchart illustrating an implementable method prior to step S300 according to an example embodiment.
Fig. 4 is a flowchart illustrating an implementable method prior to step S310 according to an example embodiment.
FIG. 5 is a graph illustrating a comparison of speech simulation performance according to an exemplary embodiment.
FIG. 6 is a block diagram illustrating a speech generating apparatus according to an example embodiment.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 8 is a block diagram illustrating a speech generating device according to an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
FIG. 1 is a flow diagram illustrating a method of speech generation according to an exemplary embodiment, as shown in FIG. 1, including the steps of:
in step S100, first text information is acquired.
In step S200, inputting the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature; the first hidden acoustic feature is a feature of the voice corresponding to the first text information.
In step S300, the first implicit acoustic feature is input into a preset adaptive acoustic model corresponding to the target speaker, so as to obtain a first acoustic feature corresponding to the target speaker.
In step S400, a speech corresponding to the target speaker and the first text information is generated according to the first acoustic feature.
The first text information comprises target content and a target language corresponding to the voice to be generated; the target content is specific content of the voice to be generated corresponding to the first text information, and the target language is the language of the voice to be generated corresponding to the first text information, wherein one first text information can correspond to one target language or a plurality of target languages; for example, if the first text message is an "Adaptive acoustic model," the corresponding target content is an "Adaptive acoustic model," and the target language is chinese and english. The implicit acoustic features are features corresponding to audio or phonetic pronunciations. The prosody analysis model is a model for converting text information into a hidden acoustic feature, and specifically, a section of text content is input into the prosody analysis model, so that the feature of a voice corresponding to the text content can be obtained. An adaptive acoustic model refers to a model that converts a hidden acoustic feature into a corresponding acoustic feature. An acoustic feature is a set of data that characterizes the acoustic properties of speech, and a vocoder can convert the acoustic feature into corresponding speech.
Specifically, after first text information comprising target content corresponding to the speech to be generated and a target language is acquired, the first text information is input into a prosody analysis model, first acoustic characteristics (for example, bottleneck characteristics) corresponding to the first text information are acquired through the prosody analysis model, the first acoustic characteristics are input into an adaptive acoustic model corresponding to a target speaker, first acoustic characteristics of the target speaker corresponding to the first text information are acquired, the first acoustic characteristics of the target speaker corresponding to the first text information are input into a vocoder, and the speech corresponding to the target speaker, the target content in the first text information and the target language is generated.
According to the voice generation method, the first text information is acquired, the first text information is input into the prosody analysis model to obtain the first acoustic characteristic, the first acoustic characteristic is the characteristic of the voice corresponding to the first text information, the first acoustic characteristic is input into the adaptive acoustic model corresponding to the target speaker to obtain the first acoustic characteristic corresponding to the target speaker, and the voice corresponding to the target speaker and the first text information is generated according to the first acoustic characteristic corresponding to the target speaker, so that the bilingual information of the target speaker does not need to be acquired, the specific voice of the target speaker does not need to be acquired, and the voice corresponding to the target speaker and the target language can be quickly and effectively generated under limited data resources.
Fig. 2 is a flowchart of an implementation manner before step S200, as shown in fig. 2, where the first text information is input into a preset prosody analysis model to obtain a first hidden acoustic feature, the method includes the following steps:
in step S210, second text information is obtained, and a second speech sample corresponding to the second text information is obtained.
In step S220, an initial implicit acoustic feature corresponding to the second speech sample is extracted.
In step S230, the second text information and the initial implicit acoustic features are input into a preset initial prosody analysis model, and the initial basic acoustic model is trained to obtain a prosody analysis model.
The second text information is information containing corresponding target content and target language; for example, when the second text message is an "Adaptive acoustic model", the corresponding target content is an "Adaptive acoustic model", the target language is chinese, when the second text message is an "Adaptive acoustic model", the corresponding target content is an "Adaptive acoustic model", the target language is english, and when the second text message is an "Adaptive acoustic model", the corresponding target content is an "Adaptive acoustic model, an Adaptive acoustic model", and the target language is chinese and english. The second voice sample is the voice of the speaker recorded according to the second text information.
Specifically, the second text information is obtained, the voice sent by the speaker according to the second text information is recorded, and the recorded voice is determined as a second voice sample. And inputting the second voice sample into a voice recognition model, and extracting the initial hidden acoustic features of the second voice sample. And taking the second text information as input. The method comprises the steps that initial hidden acoustic features are output, a preset initial rhythm analysis model is trained, the trained model is determined to be a rhythm analysis model, and the rhythm analysis model is used for converting text information into the hidden acoustic features.
In the above embodiment, the second text information is obtained, the second speech sample corresponding to the second text information is obtained, the initial implicit acoustic feature corresponding to the second speech sample is extracted, the second text information and the initial implicit acoustic feature are input into a preset initial prosody analysis model, the initial basic acoustic model is trained to obtain a prosody analysis model, a basis is provided for obtaining the first implicit acoustic feature according to the first text information, and a basis is provided for realizing rapid and effective generation of speech corresponding to the target speaker and the target language.
FIG. 3 is a flowchart of an exemplary implementation before step S300, shown in FIG. 3, of inputting a first hidden acoustic feature into a preset adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker, according to an exemplary embodiment, before the following steps are included:
in step S310, a first speech sample corresponding to a target speaker is obtained;
in step S320, the first speech sample is applied to correct the preset basic acoustic model, so as to obtain an adaptive acoustic model corresponding to the target speaker.
The basic acoustic model is a model for converting the hidden acoustic features into the acoustic features.
Specifically, after a basic acoustic model and a first voice sample corresponding to a target speaker are obtained, the implicit acoustic features of the first voice sample are used as input, the acoustic features of the first voice sample are used as output, the basic acoustic model is corrected, and an adaptive acoustic model corresponding to the target speaker is obtained.
Optionally, extracting a third implicit acoustic feature corresponding to the first voice sample, and extracting a third acoustic feature corresponding to the first voice sample; and inputting the third implicit acoustic feature and the third acoustic feature into a basic acoustic model, and training the basic acoustic model to obtain a self-adaptive acoustic model corresponding to the target speaker.
Specifically, a third implicit acoustic feature is used as an input, the third acoustic feature is used as an output, the basic acoustic model is trained, and a self-adaptive acoustic model corresponding to the target speaker is obtained.
Optionally, the language of the first speech sample includes a first target language or a second target language, and the language of the speech corresponding to the target speaker and the first text information includes the first target language and/or the second target language; the first target language and the second target language are languages of different languages.
Specifically, the prosody analysis model can convert text information into corresponding hidden acoustic features, the adaptive acoustic model can convert the hidden acoustic features into corresponding acoustic features, both models describe characteristics of speech and do not limit specific languages, therefore, input and output of the prosody analysis model and the adaptive acoustic model do not limit specific languages, a preset basic acoustic model is corrected by applying a first speech sample corresponding to a target speaker, after the adaptive acoustic model corresponding to the target speaker is obtained, the adaptive acoustic model records speech features of the target speaker, the input hidden acoustic features can be converted into acoustic features, and speech corresponding to the target speaker and the first text information is further obtained. For example, the language used for training is chinese, the language corresponding to the generated speech may be chinese, english or other languages, and accordingly the language used for training may also be a language other than chinese, and the generated speech is chinese, thereby implementing a cross-lingual speech generation scheme.
In the above embodiment, the adaptive acoustic model corresponding to the target speaker is obtained by obtaining the first voice sample corresponding to the target speaker and correcting the preset basic acoustic model by using the first voice sample, and the adaptive acoustic model can convert the first implicit acoustic feature into the first acoustic feature corresponding to the target speaker without the target voice sample corresponding to the target speaker (the voice of the target speaker recorded according to the corresponding text content), so as to provide a basis for quickly and effectively generating the voice corresponding to the target speaker and the target language. Meanwhile, in the embodiment, the preset basic acoustic model is corrected by adopting the first voice sample, based on the target speaker adaptive method, the obtained adaptive acoustic model can obtain the corresponding first acoustic feature according to the first implicit acoustic feature, can realize cross-language voice cloning, and is superior to the existing speaker adaptive method in terms of naturalness and similarity.
FIG. 4 is a flowchart illustrating an exemplary implementation before step S310, where, as shown in FIG. 4, a preset basic acoustic model is corrected by applying a first speech sample to obtain an adaptive acoustic model corresponding to a target speaker, the method includes the following steps:
in step S311, several speaker voice samples are obtained.
In step S312, a second implicit acoustic feature corresponding to each speaker voice sample is extracted, and a second acoustic feature corresponding to each speaker voice sample is extracted.
In step S313, the second hidden acoustic feature and the second acoustic feature are input into a preset initial basic acoustic model, and the initial basic acoustic model is trained to obtain a basic acoustic model.
Wherein, the voice samples of a plurality of speakers are the voice information of a plurality of speakers. The plurality of speaker voice samples are used for correcting the initial basic acoustic model for converting the hidden acoustic features into the acoustic features so as to obtain a basic acoustic model with stronger generalization and accurately convert the corresponding hidden acoustic features into the acoustic features.
Specifically, after a plurality of speaker voice samples are obtained, a second hidden acoustic feature and a second acoustic feature corresponding to each speaker voice sample are extracted, the second hidden acoustic feature is used as input, the second acoustic feature is used as output, the basic acoustic model is trained, and the basic acoustic model is obtained.
In the embodiment, the plurality of speaker voice samples are obtained, the second implicit acoustic features corresponding to each speaker voice sample are extracted, the second implicit acoustic features and the second acoustic features are input into the preset initial basic acoustic model, the initial basic acoustic model is trained, and the basic acoustic model is obtained.
According to an exemplary embodiment, an implementable method of step S400 is shown for generating speech corresponding to a target speaker and first textual information according to a first acoustic feature, comprising:
and inputting the first acoustic characteristic into a preset vocoder to generate voice corresponding to the target speaker and the first text information.
Wherein the vocoder is capable of converting acoustic features into speech information.
Specifically, the first acoustic feature is input to a preset vocoder, and the vocoder converts the first acoustic feature into voice corresponding to the target speaker and the first text information.
In the above embodiment, the preset vocoder is used to convert the first acoustic feature into the speech corresponding to the target speaker and the first text information, so that the automatic conversion of the first acoustic feature is realized, participation is not required, and the speech corresponding to the target speaker and the target language can be quickly and effectively generated.
In one specific embodiment, the target language and the non-target language are Mandarin Chinese and English, respectively, in order to synthesize a Chinese voice with arbitrary content of the timbre of the target speaker based on the given phonetic sample data of the target speaker in English. Specifically, for any text information, the prosody analysis model may obtain corresponding hidden acoustic features, then input the hidden acoustic features into the acoustic model to obtain corresponding acoustic features, and finally convert the acoustic features into expected speech through the vocoder.
With respect to the data set for training: the data used for training the prosodic analysis model can adopt a DB-1 database sourced by Biba corporation, including 12-hour Chinese Putonghua; the data used to train the acoustic model was the open source THCHS-30 database at the university of qinghua, which included 60 speakers, 250 sentences of speech per speaker. All audio is single channel 16000Hz sampling rate, the window length and frame shift are 25 ms and 10 ms respectively, and the acoustic features are composed of 30-dimensional bfcc and 2-dimensional fundamental frequency parameters.
Specifically, the speech recognition model is based on the TDNN-LSTM structure, implemented using the Kaldi toolkit, and takes the output of the last layer of LSTM, which is close to the softmax classification layer, as the Yi Sheng feature. The learning rate of the prosodic analysis model decays from 1e-3 to 1e-5 every 50000 steps, an Adam optimizer may be used, and the batch size set to 16. The acoustic model is pre-trained by using THCHS-30 data, the batch size can be 32, the L1 loss is used for training to 200000 steps, then 4000 steps of fine adjustment are carried out by using the target speaker data, and finally the voice of the target speaker is obtained according to the text information.
Fig. 5 is a comparison graph of voice simulation performance according to an exemplary embodiment, as shown in fig. 5, female voice simulation performance and male voice simulation performance are compared respectively, in two comparisons, a black bar box shows voice simulation performance realized by the prior art, and a white bar box shows voice simulation performance realized by the present application, and it can be seen from the graph that the performance of the present application is higher than that of the prior art in both female voice simulation performance and male voice simulation performance. The method is characterized in that the hidden acoustic features are embedded in the speech space of different languages, and the hidden acoustic features are embedded in the speech space of different languages. Furthermore, the implicit acoustic features can be regarded as speech features with acoustic-related features (such as pitch) removed, so that the prosody analysis model can only learn the duration mapping, and meanwhile, the basic acoustic model does not need to learn duration-related information (such as prosody), and when the basic acoustic model is fine-tuned, the model only needs to learn the voice of the target speaker without considering the speaking style. Therefore, by adopting the technical scheme of the application, the characteristics of the voice can be better extracted, the bilingual information of the target speaker does not need to be acquired, the specific voice of the target speaker does not need to be acquired, and the voice corresponding to the target speaker and the target language is quickly and effectively generated under the limited data resources.
FIG. 6 is a block diagram illustrating a speech generating apparatus according to an example embodiment. Referring to fig. 6, the apparatus includes a text information acquisition unit 601, a hidden acoustic feature acquisition unit 602, an acoustic feature acquisition unit 603, and a speech generation unit 604.
An information acquisition unit 601 configured to perform acquisition of first text information;
a hidden acoustic feature obtaining unit 602 configured to input the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature;
an acoustic feature obtaining unit 603 configured to perform inputting the first hidden acoustic feature into a preset adaptive acoustic model corresponding to the target speaker to obtain a first acoustic feature corresponding to the target speaker; the first hidden acoustic feature is a feature of the voice corresponding to the first text information;
a speech generating unit 604 configured to perform generating speech corresponding to the target speaker and the first text information according to the first acoustic feature.
With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
FIG. 7 is a block diagram illustrating an electronic device 700 for speech generation according to an example embodiment. For example, the device 700 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 7, device 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an audio component 710, an input/output (I/O) interface 712, a sensor component 714, and a communication component 716.
The processing component 702 generally controls the overall operation of the device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 702 may include one or more processors 720 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 may include a multimedia module to facilitate interaction between the multimedia component 708 and the processing component 702.
The memory 704 is configured to store various types of data to support operation at the device 700. Examples of such data include instructions for any application or method operating on device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 706 provides power to the various components of the device 700. The power components 706 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 700.
The multimedia component 708 includes a screen that provides an output interface between the device 700 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 708 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 700 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 710 is configured to output and/or input audio signals. For example, the audio component 710 includes a Microphone (MIC) configured to receive external audio signals when the device 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.
The I/O interface 712 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 714 includes one or more sensors for providing status assessment of various aspects of the device 700. For example, sensor assembly 714 may detect an open/closed state of device 700, the relative positioning of components, such as a display and keypad of device 700, sensor assembly 714 may also detect a change in position of device 700 or a component of device 700, the presence or absence of user contact with device 700, orientation or acceleration/deceleration of device 700, and a change in temperature of device 700. The sensor assembly 714 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 714 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communications component 717 is configured to facilitate communications between the device 700 and other devices in a wired or wireless manner. The device 700 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 716 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 716 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the device 700 to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
FIG. 8 is a block diagram illustrating an apparatus 800 for speech generation according to an example embodiment. For example, the apparatus 800 may be provided as a server. Referring to FIG. 8, the apparatus 800 includes a processing component 822, which further includes one or more processors, and memory resources, represented by memory 832, for storing instructions, such as applications, that are executable by the processing component 822. The application programs stored in memory 832 may include one or more modules that each correspond to a set of instructions. Further, the processing component 822 is configured to execute instructions to perform the speech generation method described above.
The device 800 may also include a power component 826 configured to perform power management of the device 800, a wired or wireless network interface 850 configured to connect the device 800 to a network, and an input/output (I/O) interface 858. The apparatus 800 may operate based on an operating system stored in the memory 832, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of speech generation, comprising:
acquiring first text information;
inputting the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature; the first hidden acoustic feature is a feature of a voice corresponding to the first text information;
inputting the first hidden acoustic feature into a preset self-adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker;
and generating voice corresponding to the target speaker and the first text information according to the first acoustic characteristic.
2. The method of claim 1, wherein the inputting the first text message into a preset prosody analysis model to obtain a first hidden acoustic feature comprises:
acquiring second text information and a second voice sample corresponding to the second text information;
extracting initial hidden acoustic features corresponding to the second voice sample;
and inputting the second text information and the initial hidden acoustic features into a preset initial prosody analysis model, and training the initial prosody analysis model to obtain the prosody analysis model.
3. The method of claim 1, wherein the inputting the first hidden acoustic feature into a preset adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker comprises:
acquiring a first voice sample corresponding to a target speaker;
and correcting a preset basic acoustic model by applying the first voice sample to obtain the self-adaptive acoustic model corresponding to the target speaker.
4. The speech generation method of claim 3, wherein the language of the first speech sample comprises a first target language or a second target language, and the language of the speech corresponding to the target speaker and the first textual information comprises the first target language and/or the second target language; the first target language and the second target language are languages of different languages.
5. The speech generation method according to claim 3, wherein the applying the first speech sample to correct a preset basic acoustic model to obtain the adaptive acoustic model corresponding to the target speaker comprises:
obtaining a plurality of speaker voice samples;
extracting a second implicit acoustic feature corresponding to each speaker voice sample, and extracting a second acoustic feature corresponding to each speaker voice sample;
and inputting the second implicit acoustic feature and the second acoustic feature into a preset initial basic acoustic model, and training the initial basic acoustic model to obtain the basic acoustic model.
6. The method of claim 3, wherein the applying the first speech sample to correct the base acoustic model to obtain the adaptive acoustic model corresponding to the target speaker comprises:
extracting a third implicit acoustic feature corresponding to the first voice sample, and extracting a third acoustic feature corresponding to the first voice sample;
and inputting the third implicit acoustic features and the third acoustic features into a basic acoustic model, and training the basic acoustic model to obtain the self-adaptive acoustic model corresponding to the target speaker.
7. The speech generation method of claim 1, wherein generating speech corresponding to the target speaker and the first textual information based on the first acoustic feature comprises:
and inputting the first acoustic characteristic into a preset vocoder to generate voice corresponding to the target speaker and the first text information.
8. A speech generating apparatus, comprising:
a text information acquisition unit configured to perform acquisition of first text information;
the hidden acoustic feature acquisition unit is configured to input the first text information into a preset prosody analysis model to obtain a first hidden acoustic feature; the first hidden acoustic feature is a feature of a voice corresponding to the first text information;
an acoustic feature acquisition unit configured to perform inputting the first hidden acoustic feature into a preset adaptive acoustic model corresponding to a target speaker to obtain a first acoustic feature corresponding to the target speaker;
a speech generating unit configured to perform generating speech corresponding to the target speaker and the first text information according to the first acoustic feature.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the speech generation method of any of claims 1 to 7.
10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the speech generation method of any of claims 1 to 7.
CN202010113619.9A 2020-02-24 2020-02-24 Voice generation method and device Pending CN111326138A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010113619.9A CN111326138A (en) 2020-02-24 2020-02-24 Voice generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010113619.9A CN111326138A (en) 2020-02-24 2020-02-24 Voice generation method and device

Publications (1)

Publication Number Publication Date
CN111326138A true CN111326138A (en) 2020-06-23

Family

ID=71168871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010113619.9A Pending CN111326138A (en) 2020-02-24 2020-02-24 Voice generation method and device

Country Status (1)

Country Link
CN (1) CN111326138A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112382267A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method, apparatus, device and storage medium for converting accents
CN112382270A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Speech synthesis method, apparatus, device and storage medium
CN113312453A (en) * 2021-06-16 2021-08-27 哈尔滨工业大学 Model pre-training system for cross-language dialogue understanding
CN113327575A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113345412A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
WO2023045954A1 (en) * 2021-09-22 2023-03-30 北京字跳网络技术有限公司 Speech synthesis method and apparatus, electronic device, and readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366731A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Text to speech (TTS) method and system
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN105895076A (en) * 2015-01-26 2016-08-24 科大讯飞股份有限公司 Speech synthesis method and system
CN107039034A (en) * 2016-02-04 2017-08-11 科大讯飞股份有限公司 A kind of prosody prediction method and system
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN110767210A (en) * 2019-10-30 2020-02-07 四川长虹电器股份有限公司 Method and device for generating personalized voice

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366731A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Text to speech (TTS) method and system
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN105895076A (en) * 2015-01-26 2016-08-24 科大讯飞股份有限公司 Speech synthesis method and system
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN107039034A (en) * 2016-02-04 2017-08-11 科大讯飞股份有限公司 A kind of prosody prediction method and system
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN110767210A (en) * 2019-10-30 2020-02-07 四川长虹电器股份有限公司 Method and device for generating personalized voice

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Z. WU 等: ""Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis"", 《2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 *
周志平: ""基于深度学习的小尺度单元拼接语音合成方法研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
李德毅: "《人工智能导论》", 31 August 2018 *
殷翔: ""语音合成中的神经网络声学建模方法研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112382267A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method, apparatus, device and storage medium for converting accents
CN112382270A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Speech synthesis method, apparatus, device and storage medium
CN113327575A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113345412A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113327575B (en) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113312453A (en) * 2021-06-16 2021-08-27 哈尔滨工业大学 Model pre-training system for cross-language dialogue understanding
CN113312453B (en) * 2021-06-16 2022-09-23 哈尔滨工业大学 Model pre-training system for cross-language dialogue understanding
WO2023045954A1 (en) * 2021-09-22 2023-03-30 北京字跳网络技术有限公司 Speech synthesis method and apparatus, electronic device, and readable storage medium

Similar Documents

Publication Publication Date Title
CN111326138A (en) Voice generation method and device
CN107705783B (en) Voice synthesis method and device
CN111583944A (en) Sound changing method and device
CN111508511A (en) Real-time sound changing method and device
CN107871494B (en) Voice synthesis method and device and electronic equipment
CN112185389A (en) Voice generation method and device, storage medium and electronic equipment
CN113409764B (en) Speech synthesis method and device for speech synthesis
CN108648754B (en) Voice control method and device
CN112037756A (en) Voice processing method, apparatus and medium
CN110610720B (en) Data processing method and device and data processing device
CN112036174B (en) Punctuation marking method and device
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN107437412B (en) Acoustic model processing method, voice synthesis method, device and related equipment
CN110930977B (en) Data processing method and device and electronic equipment
CN113223542A (en) Audio conversion method and device, storage medium and electronic equipment
CN115039169A (en) Voice instruction recognition method, electronic device and non-transitory computer readable storage medium
CN112036195A (en) Machine translation method, device and storage medium
CN113923517B (en) Background music generation method and device and electronic equipment
CN113345452B (en) Voice conversion method, training method, device and medium of voice conversion model
CN113409765B (en) Speech synthesis method and device for speech synthesis
CN113115104B (en) Video processing method and device, electronic equipment and storage medium
CN114550691A (en) Multi-tone word disambiguation method and device, electronic equipment and readable storage medium
CN114462410A (en) Entity identification method, device, terminal and storage medium
CN112149432A (en) Method and device for translating chapters by machine and storage medium
CN108364631B (en) Speech synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200623