CN111383627B

CN111383627B - Voice data processing method, device, equipment and medium

Info

Publication number: CN111383627B
Application number: CN201811628970.0A
Authority: CN
Inventors: 杨鹏; 孙子涵; 邱家洪
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2024-03-22
Anticipated expiration: 2038-12-28
Also published as: CN111383627A

Abstract

The embodiment of the invention discloses a voice data processing method, a device, equipment and a medium, which are used for reducing the required voice training data of a target speaker and reducing the workload, period and cost of voice synthesis. The voice data processing method comprises the following steps: acquiring voice training data of a plurality of speakers and voice training data of a target speaker, wherein the number of the voice training data of the target speaker is far smaller than the total number of the voice training data of the plurality of speakers; training to generate a primary speech synthesis model for synthesizing the target speaker speech based on the speech training data of the plurality of speakers and the speech training data of the target speaker; and according to the pre-configured text corpus, obtaining corpus data of the target speaker for speech synthesis by using the primary speech synthesis model.

Description

Voice data processing method, device, equipment and medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to a method, apparatus, device, and medium for processing speech data.

Background

With the release of various intelligent sound box products, users increasingly pay attention to the artificial intelligence (Artificial Intelligence, AI) capability and individuation capability of the intelligent sound box in addition to factors such as appearance, sound quality, price, content and the like.

In order to create personalized and differentiated sound boxes, not only are different voices with different colors set by different people output through a voice synthesis technology, but also the synthesized voices are required to be clear and natural. In order to synthesize voice with clear, natural and personalized voice, the common practice of the existing voice synthesis system is to select a target speaker, design a large amount of text corpus, record a large amount of corpus data of the target speaker according to the designed text corpus, and then synthesize the voice of the target speaker based on the recorded corpus data.

Although the tone color of the voice synthesized by the method is relatively close to the voice sent by the speaker, when the corpus data for synthesizing the voice of the target speaker is acquired, a large amount of pronunciation data of the target speaker needs to be recorded, so that the workload is high, the period is long, and the cost is high.

Disclosure of Invention

The embodiment of the invention provides a voice data processing method, a device, equipment and a medium, which are used for reducing the required voice training data of a target speaker and reducing the workload, period and cost of voice synthesis.

In a first aspect, an embodiment of the present invention provides a method for processing voice data, including:

acquiring voice training data of a plurality of speakers and voice training data of a target speaker, wherein the number of the voice training data of the target speaker is far smaller than the total number of the voice training data of the plurality of speakers;

training to generate a primary speech synthesis model for synthesizing the target speaker's speech based on the speech training data of the plurality of speakers and the speech training data of the target speaker;

according to the pre-configured text corpus, the primary speech synthesis model is utilized to obtain corpus data of the target speaker for speech synthesis.

In a possible implementation manner, in the method provided by the embodiment of the present invention, according to a pre-configured text corpus, corpus data of a target speaker for speech synthesis is obtained by using a primary speech synthesis model, including:

inputting the text corpus into a primary speech synthesis model to obtain speech synthesis data corresponding to the text corpus;

and screening the voice synthesis data meeting the preset requirements from the voice synthesis data, and determining the voice synthesis data as the corpus data of the target speaker.

In a possible implementation manner, in the method provided by the embodiment of the present invention, after obtaining corpus data of a target speaker for speech synthesis by using a primary speech synthesis model according to a pre-configured text corpus, the method further includes:

and adjusting parameters of the primary voice synthesis model by utilizing the corpus data to obtain a target voice synthesis model for synthesizing the voice of the target speaker.

and extracting voice fragments from the corpus data, and forming a voice library for splicing and synthesizing the voice of the target speaker by using the extracted voice fragments.

In a possible implementation manner, in the method provided by the embodiment of the present invention, based on the voice training data of a plurality of speakers and the voice training data of a target speaker, a primary voice synthesis model for synthesizing the voice of the target speaker is generated through training, including:

training to generate a base speech synthesis model for synthesizing speech based on speech training data of a plurality of speakers;

and adjusting parameters of the basic voice synthesis model by utilizing voice training data of the target speaker to obtain a primary voice synthesis model for synthesizing the voice of the target speaker.

In a second aspect, an embodiment of the present invention provides a voice data processing apparatus, including:

an acquisition unit configured to acquire voice training data of a plurality of speakers and voice training data of a target speaker, the number of voice training data of the target speaker being far smaller than the total number of voice training data of the plurality of speakers;

the training unit is used for training and generating a primary voice synthesis model for synthesizing the voice of the target speaker based on the voice training data of the plurality of speakers and the voice training data of the target speaker;

and the processing unit is used for obtaining the corpus data of the target speaker for speech synthesis by utilizing the primary speech synthesis model according to the pre-configured text corpus.

In a possible implementation manner, in the foregoing apparatus provided by the embodiment of the present invention, the processing unit is specifically configured to:

In a possible implementation manner, in the foregoing apparatus provided by the embodiment of the present invention, the processing unit is further configured to:

In a possible implementation manner, in the device provided by the embodiment of the present invention, the training unit is specifically configured to:

In a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor, at least one memory and computer program instructions stored in the memory, which when executed by the processor implement the method as provided by the first aspect of the embodiments of the invention.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as provided by the first aspect of embodiments of the present invention.

The voice data processing method, the device, the equipment and the medium provided by the embodiment of the invention acquire the voice training data of a plurality of speakers and the voice training data of a target speaker, wherein the number of the voice training data of the target speaker is far smaller than the total number of the voice training data of the plurality of speakers; training to generate a primary speech synthesis model for synthesizing the target speaker's speech based on the speech training data of the plurality of speakers and the speech training data of the target speaker; according to the pre-configured text corpus, the primary speech synthesis model is utilized to obtain corpus data of the target speaker for speech synthesis. Because only a small amount of voice training data of the target speaker is required to be obtained when the corpus data of the target speaker for voice synthesis is generated, compared with the prior voice synthesis system which needs to record a large amount of voice data of the target speaker when the corpus data of the target speaker for voice synthesis is obtained, the voice training data of the target speaker can be reduced, and the workload, period and cost of voice synthesis are reduced.

Drawings

FIG. 1 is a schematic flow chart of a voice data processing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice data processing device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Specific embodiments of a method, an apparatus, a device, and a medium for processing voice data according to embodiments of the present invention are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the voice data processing method provided by the embodiment of the present invention may include the following steps:

step 101, obtaining voice training data of a plurality of speakers and voice training data of a target speaker, wherein the number of the voice training data of the target speaker is far smaller than the total number of the voice training data of the plurality of speakers.

The target speaker is a person corresponding to the voice to be synthesized, and does not refer to a certain speaker. For example, to synthesize the voice of the speaker a, the speaker a is the target speaker.

In practical applications, since large-scale voice data (for example, 3 ten thousand sentences, 30 hours or more) of a general person (or a general speaker) is relatively easy to obtain, while large-scale voice data of a target speaker (for example, star) is not easy to obtain, only a small amount of voice data within several hundred sentences or one hour, for example, 30 minutes of recorded data, can be generally obtained for the target speaker. Therefore, in the embodiment of the present invention, the voice training data of the plurality of speakers may be voice data of an ordinary person.

Specifically, when the voice training data of a plurality of speakers and the voice training data of a target speaker are acquired, the voice training data may be acquired in various manners, for example, audio data, image data, recording data, etc. on a network.

Step 102, training and generating a primary voice synthesis model for synthesizing the voice of the target speaker based on the voice training data of the plurality of speakers and the voice training data of the target speaker.

In specific implementation, firstly, a basic speech synthesis model for synthesizing speech is trained and generated based on speech training data of a plurality of speakers, and then parameters of the basic speech synthesis model are adjusted by utilizing the speech training data of a target speaker to obtain a primary speech synthesis model for synthesizing the speech of the target speaker.

Specifically, when a basic speech synthesis model for synthesizing speech is generated based on speech training data of a plurality of speakers, text features and acoustic features of the speech training data can be extracted, then the text features are taken as input data, the acoustic features are taken as output data, a deep neural network model is generated based on training of a deep neural network learning algorithm, and the generated deep neural network model is taken as the basic speech synthesis model.

Text features may include, but are not limited to: sound sequences, parts of speech, word lengths, prosodic pauses, etc., acoustic features may include, but are not limited to: spectral parameters, duration, fundamental frequency, etc.

It should be noted that, the basic speech synthesis model is generated based on the speech training data of multiple speakers, and includes the universal features of the speech training data of multiple speakers, and the primary speech synthesis model generated after training the basic speech synthesis model is utilized to train the speech training data of the target speaker, and the parameters in the basic speech synthesis model are adjusted according to a small amount of speech training data of the target speaker, so that the speech data output by the basic speech synthesis model (i.e., the primary speech synthesis model) after the parameters are adjusted is closer to the real speech features of the target speaker.

Step 103, according to the pre-configured text corpus, obtaining corpus data of the target speaker for speech synthesis by using a primary speech synthesis model.

The text corpus configured in advance can be configured according to actual requirements, and the embodiment of the invention is not limited to the configuration. For example, if the synthesized speech is used in a navigation aspect, the pre-configured text corpus may be a navigation aspect corpus, and further for example, if the synthesized speech is used in a customer service, the pre-configured text corpus may be a customer service aspect corpus.

Because the recorded voice data of the target speaker is few, the features learned by the model are limited, when the primary voice synthesis model obtained in step 102 is synthesized, some voices cannot be synthesized due to the fact that the primary voice synthesis model is not learned, so that voice synthesis data which are inconsistent with voice features (tone, intonation, mood and the like) of the target speaker can be caused, the synthesis accuracy is low, and a large amount of training data (namely text corpus) of the target speaker is needed to be obtained.

In the specific implementation, when the primary speech synthesis model is utilized to obtain the corpus data of the target speaker for speech synthesis according to the pre-configured text corpus, the text corpus can be input into the primary speech synthesis model to obtain the speech synthesis data corresponding to the text corpus, then the speech synthesis data meeting the preset requirements is screened out from the speech synthesis data, and the speech synthesis data is determined to be the corpus data of the target speaker, so that a large amount of corpus data of the target speaker can be obtained rapidly.

When the speech synthesis data meeting the preset requirements is screened from the speech synthesis data, the speech synthesis data more meeting the characteristics of the speech data of the target speaker can be screened from the speech synthesis data according to the characteristics of the speech data of the target speaker, and the speech synthesis data is determined as the corpus data of the target speaker.

For example, the voice synthesis data may be screened by determining the matching degree of the features of timbre, intonation, etc. corresponding to the target speaker from the features of timbre, intonation, etc. of the voice synthesis data, thereby eliminating inaccurate voice synthesis data.

It should be noted that, in the specific implementation, the part may be screened manually, or may be screened by using a preset algorithm or model, or may be screened by adopting a combination mode of the two, which is not limited in the embodiment of the present invention. For example, an algorithm or model is used to screen the speech synthesis data with wrong pronunciation, and the speech synthesis data with wrong pronunciation is removed, and then the professional speech recognition person screens the rest speech synthesis data.

Because the voice synthesis data obtained through the primary voice synthesis model is screened, the voice synthesis data meeting the preset requirements is determined to be the corpus data of the target speaker, and the obtained corpus data of the target speaker is more attached to the voice characteristics of the target speaker.

In one possible implementation manner, after obtaining the corpus data of the target speaker for speech synthesis, the embodiment of the present invention may further use the corpus data to adjust parameters of the primary speech synthesis model to obtain the target speech synthesis model for synthesizing the speech of the target speaker.

In the embodiment, after the target speech synthesis model for synthesizing the target speaker speech is obtained, the target speaker speech may be synthesized based on the target speech synthesis model. Namely, the text corpus is input into the target voice synthesis model, and the voice of the target speaker corresponding to the text corpus is output.

Because the corpus data of the target speaker is used for adjusting the parameters of the primary voice synthesis model, the voice synthesis data obtained based on the target voice synthesis model is clearer and more natural, and the target speaker has personalized tone. By the voice synthesis scheme provided by the embodiment of the invention, a voice synthesis model with characteristics similar to the target speaker such as tone, intonation, mood and the like can be obtained rapidly and at low cost by using a small amount of recorded voice data of the target speaker, so that various voice data of the target speaker can be synthesized rapidly. Because the speech synthesis model has strong quick replication capability, the requirements of different people on the intelligent equipment for setting roles are met.

In another possible implementation manner, after obtaining the corpus data of the target speaker for speech synthesis, the embodiment of the present invention may further extract speech segments from the corpus data, and use the extracted speech segments to form a speech library for synthesizing the speech of the target speaker.

In the implementation, when a voice library for splicing and synthesizing the voice of the target speaker is obtained, the acoustic characteristics of the target speaker can be determined according to the corpus data for voice synthesis, and then the voice data of the target speaker can be synthesized based on the acoustic characteristics and the voice library.

For example, after obtaining the corpus data of a large number of target speakers, processing the corpus data may obtain a speech library containing a large number of acoustic segments of the target speakers. During synthesis, processing an input text to obtain acoustic parameters corresponding to the input text; and then, according to the obtained acoustic parameters, acquiring corresponding acoustic fragments from a voice library to splice and synthesize.

When acoustic parameters are acquired, a method of a deep neural network model can be adopted, and a traditional HTS method can also be adopted. For example, inputting text, phonetic notation is performed to generate a phonetic sequence, then structural analysis is performed to the phonetic sequence to generate prosody level information, the generated prosody level information is converted into acoustic parameters such as fundamental frequency and spectrum according to an acoustic model, and finally the acoustic parameters are synthesized into voice or corresponding acoustic fragments are acquired from a voice library and spliced into voice.

In the embodiment of the invention, as only a small amount of voice training data of the target speaker is required to be obtained when the corpus data of the target speaker for voice synthesis is generated, compared with the prior voice synthesis system which needs to record a large amount of voice data of the target speaker when the corpus data of the target speaker for voice synthesis is obtained, the voice training data of the target speaker can be reduced, and the workload, period and cost of voice synthesis are reduced. And the data enhancement is carried out on the target speaker, so that a voice synthesis model with sound characteristics very similar to those of the target speaker or a voice library containing a large number of acoustic fragments of the target speaker can be obtained quickly and at low cost, and the obtained voice synthesis data is closer to the sound characteristics of the target inventor when the voice data of the target speaker is synthesized by using the voice synthesis model or the voice library containing the acoustic fragments of the target speaker.

Based on the same inventive concept, the embodiment of the invention also provides a voice data processing device.

As shown in fig. 2, an embodiment of the present invention provides a voice data processing apparatus, including:

an obtaining unit 201, configured to obtain voice training data of a plurality of speakers and voice training data of a target speaker, where the number of voice training data of the target speaker is far smaller than the total number of voice training data of the plurality of speakers;

a training unit 202 for training and generating a primary speech synthesis model for synthesizing the speech of the target speaker based on the speech training data of the plurality of speakers and the speech training data of the target speaker;

the processing unit 203 is configured to obtain corpus data of the target speaker for speech synthesis by using the primary speech synthesis model according to the pre-configured text corpus.

In one possible implementation, the processing unit 203 is specifically configured to: inputting the text corpus into a primary speech synthesis model to obtain speech synthesis data corresponding to the text corpus; and screening the voice synthesis data meeting the preset requirements from the voice synthesis data, and determining the voice synthesis data as the corpus data of the target speaker.

In a possible implementation, the processing unit 203 is further configured to: and adjusting parameters of the primary voice synthesis model by utilizing the corpus data to obtain a target voice synthesis model for synthesizing the voice of the target speaker.

In a possible implementation, the processing unit 203 is further configured to: and extracting voice fragments from the corpus data, and forming a voice library for splicing and synthesizing the voice of the target speaker by using the extracted voice fragments.

In one possible implementation, the training unit 202 is specifically configured to: training to generate a base speech synthesis model for synthesizing speech based on speech training data of a plurality of speakers; and adjusting parameters of the basic voice synthesis model by utilizing voice training data of the target speaker to obtain a primary voice synthesis model for synthesizing the voice of the target speaker.

In addition, the voice data processing method and apparatus of the embodiments of the present invention described in connection with fig. 1-2 may be implemented by an electronic device. The electronic device may be an intelligent device (such as a robot) or a controller of the intelligent device, or may be a server. The embodiment of the invention does not limit the specific implementation form of the electronic equipment. Fig. 3 shows a schematic hardware structure of an electronic device according to an embodiment of the present invention.

The electronic device may comprise a processor 301 and a memory 302 storing computer program instructions.

In particular, the processor 301 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present invention.

Memory 302 may include mass storage for data or instructions. By way of example, and not limitation, memory 302 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 302 may include removable or non-removable (or fixed) media, where appropriate. Memory 302 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 302 is a non-volatile solid-state memory. In particular embodiments, memory 302 includes Read Only Memory (ROM). The ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these, where appropriate.

The processor 301 implements any of the voice data processing methods of the above embodiments by reading and executing computer program instructions stored in the memory 302.

In one example, the electronic device may also include a communication interface 303 and a bus 310. As shown in fig. 3, the processor 301, the memory 302, and the communication interface 303 are connected to each other by a bus 310 and perform communication with each other.

The communication interface 303 is mainly used to implement communication between each module, device, unit and/or apparatus in the embodiment of the present invention.

Bus 310 includes hardware, software, or both that couple the components of the electronic device to one another. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. Bus 310 may include one or more buses, where appropriate. Although embodiments of the invention have been described and illustrated with respect to a particular bus, the invention contemplates any suitable bus or interconnect.

The electronic device can execute the voice data processing method in the embodiment of the invention based on the acquired voice training data of a plurality of speakers and the voice training data of the target speaker, thereby realizing the voice data processing method and the device described in connection with fig. 1-2.

In addition, in combination with the voice data processing method in the above embodiment, the embodiment of the present invention may be implemented by providing a computer readable storage medium. The computer readable storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the voice data processing methods of the above embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of processing speech data, comprising:

training to generate a primary speech synthesis model for synthesizing the target speaker speech based on the speech training data of the plurality of speakers and the speech training data of the target speaker;

according to the pre-configured text corpus, the primary speech synthesis model is utilized to obtain corpus data of the target speaker for speech synthesis;

the obtaining, according to a pre-configured text corpus, corpus data of the target speaker for speech synthesis by using the primary speech synthesis model includes:

inputting a pre-configured text corpus into the primary speech synthesis model to obtain speech synthesis data corresponding to the text corpus;

screening out voice synthesis data meeting preset requirements from the voice synthesis data, and determining the voice synthesis data as corpus data of the target speaker;

the method further comprises the steps of:

extracting voice fragments from the corpus data, and forming a voice library for splicing and synthesizing the voice of the target speaker by using the extracted voice fragments, wherein the voice library comprises acoustic fragments of the target speaker;

when the corpus data of the target speaker is synthesized based on the obtained input text, processing the obtained input text to obtain acoustic parameters corresponding to the input text;

and selecting corresponding acoustic fragments from the voice library according to the acoustic parameters, and splicing and synthesizing to obtain voice data of a target speaker corresponding to the input text.

2. The method of claim 1, further comprising, after obtaining corpus data for speech synthesis of the target speaker using the primary speech synthesis model from a pre-configured text corpus, the steps of:

3. The method according to claim 1 or 2, wherein training to generate a primary speech synthesis model for synthesizing the target speaker's speech based on the plurality of speakers ' speech training data and the target speaker's speech training data comprises:

training to generate a base speech synthesis model for synthesizing speech based on speech training data of the plurality of speakers;

4. A voice data processing apparatus, comprising:

an obtaining unit, configured to obtain voice training data of a plurality of speakers and voice training data of a target speaker, where the number of voice training data of the target speaker is far smaller than the total number of voice training data of the plurality of speakers;

a training unit for training and generating a primary speech synthesis model for synthesizing the speech of the target speaker based on the speech training data of the plurality of speakers and the speech training data of the target speaker;

the processing unit is used for obtaining corpus data of the target speaker for speech synthesis by utilizing the primary speech synthesis model according to a pre-configured text corpus;

the processing unit is specifically configured to:

the processing unit is further configured to:

5. The apparatus of claim 4, wherein the processing unit is further configured to:

6. The device according to claim 4 or 5, wherein the training unit is specifically configured to:

7. An electronic device, comprising: at least one processor, at least one memory and computer program instructions stored in the memory, which when executed by the processor, implement the method of any one of claims 1-3.

8. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-3.