CN113539230A

CN113539230A - Speech synthesis method and device

Info

Publication number: CN113539230A
Application number: CN202010247289.2A
Authority: CN
Inventors: 刘崴; 张海雷; 胡一川; 汪冠春; 褚瑞; 李玮
Original assignee: Beijing Benying Network Technology Co Ltd
Current assignee: Beijing Benying Network Technology Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2021-10-22

Abstract

The application provides a voice synthesis method and a voice synthesis device, wherein the method comprises the following steps: acquiring a text to be subjected to voice synthesis; inputting each sentence in the text into a preset attribute recognition model, and acquiring the attribute characteristics of the sentence; the attribute features include: speaker identification, and/or, emotion type; generating a voice with attribute characteristics according to the sentences and the attribute characteristics of the sentences; the method can automatically recognize the attribute characteristics of the sentences in the text, generate the voice with the attribute characteristics according to the attribute characteristics of the sentences, and further perform voice synthesis, thereby improving the accuracy and efficiency of voice synthesis and reducing the cost of voice synthesis.

Description

Speech synthesis method and device

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus.

Background

Speech synthesis is a technique that can generate speech from text. The current speech synthesis technology is mainly based on emotion and speaker speech synthesis technology. The construction of speech synthesis models based on emotion and speaker requires a large amount of text labeled by emotion and speaker and speech corresponding to the text. The emotion marking and the speaker marking of the text are finished manually, marking cost is high, marking efficiency is poor, and accuracy of a speech synthesis model is poor.

Disclosure of Invention

The object of the present application is to solve at least to some extent one of the above mentioned technical problems.

Therefore, a first objective of the present application is to provide a speech synthesis method, which can automatically recognize the attribute features of sentences in a text, and generate speech with the attribute features according to the attribute features of the sentences, so as to perform speech synthesis, thereby improving the accuracy and efficiency of speech synthesis, and reducing the cost of speech synthesis.

A second object of the present application is to provide a speech synthesis apparatus.

A third object of the present application is to propose another speech synthesis apparatus.

A fourth object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, an embodiment of a first aspect of the present application provides a speech synthesis method, including: acquiring a text to be subjected to voice synthesis; inputting the sentences into a preset attribute recognition model aiming at each sentence in the text, and acquiring the attribute characteristics of the sentences; the attribute features include: speaker identification, and/or, emotion type; generating voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences; and synthesizing the voice corresponding to each sentence in the text to obtain synthesized voice.

According to the voice synthesis method, the text to be subjected to voice synthesis is obtained; inputting the sentences into a preset attribute recognition model aiming at each sentence in the text, and acquiring the attribute characteristics of the sentences; the attribute features include: speaker identification, and/or, emotion type; generating voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences; and synthesizing the voice corresponding to each sentence in the text to obtain synthesized voice. The method can automatically identify the attribute characteristics of the sentences in the text, generate the voice with the attribute characteristics according to the attribute characteristics of the sentences, and further perform voice synthesis, thereby improving the accuracy and efficiency of the voice synthesis and reducing the cost of the voice synthesis.

In order to achieve the above object, a second aspect of the present application provides a speech synthesis apparatus, including: the acquisition module is used for acquiring a text to be subjected to voice synthesis; the input module is used for inputting each sentence in the text into a preset attribute recognition model to acquire the attribute characteristics of the sentence; the attribute features include: speaker identification, and/or, emotion type; the generating module is used for generating the voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences; and the processing module is used for synthesizing the voice corresponding to each sentence in the text to obtain the synthesized voice.

The voice synthesis device of the embodiment of the application obtains the text to be subjected to voice synthesis; inputting the sentences into a preset attribute recognition model aiming at each sentence in the text, and acquiring the attribute characteristics of the sentences; the attribute features include: speaker identification, and/or, emotion type; generating voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences; and synthesizing the voice corresponding to each sentence in the text to obtain synthesized voice. The device can automatically recognize the attribute characteristics of the sentences in the text, generate the voice with the attribute characteristics according to the attribute characteristics of the sentences, and further perform voice synthesis, thereby improving the accuracy and efficiency of the voice synthesis and reducing the cost of the voice synthesis.

To achieve the above object, a third aspect of the present application provides another speech synthesis apparatus, including: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech synthesis method as described above when executing the program.

In order to achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the speech synthesis method as described above.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a speech synthesis method according to one embodiment of the present application;

FIG. 2 is a flow diagram illustrating a speech synthesis method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a speech synthesis apparatus according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech synthesis apparatus according to another embodiment of the present application;

fig. 5 is a schematic structural diagram of another speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

A speech synthesis method and apparatus according to an embodiment of the present application will be described below with reference to the drawings.

Fig. 1 is a schematic flowchart of a speech synthesis method according to an embodiment of the present application. As shown in fig. 1, the speech synthesis method includes the steps of:

step 101, obtaining a text to be subjected to voice synthesis.

In the embodiment of the present application, the text to be subjected to speech synthesis may be a text that needs to be subjected to speech synthesis, and the text that needs to be subjected to speech synthesis may be intercepted from a fiction or a script, or downloaded from a network to be acquired.

102, inputting each sentence in the text into a preset attribute identification model, and acquiring the attribute characteristics of the sentence; the attribute features include: speaker identification, and/or emotion type.

In the embodiment of the present application, for each sentence in a text, the sentence is input into a preset attribute recognition model, and attribute features of the sentence can be obtained, where the attribute features may include: speaker identification, and/or, emotion type; for example, the speaker identification can be an identification of a speaker in different roles of text, and can be represented numerically, such as: the speech _ id is 1, and the emotion type may be an emotion contained in the text, such as: joy, sadness, exclamation, anger, etc. The attribute recognition model may include, but is not limited to, a speaker recognition submodel and an emotion recognition submodel.

As an example, for each sentence in the text, the sentence is input into the speaker identification submodel, and the speaker identification of the sentence can be obtained; for each sentence in the text, the sentence is input into the emotion recognition submodel, and the emotion type of the sentence can be acquired. The number of the speaker recognition models can be multiple, wherein each speaker recognition submodel can correspond to a speaker identifier and is used for recognizing whether the speaker identifier of a sentence is the speaker identifier corresponding to the speaker recognition submodel or not; the number of the emotion recognition submodels can be multiple, each emotion recognition submodel can correspond to one emotion type and is used for recognizing whether the emotion type of a sentence is the emotion type corresponding to the emotion recognition submodel or not.

For example, "too good, the yellow puppet is happy" is inputted into the speaker recognition submodel and the emotion recognition submodel, respectively, the speaker identifier of the sentence is "yellow puppet" and the emotion type is "happy".

It should be understood that, for each sentence in the text, the sentence is input into the preset attribute recognition model, and before the attribute features of the sentence are obtained, the preset attribute recognition model may be obtained. Optionally, first training data is obtained, each training sample in the first training data comprising: training texts and corresponding attribute features; and training the initial attribute recognition model by adopting the first training data to obtain a preset attribute recognition model.

As an example, a large amount of text data is labeled, and the labeling information may include: the emotion type and speaker identification of the sentence. For example: "too good", the yellow wood doll is happy. The text labeling results are: the speaker is marked as a 'yellow puppet', the emotion type is 'happy', the content is 'too good', then, a large amount of marked text data are used as training data, and an initial attribute recognition model is trained to obtain a preset attribute recognition model. For example, a neural network model is trained by using a large amount of labeled text data, and the trained neural network model is used as a preset attribute recognition model.

Step 103, generating the voice with the attribute characteristics according to the sentence and the attribute characteristics of the sentence.

Optionally, as shown in fig. 2, a sentence may be input into a speech synthesis model corresponding to the attribute feature of the sentence, and a speech with the attribute feature is generated, which is implemented as follows:

step 201, a speech synthesis model corresponding to the attribute features of the sentence is obtained.

It is to be understood that the sentence is input into the speech synthesis model corresponding to the attribute feature of the sentence, and the speech synthesis model corresponding to the attribute feature of the sentence may be acquired before the speech having the attribute feature is acquired. Optionally, for the attribute feature, second training data corresponding to the attribute feature is obtained, where each training sample in the second training data includes: the attribute characteristics of the training text and the voice corresponding to the training text; and training the initial voice synthesis model by adopting the second training data to obtain the voice synthesis model corresponding to the attribute characteristics.

For example, a large amount of text data is labeled in advance, and the labeling information may include: the emotion type and speaker identification of the sentence. The speaker identification can be represented by numbers, and the number of the speaker identifications can be at least 2; the emotion types can use predefined types in N (greater than or equal to 2), and for example, the emotion types can include: joy, sadness, exclamation, anger, etc. And then, taking the marked text data and the voice corresponding to the text data as second training data, and inputting the second training data into the initial voice synthesis model for training to obtain the voice synthesis model corresponding to the attribute characteristics. For example, the second training data is input to the neural network model for training, and the trained neural network model is used as the speech synthesis model corresponding to the attribute feature.

Step 202, inputting the sentence into a speech synthesis model corresponding to the attribute characteristics of the sentence, and acquiring speech with the attribute characteristics.

Then, a sentence is input into a speech synthesis model corresponding to the attribute feature of the sentence, and speech having the attribute feature can be acquired. The voice with the attribute characteristics can comprise speaker identification and emotion types corresponding to the voice.

And 104, synthesizing the voice corresponding to each sentence in the text to obtain synthesized voice.

In the embodiment of the application, the speech corresponding to each sentence in the text can be synthesized to obtain the synthesized speech. For example, the speech is composed into a dialogue form according to the story scene in the text, and the corresponding synthesized speech can be obtained.

According to the voice synthesis method, the text to be subjected to voice synthesis is obtained; inputting each sentence in the text into a preset attribute recognition model, and acquiring the attribute characteristics of the sentence; the attribute features include: speaker identification, and/or, emotion type; generating a voice with attribute characteristics according to the sentences and the attribute characteristics of the sentences; and synthesizing the voice corresponding to each sentence in the text to obtain the synthesized voice. The method can automatically identify the attribute characteristics of the sentences in the text, generate the voice with the attribute characteristics according to the attribute characteristics of the sentences, and further perform voice synthesis, thereby improving the accuracy and efficiency of the voice synthesis and reducing the cost of the voice synthesis.

Corresponding to the speech synthesis methods provided in the foregoing embodiments, an embodiment of the present application further provides a speech synthesis apparatus, and since the speech synthesis apparatus provided in the embodiment of the present application corresponds to the speech synthesis methods provided in the foregoing embodiments, the implementation of the foregoing speech synthesis method is also applicable to the speech synthesis apparatus provided in the embodiment, and is not described in detail in the embodiment. Fig. 3 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application. As shown in fig. 3, the speech synthesis apparatus includes: the device comprises an acquisition module 310, an input module 320, a generation module 330 and a processing module 340.

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text to be subjected to voice synthesis; the input module is used for inputting each sentence in the text into a preset attribute recognition model and acquiring the attribute characteristics of the sentence; the attribute features include: speaker identification, and/or, emotion type; the generating module is used for generating the voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences; and the processing module is used for synthesizing the voice corresponding to each sentence in the text to obtain the synthesized voice.

As a possible implementation manner of the embodiment of the present application, as shown in fig. 4, on the basis of fig. 3, the speech synthesis apparatus further includes: a training module 350.

The obtaining module 310 is further configured to obtain first training data, where each training sample in the first training data includes: training texts and corresponding attribute features; the training module 350 is configured to train the initial attribute recognition model by using the first training data to obtain a preset attribute recognition model.

As a possible implementation manner of the embodiment of the present application, the attribute identification model includes: a speaker recognition submodel and an emotion recognition submodel; the input module 320 is specifically configured to, for each sentence in the text, input the sentence into the speaker recognition submodel, and obtain a speaker identifier of the sentence; and/or inputting the sentence into the emotion recognition submodel for each sentence in the text, and acquiring the emotion type of the sentence.

As a possible implementation manner of the embodiment of the application, the number of the speaker recognition submodels is multiple, each speaker recognition submodel corresponds to one speaker identifier, and the speaker identifiers used for identifying the sentences are the speaker identifiers corresponding to the speaker recognition submodels; the number of the emotion recognition submodels is multiple, each emotion recognition submodel corresponds to one emotion type and is used for recognizing whether the emotion type of a sentence is the emotion type corresponding to the emotion recognition submodel or not.

As a possible implementation manner of the embodiment of the present application, the generating module 330 is specifically configured to obtain a speech synthesis model corresponding to an attribute feature of a sentence; and inputting the sentence into a speech synthesis model corresponding to the attribute characteristics of the sentence to obtain speech with the attribute characteristics.

As a possible implementation manner of the embodiment of the present application, the generating module 330 is further specifically configured to, for the attribute feature, obtain second training data corresponding to the attribute feature, where each training sample in the second training data includes: the attribute characteristics of the training text and the voice corresponding to the training text; and training the initial voice synthesis model by adopting the second training data to obtain the voice synthesis model corresponding to the attribute characteristics.

In order to implement the foregoing embodiment, the present application further provides another speech synthesis apparatus, and fig. 5 is a schematic structural diagram of another speech synthesis apparatus provided in the embodiment of the present application. The speech synthesis apparatus includes:

memory 1001, processor 1002, and computer programs stored on memory 1001 and executable on processor 1002.

The processor 1002, when executing the program, implements the speech synthesis method provided in the above-described embodiments.

Further, the speech synthesis apparatus further includes:

a communication interface 1003 for communicating between the memory 1001 and the processor 1002.

A memory 1001 for storing computer programs that may be run on the processor 1002.

Memory 1001 may include high-speed RAM memory and may also include non-volatile memory (e.g., at least one disk memory).

The processor 1002 is configured to implement the speech synthesis method according to the foregoing embodiment when executing the program.

If the memory 1001, the processor 1002, and the communication interface 1003 are implemented independently, the communication interface 1003, the memory 1001, and the processor 1002 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

Optionally, in a specific implementation, if the memory 1001, the processor 1002, and the communication interface 1003 are integrated on one chip, the memory 1001, the processor 1002, and the communication interface 1003 may complete communication with each other through an internal interface.

The processor 1002 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the speech synthesis method as in the above embodiments.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method of speech synthesis, comprising:

acquiring a text to be subjected to voice synthesis;

inputting the sentences into a preset attribute recognition model aiming at each sentence in the text, and acquiring the attribute characteristics of the sentences; the attribute features include: speaker identification, and/or, emotion type;

generating voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences;

and synthesizing the voice corresponding to each sentence in the text to obtain synthesized voice.

2. The method according to claim 1, wherein for each sentence in the text, before inputting the sentence into a preset attribute recognition model and obtaining the attribute feature of the sentence, the method further comprises:

obtaining first training data, each training sample in the first training data comprising: training texts and corresponding attribute features;

and training an initial attribute recognition model by adopting the first training data to obtain the preset attribute recognition model.

3. The method of claim 1, wherein the attribute-recognition model comprises: a speaker recognition submodel and an emotion recognition submodel;

for each sentence in the text, inputting the sentence into a preset attribute recognition model, and acquiring the attribute characteristics of the sentence, including:

inputting the sentence into the speaker recognition submodel aiming at each sentence in the text, and acquiring a speaker identifier of the sentence;

and/or the presence of a gas in the gas,

and inputting the sentence into the emotion recognition sub-model aiming at each sentence in the text, and acquiring the emotion type of the sentence.

4. The method according to claim 3, wherein the speaker recognition submodel is plural in number, each speaker recognition submodel corresponds to a speaker ID, and the speaker ID for recognizing the sentence is the speaker ID corresponding to the speaker recognition submodel;

the number of the emotion recognition submodels is multiple, each emotion recognition submodel corresponds to one emotion type and is used for recognizing whether the emotion type of a sentence is the emotion type corresponding to the emotion recognition submodel or not.

5. The method according to claim 1, wherein the generating the speech having the attribute feature according to the sentence and the attribute feature of the sentence comprises:

acquiring a speech synthesis model corresponding to the attribute characteristics of the sentence;

and inputting the sentence into a speech synthesis model corresponding to the attribute characteristics of the sentence, and acquiring speech with the attribute characteristics.

6. The method according to claim 5, wherein before the inputting the sentence into the speech synthesis model corresponding to the attribute feature of the sentence and obtaining the speech having the attribute feature, further comprises:

for the attribute features, obtaining second training data corresponding to the attribute features, wherein each training sample in the second training data comprises: the method comprises the following steps of training attribute characteristics of a text and a voice corresponding to the training text;

and training an initial voice synthesis model by adopting the second training data to obtain a voice synthesis model corresponding to the attribute characteristics.

7. A speech synthesis apparatus, comprising:

the acquisition module is used for acquiring a text to be subjected to voice synthesis;

the input module is used for inputting each sentence in the text into a preset attribute recognition model to acquire the attribute characteristics of the sentence; the attribute features include: speaker identification, and/or, emotion type;

the generating module is used for generating the voice with the attribute characteristics according to the sentences and the attribute characteristics of the sentences;

and the processing module is used for synthesizing the voice corresponding to each sentence in the text to obtain the synthesized voice.

8. The apparatus of claim 7, further comprising: a training module;

the obtaining module is further configured to obtain first training data, where each training sample in the first training data includes: training texts and corresponding attribute features;

and the training module is used for training an initial attribute recognition model by adopting the first training data to obtain the preset attribute recognition model.

9. The apparatus of claim 7, wherein the attribute recognition model comprises: a speaker recognition submodel and an emotion recognition submodel;

the input module is specifically configured to,

and/or the presence of a gas in the gas,

10. The apparatus of claim 9, wherein the speaker recognition submodel is plural in number, each speaker recognition submodel corresponds to a speaker identifier, and the speaker identifier for identifying a sentence is a speaker identifier corresponding to the speaker recognition submodel;

11. The apparatus of claim 7, wherein the generation module is specifically configured to,

12. The apparatus of claim 11, wherein the generation module is further specifically configured to,

13. A speech synthesis apparatus, comprising:

memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech synthesis method according to any of claims 1-6 when executing the program.

14. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the speech synthesis method according to any one of claims 1-6.