CN111128119B

CN111128119B - Voice synthesis method and device

Info

Publication number: CN111128119B
Application number: CN201911420316.5A
Authority: CN
Inventors: 孙见青
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-04-22
Anticipated expiration: 2039-12-31
Also published as: CN111128119A

Abstract

The invention discloses a voice synthesis method and a device, wherein the method comprises the following steps: sequentially recording N pieces of voice information for a user through preset equipment; when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side; training a preset baseline model of a server side through the first N/2 pieces of voice information to obtain a first voice synthesis model; when the preset device records the later N/2 pieces of voice information, sending the later N/2 pieces of voice information to a server side; and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model. By the technical scheme, the speech synthesis model with the user speaking mode or emotion can be synthesized according to the requirements of the user, the naturalness of the synthesis result is high, namely the similarity with the user speaking mode, emotion and tone in the user voice is high, the time spent on model construction is short, and the user experience is greatly improved.

Description

Voice synthesis method and device

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus.

Background

The Speech synthesis, also known as Text to Speech (Text to Speech) technology, can convert any Text information into standard smooth Speech in real time for reading, and is equivalent to mounting an artificial mouth on a machine. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert character information into audible sound information, namely, to enable a machine to speak like a person. The personalized speech synthesis is To make TTS (text To speech) speech technology synthesize speech, speaking mode and speaking emotion of a specific person by recording some speech segments of the person through some recording equipment at the side.

At present, during personalized speech synthesis, users of different ages and different sexes all adopt a single speech synthesis model, and speech synthesis can be carried out only according to specific speaking modes and emotions in the model, so that the similarity between speech synthesis results and the speaking modes, emotions and sound timbre of the users is low, and the user experience is greatly influenced.

Disclosure of Invention

The invention provides a voice synthesis method and a voice synthesis device. The technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a speech synthesis method, including:

sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;

when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side;

training a preset baseline model of the server side through the first N/2 pieces of voice information to obtain a first voice synthesis model;

when the preset equipment records the back N/2 pieces of voice information, sending the back N/2 pieces of voice information to the server side;

and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.

In an embodiment, the training the preset baseline model of the server side through the first N/2 pieces of speech information to obtain a first speech synthesis model includes:

when the number of the N pieces of voice information is smaller than the preset number, determining that the first voice synthesis model reaches a convergence state;

acquiring a preset number of models generated in the process of training a preset baseline model of the server side through the first N/2 pieces of voice information;

and when the N pieces of voice information are more than or equal to the preset number, selecting a model meeting a preset standard from the preset number of models as the first voice synthesis model.

In one embodiment, the speech synthesis method further comprises:

before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side.

In one embodiment, the filtering the voice information includes:

acquiring first voiceprint information prestored by the user;

extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;

and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.

In one embodiment, the second speech synthesis model has reached a converged state.

According to a second aspect of the embodiments of the present invention, there is provided a speech synthesis apparatus including:

the recording module is used for sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;

the first sending module is used for sending the first N/2 pieces of voice information to a server side when the preset equipment finishes recording the first N/2 pieces of voice information;

the first training module is used for training a preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model;

the second sending module is used for sending the last N/2 pieces of voice information to the server side when the preset equipment records the last N/2 pieces of voice information;

and the second training module is used for training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.

In one embodiment, the first training module comprises:

the determining submodule is used for determining that the first voice synthesis model reaches a convergence state when the N pieces of voice information are smaller than the preset number;

the first obtaining submodule is used for obtaining a preset number of models generated in the process of training the preset baseline model of the server side through the first N/2 pieces of voice information;

and the selecting submodule is used for selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model when the N pieces of voice information are more than or equal to the preset number.

In one embodiment, the speech synthesis apparatus further includes:

and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.

In one embodiment, the processing module includes:

the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;

the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;

and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the method comprises the steps of recording N pieces of voice information for a user in sequence through preset equipment, sending the front N/2 pieces of voice information to a server end when the preset equipment records the front N/2 pieces of voice information, then training a preset baseline model of the server end through the front N/2 pieces of voice information to obtain a first voice synthesis model, sending the back N/2 pieces of voice information to the server end when the preset equipment records the back N/2 pieces of voice information, further training the first voice synthesis model through the back N/2 pieces of voice information to obtain a second voice synthesis model, and then performing personalized voice synthesis through the second voice synthesis model, wherein compared with the method that a single voice synthesis model is adopted for users of different ages and different sexes, in the technical scheme of the invention, the method can synthesize the voice synthesis model with the user speaking mode or emotion according to the requirements of the user, namely, when the user needs to perform personalized voice synthesis through the voice synthesis model, N pieces of voice information of the user can be recorded in sequence, a preset baseline model is trained through the front N \2 pieces of voice information to obtain a first voice synthesis model, when the user urgently needs to perform personalized voice synthesis, the first voice synthesis model can be directly synthesized through the first voice synthesis model, then the first voice model is trained through the rear N \2 pieces of voice information on the basis of the first voice synthesis model to obtain a second voice synthesis model meeting the requirements of the user, the result naturalness of the synthesis of the second voice synthesis model is high, namely, the result has high similarity with the user speaking mode, emotion and the tone in the user voice, because the number of the required voices is small during model training, the time spent on model construction is short, and the user experience is greatly improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speech synthesis method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a speech synthesis apparatus according to an embodiment of the present invention;

FIG. 4 is a block diagram of another speech synthesis apparatus according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present invention, and as shown in fig. 1, the method can be implemented as the following steps S11-S15:

in step S11, sequentially recording N pieces of voice information to a user through a preset device, where N is a positive integer; the value of N may be, but is not limited to, tens, for example thirty.

In step S12, when the preset device finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to the server, where when N is an even number, the first N/2 pieces of voice information are the first half of the N pieces of voice, and the last N/2 pieces of voice information are the second half of the N pieces of voice; when N is an odd number, the first piece of speech information belongs to the first N/2 pieces, the first half of the remaining pieces of speech information excluding the first piece of speech information also belongs to the first N/2 pieces, and the second half belongs to the last N/2 pieces, for example, if the value of N is 31, the first 16 pieces belong to the first N/2 pieces, and the remaining 15 pieces belong to the last N/2 pieces;

in step S13, training a preset baseline model of the server side through the first N/2 pieces of speech information to obtain a first speech synthesis model; the preset baseline model is a model obtained by training a large amount of voice data, can be used for voice synthesis, but cannot be used for personalized voice synthesis, and the first voice synthesis model can be used for personalized voice synthesis, but the voice synthesis effect does not reach the optimal state.

In step S14, when the preset device finishes recording the last N/2 pieces of voice information, sending the last N/2 pieces of voice information to the server;

in step S15, the first speech synthesis model is trained by the last N/2 pieces of speech information to obtain a second speech synthesis model, where the second speech synthesis model is used for speech synthesis, that is, the second speech synthesis model can perform personalized speech synthesis, and the speech synthesis effect is optimal.

As shown in fig. 2, in one embodiment, the above step S13 can be implemented as the following steps S131-S133:

in step S131, when the N pieces of speech information are less than the preset number, it is determined that the first speech synthesis model reaches a convergence state; the convergence state is the case where the mean square error is the smallest.

In step S132, obtaining a preset number of models generated in a process of training a preset baseline model of the server side through the first N/2 pieces of voice information;

in step S133, when the N pieces of speech information are greater than or equal to the preset number, selecting a model meeting a preset standard from the preset number of models as a first speech synthesis model; the preset standard is that a verification set (validationset) is used as a test sample, data in the verification set is voice information recorded by a user, parameters of a preset number of models are estimated through the data in the verification set, namely, a predicted mean square error is calculated, the predicted mean square errors of the models are compared, and a fitting model with the minimum predicted mean square error is selected as a first voice synthesis model.

The method comprises the steps of obtaining a preset number of models generated in the process of training a preset baseline model of a server end through the first N/2 pieces of voice information, and selecting a model meeting a preset standard from the preset number of models as a first voice synthesis model when the N pieces of voice information are greater than or equal to the preset number, so that the first voice synthesis model is guaranteed to be the model with the best effect in the training process, and an optimal base stone is provided for the generation of a second voice synthesis model.

In one embodiment, the speech synthesis method further comprises:

before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side. For example, an excessively long silent section or the like in the voice information can be removed by the filtering process.

The processed voice can be obtained by carrying out noise reduction processing and screening processing on the voice information, model training is carried out on the processed voice, and when the finally obtained model carries out voice synthesis, the accuracy rate is higher.

In one embodiment, the filtering the voice information includes:

acquiring first voiceprint information prestored by the user;

The second voiceprint information in the user recorded voice information is extracted, and then matching is carried out according to the second voiceprint information and the first voiceprint information.

When the second speech synthesis model reaches the convergence state, the speech synthesis effect is optimal.

As to the above speech synthesis method provided by the embodiment of the present invention, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 3, the apparatus includes:

the recording module 31 is configured to record N pieces of voice information to a user in sequence through a preset device, where N is a positive integer; (ii) a

The first sending module 32 is configured to send the first N/2 pieces of voice information to a server side when the preset device finishes recording the first N/2 pieces of voice information;

the first training module 33 is configured to train a preset baseline model of the server through the first N/2 pieces of speech information to obtain a first speech synthesis model;

the second sending module 34 is configured to, when N/2 pieces of voice information are recorded by the preset device, send the last N/2 pieces of voice information to the server;

a second training module 35, configured to train the first speech synthesis model through the last N/2 pieces of speech information to obtain a second speech synthesis model, where the second speech synthesis model is used for speech synthesis.

As shown in fig. 4, in one embodiment, the first training module 33 includes:

the determining submodule 331 is configured to determine that the first speech synthesis model reaches a convergence state when the N pieces of speech information are smaller than a preset number;

a first obtaining submodule 332, configured to obtain a preset number of models generated in a process of training a preset baseline model of the server through the first N/2 pieces of voice information;

the selecting submodule 333 is configured to select, when the N pieces of speech information are greater than or equal to a preset number, a model meeting a preset standard from among the preset number of models as the first speech synthesis model.

In one embodiment, the speech synthesis apparatus further includes:

In one embodiment, the processing module includes:

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of speech synthesis, comprising:

training the preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model, which specifically comprises the following steps:

when the number of the N pieces of voice information is larger than or equal to a preset number, selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein said filtering said voice message comprises:

acquiring first voiceprint information prestored by the user;

4. The method of claim 1, wherein the second speech synthesis model has reached a converged state.

5. A speech synthesis apparatus, comprising:

the first training module is used for training the preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model, and comprises:

the selecting submodule is used for selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model when the N pieces of voice information are more than or equal to the preset number;

6. The apparatus of claim 5, further comprising:

7. The apparatus of claim 6, wherein the processing module comprises:

8. The apparatus of claim 5, in which the second speech synthesis model has reached a converged state.