CN113380220A

CN113380220A - Speech synthesis coding method and device

Info

Publication number: CN113380220A
Application number: CN202110647984.2A
Authority: CN
Inventors: 皮碧虹; 杨德文; 龙丁奋
Original assignee: Shenzhen Tongxingzhe Technology Co ltd
Current assignee: Shenzhen Tongxingzhe Technology Co ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-09-10

Abstract

According to the voice synthesis coding method and device provided by one or more embodiments of the present specification, after text data is synthesized into pcm stream data, a start buffer threshold Tstart required for starting playing is dynamically calculated according to a current system load condition, and if a buffer duration of a buffer is greater than the start buffer threshold Tstart, the pcm stream data in the buffer is read for playing; dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition; in the playing process, whether text data is continuously synthesized into pcm stream data or synthesized temporarily is judged according to the relation between the buffer duration of the buffer area and the continuous buffer threshold Tblock, so that the stability and smoothness of playing are guaranteed, and the smooth occupation of a cpu and a memory is realized.

Description

Speech synthesis coding method and device

Technical Field

The present invention relates to the field of speech synthesis methods, and in particular, to a speech synthesis encoding method and apparatus.

Background

The current encoding and playing schemes for speech synthesis (text to speech) are:

1. one-time synthesis: inputting the text into a speech synthesis engine, acquiring encoded pcm data at one time, and transmitting the pcm data to a player for playing at one time; the method needs to occupy a large amount of memory storage pcm, the synthesis waiting time is long, and the playing is started after the data are completely synthesized.

2. Sleep in streaming synthesis: synthesizing pcm data, stopping processing for a player, sleeping for a certain time in the synthesis process, and continuously synthesizing and playing; the size of a data block synthesized in a single mode is fixed, cpu fluctuation exists, the cpu may be occupied too high when the sleep time is too short, and the player may be interrupted or noise may occur when the sleep time is too long.

Disclosure of Invention

In view of the above, one or more embodiments of the present disclosure are directed to a speech synthesis encoding method and apparatus, which can effectively solve the technical problems in the prior art.

In view of the above, one or more embodiments of the present specification provide a speech synthesis encoding method, including:

starting to synthesize the text data into pcm stream data, and storing the pcm stream data in a buffer area;

dynamically calculating an initial buffering threshold value Tstart required for starting playing according to the current system load condition;

if the buffer duration of the buffer area is greater than the starting buffer threshold Tstart, reading the pcm stream data of the buffer area for playing;

dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition;

if the buffer duration of the buffer area is greater than the continuous buffer threshold Tblock, the text data is paused to be synthesized into pcm stream data, and the step of calculating the continuous buffer threshold Tblock is returned after the preset time is waited; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time until all the text data are synthesized into pcm stream data.

As an optional implementation manner, the dynamically calculating a start buffer threshold Tstart required for starting playing according to the current system load condition includes:

if T2-T1< Tmin, Tstart is Tmin, otherwise Tstart is T2-T1;

wherein T1 is the estimation of synthesis duration;

t2 is the playing time length;

tmin is the minimum buffer duration.

As an alternative embodiment, T1 ═ L × U/C, T2 ═ L × T;

wherein C is the maximum idle calculation power of the single-core cpu; u is the calculation power consumption of single character synthesis; t is the estimation of the duration of a single character; l is the length of the whole sentence.

As an optional implementation manner, the dynamically calculating a sustained buffer threshold Tblock required for sustained play according to the current system load condition includes:

if T4 is less than or equal to T3, Tblock is T3, otherwise, Tblock is x (T2-T1) + Tbuf;

if Tblock is less than Tmin, Tblock is Tmin;

wherein, T4 is the residual playing time length estimation, T3 is the residual synthesis time length estimation, Tbuf is the residual playing time length of the current buffer, and x is the buffer unit.

As an alternative embodiment, T3 ═ R × U/C, T4 ═ R × T + Tbuf, Tmin ═ F × Tplayer;

wherein C is the maximum idle calculation power of the single-core cpu; u is the calculation power consumption of single character synthesis; r is the residual word length, F is the minimum playing buffer coefficient, Tplayer is the minimum buffer time length of the player, and T is the single word time length estimation.

As an alternative embodiment, the buffer unit x is 1%, and the minimum play buffer factor F is 2.

As an optional embodiment, the method further comprises the step of pausing the playing of the pcm stream data.

Corresponding to the speech synthesis encoding method, an embodiment of the present invention further provides a speech synthesis encoding apparatus, including:

the buffer module is used for starting to synthesize the text data into pcm stream data and storing the pcm stream data in a buffer area;

the first calculation module is used for dynamically calculating a starting buffer threshold value Tstart required for starting playing according to the current system load condition;

the playing module is used for reading the pcm streaming data in the buffer area to play when the buffer duration of the buffer area is greater than the starting buffer threshold Tstart;

the second calculation module is used for dynamically calculating a continuous buffering threshold Tblock required by continuous playing according to the current system load condition;

the judging module is used for pausing the synthesis of the text data into pcm streaming data if the buffer duration of the buffer area is greater than the continuous buffer threshold Tblock, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time until all the text data are synthesized into pcm stream data.

As an alternative implementation, the first computing module is configured to

If T2-T1< Tmin, Tstart is Tmin, otherwise Tstart is T2-T1;

wherein T1 is the estimation of synthesis duration;

t2 is the playing time length;

tmin is the minimum buffer duration.

As an alternative implementation, the second computing module is configured to

if Tblock is less than Tmin, Tblock is Tmin;

As can be seen from the above, in the speech synthesis encoding method and apparatus provided in one or more embodiments of the present disclosure, after text data is synthesized into pcm stream data, a start buffer threshold Tstart required for starting playing is dynamically calculated according to a current system load condition, and if a buffer duration of a buffer is greater than the start buffer threshold Tstart, the pcm stream data in the buffer is read for playing; dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition; in the playing process, whether text data is continuously synthesized into pcm stream data or synthesized temporarily is judged according to the relation between the buffer duration of the buffer area and the continuous buffer threshold Tblock, so that the stability and smoothness of playing are guaranteed, and the smooth occupation of a cpu and a memory is realized.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.

FIG. 1 is a logic diagram of a speech synthesis encoding method according to an embodiment of the present invention;

FIG. 2 is a diagram of a speech synthesis encoding apparatus according to an embodiment of the present invention.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

To achieve the above object, an embodiment of the present invention provides a speech synthesis encoding method, including:

In the embodiment of the invention, after text data is synthesized into pcm stream data, a starting buffer threshold value Tstart required for starting playing is dynamically calculated according to the current system load condition, and if the buffer duration of a buffer area is greater than the starting buffer threshold value Tstart, the pcm stream data in the buffer area is read for playing; dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition; in the playing process, whether text data is continuously synthesized into pcm stream data or synthesized temporarily is judged according to the relation between the buffer duration of the buffer area and the continuous buffer threshold Tblock, so that the stability and smoothness of playing are guaranteed, and the smooth occupation of a cpu and a memory is realized.

As shown in fig. 1, an embodiment of the present invention provides a speech synthesis encoding method, including:

and S100, starting to synthesize the text data into pcm stream data, and storing the pcm stream data in a buffer area.

S200, dynamically calculating a starting buffer threshold Tstart required by starting playing according to the current system load condition.

Optionally, the dynamically calculating a start buffering threshold Tstart required for starting playing according to the current system load condition includes:

if T2-T1< Tmin, Tstart is Tmin, otherwise Tstart is T2-T1;

wherein, T1 is the estimated synthesis duration, T1 ═ L × U/C; t2 is the playing duration, T2 ═ L × T; tmin is the minimum buffer duration; c is the maximum idle calculation power of the single-core cpu; u is the calculation power consumption of single character synthesis; t is the estimation of the duration of a single character; l is the length of the whole sentence.

And S300, reading the pcm streaming data in the buffer area for playing if the buffer duration of the buffer area is greater than the starting buffer threshold value Tstart.

S400, dynamically calculating a continuous buffering threshold value Tblock required by continuous playing according to the current system load condition.

Optionally, the dynamically calculating a sustained buffer threshold Tblock required for sustained play according to the current system load condition includes:

if Tblock is less than Tmin, Tblock is Tmin;

wherein, T4 is the residual playing time estimation, T4 ═ R × T + Tbuf, T3 is the residual synthesized time estimation, T3 ═ R × U/C, Tbuf is the residual playing time of the current buffer, x is the buffer unit, the value is usually 1%, the minimum synthesized buffer time Tmin, Tmin ═ F × Tplayer; c is the maximum idle calculation power of the single-core cpu; u is the calculation power consumption of single character synthesis; r is the remaining word length, F is the minimum play buffer coefficient, usually the value F is 2, Tplayer is the player minimum buffer duration, and T is the single word duration estimate.

S500, if the buffer duration of the buffer area is greater than the continuous buffer threshold value Tblock, the text data is paused to be synthesized into pcm stream data, and the step of calculating the continuous buffer threshold value Tblock is returned after the preset time is waited; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time until all the text data are synthesized into pcm stream data.

Corresponding to the speech synthesis encoding method, as shown in fig. 2, an embodiment of the present invention further provides a speech synthesis encoding apparatus, including:

the buffer module 10 is configured to start synthesizing text data into pcm stream data, and store the pcm stream data in a buffer area;

the first calculating module 20 is configured to dynamically calculate a starting buffer threshold Tstart required for starting playing according to a current system load condition;

the playing module 30 is configured to read the pcm stream data in the buffer for playing when the buffer duration of the buffer is greater than the starting buffer threshold Tstart;

the second calculating module 40 is configured to dynamically calculate a continuous buffer threshold Tblock required for continuous playing according to a current system load condition;

the judging module 50 is configured to suspend synthesizing the text data into pcm stream data if the buffer duration of the buffer area is greater than the continuous buffer threshold Tblock, and return to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time until all the text data are synthesized into pcm stream data.

Optionally, the first computing module 20 is configured to

If T2-T1< Tmin, Tstart is Tmin, otherwise Tstart is T2-T1;

wherein T1 is the estimation of synthesis duration;

t2 is the playing time length;

tmin is the minimum buffer duration.

Optionally, the second computing module 40 is configured to

if Tblock is less than Tmin, Tblock is Tmin;

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A speech synthesis encoding method, comprising:

2. The speech synthesis encoding method according to claim 1, wherein the dynamically calculating a start buffer threshold Tstart required for starting playing according to the current system load condition comprises:

if T2-T1< Tmin, Tstart is Tmin, otherwise Tstart is T2-T1;

wherein T1 is the estimation of synthesis duration;

t2 is the playing time length;

tmin is the minimum buffer duration.

3. The speech synthesis coding method according to claim 2, wherein T1 ═ L × U/C, T2 ═ L × T;

4. The speech synthesis encoding method according to claim 1, wherein the dynamically calculating the Tblock required for continuous playback according to the current system load condition comprises:

if Tblock is less than Tmin, Tblock is Tmin;

5. The speech synthesis coding method according to claim 4, wherein T3, T4, Tmin, Tlayer;

6. The speech synthesis coding method according to claim 5, wherein the buffer unit x is 1%, and the minimum play buffer coefficient F is 2.

7. The speech synthesis encoding method according to claim 1, further comprising the step of pausing the playing of the pcm stream data.

8. A speech synthesis encoding apparatus, comprising:

9. The speech synthesis encoding apparatus of claim 8, wherein the first computing module is configured to compute the first speech signal according to the first speech signal

If T2-T1< Tmin, Tstart is Tmin, otherwise Tstart is T2-T1;

wherein T1 is the estimation of synthesis duration;

t2 is the playing time length;

tmin is the minimum buffer duration.

10. The speech synthesis encoding apparatus of claim 8, wherein the second computing module is configured to compute the second speech signal using a second algorithm

if Tblock is less than Tmin, Tblock is Tmin;