CN113380220B

CN113380220B - Speech synthesis coding method and device

Info

Publication number: CN113380220B
Application number: CN202110647984.2A
Authority: CN
Inventors: 皮碧虹; 杨德文; 龙丁奋
Original assignee: Shenzhen Tongxingzhe Technology Co ltd
Current assignee: Shenzhen Tongxingzhe Technology Co ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2024-05-14
Anticipated expiration: 2041-06-10
Also published as: CN113380220A

Abstract

According to the voice synthesis coding method and device provided by one or more embodiments of the present disclosure, after synthesizing text data into pcm stream data, dynamically calculating a start buffer threshold Tstart required for starting playing according to a current system load condition, and if a buffer time length is longer than the start buffer threshold Tstart, reading the pcm stream data of a buffer area for playing; dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition; in the playing process, whether the text data is continuously synthesized into pcm stream data or is paused to be synthesized is judged according to the relation between the buffer duration of the buffer zone and the continuous buffer threshold Tblock, so that the stability and smoothness of playing are ensured, and the smooth occupation of the cpu and the memory is realized.

Description

Speech synthesis coding method and device

Technical Field

The present invention relates to the field of speech synthesis methods, and in particular, to a speech synthesis coding method and apparatus.

Background

The current coding and playing schemes for speech synthesis (text-to-speech) are:

1. and (3) one-time synthesis: inputting the text to a voice synthesis engine, obtaining coded pcm data at one time, and transmitting the pcm data to a player for playing at one time; the mode needs to occupy a large amount of memory to store pcm, the synthesis waiting time is long, and the playing is started after all data are synthesized.

2. Sleep in streaming synthesis: synthesizing pcm data, processing the pcm data by a player, sleeping for a certain time in the synthesis process, and continuing synthesizing and playing; the size of the data block synthesized once in the mode is fixed, the CPU fluctuation exists, the sleep time is too short, the CPU is possibly occupied, and the sleep time is too long, so that the player can be disconnected from broadcasting or noise occurs.

Disclosure of Invention

In view of the foregoing, one or more embodiments of the present disclosure are directed to a speech synthesis coding method and apparatus, which can effectively solve the technical problems in the prior art.

In view of the above object, one or more embodiments of the present specification provide a speech synthesis encoding method, including:

The method comprises the steps of starting to synthesize text data into pcm stream data, and storing the pcm stream data in a buffer area;

dynamically calculating a starting buffer threshold Tstart required for starting playing according to the current system load condition;

if the buffer time length of the buffer area is larger than the initial buffer threshold Tstart, reading the pcm stream data of the buffer area to play;

dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition;

If the buffer time length of the buffer area is larger than the continuous buffer threshold Tblock, suspending the synthesis of the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time until all the text data are synthesized into pcm stream data.

As an optional implementation manner, the dynamically calculating the starting buffer threshold Tstart required for starting playing according to the current system load condition includes:

Tstart=tmin if T2-T1< Tmin, otherwise tstart=t2-T1;

Wherein T1 is the synthesis duration prediction;

T2 is the playing time length;

Tmin is the minimum buffer duration.

As an alternative embodiment, t1=l×u/C, t2=l×t;

Wherein, C is the maximum idle calculation force of the single core cpu; u is the calculation power consumption of single word synthesis; t is the duration prediction of the single word; l is the word length of the whole sentence.

As an optional implementation manner, the dynamically calculating the persistent buffer threshold Tblock required for persistent playing according to the current system load condition includes:

If T4 is less than or equal to T3, tblock=t3, otherwise tblock=x (T2-T1) + Tbuf;

If Tblock < Tmin, tblock=tmin;

Wherein, T4 is the residual playing time length estimation, T3 is the residual synthesizing time length estimation, tbuf is the residual playing time length of the current buffer area, and x is the buffer unit.

As an alternative embodiment, t3=r×u/C, t4=r×t+ Tbuf, tmin=f× Tplayer;

Wherein, C is the maximum idle calculation force of the single core cpu; u is the calculation power consumption of single word synthesis; r is the residual word length, F is the minimum play buffer coefficient, tplayer is the minimum buffer length of the player, and T is the single word length estimated.

As an alternative embodiment, the buffer unit x=1% and the minimum play buffer coefficient f=2.

As an alternative embodiment, the method further comprises the step of suspending playing of the pcm stream data.

Corresponding to the speech synthesis coding method, the embodiment of the invention also provides a speech synthesis coding device, which comprises:

The buffer module is used for starting to synthesize the text data into pcm stream data and storing the pcm stream data in a buffer area;

The first calculation module is used for dynamically calculating a starting buffer threshold Tstart required for starting playing according to the current system load condition;

the playing module is used for reading the pcm stream data of the buffer area to play when the buffer time of the buffer area is longer than the initial buffer threshold Tstart;

the second calculation module is used for dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition;

The judging module is used for suspending the synthesis of the text data into pcm stream data if the buffer time length of the buffer area is larger than the continuous buffer threshold Tblock, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time until all the text data are synthesized into pcm stream data.

As an alternative embodiment, the first computing module is configured to

Tstart=tmin if T2-T1< Tmin, otherwise tstart=t2-T1;

Wherein T1 is the synthesis duration prediction;

T2 is the playing time length;

Tmin is the minimum buffer duration.

As an alternative embodiment, the second computing module is configured to

If Tblock < Tmin, tblock=tmin;

As can be seen from the foregoing, in the speech synthesis coding method and apparatus provided in one or more embodiments of the present disclosure, after synthesizing text data into pcm stream data, dynamically calculating an initial buffer threshold Tstart required for starting playing according to a current system load condition, and if a buffer time period is longer than the initial buffer threshold Tstart, reading the pcm stream data in a buffer area to play; dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition; in the playing process, whether the text data is continuously synthesized into pcm stream data or is paused to be synthesized is judged according to the relation between the buffer duration of the buffer zone and the continuous buffer threshold Tblock, so that the stability and smoothness of playing are ensured, and the smooth occupation of the cpu and the memory is realized.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only one or more embodiments of the present description, from which other drawings can be obtained, without inventive effort, for a person skilled in the art.

FIG. 1 is a schematic diagram of a speech synthesis encoding method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of a speech synthesis encoding apparatus according to an embodiment of the invention.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

To achieve the above object, an embodiment of the present invention provides a speech synthesis encoding method, including:

In the embodiment of the invention, after the text data is synthesized into the pcm stream data, dynamically calculating a starting buffer threshold Tstart required for starting playing according to the current system load condition, and if the buffer time length of the buffer zone is longer than the starting buffer threshold Tstart, reading the pcm stream data of the buffer zone for playing; dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition; in the playing process, whether the text data is continuously synthesized into pcm stream data or is paused to be synthesized is judged according to the relation between the buffer duration of the buffer zone and the continuous buffer threshold Tblock, so that the stability and smoothness of playing are ensured, and the smooth occupation of the cpu and the memory is realized.

As shown in fig. 1, an embodiment of the present invention provides a speech synthesis coding method, including:

S100, starting to synthesize the text data into pcm stream data, and storing the pcm stream data in a buffer area.

S200, dynamically calculating a starting buffer threshold Tstart required for starting playing according to the current system load condition.

Optionally, the dynamically calculating the starting buffer threshold Tstart required for starting playing according to the current system load condition includes:

Tstart=tmin if T2-T1< Tmin, otherwise tstart=t2-T1;

Wherein T1 is a synthesis duration estimate, t1=l×u/C; t2 is the play duration, t2=l×t; tmin is the minimum buffer duration; c is the maximum idle computing power of the single core cpu; u is the calculation power consumption of single word synthesis; t is the duration prediction of the single word; l is the word length of the whole sentence.

And S300, reading the pcm stream data of the buffer area to play if the buffer time of the buffer area is longer than the initial buffer threshold Tstart.

S400, dynamically calculating a continuous buffer threshold Tblock required by continuous playing according to the current system load condition.

Optionally, the dynamically calculating the persistent buffer threshold Tblock required for persistent playing according to the current system load condition includes:

If Tblock < Tmin, tblock=tmin;

Wherein, T4 is the residual play duration estimate, t4=r×t+ Tbuf, T3 is the residual composite duration estimate, t3=r×u/C, tbuf is the current buffer residual play duration, x is the buffer unit, and the value is usually 1%, the minimum composite buffer duration Tmin, tmin=f× Tplayer; c is the maximum idle computing power of the single core cpu; u is the calculation power consumption of single word synthesis; r is the residual word length, F is the minimum playing buffer coefficient, the value F=2 is usually taken, tlayer is the minimum buffer duration of the player, and T is the single word duration prediction.

S500, if the buffer time length of the buffer area is larger than the continuous buffer threshold Tblock, suspending the synthesis of the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time until all the text data are synthesized into pcm stream data.

Corresponding to the speech synthesis coding method, as shown in fig. 2, an embodiment of the present invention further provides a speech synthesis coding apparatus, including:

a buffer module 10, configured to start synthesizing text data into pcm stream data, and store the pcm stream data in a buffer;

the first calculating module 20 is configured to dynamically calculate an initial buffer threshold Tstart required for playing according to a current system load condition;

the playing module 30 is configured to read the pcm stream data in the buffer area for playing when the buffer time of the buffer area is longer than the initial buffer threshold Tstart;

the second calculating module 40 is configured to dynamically calculate a continuous buffer threshold Tblock required for continuous playing according to a current system load condition;

the judging module 50 is configured to suspend the synthesizing of the text data into pcm stream data if the buffer time period of the buffer area is longer than the continuous buffer threshold Tblock, and return to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time period; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time until all the text data are synthesized into pcm stream data.

Optionally, the first computing module 20 is configured to

Tstart=tmin if T2-T1< Tmin, otherwise tstart=t2-T1;

Wherein T1 is the synthesis duration prediction;

T2 is the playing time length;

Tmin is the minimum buffer duration.

Optionally, the second computing module 40 is configured to

If Tblock < Tmin, tblock=tmin;

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The present disclosure is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the one or more embodiments of the disclosure, are therefore intended to be included within the scope of the disclosure.

Claims

1. A speech synthesis coding method, comprising:

If the buffer time length of the buffer area is larger than the continuous buffer threshold Tblock, suspending the synthesis of the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time until all the text data are synthesized into pcm stream data; the dynamically calculating the starting buffer threshold Tstart required for starting playing according to the current system load condition includes:

Tstart=tmin if T2-T1< Tmin, otherwise tstart=t2-T1;

Wherein T1 is the synthesis duration prediction;

T2 is the playing time length;

Tmin is the minimum buffer duration;

the dynamically calculating the continuous buffer threshold Tblock required by continuous playing according to the current system load condition comprises the following steps:

If Tblock < Tmin, tblock=tmin;

2. The speech synthesis coding method according to claim 1, wherein t1=l×u/C, t2=l×t;

3. The speech synthesis coding method according to claim 1, wherein t3=r x U/C, t4=r x t+ Tbuf, tmin=f x Tplayer;

4. A speech synthesis coding method according to claim 3, wherein the buffer unit x = 1% and the minimum play buffer factor F = 2.

5. The speech synthesis coding method according to claim 1, further comprising the step of pausing playing the pcm stream data.

6. A speech synthesis encoding apparatus comprising:

The judging module is used for suspending the synthesis of the text data into pcm stream data if the buffer time length of the buffer area is larger than the continuous buffer threshold Tblock, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for the preset time; otherwise, continuously synthesizing the text data into pcm stream data, and returning to the step of calculating the continuous buffer threshold Tblock after waiting for a preset time until all the text data are synthesized into pcm stream data;

Wherein the first computing module is used for

Tstart=tmin if T2-T1< Tmin, otherwise tstart=t2-T1;

Wherein T1 is the synthesis duration prediction;

T2 is the playing time length;

tmin is the minimum buffer duration that is required,

The second computing module is used for

If Tblock < Tmin, tblock=tmin;