CN111105778A

CN111105778A - Speech synthesis method, speech synthesis device, computing equipment and storage medium

Info

Publication number: CN111105778A
Application number: CN201811270533.6A
Authority: CN
Inventors: 郑志辉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2020-05-05

Abstract

The invention discloses a voice synthesis method, a device, a computing device and a storage medium, wherein the voice synthesis method executed at a server comprises the following steps: performing text-to-speech conversion processing on text data in a client data packet from a client to obtain audio data corresponding to the text data; determining a code rate for compressing the audio data based on a network connection state between the client and the server; compressing the audio data based on the code rate to obtain an audio data packet subjected to compression processing; and returning the audio data packet to the client. Therefore, the whole voice synthesis process is ensured to be smooth by adjusting the compression code rate aiming at the complex and changeable network environment, so that a user can obtain smooth voice data playing.

Description

Speech synthesis method, speech synthesis device, computing equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, computing device, and storage medium.

Background

Speech synthesis is a technique for generating artificial speech by mechanical, electronic methods. TTS technology (also known as text-to-speech technology) belongs to speech synthesis, and is a technology for converting text information generated by a computer or input from the outside into intelligible and fluent audio output.

In many current speech synthesis services, audio data synthesized by a server is basically transmitted directly to a client, and the client performs appropriate buffering and then plays the audio data. In the method, the influence of complex internet environment and factors such as play buffering and decoding buffering is not considered, and the whole voice synthesis system cannot sense the change of network conditions and adaptively provides audio data with different qualities.

Accordingly, there is still a need for an improved speech synthesis technique to provide a smooth audio data playback effect to the user.

Disclosure of Invention

The invention aims to provide a voice synthesis method and a voice synthesis device, which are suitable for different network environments and network conditions and provide smooth voice data playing effect for users.

According to an aspect of the present invention, there is provided a speech synthesis method performed at a server, including: performing text-to-speech conversion processing on text data in a client data packet from a client to obtain audio data corresponding to the text data; determining a code rate for compressing the audio data based on a network connection state between the client and the server; compressing the audio data based on the code rate to obtain an audio data packet subjected to compression processing; and returning the audio data packet to the client.

Optionally, the speech synthesis method may further include determining the network connection status based on a packet loss rate parameter from the client.

Optionally, the step of determining a code rate for compressing the audio data may include: and under the condition that the network connection state is good, determining a code rate for compressing the audio data based on the packet loss rate parameter.

Optionally, in the case that the network connection state is poor, the server does not perform text-to-speech conversion processing on the text data or does not perform compression processing on the audio data, and sends an instruction for generating the audio data offline to the client.

Optionally, the packet loss rate parameter is calculated by the client according to statistical information of previously received audio data packets.

According to another aspect of the present invention, there is also provided a server for speech synthesis, including: the first text-to-speech conversion unit is used for performing text-to-speech conversion processing on text data in a client data packet from a client to obtain audio data corresponding to the text data; a code rate determining unit, configured to determine a code rate for compressing the audio data based on a network connection state between the client and the server; the compression unit is used for compressing the audio data based on the code rate to obtain an audio data packet after compression; and the first transmission unit is used for returning the audio data packet to the client.

Optionally, the server may further include a first network status determining unit, configured to determine the network connection status based on a packet loss rate parameter from the client.

Optionally, the code rate determining unit determines the code rate for compressing the audio data based on the packet loss rate parameter when the network connection state is good.

Optionally, the server may further include an instruction control unit, configured to generate an instruction for generating the audio data offline in a case that the network connection status is poor, where in a case that the network connection status is poor, the first text-to-speech conversion unit does not perform text-to-speech conversion on the text data, or the compression unit does not perform compression on the audio data, and the first transmission unit sends the instruction for generating the audio data offline to the client.

According to another aspect of the present invention, there is also provided a speech synthesis method performed at a client, including: sending a network connection state between a client and a server or a parameter which can be used for determining the network connection state correlation to the server; sending character data input by a user to a server; and

and receiving an audio data packet from the server, wherein the audio data packet comprises audio data corresponding to the text data, and the compression code rate of the audio data packet is related to the network connection state.

Optionally, the speech synthesis method may further include determining the network connection status based on a packet loss rate parameter.

Optionally, the text data is sent to the server when the network connection state is good.

Optionally, the method may further include: under the condition that the network connection state is poor, text-to-speech conversion processing is performed on the text data locally at the client; or responding to an instruction of generating audio data offline from a server, and locally performing text-to-speech conversion processing on the text data at the client.

Optionally, the parameter that can be used to determine the network connection status is a packet loss rate parameter.

Optionally, the method may further include: and calculating and updating the packet loss rate parameter based on the statistical information of the received audio data packet.

Optionally, the method may further include: and dynamically adjusting the size of a local jitter buffer area according to the packet loss rate parameter.

There is also provided according to another aspect of the present invention, a client for speech synthesis, including: the second transmission unit is used for sending the network connection state between the client and the server or parameters related to the network connection state which can be used for determining the network connection state to the server; the third transmission unit is used for transmitting the character data input by the user to the server; and a fourth transmission unit, configured to receive an audio data packet from the server, where the audio data packet includes audio data corresponding to the text data, and a compression code rate of the audio data packet is related to the network connection state.

Optionally, the client may further include: and the second network state determining unit is used for determining the network connection state based on the packet loss rate parameter.

Optionally, the third transmission unit sends the text data to the server when the network connection state is good.

Optionally, the client may further include: the second text-to-speech conversion unit is used for performing text-to-speech conversion processing on the text data locally at the client under the condition that the network connection state is poor; or in response to receiving an instruction for generating audio data offline from a server, the second text-to-speech conversion unit performs text-to-speech conversion processing on the text data locally at the client.

Optionally, the client may further include a calculating unit, configured to calculate and update the packet loss rate parameter based on the statistical information of the received audio data packets.

Optionally, the client may further include a jitter buffer adjusting unit, configured to dynamically adjust a size of a local jitter buffer according to the packet loss rate parameter.

According to another aspect of the present invention, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

According to another aspect of the present invention, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as described above.

Therefore, by the invention, aiming at a complex and changeable network environment, a network condition monitoring mechanism, a network congestion control mechanism, a local buffer dynamic adjustment mechanism and other mechanisms are introduced to ensure the smoothness of the whole voice synthesis process, so that a user can obtain a smooth voice data playing effect.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic diagram of a speech synthesis system for implementing an embodiment of the invention.

FIG. 2 shows a schematic diagram of a speech synthesis system according to one embodiment of the invention.

Fig. 3 shows a flow diagram of a speech synthesis method according to an embodiment of the invention.

Fig. 4 shows a flow diagram of a speech synthesis method according to another embodiment of the invention.

Fig. 5 shows a schematic structural diagram of a server according to an embodiment of the present invention.

Fig. 6 shows a schematic structural diagram of a client according to an embodiment of the present invention.

FIG. 7 shows a schematic block diagram of a computing device in accordance with one embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As mentioned above, the current speech synthesis service does not consider the influence of the complicated internet environment and factors such as jitter buffering and decoding buffering, and the whole speech synthesis system cannot sense the change of the network condition and adapt to audio data with different quality.

At present, the online speech synthesis service directly transmits the audio data synthesized by the server to the client, and the client performs appropriate buffering and then playing, and such a scheme has the following disadvantages:

1) the internet environment is complex, and the playing buffer size of the local client cannot be dynamically changed along with the network condition due to different network jitter degrees. If the play buffer is too small, the audio play will be jammed; if the play buffer is too large, memory will be wasted and delay of audio play will be increased;

2) decoding buffering is lacking. If the decoding capability of the client is insufficient, the decoded audio data is not available during audio playing, and playing is blocked;

3) the network environment is complex and there is no QoS (Quality of Service) mechanism, so that the whole transmission system cannot sense the change of the network condition, and accordingly, the Quality of the transmitted audio data (i.e. audio data with different code rates) is changed.

In view of this, the present invention provides a speech synthesis method and apparatus, which, aiming at a complex and variable network environment, ensure the smoothness of the whole speech synthesis process by introducing a series of mechanisms such as a network condition monitoring mechanism, a network congestion control mechanism, a local buffer dynamic adjustment mechanism, etc., so that a user can obtain smooth speech data playing.

The speech synthesis scheme of the present invention will be described in detail below with reference to the accompanying drawings and embodiments.

As shown in fig. 1, the speech synthesis system of the present invention may include at least one server 20 and a plurality of terminal devices 10. The terminal device 10 can transmit and receive information to and from the server 20 via the network 40. The server 20 can acquire contents required by the terminal device 10 by accessing the database 30. Mobile terminals (e.g., between 10_1 and 10_2 or 10_ N) may also communicate with each other via the network 40.

Network 40 may be a network for information transfer in a broad sense and may include one or more communication networks such as a wireless communication network, the internet, a private network, a local area network, a metropolitan area network, a wide area network, or a cellular data network, among others. In one embodiment, the network 40 may also include a satellite network, whereby the GPS signals of the terminal device 10 are transmitted to the server 20.

Terminal device 10 is any suitable portable electronic device that may be used for network access including, but not limited to, a smart phone, tablet, or other portable client. The terminal device 10 may have one or more application clients installed thereon, which may provide a text input interface and an audio playing function to a user to facilitate the user to input text and play audio data corresponding to the input text. Unless otherwise stated, the present invention is described below with reference to the client 10 as a terminal device or one or more application clients installed on a terminal device.

The server 20 is any server capable of providing information required for an interactive service through a network. The server side can receive a client side data packet containing the text data sent by the client side, can perform text-to-speech conversion processing on the text data to obtain audio data corresponding to the text data, and sends the audio data packet containing the audio data to the client side for audio playing.

In the following description, one or a part of the mobile terminals (for example, the terminal device 10-1) will be selected and described, but it should be understood by those skilled in the art that the above-mentioned 1 … N mobile terminals are intended to represent a large number of mobile terminals existing in a real network, and the single server 20 and database 30 shown are intended to represent the operation of the technical solution of the present invention involving the server and the database. The detailed description of the mobile terminal and the single server and database with specific numbers is at least for convenience of description and does not imply any limitation as to the type or location of the mobile terminal and server.

It should be noted that the underlying concepts of the exemplary embodiments of the present invention are not altered if additional modules are added or removed from the illustrated environments. In addition, although a bidirectional arrow from the database 30 to the server 20 is shown in the figure for convenience of explanation, it will be understood by those skilled in the art that the above-described data transmission and reception may be realized through the network 40.

As shown in fig. 2, the speech synthesis system of the present invention may include at least a client 10 and a server 20.

The client 10 may be a terminal device as shown in fig. 1, or may be an application client installed on the terminal device side.

The client 10 may include a component that receives text input that provides an external input interface through which a user may directly input text that requires speech synthesis.

The client 10 may include a network module, which is a component for network communication and can be used for network transmission, wherein the protocol format of the client and server communication may be in the form of data packets. In one embodiment, the network module can send a client data packet including text data input by a user to the server and can also receive an audio data packet returned by the server.

In a preferred embodiment, the client 10 may further include a Quality of Service (QoS) module, which is capable of calculating a network parameter indicating a network condition, such as a packet loss rate, according to the received packet statistics information, and adjusting a jitter buffer of the local client according to the current network condition to assist in adaptively playing the smooth audio data. The QoS module may preferably perform the above calculation and control by using a congestion control algorithm (GCC algorithm) (see the following description for a specific control scheme). Preferably, in order to calculate the packet loss rate, the data packet transmitted by the network may further include a sequence number of the data packet.

Further, the client 10 may further include an audio decoder, which may include a decoding buffer, and may be capable of decoding and placing the audio data returned by the server into the decoding buffer, so that the audio player may play the audio data obtained from the decoding buffer, thereby enabling the user to obtain smooth audio data playing.

In a preferred embodiment, the client 10 may further include a speech synthesis module (not shown in the figure) capable of synthesizing audio data corresponding to the text data input by the user locally offline at the client. The voice synthesis module can automatically synthesize the audio data at the local client under the condition of poor network condition, and can also locally synthesize the audio data in an off-line manner at the local client after receiving a control instruction which is sent by the server and used for synthesizing the audio data.

The server 20 may also include corresponding components corresponding to the client.

For example, the server 20 may also include a network module for network transmission. The network module can receive text data from the client and can also transmit audio data synthesized at the server to the client 10.

The server 20 may have a QoS module corresponding to the client, which can determine the current network condition (such as network connection status) according to the obtained network parameters, and then perform control related to network congestion based on the current network condition, so as to obtain audio data with different quality at the server.

In a preferred embodiment, the above calculation and control can be preferably performed by using a congestion control algorithm (GCC algorithm), and the QoS module on the client side performs control of network congestion in cooperation with the QoS module on the server side to control the quality of the voice-synthesized audio data according to the network conditions.

The network parameters (e.g., packet loss rate) can reflect the network congestion condition, and based on the GCC algorithm, the compression code rate can be controlled at the transmitting end (e.g., the server end) based on the network parameters (e.g., packet loss rate).

In a preferred embodiment, when the packet loss rate is small or zero, it indicates that the network condition is good, and when the packet loss rate does not exceed the preset maximum code rate, the compression code rate of the sending end can be increased; conversely, when the packet loss rate is increased, the code rate of the transmitting side (e.g., server side) should be decreased in this case, which indicates that the network condition is deteriorated. In other cases, the compression code rate of the transmitting end can be kept unchanged.

The network parameters (e.g., packet loss rate) used by the GCC algorithm may be calculated according to the statistical information of the data packets received by the receiving end (e.g., client), and then the receiving end returns the network parameters to the transmitting end.

After receiving the packet loss rate, the transmitting end may calculate (1-1) a compression code rate of the transmitting end according to the following formula: when the packet loss rate is greater than 0.1, the network is congested, and the compression code rate of a sending end is reduced; when the packet loss rate is less than 0.02, the network condition is good, and the compression code rate of the sending end is increased; in other cases, the compression code rate at the transmitting end may remain unchanged.

Wherein A is_s(t_k) Representing the compression code rate, f, of the transmitting end_l(t_k) Is the packet loss rate parameter, k and k-1 represent the sequence number of the transmitted data packet, t_k、t_k-1Respectively indicating the time when the kth and k-1 th data packets are transmitted.

The server 20 may further include a speech synthesis module capable of synthesizing the received text data into audio data corresponding thereto.

Further, the server 20 may further include an audio encoder, which is capable of compressing the audio data obtained by the synthesis according to the compression code rate of the sending end calculated by the QoS module, so as to facilitate network transmission, that is, obtain audio data with different qualities.

In addition, although the text-to-speech processing capability of the server is stronger than that of the client side, when the network state is poor, the technical bottleneck of speech synthesis is shifted from the text-to-speech processing capability to network transmission, which brings a poor experience to the user.

Therefore, in a preferred embodiment, the server 20 may also determine the current network condition according to the obtained network parameters, perform online speech synthesis at the server side when the network condition is good, and send a control instruction for performing speech synthesis to the client side when the network condition is poor, so as to perform offline speech synthesis at the client side, thereby avoiding delay, pause and the like caused by poor network, ensuring that the user obtains smooth audio data playback, and ensuring user experience.

Therefore, the conversion from the local client to the local client can be determined according to the compromise between the text-to-speech conversion processing and the network transmission when the network connection state is poor.

In other words, when the user experience (e.g. fluency, transmission speed, etc.) caused by poor network status is lower than the effect of local conversion, the network connection status is considered to be poor, and the text-to-speech conversion process is performed locally by the client.

It should be understood by those skilled in the art that the text data of the present invention may include, but is not limited to, text data input by a user, and the text data of the present invention is suitable for any application scenario.

Taking a novel as an example, in an application scene of online synthesis and playing of the novel, a novel reading module in the application can segment text data corresponding to the novel, call a client input interface of the speech synthesis system, transmit the text data to the speech synthesis system, and then after receiving the text data, a server side performs online speech synthesis and returns audio data to the client side for playing. Meanwhile, in the process, the service end can generate audio data with different qualities by combining the GCC algorithm, the QoS module of the client end can dynamically adjust jitter buffering according to the change of the network environment, and decoding buffering is introduced to ensure that the audio data is smoothly played.

Furthermore, the speech synthesis scheme of the present invention can also be implemented as a speech synthesis method that can be executed separately on the server side and on the client side.

Fig. 3 shows a flow diagram of a speech synthesis method according to an embodiment of the invention. The speech synthesis method may be performed on the server side as shown in fig. 1.

As shown in fig. 3, in step S310, text-to-speech conversion processing is performed on text data in a client data packet from a client, so as to obtain audio data corresponding to the text data.

The client data may include, but is not limited to, text data input by a user and/or a network parameter (e.g., packet loss rate) calculated by the client, and the network parameter (e.g., packet loss rate parameter) may be calculated by the client according to statistical information of previously received audio data packets. The text-to-speech conversion process can be referred to as the existing text-to-speech conversion process, and is not described herein again.

In step S320, a bitrate for compressing the audio data is determined based on a network connection state between the client and the server.

The network connection state is a description of the network environment or network conditions. Here, the network connection state may be determined based on a packet loss rate parameter from a client. Further, a (transmitting end) code rate for compressing the synthesized audio data is determined based on the current network connection state. Wherein the code rate can be calculated by the above-described calculation formula (1-1).

Wherein the determining of the code rate for compressing the audio data may comprise: and under the condition that the network connection state is good, determining a code rate for compressing the audio data based on the packet loss rate parameter.

It should be understood that, the present invention does not limit the sequence of the step of synthesizing audio data and the step of determining the code rate, and the two steps may be performed simultaneously, or the network connection state and the code rate may be determined first, and then the above text-to-speech conversion processing, i.e. the online speech synthesis, is performed at the server side under the condition that the current network status is good.

Then, in step S330, the audio data is compressed based on the code rate, so as to obtain an audio data packet after compression.

Finally, in step S340, the audio data packet is returned to the client.

In addition, in the opposite case, for example, the network condition is poor, the server may perform a speech synthesis method which is not exactly the same as the above steps.

For example, the server may first determine a network connection state between the client and the server based on a packet loss rate parameter from the client, and if the network connection state is poor, the text-to-speech conversion processing is not performed on the text data or the audio data is not compressed (i.e., the steps S310 and S330 are not performed), and an instruction for generating the audio data offline is sent to the client in step S340.

Fig. 4 shows a flow diagram of a speech synthesis method according to an embodiment of the invention. The speech synthesis method may be performed on the client side as shown in fig. 1.

As shown in fig. 4, in step S410, a network connection state between the client and the server or a parameter that can be used to determine the network connection state is sent to the server.

The parameter that can be used to determine the network connection status may be a packet loss rate parameter. The client can calculate and update the packet loss rate parameter stored locally at the client based on the statistical information of the previously received audio data packets.

Before step S410, the client may determine the network connection status based on the packet loss rate parameter, and then send the network connection status to the server in step S410. Alternatively, in step S410, the client may also send the packet loss rate parameter to the server, so that the server may determine the network connection state between the client and the server based on the packet loss rate parameter.

In step S420, the character data input by the user is transmitted to the server.

The client may send the text data and the network connection status in association, or may use the related parameters to determine the network connection status. In a preferred embodiment, the client can package the text data input by the user and the network connection state or the related parameters capable of being used for determining the network connection state into a client data packet, and send the client data packet to the server, so as to perform related speech synthesis on the server side.

And, in step S430, the client is able to receive the audio data packet from the server and perform voice playing based on the audio data packet. The audio data can be directly played at the client, and can also be stored or sent to other equipment for playing.

The audio data packet may include audio data corresponding to the text data, and a compression rate of the audio data packet is related to the network connection status or the related parameter that can be used to determine the network connection status. For example, the compression code rate may be determined according to the network connection status or determined according to the related parameter.

In addition, in order to ensure the smooth playing of the audio data provided for the user, at the client side, the size of the local jitter buffer can be dynamically adjusted in a self-adaptive manner according to the packet loss rate parameter.

Because the packet loss rate, the time delay and the jitter buffer are related, the size of the local jitter buffer area of the client can be dynamically adjusted by combining the packet loss rate and the preset time delay threshold, the jitter buffer with variable size is self-adapted at the client, and the continuity of audio playing and the low time delay of the playing of the client are ensured.

In addition, in order to ensure the continuity of audio data playing, the client also introduces a decoding buffer area, and the audio data is further acquired from the decoding buffer area for playing by putting the decoded audio data into the decoding buffer area. Thus, by adopting the decoding buffer, the playing pause caused by too slow decoding of a machine with poor performance is prevented.

Therefore, the situation that the decoding efficiency is reduced due to the occupation of the CPU by other processes, and the decoding jitter (the decoded data is inconsistent in time delay) is avoided.

In addition, the client locally can also realize an offline voice synthesis function, and offline voice synthesis is preferentially selected under the condition of poor network conditions, so that the conditions of network delay, network jitter, packet loss and the like caused by poor network connection state are avoided, and the audio data corresponding to the character data input by the user can be smoothly played at the client.

Thus, after determining the network connection state between the client and the server, the client can execute the steps S420 and S430 only when the network connection state is good. Under the condition of poor network connection state, the text data can be directly subjected to text-to-speech conversion processing locally at the client without sending the text data to the server.

In addition, the client can also receive a control instruction from the server, and can respond to the instruction of generating the audio data offline from the server, and locally perform text-to-speech conversion processing on the text data at the client.

So far, the speech synthesis methods executed at the server and the client have been described in detail with reference to fig. 3 and fig. 4, respectively, and by the speech synthesis methods, different network conditions can be adapted to, and it is ensured that the user obtains smooth speech data playback.

Fig. 5 shows a schematic structural diagram of a server according to an embodiment of the present invention. The server may be used to implement the speech synthesis method as shown in fig. 3.

As shown in fig. 5, the server 500 for speech synthesis of the present invention may include a first text-to-speech conversion unit 510, a code rate determination unit 520, a compression unit 530, and a first transmission unit 540.

The first text-to-speech conversion unit 510 may perform text-to-speech conversion processing on text data in a client data packet from the client, so as to obtain audio data corresponding to the text data.

The code rate determining unit 520 may determine a code rate for compressing the audio data based on a network connection state between the client and the server.

The compressing unit 530 may perform compression processing on the audio data based on the code rate to obtain a compressed audio data packet.

The first transmission unit 540 may return the audio data packet to the client.

In a preferred embodiment, the server 500 may further include a first network status determining unit (not shown).

The first network state determination unit may determine the network connection state based on a packet loss rate parameter from the client. The code rate determining unit 520 may determine the code rate for compressing the audio data based on the packet loss rate parameter when the network connection status is good.

In a preferred embodiment, the server 500 may further include an instruction control unit (not shown). The instruction control unit is capable of generating an instruction to generate audio data offline in a case where the network connection state is poor. In the case of a poor network connection state, the first text-to-speech conversion unit 510 does not perform text-to-speech conversion processing on the text data, or the compression unit 530 does not perform compression processing on the audio data, and the first transmission unit 540 sends an instruction for generating the audio data offline to the client.

Fig. 6 shows a schematic structural diagram of a client according to an embodiment of the present invention. The client is used to implement the speech synthesis method as shown in fig. 4.

As shown in fig. 6, the client 600 for speech synthesis of the present invention may include a second transmission unit 610, a third transmission unit 620, and a fourth transmission unit 630.

The second transmission unit 610 can transmit the network connection state between the client and the server or parameters that can be used to determine the network connection state to the server. In one embodiment, the relevant parameter that can be used to determine the network connection status may be a packet loss rate parameter.

The third transmission unit 620 can transmit the text data input by the user to the server.

The fourth transmission unit 630 is capable of receiving an audio data packet from the server, where the audio data packet includes audio data corresponding to the text data, and a compression code rate of the audio data packet is related to the network connection status.

In a preferred embodiment, the second transmission unit 610, the third transmission unit 620 and the fourth transmission unit 630 may be multiplexed, for example, the network module on the client side shown in fig. 2.

In a preferred embodiment, the client 600 may further include a computing unit (not shown). The calculating unit may calculate and update the packet loss rate parameter according to the statistical information of the received audio data packet.

In a preferred embodiment, the client 600 may further include a second network status determining unit (not shown in the figure). The second network state determination unit may be capable of determining the network connection state based on a packet loss rate parameter. When the network connection state is good, the third transmission unit 620 sends the text data to the server.

In a preferred embodiment, the client 600 may further include a second text-to-speech conversion unit (not shown). The second text-to-speech conversion unit can perform text-to-speech conversion processing on the text data locally at the client under the condition of poor network connection state. Or, in response to receiving an instruction for generating audio data offline from the server, the second text-to-speech conversion unit can perform text-to-speech conversion processing on the text data locally at the client.

In a preferred embodiment, the client 600 may further include a jitter buffer adjustment unit (not shown). The jitter buffer adjustment unit can dynamically adjust the size of the local jitter buffer according to the packet loss rate parameter.

So far, the server and the client for performing speech synthesis according to the present invention have been described in detail with reference to fig. 5 to 6. By the method and the device, the smoothness of the whole voice synthesis process can be ensured by introducing a series of mechanisms such as a network condition monitoring mechanism, a network congestion control mechanism, a local buffer dynamic adjustment mechanism and the like aiming at a complex and changeable network environment, so that delay, blockage and the like caused by network change are avoided, the timeliness and the continuity of audio playing of the client are ensured, and the user experience is improved.

Fig. 7 is a schematic structural diagram of a computing device that can be used to implement the above-described speech synthesis method according to an embodiment of the present invention.

Referring to fig. 7, computing device 700 includes memory 710 and processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 720 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 710 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 710 has stored thereon processable code that, when processed by the processor 720, causes the processor 720 to perform the speech synthesis methods described above.

The speech synthesis scheme according to the invention has been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A speech synthesis method performed at a server, comprising:

performing text-to-speech conversion processing on text data in a client data packet from a client to obtain audio data corresponding to the text data;

determining a code rate for compressing the audio data based on a network connection state between the client and the server;

compressing the audio data based on the code rate to obtain an audio data packet subjected to compression processing;

and returning the audio data packet to the client.

2. The method of claim 1, further comprising:

and determining the network connection state based on the packet loss rate parameter from the client.

3. The method of claim 2, wherein the determining a code rate for compressing the audio data comprises:

and under the condition that the network connection state is good, determining a code rate for compressing the audio data based on the packet loss rate parameter.

4. The method according to claim 2, wherein, in the case of the poor network connection state, the server does not perform text-to-speech conversion processing on text data or compression processing on the audio data, and sends an instruction to generate audio data offline to the client.

5. The method of claim 2, wherein,

the packet loss rate parameter is calculated by the client according to statistical information of the previously received audio data packet.

6. A server for speech synthesis, comprising:

the first text-to-speech conversion unit is used for performing text-to-speech conversion processing on text data in a client data packet from a client to obtain audio data corresponding to the text data;

a code rate determining unit, configured to determine a code rate for compressing the audio data based on a network connection state between the client and the server;

the compression unit is used for compressing the audio data based on the code rate to obtain an audio data packet after compression; and

and the first transmission unit is used for returning the audio data packet to the client.

7. The server of claim 6, further comprising:

a first network state determining unit, configured to determine the network connection state based on a packet loss rate parameter from the client.

8. The server according to claim 7, wherein,

and the code rate determining unit determines the code rate for compressing the audio data based on the packet loss rate parameter under the condition that the network connection state is good.

9. The server of claim 7, further comprising:

an instruction control unit for generating an instruction to generate audio data offline in a case where the network connection state is poor,

under the condition that the network connection state is poor, the first text-to-speech conversion unit does not perform text-to-speech conversion processing on text data, or the compression unit does not perform compression processing on the audio data, and the first transmission unit sends an instruction for generating the audio data offline to the client.

10. A speech synthesis method performed at a client, comprising:

sending a network connection state between a client and a server or a parameter which can be used for determining the network connection state correlation to the server;

sending character data input by a user to a server; and

11. The method of claim 10, further comprising:

and determining the network connection state based on the packet loss rate parameter.

12. The method of claim 11, wherein,

and sending the character data to the server under the condition that the network connection state is good.

13. The method of claim 12, further comprising:

under the condition that the network connection state is poor, text-to-speech conversion processing is performed on the text data locally at the client; or

And responding to an instruction of generating audio data offline from a server, and locally performing text-to-speech conversion processing on the text data at the client.

14. The method of claim 11, wherein,

the parameter that can be used to determine the network connection status is a packet loss rate parameter.

15. The method according to any of claims 11-14, further comprising:

and calculating and updating the packet loss rate parameter based on the statistical information of the received audio data packet.

16. The method of claim 15, further comprising:

and dynamically adjusting the size of a local jitter buffer area according to the packet loss rate parameter.

17. A client for speech synthesis, comprising:

the second transmission unit is used for sending the network connection state between the client and the server or parameters related to the network connection state which can be used for determining the network connection state to the server;

the third transmission unit is used for transmitting the character data input by the user to the server; and

and the fourth transmission unit is used for receiving an audio data packet from the server, wherein the audio data packet comprises audio data corresponding to the character data, and the compression code rate of the audio data packet is related to the network connection state.

18. The client of claim 17, further comprising:

and the second network state determining unit is used for determining the network connection state based on the packet loss rate parameter.

19. The client according to claim 18, wherein,

and under the condition that the network connection state is good, the third transmission unit sends the character data to the server.

20. The client of claim 18, further comprising:

the second text-to-speech conversion unit is used for performing text-to-speech conversion processing on the text data locally at the client under the condition that the network connection state is poor; or

And in response to receiving an instruction of generating audio data offline from a server, the second text-to-speech conversion unit performs text-to-speech conversion processing on the text data locally at the client.

21. The client according to claim 18, wherein,

22. The client according to any of claims 18-21, further comprising:

and the calculating unit is used for calculating and updating the packet loss rate parameter based on the statistical information of the received audio data packet.

23. The client of claim 22, further comprising:

and the jitter buffer adjusting unit is used for dynamically adjusting the size of the local jitter buffer according to the packet loss rate parameter.

24. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-5, 10-16.

25. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-5, 10-16.