CN112581934A

CN112581934A - Voice synthesis method, device and system

Info

Publication number: CN112581934A
Application number: CN201910944037.2A
Authority: CN
Inventors: 陈孝良; 张国超; 邢越峰; 苏少炜
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-03-30

Abstract

The invention provides a voice synthesis method, a device and a system.A server receives request information carrying text information sent by a client, and obtains a text to be synthesized according to the text information; segmenting a text to be synthesized into at least one sub-text according to a preset processing rule; performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result; and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode. The client and the server establish one TCP connection to complete one TTS speech synthesis, and the server divides the text to be synthesized, asynchronously synthesizes and sends the sub-text obtained after division, and does not need to wait for the whole synthesis of the text to be synthesized and then send the sub-text, thereby improving the response efficiency of the TTS speech synthesis service.

Description

Voice synthesis method, device and system

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, and system.

Background

TTS (Text-To-Speech) converts Text To Speech, and provides a Speech synthesis service To a user, and the response efficiency of the Speech synthesis service is of great concern.

In the prior art, TTS speech synthesis is completed by interaction between a mobile terminal and a cloud application platform, wherein a TCP connection needs to be established between the mobile terminal and the cloud application platform at first, the mobile terminal sends a post request of HTTP to the cloud application platform, a TCP connection needs to be established between the mobile terminal and the cloud application platform again subsequently when a TTS speech synthesis result is transmitted, and a TTS speech synthesis result is transmitted through RTSP (Real Time Streaming Protocol), that is, a TCP connection needs to be established twice for one TTS speech synthesis, and response efficiency of a TTS speech synthesis service is low.

Disclosure of Invention

In view of this, the invention provides a speech synthesis method, device and system, which improve the response efficiency of TTS speech synthesis service.

In order to achieve the above purpose, the invention provides the following specific technical scheme:

a speech synthesis method is applied to a server side, and comprises the following steps:

receiving request information carrying text information sent by a client, and acquiring a text to be synthesized according to the text information;

segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;

performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;

and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode.

Optionally, before receiving the request information carrying the text information sent by the client, the method further includes:

and establishing TCP connection with the client.

Optionally, the text information is the text to be synthesized, the obtaining address of the text to be synthesized, or the identifier of the text to be synthesized.

Optionally, the segmenting the text to be synthesized into at least one sub-text according to a preset processing rule includes:

and segmenting the text to be synthesized into at least one sub-text with the length within a preset range according to the sentence logic of the text to be synthesized.

Optionally, the sending the response information carrying the synthesis result to the client in a block transmission coding manner includes:

setting a transmission mode as a block transmission coding mode in a response head of the response information;

writing the synthesis result and the length of the synthesis result into a response body of response information;

and sending the response information to the client.

Optionally, when the sub-text is the last sub-text in the text to be synthesized, the method further includes:

and adding an end mark in a response body of the response information.

A speech synthesis method is applied to a client, and comprises the following steps:

segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;

generating request information carrying the sub-text information corresponding to the sub-texts;

according to the sequence of the sub texts, sending request information to a server side in a blocking transmission coding mode;

and receiving response information which is sent by the server and carries the synthesis result, and outputting the synthesis result in a streaming mode.

Optionally, before segmenting the text to be synthesized into at least one sub-text according to the preset processing rule, the method further includes:

and establishing TCP connection with the server.

Optionally, the method further includes:

and when the received response information carries an end mark, disconnecting the TCP connection with the server.

A speech synthesis device is arranged at a server side, and the device comprises:

the device comprises a to-be-synthesized text acquisition unit, a text synthesis unit and a text synthesis unit, wherein the to-be-synthesized text acquisition unit is used for receiving request information which is sent by a client and carries text information, and acquiring a to-be-synthesized text according to the text information;

the first text to be synthesized is segmented into at least one sub-text according to a preset processing rule;

the TTS speech synthesis unit is used for carrying out TTS speech synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;

and the synthesis result sending unit is used for sending the response information carrying the synthesis result to the client in a blocking transmission coding mode so as to enable the client to output the synthesis result in a streaming mode.

Optionally, the apparatus further comprises:

and the first connection establishing unit is used for establishing TCP connection with the client.

Optionally, the first to-be-synthesized text segmentation unit is specifically configured to segment the to-be-synthesized text into at least one sub-text with a length within a preset range according to the sentence logic of the to-be-synthesized text.

Optionally, the combined result sending unit is specifically configured to set a transmission mode in a response header of the response information as a block transmission coding mode, write the combined result and the length of the combined result into a response body of the response information, and send the response information to the client.

Optionally, the synthesis result sending unit is further configured to add an end mark in a response body of the response information when the sub text is the last sub text in the text to be synthesized.

A speech synthesis apparatus provided at a client, the apparatus comprising:

the second text to be synthesized segmentation unit is used for segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;

the request information generating unit is used for generating request information which corresponds to the subfolders and carries the information of the subfolders;

the request information sending unit is used for sending request information to the server side in a blocking transmission coding mode according to the sequence of the sub texts;

and the synthesis result output unit is used for receiving the response information which is sent by the server and carries the synthesis result and outputting the synthesis result in a streaming mode.

Optionally, the apparatus further comprises:

and the second connection establishing unit is used for establishing TCP connection with the server.

Optionally, the apparatus further comprises:

and the TCP connection disconnection unit is used for disconnecting the TCP connection with the server side when the received response information carries the ending mark.

A speech synthesis system comprises a client and a server;

the server is used for executing the voice synthesis method;

the client is configured to execute the speech synthesis method.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses a speech synthesis method, under the condition that a client terminal and a server terminal are connected by TCP, the server terminal can directly perform TTS speech synthesis after receiving request information sent by the client terminal, and does not need to perform TCP connection with the client terminal again, namely, the TTS speech synthesis can be completed by establishing TCP connection between the client terminal and the server terminal, and the server terminal divides a text to be synthesized, asynchronously synthesizes and sends a subfile obtained after division, and does not need to wait for the whole synthesis of the text to be synthesized and then send the text, thereby improving the response efficiency of TTS speech synthesis service.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a timing diagram of a speech synthesis method in the prior art;

fig. 2 is a schematic flowchart of a speech synthesis method applied to a server according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a speech synthesis method applied to a client according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech synthesis apparatus disposed at a server according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech synthesis apparatus disposed at a client according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, a method for completing TTS speech synthesis through interaction between a mobile terminal and a cloud application platform is shown in FIG. 1, firstly, the mobile terminal establishes TCP connection with the cloud application platform, and the mobile terminal sends a post request of http to the cloud application platform; the cloud application platform sends 'OK' to the client, which indicates that the cloud application platform has received a post request; the cloud application platform requests a text needing TTS synthesis from the resource server according to the text mark in the post request; after the cloud application platform obtains the text, TTS synthesis is carried out on the text; the mobile terminal receives an 'OK' message of the cloud application platform, knows that the request is allowed, and then initiates an RTSP connection request to the server; the cloud application platform completes connection with the mobile terminal and returns 'OK'; after the TTS synthesis is completed by the cloud application platform, the synthesized audio result is sent to the mobile terminal through the RTSP channel; and the mobile terminal receives the audio, and the TTS synthesis process is finished once.

The method for realizing speech synthesis in the prior art has the following defects:

1. when the mobile terminal establishes contact with the cloud application platform, one TCP connection is established, a post request of the HTTP is made, and when a TTS synthetic result is transmitted subsequently, one TCP connection is established, and the result is transmitted through an upper RTSP protocol. One TTS synthesis request requires two TCP connections to be established, which may be time consuming.

2. After the cloud application platform obtains the text, the TTS synthesis work is completed completely, and then the synthesized audio is sent. If the text is long, the whole request process is blocked in the TTS synthesis stage, and the TTS speech synthesis service responds with delay.

3. And the network transmission of the TTS voice synthesis result is carried out by adopting the RTSP channel, so that the time delay is longer. The establishment of the RTSP channel needs to rely on HTTP requests, which results in that each time a TTS speech synthesis service is requested, HTTP is established first and then RTSP is established, and two connections consume more resources.

Therefore, the speech synthesis method in the prior art has the problems of high delay and high consumption.

In order to solve the above technical problem, this embodiment discloses a speech synthesis method, which is applied to a server, where the server may be a server, a cloud application platform, and the like for implementing speech synthesis, and please refer to fig. 2, the speech synthesis method disclosed in this embodiment specifically includes the following steps:

s101: receiving request information carrying text information sent by a client, and acquiring a text to be synthesized according to the text information;

the server needs to establish a TCP connection with the client before receiving the request message sent by the client.

Wherein, the request information is a POST request of HTTP.

The text information is a text to be synthesized, an acquisition address of the text to be synthesized or an identification of the text to be synthesized.

When the text information is the text to be synthesized, the text to be synthesized can be directly obtained according to the text information.

When the text information is the acquisition address of the text to be synthesized, the text to be synthesized can be acquired from the resource server according to the acquisition address of the text to be synthesized.

When the text information is the identifier of the text to be synthesized, the text to be synthesized corresponding to the identifier of the text to be synthesized can be acquired from the resource server.

When the text to be synthesized is an encrypted text, the text information may further include a password for extracting the text to be synthesized.

S102: segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;

specifically, the text to be synthesized is segmented into at least one sub-text with the length within a preset range according to the sentence logic of the text to be synthesized.

It should be noted that the preset processing rule includes segmenting according to the sub-text length range and segmenting according to the sentence logic, and the segmentation of the text to be synthesized into at least one sub-text needs to satisfy the above two conditions at the same time.

The sentence logic of the text to be synthesized can be sentence break logic of the text to be synthesized, and whether the sentence is broken or not can be judged according to the sentence numbers.

The length range of the sub-text can be 40960 bytes, and can be preset according to a specific application scenario.

It is understood that when the length of the text to be synthesized is within the length range of the sub-text, the sub-text is the text to be synthesized.

S103: performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;

and the sequence of the sub-texts is the sequence of the sub-texts in the text to be synthesized.

If the text to be synthesized is segmented into a sub-text A, a sub-text B and a sub-text C, and the sequence of the sub-texts in the text to be synthesized is A-B-C, TTS speech synthesis is firstly carried out on the sub-text A, then TTS speech synthesis is carried out on the sub-text B, and finally TTS speech synthesis is carried out on the sub-text C.

S104: and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode.

Chunked transfer encoding (Chunked transfer encoding) is a data transmission mode in the hypertext transfer protocol HTTP, and allows data transmitted by HTTP to be divided into a plurality of parts.

In the above example, after the TTS speech synthesis of the sub-text a is completed, the response information carrying the synthesis result of the sub-text a is sent to the client in the block transmission coding manner, and then after the TTS speech synthesis of the sub-text B is completed, the response information carrying the synthesis result of the sub-text B is sent to the client in the block transmission coding manner, and finally after the TTS speech synthesis of the sub-text C is completed, the response information carrying the synthesis result of the sub-text C is sent to the client in the block transmission coding manner.

Specifically, the response information corresponds to request information sent by the client, and represents a response to the request information of the client, and the structure of the response information includes a response header and a response body. And for each sub-text, setting a transmission mode as a block transmission coding mode in a response header of the response information, writing a synthesis result of the sub-text and the length of the synthesis result into a response body of the response information, and sending the response information to the client.

The TTS synthesis result is streaming audio (audio formats such as pcm and mpeg), and the client can play each piece of audio immediately or perform other business processing.

It should be noted that, when the sub-text is the last sub-text in the text to be synthesized, the server adds an end mark in the response body of the response message, so that the client disconnects the TCP connection with the server after receiving the response message.

Therefore, in the speech synthesis method disclosed in this embodiment, under the condition that the TCP connection is established between the client and the server, the server can directly perform TTS speech synthesis after receiving the request information sent by the client, and does not need to perform a TCP connection with the client again, that is, a TCP connection is established between the client and the server to complete a TTS speech synthesis, so that the corresponding efficiency of the TTS speech synthesis service is improved, and the energy consumption of the TTS speech synthesis service is reduced.

Meanwhile, the server side divides the text to be synthesized, asynchronously synthesizes and sends the sub-texts obtained after division, and does not need to wait for the text to be synthesized to be completely synthesized and then send the sub-texts, so that the response efficiency of the TTS speech synthesis service is improved.

The speech synthesis method disclosed in the above embodiment is an improvement on speech synthesis service at the service end, and is applied to the client, where the client may be a mobile terminal, such as a smart phone, an intercom, a smart television, a PAD (portable android device), a PDA (personal digital assistant), and the like, and the speech synthesis service at the client is improved, please refer to fig. 3, and the speech synthesis method disclosed in this embodiment specifically includes the following steps:

s201: segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;

it should be noted that before the text to be synthesized is cut into at least one text, the client establishes a TCP connection with the server.

The preset processing rule comprises the steps of segmenting according to the length range of the sub-text and segmenting according to statement logic, and the two conditions are simultaneously met when the text to be synthesized is segmented into at least one sub-text.

S202: generating request information carrying the sub-text information corresponding to the sub-texts;

The request information is a POST request of HTTP, the request information comprises a request head and a request body, a transmission mode is set to be a block transmission coding mode in the request head, and text information of a text to be synthesized is written in the request body.

S203: according to the sequence of the sub texts, sending request information to a server side in a blocking transmission coding mode;

and if the text to be synthesized is divided into a sub text A, a sub text B and a sub text C, and the sequence of the sub text in the text to be synthesized is A-B-C, the request information of the sub text A is sent first, then the request information of the sub text B is sent, and finally the request information of the sub text C is sent.

S204: and receiving response information which is sent by the server and carries the synthesis result, and outputting the synthesis result in a streaming mode.

Specifically, the received response information is analyzed to obtain a synthesis result in a response body of the response information, when the response body carries an end mark, it is determined that all speech synthesis results of the text to be synthesized are received, and the TCP connection with the server is disconnected.

Meanwhile, a mechanism that a blocking transmission mode allows multiple sending is utilized at the client, when the text to be processed is long, the text to be processed is divided into a plurality of sub-texts, and the request information carrying the sub-text information is sent for multiple times, so that the sending process of the text to be processed is optimized, the sending efficiency of the text to be processed is improved, and the response efficiency of TTS speech synthesis service is improved.

Based on the speech synthesis method applied to the server disclosed in the above embodiment, this embodiment correspondingly discloses a speech synthesis apparatus, which is disposed at the server, and please refer to fig. 4, where the speech synthesis apparatus includes:

a to-be-synthesized text obtaining unit 401, configured to receive request information carrying text information sent by a client, and obtain a to-be-synthesized text according to the text information;

a first text to be synthesized segmentation unit 402, configured to segment the text to be synthesized into at least one sub-text according to a preset processing rule;

a TTS speech synthesis unit 403, configured to perform TTS speech synthesis on the sub-texts according to the order of the sub-texts to obtain a synthesis result;

a composite result sending unit 404, configured to send response information carrying the composite result to the client in a block transmission coding manner, so that the client outputs the composite result in a streaming manner.

Optionally, the apparatus further comprises:

Optionally, the first to-be-synthesized text segmenting unit 402 is specifically configured to segment the to-be-synthesized text into at least one sub-text with a length within a preset range according to the sentence logic of the to-be-synthesized text.

Optionally, the combined result sending unit 404 is specifically configured to set a transmission mode in a response header of the response information as a block transmission coding mode, write the combined result and the length of the combined result into a response body of the response information, and send the response information to the client.

Optionally, the synthesis result sending unit 404 is further configured to add an end flag in a response body of the response information when the sub text is the last sub text in the text to be synthesized.

Based on the speech synthesis method applied to the client disclosed in the above embodiments, this embodiment correspondingly discloses a speech synthesis apparatus, which is disposed at the client, please refer to fig. 5, and the speech synthesis apparatus includes:

a second text to be synthesized segmentation unit 501, configured to segment a text to be synthesized into at least one sub-text according to a preset processing rule;

a request information generating unit 502, configured to generate request information carrying sub-document information corresponding to the sub-document;

a request information sending unit 503, configured to send request information to the server in a block transmission coding manner according to the order of the sub-texts;

a synthesized result output unit 504, configured to receive response information carrying a synthesized result sent by the server, and output the synthesized result in a streaming manner.

Optionally, the apparatus further comprises:

Referring to fig. 6, the present embodiment discloses a speech synthesis system, which includes a client 601 and a server 602.

The client 601 is configured to execute the following speech synthesis method:

Further, before the segmenting the text to be synthesized into at least one sub-text according to the preset processing rule, the method further includes:

and establishing TCP connection with the server.

Further, the method further comprises:

The server 602 is configured to perform the following speech synthesis method:

Further, before receiving the request information carrying the text information sent by the client, the method further includes:

and establishing TCP connection with the client.

Further, the text information is the text to be synthesized, the acquisition address of the text to be synthesized, or the identifier of the text to be synthesized.

Further, the segmenting the text to be synthesized into at least one sub-text according to a preset processing rule includes:

Further, the sending the response information carrying the synthesis result to the client in a block transmission coding manner includes:

and sending the response information to the client.

Further, when the sub-text is the last sub-text in the text to be synthesized, the method further includes:

and adding an end mark in a response body of the response information.

In the speech synthesis system disclosed in this embodiment, under the condition that the TCP connection has been established between the client and the server, the server can directly perform TTS speech synthesis after receiving the request information sent by the client, and does not need to perform a TCP connection with the client again, that is, a TCP connection is established between the client and the server to complete a TTS speech synthesis, so that the corresponding efficiency of the TTS speech synthesis service is improved, and the energy consumption of the TTS speech synthesis service is reduced.

The method has the advantages that a mechanism that the multi-time sending is allowed in a blocking transmission mode is utilized at the client, when the text to be processed is long, the text to be processed is divided into a plurality of sub-texts, the request information carrying the sub-text information is sent for a plurality of times, the sending process of the text to be processed is optimized, the sending efficiency of the text to be processed is improved, and the response efficiency of TTS voice synthesis service is improved.

The text to be synthesized is segmented at the server side, the sub texts obtained after segmentation are asynchronously synthesized and sent, the text to be synthesized is not required to be completely synthesized and then sent, and the response efficiency of the TTS speech synthesis service is improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech synthesis method, applied to a server, the method comprising:

2. The method according to claim 1, wherein before receiving the request information carrying the text information sent by the client, the method further comprises:

and establishing TCP connection with the client.

3. The method according to claim 1, wherein the text information is the text to be synthesized, an acquisition address of the text to be synthesized, or an identifier of the text to be synthesized.

4. The method according to claim 1, wherein the segmenting the text to be synthesized into at least one sub-text according to a preset processing rule comprises:

5. The method of claim 1, wherein sending the response information carrying the synthesis result to the client in a block-wise transmission coding manner comprises:

and sending the response information to the client.

6. The method according to claim 4, wherein when the sub-text is the last sub-text in the text to be synthesized, the method further comprises:

and adding an end mark in a response body of the response information.

7. A speech synthesis method applied to a client, the method comprising:

8. The method according to claim 7, wherein before the segmenting the text to be synthesized into at least one sub-text according to the preset processing rule, the method further comprises:

and establishing TCP connection with the server.

9. The method of claim 7, further comprising:

10. A speech synthesis apparatus, provided at a server, the apparatus comprising:

11. A speech synthesis apparatus provided at a client, the apparatus comprising:

12. A speech synthesis system is characterized by comprising a client and a server;

the server is used for executing the voice synthesis method according to any one of claims 1-6;

the client is used for executing the voice synthesis method according to any one of claims 7 to 9.