CN112581934A - Voice synthesis method, device and system - Google Patents

Voice synthesis method, device and system Download PDF

Info

Publication number
CN112581934A
CN112581934A CN201910944037.2A CN201910944037A CN112581934A CN 112581934 A CN112581934 A CN 112581934A CN 201910944037 A CN201910944037 A CN 201910944037A CN 112581934 A CN112581934 A CN 112581934A
Authority
CN
China
Prior art keywords
text
synthesized
sub
client
synthesis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910944037.2A
Other languages
Chinese (zh)
Inventor
陈孝良
张国超
邢越峰
苏少炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN201910944037.2A priority Critical patent/CN112581934A/en
Publication of CN112581934A publication Critical patent/CN112581934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a voice synthesis method, a device and a system.A server receives request information carrying text information sent by a client, and obtains a text to be synthesized according to the text information; segmenting a text to be synthesized into at least one sub-text according to a preset processing rule; performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result; and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode. The client and the server establish one TCP connection to complete one TTS speech synthesis, and the server divides the text to be synthesized, asynchronously synthesizes and sends the sub-text obtained after division, and does not need to wait for the whole synthesis of the text to be synthesized and then send the sub-text, thereby improving the response efficiency of the TTS speech synthesis service.

Description

Voice synthesis method, device and system
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, and system.
Background
TTS (Text-To-Speech) converts Text To Speech, and provides a Speech synthesis service To a user, and the response efficiency of the Speech synthesis service is of great concern.
In the prior art, TTS speech synthesis is completed by interaction between a mobile terminal and a cloud application platform, wherein a TCP connection needs to be established between the mobile terminal and the cloud application platform at first, the mobile terminal sends a post request of HTTP to the cloud application platform, a TCP connection needs to be established between the mobile terminal and the cloud application platform again subsequently when a TTS speech synthesis result is transmitted, and a TTS speech synthesis result is transmitted through RTSP (Real Time Streaming Protocol), that is, a TCP connection needs to be established twice for one TTS speech synthesis, and response efficiency of a TTS speech synthesis service is low.
Disclosure of Invention
In view of this, the invention provides a speech synthesis method, device and system, which improve the response efficiency of TTS speech synthesis service.
In order to achieve the above purpose, the invention provides the following specific technical scheme:
a speech synthesis method is applied to a server side, and comprises the following steps:
receiving request information carrying text information sent by a client, and acquiring a text to be synthesized according to the text information;
segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;
performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;
and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode.
Optionally, before receiving the request information carrying the text information sent by the client, the method further includes:
and establishing TCP connection with the client.
Optionally, the text information is the text to be synthesized, the obtaining address of the text to be synthesized, or the identifier of the text to be synthesized.
Optionally, the segmenting the text to be synthesized into at least one sub-text according to a preset processing rule includes:
and segmenting the text to be synthesized into at least one sub-text with the length within a preset range according to the sentence logic of the text to be synthesized.
Optionally, the sending the response information carrying the synthesis result to the client in a block transmission coding manner includes:
setting a transmission mode as a block transmission coding mode in a response head of the response information;
writing the synthesis result and the length of the synthesis result into a response body of response information;
and sending the response information to the client.
Optionally, when the sub-text is the last sub-text in the text to be synthesized, the method further includes:
and adding an end mark in a response body of the response information.
A speech synthesis method is applied to a client, and comprises the following steps:
segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;
generating request information carrying the sub-text information corresponding to the sub-texts;
according to the sequence of the sub texts, sending request information to a server side in a blocking transmission coding mode;
and receiving response information which is sent by the server and carries the synthesis result, and outputting the synthesis result in a streaming mode.
Optionally, before segmenting the text to be synthesized into at least one sub-text according to the preset processing rule, the method further includes:
and establishing TCP connection with the server.
Optionally, the method further includes:
and when the received response information carries an end mark, disconnecting the TCP connection with the server.
A speech synthesis device is arranged at a server side, and the device comprises:
the device comprises a to-be-synthesized text acquisition unit, a text synthesis unit and a text synthesis unit, wherein the to-be-synthesized text acquisition unit is used for receiving request information which is sent by a client and carries text information, and acquiring a to-be-synthesized text according to the text information;
the first text to be synthesized is segmented into at least one sub-text according to a preset processing rule;
the TTS speech synthesis unit is used for carrying out TTS speech synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;
and the synthesis result sending unit is used for sending the response information carrying the synthesis result to the client in a blocking transmission coding mode so as to enable the client to output the synthesis result in a streaming mode.
Optionally, the apparatus further comprises:
and the first connection establishing unit is used for establishing TCP connection with the client.
Optionally, the text information is the text to be synthesized, the obtaining address of the text to be synthesized, or the identifier of the text to be synthesized.
Optionally, the first to-be-synthesized text segmentation unit is specifically configured to segment the to-be-synthesized text into at least one sub-text with a length within a preset range according to the sentence logic of the to-be-synthesized text.
Optionally, the combined result sending unit is specifically configured to set a transmission mode in a response header of the response information as a block transmission coding mode, write the combined result and the length of the combined result into a response body of the response information, and send the response information to the client.
Optionally, the synthesis result sending unit is further configured to add an end mark in a response body of the response information when the sub text is the last sub text in the text to be synthesized.
A speech synthesis apparatus provided at a client, the apparatus comprising:
the second text to be synthesized segmentation unit is used for segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;
the request information generating unit is used for generating request information which corresponds to the subfolders and carries the information of the subfolders;
the request information sending unit is used for sending request information to the server side in a blocking transmission coding mode according to the sequence of the sub texts;
and the synthesis result output unit is used for receiving the response information which is sent by the server and carries the synthesis result and outputting the synthesis result in a streaming mode.
Optionally, the apparatus further comprises:
and the second connection establishing unit is used for establishing TCP connection with the server.
Optionally, the apparatus further comprises:
and the TCP connection disconnection unit is used for disconnecting the TCP connection with the server side when the received response information carries the ending mark.
A speech synthesis system comprises a client and a server;
the server is used for executing the voice synthesis method;
the client is configured to execute the speech synthesis method.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses a speech synthesis method, under the condition that a client terminal and a server terminal are connected by TCP, the server terminal can directly perform TTS speech synthesis after receiving request information sent by the client terminal, and does not need to perform TCP connection with the client terminal again, namely, the TTS speech synthesis can be completed by establishing TCP connection between the client terminal and the server terminal, and the server terminal divides a text to be synthesized, asynchronously synthesizes and sends a subfile obtained after division, and does not need to wait for the whole synthesis of the text to be synthesized and then send the text, thereby improving the response efficiency of TTS speech synthesis service.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a timing diagram of a speech synthesis method in the prior art;
fig. 2 is a schematic flowchart of a speech synthesis method applied to a server according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a speech synthesis method applied to a client according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a speech synthesis apparatus disposed at a server according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speech synthesis apparatus disposed at a client according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, a method for completing TTS speech synthesis through interaction between a mobile terminal and a cloud application platform is shown in FIG. 1, firstly, the mobile terminal establishes TCP connection with the cloud application platform, and the mobile terminal sends a post request of http to the cloud application platform; the cloud application platform sends 'OK' to the client, which indicates that the cloud application platform has received a post request; the cloud application platform requests a text needing TTS synthesis from the resource server according to the text mark in the post request; after the cloud application platform obtains the text, TTS synthesis is carried out on the text; the mobile terminal receives an 'OK' message of the cloud application platform, knows that the request is allowed, and then initiates an RTSP connection request to the server; the cloud application platform completes connection with the mobile terminal and returns 'OK'; after the TTS synthesis is completed by the cloud application platform, the synthesized audio result is sent to the mobile terminal through the RTSP channel; and the mobile terminal receives the audio, and the TTS synthesis process is finished once.
The method for realizing speech synthesis in the prior art has the following defects:
1. when the mobile terminal establishes contact with the cloud application platform, one TCP connection is established, a post request of the HTTP is made, and when a TTS synthetic result is transmitted subsequently, one TCP connection is established, and the result is transmitted through an upper RTSP protocol. One TTS synthesis request requires two TCP connections to be established, which may be time consuming.
2. After the cloud application platform obtains the text, the TTS synthesis work is completed completely, and then the synthesized audio is sent. If the text is long, the whole request process is blocked in the TTS synthesis stage, and the TTS speech synthesis service responds with delay.
3. And the network transmission of the TTS voice synthesis result is carried out by adopting the RTSP channel, so that the time delay is longer. The establishment of the RTSP channel needs to rely on HTTP requests, which results in that each time a TTS speech synthesis service is requested, HTTP is established first and then RTSP is established, and two connections consume more resources.
Therefore, the speech synthesis method in the prior art has the problems of high delay and high consumption.
In order to solve the above technical problem, this embodiment discloses a speech synthesis method, which is applied to a server, where the server may be a server, a cloud application platform, and the like for implementing speech synthesis, and please refer to fig. 2, the speech synthesis method disclosed in this embodiment specifically includes the following steps:
s101: receiving request information carrying text information sent by a client, and acquiring a text to be synthesized according to the text information;
the server needs to establish a TCP connection with the client before receiving the request message sent by the client.
Wherein, the request information is a POST request of HTTP.
The text information is a text to be synthesized, an acquisition address of the text to be synthesized or an identification of the text to be synthesized.
When the text information is the text to be synthesized, the text to be synthesized can be directly obtained according to the text information.
When the text information is the acquisition address of the text to be synthesized, the text to be synthesized can be acquired from the resource server according to the acquisition address of the text to be synthesized.
When the text information is the identifier of the text to be synthesized, the text to be synthesized corresponding to the identifier of the text to be synthesized can be acquired from the resource server.
When the text to be synthesized is an encrypted text, the text information may further include a password for extracting the text to be synthesized.
S102: segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;
specifically, the text to be synthesized is segmented into at least one sub-text with the length within a preset range according to the sentence logic of the text to be synthesized.
It should be noted that the preset processing rule includes segmenting according to the sub-text length range and segmenting according to the sentence logic, and the segmentation of the text to be synthesized into at least one sub-text needs to satisfy the above two conditions at the same time.
The sentence logic of the text to be synthesized can be sentence break logic of the text to be synthesized, and whether the sentence is broken or not can be judged according to the sentence numbers.
The length range of the sub-text can be 40960 bytes, and can be preset according to a specific application scenario.
It is understood that when the length of the text to be synthesized is within the length range of the sub-text, the sub-text is the text to be synthesized.
S103: performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;
and the sequence of the sub-texts is the sequence of the sub-texts in the text to be synthesized.
If the text to be synthesized is segmented into a sub-text A, a sub-text B and a sub-text C, and the sequence of the sub-texts in the text to be synthesized is A-B-C, TTS speech synthesis is firstly carried out on the sub-text A, then TTS speech synthesis is carried out on the sub-text B, and finally TTS speech synthesis is carried out on the sub-text C.
S104: and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode.
Chunked transfer encoding (Chunked transfer encoding) is a data transmission mode in the hypertext transfer protocol HTTP, and allows data transmitted by HTTP to be divided into a plurality of parts.
In the above example, after the TTS speech synthesis of the sub-text a is completed, the response information carrying the synthesis result of the sub-text a is sent to the client in the block transmission coding manner, and then after the TTS speech synthesis of the sub-text B is completed, the response information carrying the synthesis result of the sub-text B is sent to the client in the block transmission coding manner, and finally after the TTS speech synthesis of the sub-text C is completed, the response information carrying the synthesis result of the sub-text C is sent to the client in the block transmission coding manner.
Specifically, the response information corresponds to request information sent by the client, and represents a response to the request information of the client, and the structure of the response information includes a response header and a response body. And for each sub-text, setting a transmission mode as a block transmission coding mode in a response header of the response information, writing a synthesis result of the sub-text and the length of the synthesis result into a response body of the response information, and sending the response information to the client.
The TTS synthesis result is streaming audio (audio formats such as pcm and mpeg), and the client can play each piece of audio immediately or perform other business processing.
It should be noted that, when the sub-text is the last sub-text in the text to be synthesized, the server adds an end mark in the response body of the response message, so that the client disconnects the TCP connection with the server after receiving the response message.
Therefore, in the speech synthesis method disclosed in this embodiment, under the condition that the TCP connection is established between the client and the server, the server can directly perform TTS speech synthesis after receiving the request information sent by the client, and does not need to perform a TCP connection with the client again, that is, a TCP connection is established between the client and the server to complete a TTS speech synthesis, so that the corresponding efficiency of the TTS speech synthesis service is improved, and the energy consumption of the TTS speech synthesis service is reduced.
Meanwhile, the server side divides the text to be synthesized, asynchronously synthesizes and sends the sub-texts obtained after division, and does not need to wait for the text to be synthesized to be completely synthesized and then send the sub-texts, so that the response efficiency of the TTS speech synthesis service is improved.
The speech synthesis method disclosed in the above embodiment is an improvement on speech synthesis service at the service end, and is applied to the client, where the client may be a mobile terminal, such as a smart phone, an intercom, a smart television, a PAD (portable android device), a PDA (personal digital assistant), and the like, and the speech synthesis service at the client is improved, please refer to fig. 3, and the speech synthesis method disclosed in this embodiment specifically includes the following steps:
s201: segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;
it should be noted that before the text to be synthesized is cut into at least one text, the client establishes a TCP connection with the server.
The preset processing rule comprises the steps of segmenting according to the length range of the sub-text and segmenting according to statement logic, and the two conditions are simultaneously met when the text to be synthesized is segmented into at least one sub-text.
The sentence logic of the text to be synthesized can be sentence break logic of the text to be synthesized, and whether the sentence is broken or not can be judged according to the sentence numbers.
The length range of the sub-text can be 40960 bytes, and can be preset according to a specific application scenario.
It is understood that when the length of the text to be synthesized is within the length range of the sub-text, the sub-text is the text to be synthesized.
S202: generating request information carrying the sub-text information corresponding to the sub-texts;
the text information is a text to be synthesized, an acquisition address of the text to be synthesized or an identification of the text to be synthesized.
When the text to be synthesized is an encrypted text, the text information may further include a password for extracting the text to be synthesized.
The request information is a POST request of HTTP, the request information comprises a request head and a request body, a transmission mode is set to be a block transmission coding mode in the request head, and text information of a text to be synthesized is written in the request body.
S203: according to the sequence of the sub texts, sending request information to a server side in a blocking transmission coding mode;
and if the text to be synthesized is divided into a sub text A, a sub text B and a sub text C, and the sequence of the sub text in the text to be synthesized is A-B-C, the request information of the sub text A is sent first, then the request information of the sub text B is sent, and finally the request information of the sub text C is sent.
S204: and receiving response information which is sent by the server and carries the synthesis result, and outputting the synthesis result in a streaming mode.
Specifically, the received response information is analyzed to obtain a synthesis result in a response body of the response information, when the response body carries an end mark, it is determined that all speech synthesis results of the text to be synthesized are received, and the TCP connection with the server is disconnected.
Therefore, in the speech synthesis method disclosed in this embodiment, under the condition that the TCP connection is established between the client and the server, the server can directly perform TTS speech synthesis after receiving the request information sent by the client, and does not need to perform a TCP connection with the client again, that is, a TCP connection is established between the client and the server to complete a TTS speech synthesis, so that the corresponding efficiency of the TTS speech synthesis service is improved, and the energy consumption of the TTS speech synthesis service is reduced.
Meanwhile, a mechanism that a blocking transmission mode allows multiple sending is utilized at the client, when the text to be processed is long, the text to be processed is divided into a plurality of sub-texts, and the request information carrying the sub-text information is sent for multiple times, so that the sending process of the text to be processed is optimized, the sending efficiency of the text to be processed is improved, and the response efficiency of TTS speech synthesis service is improved.
Based on the speech synthesis method applied to the server disclosed in the above embodiment, this embodiment correspondingly discloses a speech synthesis apparatus, which is disposed at the server, and please refer to fig. 4, where the speech synthesis apparatus includes:
a to-be-synthesized text obtaining unit 401, configured to receive request information carrying text information sent by a client, and obtain a to-be-synthesized text according to the text information;
a first text to be synthesized segmentation unit 402, configured to segment the text to be synthesized into at least one sub-text according to a preset processing rule;
a TTS speech synthesis unit 403, configured to perform TTS speech synthesis on the sub-texts according to the order of the sub-texts to obtain a synthesis result;
a composite result sending unit 404, configured to send response information carrying the composite result to the client in a block transmission coding manner, so that the client outputs the composite result in a streaming manner.
Optionally, the apparatus further comprises:
and the first connection establishing unit is used for establishing TCP connection with the client.
Optionally, the text information is the text to be synthesized, the obtaining address of the text to be synthesized, or the identifier of the text to be synthesized.
Optionally, the first to-be-synthesized text segmenting unit 402 is specifically configured to segment the to-be-synthesized text into at least one sub-text with a length within a preset range according to the sentence logic of the to-be-synthesized text.
Optionally, the combined result sending unit 404 is specifically configured to set a transmission mode in a response header of the response information as a block transmission coding mode, write the combined result and the length of the combined result into a response body of the response information, and send the response information to the client.
Optionally, the synthesis result sending unit 404 is further configured to add an end flag in a response body of the response information when the sub text is the last sub text in the text to be synthesized.
Based on the speech synthesis method applied to the client disclosed in the above embodiments, this embodiment correspondingly discloses a speech synthesis apparatus, which is disposed at the client, please refer to fig. 5, and the speech synthesis apparatus includes:
a second text to be synthesized segmentation unit 501, configured to segment a text to be synthesized into at least one sub-text according to a preset processing rule;
a request information generating unit 502, configured to generate request information carrying sub-document information corresponding to the sub-document;
a request information sending unit 503, configured to send request information to the server in a block transmission coding manner according to the order of the sub-texts;
a synthesized result output unit 504, configured to receive response information carrying a synthesized result sent by the server, and output the synthesized result in a streaming manner.
Optionally, the apparatus further comprises:
and the second connection establishing unit is used for establishing TCP connection with the server.
Optionally, the apparatus further comprises:
and the TCP connection disconnection unit is used for disconnecting the TCP connection with the server side when the received response information carries the ending mark.
Referring to fig. 6, the present embodiment discloses a speech synthesis system, which includes a client 601 and a server 602.
The client 601 is configured to execute the following speech synthesis method:
segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;
generating request information carrying the sub-text information corresponding to the sub-texts;
according to the sequence of the sub texts, sending request information to a server side in a blocking transmission coding mode;
and receiving response information which is sent by the server and carries the synthesis result, and outputting the synthesis result in a streaming mode.
Further, before the segmenting the text to be synthesized into at least one sub-text according to the preset processing rule, the method further includes:
and establishing TCP connection with the server.
Further, the method further comprises:
and when the received response information carries an end mark, disconnecting the TCP connection with the server.
The server 602 is configured to perform the following speech synthesis method:
receiving request information carrying text information sent by a client, and acquiring a text to be synthesized according to the text information;
segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;
performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;
and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode.
Further, before receiving the request information carrying the text information sent by the client, the method further includes:
and establishing TCP connection with the client.
Further, the text information is the text to be synthesized, the acquisition address of the text to be synthesized, or the identifier of the text to be synthesized.
Further, the segmenting the text to be synthesized into at least one sub-text according to a preset processing rule includes:
and segmenting the text to be synthesized into at least one sub-text with the length within a preset range according to the sentence logic of the text to be synthesized.
Further, the sending the response information carrying the synthesis result to the client in a block transmission coding manner includes:
setting a transmission mode as a block transmission coding mode in a response head of the response information;
writing the synthesis result and the length of the synthesis result into a response body of response information;
and sending the response information to the client.
Further, when the sub-text is the last sub-text in the text to be synthesized, the method further includes:
and adding an end mark in a response body of the response information.
In the speech synthesis system disclosed in this embodiment, under the condition that the TCP connection has been established between the client and the server, the server can directly perform TTS speech synthesis after receiving the request information sent by the client, and does not need to perform a TCP connection with the client again, that is, a TCP connection is established between the client and the server to complete a TTS speech synthesis, so that the corresponding efficiency of the TTS speech synthesis service is improved, and the energy consumption of the TTS speech synthesis service is reduced.
The method has the advantages that a mechanism that the multi-time sending is allowed in a blocking transmission mode is utilized at the client, when the text to be processed is long, the text to be processed is divided into a plurality of sub-texts, the request information carrying the sub-text information is sent for a plurality of times, the sending process of the text to be processed is optimized, the sending efficiency of the text to be processed is improved, and the response efficiency of TTS voice synthesis service is improved.
The text to be synthesized is segmented at the server side, the sub texts obtained after segmentation are asynchronously synthesized and sent, the text to be synthesized is not required to be completely synthesized and then sent, and the response efficiency of the TTS speech synthesis service is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A speech synthesis method, applied to a server, the method comprising:
receiving request information carrying text information sent by a client, and acquiring a text to be synthesized according to the text information;
segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;
performing TTS voice synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;
and sending the response information carrying the synthesis result to the client in a blocking transmission coding mode, so that the client outputs the synthesis result in a streaming mode.
2. The method according to claim 1, wherein before receiving the request information carrying the text information sent by the client, the method further comprises:
and establishing TCP connection with the client.
3. The method according to claim 1, wherein the text information is the text to be synthesized, an acquisition address of the text to be synthesized, or an identifier of the text to be synthesized.
4. The method according to claim 1, wherein the segmenting the text to be synthesized into at least one sub-text according to a preset processing rule comprises:
and segmenting the text to be synthesized into at least one sub-text with the length within a preset range according to the sentence logic of the text to be synthesized.
5. The method of claim 1, wherein sending the response information carrying the synthesis result to the client in a block-wise transmission coding manner comprises:
setting a transmission mode as a block transmission coding mode in a response head of the response information;
writing the synthesis result and the length of the synthesis result into a response body of response information;
and sending the response information to the client.
6. The method according to claim 4, wherein when the sub-text is the last sub-text in the text to be synthesized, the method further comprises:
and adding an end mark in a response body of the response information.
7. A speech synthesis method applied to a client, the method comprising:
segmenting a text to be synthesized into at least one sub-text according to a preset processing rule;
generating request information carrying the sub-text information corresponding to the sub-texts;
according to the sequence of the sub texts, sending request information to a server side in a blocking transmission coding mode;
and receiving response information which is sent by the server and carries the synthesis result, and outputting the synthesis result in a streaming mode.
8. The method according to claim 7, wherein before the segmenting the text to be synthesized into at least one sub-text according to the preset processing rule, the method further comprises:
and establishing TCP connection with the server.
9. The method of claim 7, further comprising:
and when the received response information carries an end mark, disconnecting the TCP connection with the server.
10. A speech synthesis apparatus, provided at a server, the apparatus comprising:
the device comprises a to-be-synthesized text acquisition unit, a text synthesis unit and a text synthesis unit, wherein the to-be-synthesized text acquisition unit is used for receiving request information which is sent by a client and carries text information, and acquiring a to-be-synthesized text according to the text information;
the first text to be synthesized is segmented into at least one sub-text according to a preset processing rule;
the TTS speech synthesis unit is used for carrying out TTS speech synthesis on the sub-texts according to the sequence of the sub-texts to obtain a synthesis result;
and the synthesis result sending unit is used for sending the response information carrying the synthesis result to the client in a blocking transmission coding mode so as to enable the client to output the synthesis result in a streaming mode.
11. A speech synthesis apparatus provided at a client, the apparatus comprising:
the second text to be synthesized segmentation unit is used for segmenting the text to be synthesized into at least one sub-text according to a preset processing rule;
the request information generating unit is used for generating request information which corresponds to the subfolders and carries the information of the subfolders;
the request information sending unit is used for sending request information to the server side in a blocking transmission coding mode according to the sequence of the sub texts;
and the synthesis result output unit is used for receiving the response information which is sent by the server and carries the synthesis result and outputting the synthesis result in a streaming mode.
12. A speech synthesis system is characterized by comprising a client and a server;
the server is used for executing the voice synthesis method according to any one of claims 1-6;
the client is used for executing the voice synthesis method according to any one of claims 7 to 9.
CN201910944037.2A 2019-09-30 2019-09-30 Voice synthesis method, device and system Pending CN112581934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910944037.2A CN112581934A (en) 2019-09-30 2019-09-30 Voice synthesis method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910944037.2A CN112581934A (en) 2019-09-30 2019-09-30 Voice synthesis method, device and system

Publications (1)

Publication Number Publication Date
CN112581934A true CN112581934A (en) 2021-03-30

Family

ID=75117263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910944037.2A Pending CN112581934A (en) 2019-09-30 2019-09-30 Voice synthesis method, device and system

Country Status (1)

Country Link
CN (1) CN112581934A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096637A (en) * 2021-06-09 2021-07-09 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus and computer readable storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101098507A (en) * 2007-06-29 2008-01-02 中兴通讯股份有限公司 System and method for providing speech synthesis application united development platform
CN102232298A (en) * 2011-04-07 2011-11-02 华为技术有限公司 Method, device and system for transmitting and processing media content
CN102387206A (en) * 2011-10-20 2012-03-21 镇江睿泰信息科技有限公司 Synthesis method and system of concurrent request of Web service
CN102629936A (en) * 2012-03-12 2012-08-08 华为终端有限公司 Method for mobile terminal to process text, related device and system
CN102694864A (en) * 2012-05-30 2012-09-26 安科智慧城市技术(中国)有限公司 Method for achieving streaming media function by utilizing HTTP, streaming media server and system
CN106034157A (en) * 2015-03-18 2016-10-19 国家计算机网络与信息安全管理中心 HTTP transmission method in data exchange, server and storage device
CN106098056A (en) * 2016-06-14 2016-11-09 腾讯科技(深圳)有限公司 Processing method, NEWS SERVER and the system of a kind of voice news
CN106504742A (en) * 2016-11-14 2017-03-15 海信集团有限公司 The transmission method of synthesis voice, cloud server and terminal device
CN107274882A (en) * 2017-08-08 2017-10-20 腾讯科技(深圳)有限公司 Data transmission method and device
CN107294913A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Safety communicating method, service end and client based on HTTP
CN108173742A (en) * 2017-12-08 2018-06-15 腾讯科技(深圳)有限公司 A kind of image processing method, device
CN108881485A (en) * 2018-07-30 2018-11-23 中国石油化工股份有限公司 The method for ensureing the high concurrent system response time under big data packet
CN108877804A (en) * 2018-06-26 2018-11-23 苏州思必驰信息科技有限公司 Voice service method, system, electronic equipment and storage medium
CN110197655A (en) * 2019-06-28 2019-09-03 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing voice

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101098507A (en) * 2007-06-29 2008-01-02 中兴通讯股份有限公司 System and method for providing speech synthesis application united development platform
CN102232298A (en) * 2011-04-07 2011-11-02 华为技术有限公司 Method, device and system for transmitting and processing media content
CN102387206A (en) * 2011-10-20 2012-03-21 镇江睿泰信息科技有限公司 Synthesis method and system of concurrent request of Web service
CN102629936A (en) * 2012-03-12 2012-08-08 华为终端有限公司 Method for mobile terminal to process text, related device and system
CN102694864A (en) * 2012-05-30 2012-09-26 安科智慧城市技术(中国)有限公司 Method for achieving streaming media function by utilizing HTTP, streaming media server and system
CN106034157A (en) * 2015-03-18 2016-10-19 国家计算机网络与信息安全管理中心 HTTP transmission method in data exchange, server and storage device
CN107294913A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Safety communicating method, service end and client based on HTTP
CN106098056A (en) * 2016-06-14 2016-11-09 腾讯科技(深圳)有限公司 Processing method, NEWS SERVER and the system of a kind of voice news
CN106504742A (en) * 2016-11-14 2017-03-15 海信集团有限公司 The transmission method of synthesis voice, cloud server and terminal device
CN107274882A (en) * 2017-08-08 2017-10-20 腾讯科技(深圳)有限公司 Data transmission method and device
CN108173742A (en) * 2017-12-08 2018-06-15 腾讯科技(深圳)有限公司 A kind of image processing method, device
CN108877804A (en) * 2018-06-26 2018-11-23 苏州思必驰信息科技有限公司 Voice service method, system, electronic equipment and storage medium
CN108881485A (en) * 2018-07-30 2018-11-23 中国石油化工股份有限公司 The method for ensureing the high concurrent system response time under big data packet
CN110197655A (en) * 2019-06-28 2019-09-03 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing voice

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096637A (en) * 2021-06-09 2021-07-09 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus and computer readable storage medium
CN113096637B (en) * 2021-06-09 2021-11-02 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN104700836B (en) A kind of audio recognition method and system
US10827065B2 (en) Systems and methods for providing integrated computerized personal assistant services in telephony communications
EP3125594A1 (en) Intelligent communication method, terminal and system
CN104219286B (en) streaming media processing method, device, client, CDN node server and terminal
TW201517572A (en) A method, device, and system thereof for data processing
CN108696899B (en) SIP message transmitting and receiving method and transmitting and receiving device
CN111541718B (en) Internal and external network interaction method and system of power terminal and data transmission method
CN112887429A (en) Data transmission method and device, electronic equipment and storage medium
US9767802B2 (en) Methods and apparatus for conducting internet protocol telephony communications
CN111381962A (en) Edge service migration method and device
CN107733876A (en) A kind of stream media caption display methods, mobile terminal and storage device
WO2018166367A1 (en) Real-time prompt method and device in real-time conversation, storage medium, and electronic device
CN112581934A (en) Voice synthesis method, device and system
CN114257562A (en) Instant messaging method, instant messaging device, electronic equipment and computer readable storage medium
WO2021103741A1 (en) Content processing method and apparatus, computer device, and storage medium
CN110502631B (en) Input information response method and device, computer equipment and storage medium
CN102802197A (en) Method and device for transmitting application data
CN113411503B (en) Cloud mobile phone camera preview method and device, computer equipment and storage medium
CN103929524A (en) Method for recording information in communication process and mobile terminal with method applied
WO2016183383A1 (en) Instant communication method and server
US8681949B1 (en) System, method, and computer program for automated non-sound operations by interactive voice response commands
CN110808054B (en) Multi-channel audio compression and decompression method and system
CN111081247A (en) Method for speech recognition, terminal, server and computer-readable storage medium
CN108989401B (en) Alarm clock setting method, device, terminal, server and storage medium
CN102255912B (en) Method, system and device for authenticating access of IMS (internet protocol multimedia subsystem) terminal to IMS network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination