WO2022110943A1

WO2022110943A1 - Speech preview method and apparatus

Info

Publication number: WO2022110943A1
Application number: PCT/CN2021/115113
Authority: WO
Inventors: 陈翔宇; 张晨
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2020-11-26
Filing date: 2021-08-27
Publication date: 2022-06-02
Also published as: CN112562638A

Abstract

A speech preview method, comprising: receiving a text input (S201); performing real-time buffering on speech data synthesized, by means of a speech synthesis service, from the text input (S202); and when the synthesized speech data is buffered to reach a playable length, decoding and playing the buffered speech data (S203). A speech preview apparatus (600), comprising a receiving unit (610), a buffer unit (620), a decoding unit (630), a playing unit (640) and a sending unit (650). An electronic device, a speech processing system, a computer-readable storage medium and a computing program product. A delay is greatly reduced by means of real-time transmission, real-time preview is started without hardly any waiting time when very few pieces of speech data are buffered. When timbre switching is performed, a local terminal device itself no longer performs TTS on remaining text that has not been subjected to TTS, or notifies a server of same, such that the cost of TTS services is reduced, thereby improving the speed of TTS preview of a user in video editing, and optimizing the user experience.

Description

Method and device for voice preview

CROSS-REFERENCE TO RELATED APPLICATIONS

The present disclosure claims priority to Chinese Patent Application No. 202011355823.8 filed in China on Nov. 26, 2020, the entire contents of which are incorporated by reference in their entirety.

technical field

The present disclosure relates to the field of signal processing, and in particular, to a method and device for previewing speech.

Background technique

In the related art, the scenario in which a terminal device (such as a mobile phone) uses a speech synthesis (TTS) service is: input text, generate a voice file by calling a network or an offline software development kit (SDK), and then return it to the terminal through the network or file. device, and thereafter, the terminal device plays by calling the voice file. In the video editing scenario, the user uses the terminal device to edit the captured video, edit the text, and then use the edited text to generate audio files with different timbres, which are then synthesized into the video to complete the dubbing process.

SUMMARY OF THE INVENTION

The present disclosure provides a voice preview method and device, and the technical solutions of the present disclosure are as follows:

According to a first aspect of the embodiments of the present disclosure, a method for previewing a voice is provided, including: receiving text input; buffering voice data synthesized based on the text input through a voice synthesis service; In the case of the length, the buffered voice data is decoded and played.

In some embodiments, the method further includes: sending the text input to the server; and receiving, from the server, speech data synthesized from the text input through a speech synthesis service.

In some embodiments, based on the speech settings related to the speech synthesis service being changed, the server is notified to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and the server is notified using the speech synthesis service according to the changed speech settings. The text input is re-speech-synthesized, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings received from the server is re-cached.

In some embodiments, the method further includes: performing speech synthesis on the text input through a speech synthesis service to obtain speech data.

In some embodiments, based on the speech settings related to the speech synthesis service being changed, the speech synthesis operation of the text input using the speech synthesis service according to the speech settings before the change is stopped, and the text input is re-entered using the speech synthesis service according to the changed speech settings. Speech synthesis is performed, and the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting color is re-cached.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for voice preview, the apparatus comprising: a receiving unit configured to receive text input; a buffer unit configured to perform a speech synthesis service based on the text input The synthesized voice data is buffered; the decoding unit is configured to decode the buffered voice data when the synthesized voice data is buffered to a playable length; the playing unit is configured to play the decoded audio data.

In some embodiments, the apparatus for voice preview further includes: a sending unit configured to send the text input to the server, wherein the receiving unit is further configured to receive from the server voice data synthesized from the text input through a speech synthesis service.

In some embodiments, based on the speech settings related to the speech synthesis service being changed, the sending unit is configured to notify the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to notify the server to use the speech synthesis service. The text input is re-speech-synthesized according to the changed speech setting, and the buffer unit is configured to re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting received by the receiving unit from the server.

In some embodiments, the apparatus for speech previewing further includes: a speech synthesis unit configured to perform speech synthesis on the text input through a speech synthesis service to obtain speech data.

In some embodiments, based on the speech settings related to the speech synthesis service being changed, the speech synthesis unit is configured to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to restart according to the changed speech settings. The text input is speech synthesized, and the buffer unit is configured to re-cache the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a processor; a memory for storing executable instructions, wherein the processor is configured to execute the executable instructions to implement the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.

In some embodiments, wherein the processor is configured to execute executable instructions, the steps of: sending the textual input to the server; and receiving, from the server, speech data synthesized from the textual input by a speech synthesis service.

In some embodiments, wherein the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, notifying the server to stop utilizing the speech synthesis service to interpret the text input speech according to the speech settings before the change The synthesis operation notifies the server to re-speech the text input according to the changed speech settings using the speech synthesis service, and re-caches the speech data synthesized from the server through the speech synthesis service based on the text input according to the changed speech settings.

In some embodiments, the processor is configured to execute executable instructions to perform the steps of: performing speech synthesis on textual input through a speech synthesis service to obtain speech data.

In some embodiments, wherein the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change , using the speech synthesis service to re-speech the text input according to the changed speech settings, and re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech settings.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech processing system, comprising: a terminal device configured to receive text input, send the text input to a server, The voice data synthesized by the text input is buffered in real time, and when the received voice data is buffered to a playable length, the buffered voice data is decoded and played; Speech synthesis is performed on the text input of the device to obtain voice data, and the obtained voice data is transmitted to the terminal device in real time.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is caused to perform the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer program product, wherein instructions in the computer program product are executed by at least one processor in an electronic device, and perform the following steps: receiving text input; , and buffer the synthesized speech data based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.

In the embodiment of the present disclosure, the delay is greatly reduced through real-time transmission, and a real-time preview is started when very little voice data is buffered, with almost no waiting. At the same time, when performing tone switching, the local terminal device itself or informs the server not to perform TTS on the remaining texts that have not yet been TTS, which reduces the cost of TTS services, thereby improving the speed of TTS preview for users in video editing, optimizing user experience.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.

FIG. 1 is an exemplary system architecture diagram to which exemplary embodiments of the present disclosure may be applied;

2 is a flowchart of a method for voice preview according to an exemplary embodiment of the present disclosure;

3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure;

4 is a detailed flowchart of a method for voice preview when a TTS service is locally executed on a terminal device according to an exemplary embodiment of the present disclosure;

5 is a block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure;

6 is a detailed block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure;

Fig. 7 is a detailed block diagram of a voice preview apparatus according to another exemplary embodiment of the present disclosure;

8 is a block diagram of a system for voice preview according to an exemplary embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following examples are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.

As mentioned in the background of the present disclosure, in the related art, when using text input to generate an audio file and preview the audio file in a video editing scenario, the text file input by the user needs to undergo text input and/or upload, voice The processes of synthesis, speech encoding, and downloading are completed, that is to say, the traditional file usage method is executed serially, and the input text file needs to wait for the completion of the above steps before starting to preview the downloaded audio file. In view of this, the present disclosure proposes to cache the audio data synthesized through the TTS service in real time after receiving the text input, and start the real-time preview of the audio file when very little data is cached, almost without waiting, and at the same time when the timbre is switched. At this time, speech synthesis can no longer be performed on the text that has not been speech synthesized, which reduces the cost of TTS services.

FIG. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

As shown in FIG. 1 , the system architecture 100 may include

terminal devices

101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages (eg, TTS service request, audio and video data upload request, audio and video data acquisition request) and the like. Various communication client applications may be installed on the

terminal devices

101 , 102 and 103 , such as video recording applications, audio playback applications, video and audio editing applications, instant messaging tools, email clients, social platform software, and the like. The

terminal devices

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102 and 103 are hardware, they can be various electronic devices with a display screen and capable of playing, recording and editing audio and video, including but not limited to smart phones, tablet computers, laptop computers and desktop computer, etc. When the

terminal devices

101, 102, and 103 are software, they can be installed in the electronic devices listed above, which can be implemented as multiple software or software modules (for example, used to provide distributed services), or as a single software or software module. software module. There is no specific limitation here.

The

terminal devices

101, 102, 103 may be installed with image capture devices (eg, cameras) to capture video data. In addition, the

terminal devices

101, 102, 103 may also be installed with components for converting electrical signals into sounds (such as speakers) to play sounds, and may also be installed with devices for converting analog audio signals into digital audio signals (for example, microphone) to capture sound.

The

terminal devices

101 , 102 , and 103 can use the image collection device installed on them to collect video data, and use the audio collection device installed on them to collect audio data. Moreover, the

terminal devices

101, 102, 103 can perform TTS service on the received text input to synthesize audio data from the text input, and can play the audio data by using an audio processing component installed on it that supports audio playback.

The server 105 may be a server that provides various services, such as a background server that provides support for audio and video recording applications, audio and video editing applications, etc. installed on the

terminal devices

101 , 102 , and 103 . The background server can perform analysis, TTS service, storage and other processing of uploaded text input and other data, and can also receive TTS service requests sent by

terminal devices

101, 102, and 103, and feed back the audio data synthesized by speech to the

terminal devices

101, 102, and 103. 102, 103.

It should be noted that the server may be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server. When the server is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module. There is no specific limitation here.

It should be noted that the voice preview methods provided in the embodiments of the present application are generally performed by the

terminal devices

101 , 102 , and 103 , and correspondingly, the voice preview devices are generally set in the

terminal devices

101 , 102 and 103 .

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. According to implementation requirements, there may be any number of terminal devices, networks and servers, which are not limited in the present disclosure.

Fig. 2 is a flow chart of a method for voice preview according to an exemplary embodiment of the present disclosure.

In step S201, a terminal device (eg, terminal device 101) receives text input. Here, the text input may be text input or edited by the user in any way on the terminal device. In some embodiments, based on the user starting the editing preview function in the audio and video editing software on the terminal device, the user can directly enter the audio and video editing software on the terminal device. Enter text into the video editing software, or directly load the text files received from other devices or downloaded from the server into the audio and video editing software. The above are only examples, the present disclosure does not specifically limit the manner of receiving text input, and any manner that can perform text input is included within the scope of the present disclosure.

In step S202, the terminal device buffers the speech data synthesized based on the text input through the speech synthesis service (TTS service). For example, the cache can be real-time. Then, in step S203, when the synthesized voice data is buffered to a playable length, the terminal device decodes and plays the buffered voice data. In some embodiments, the TTS service may be a local TTS service invoked by the terminal device through audio and video editing software, or may be a server-side TTS service invoked by the audio and video editing software. The TTS service is performed to synthesize the text input into voice data, or the terminal device may request the server side to perform the TTS service to synthesize the uploaded text input into voice data. The detailed process of the method for speech processing when the TTS service is executed on the server side and when the TTS service is executed locally on the terminal device will be described below with reference to FIG. 3 and FIG. 4 , respectively.

FIG. 3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure.

In step S301, the terminal device receives text input. Since this step is the same as the operation of step S201, it will not be described repeatedly here.

In step S302, the terminal device sends the text input to the server. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the terminal device, the terminal device can invoke the background TTS service, which is located on the server side, that is, the terminal device can invoke the server The TTS service on the side, and upload the text input to the server, and then the server synthesizes the text input received from the terminal device into voice data through the TTS service.

In step S303, the terminal device receives the voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the terminal device can receive the voice data synthesized from the text input through the TTS service from the server in real time through streaming. In some embodiments, based on the server receiving the text input uploaded from the terminal device, the server performs speech synthesis on the text input through the TTS service according to the request of the terminal device, thereby generating audio encoded data in a specific format, for example, generating an advanced Audio encoded data in Audio Coded (AAC) format. Wherein, the request of the terminal device may include various voice settings related to the TTS service, for example, the user expects the timbre, pitch, speech rate, tone, background music, etc. of the synthesized voice. In addition, the present disclosure does not specifically limit the format of the generated audio encoded data, as long as the generated audio encoded data can be subsequently streamed and each audio frame can be independently decoded, and the audio format with a smaller audio frame duration are included within the scope of this disclosure. After that, the generated audio encoded data in a specific format is packaged by the server and transmitted to the terminal device in real time by streaming. For example, the generated audio encoded data can be packaged by the server in the AAC Streaming Transmission Format (ADTS) format, and It is transmitted back to the terminal device in real time according to the streaming mode.

In step S304, the terminal device buffers the speech data synthesized from the text input through the TTS service in real time. In some embodiments, based on the terminal device receiving the voice data synthesized through the TTS service from the server through streaming, the terminal device buffers the received voice data in a buffer in real time. For example, the terminal device may receive the voice data from the server. Data packets in ADTS format are buffered in the buffer in real time.

In step S305, the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length. In some embodiments, based on the voice data buffered in the buffer by the terminal device reaching a playable length (ie, a certain size), for example, based on the voice data buffered by the terminal device in the buffer with a playback length of 1 second, the terminal device can Decode the currently buffered voice data, and play the decoded PCM data. In addition, during the above decoding process, the terminal device always receives the just synthesized voice data from the server according to the streaming Data is buffered in buffers to ensure continuity of decoding and playback. The above-described process of receiving the synthesized voice data from the server in real time through streaming transmission, buffering the received voice data in real time, and starting decoding and playing after buffering a certain amount of data can realize real-time preview and greatly reduce the preview delay. .

In addition, in practical applications, the user may be dissatisfied with the voice synthesized according to the current voice settings related to the TTS service, and then change the voice settings related to the TTS service during the voice preview. For example, the user may change the voice settings. Timbre, pitch, speech rate, tone or background music, etc., because in the related art, the voice data synthesis corresponding to the entire text input is often completed through the TTS service according to the voice settings related to the TTS service set by the user and received from the server. Voice preview cannot be performed until all the voice data is synthesized. Therefore, if the user does not like the voice synthesized according to the current voice settings, resources will be wasted and user experience will be affected.

However, in the present disclosure, based on the voice setting related to the TTS service being changed, the terminal device may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the The changed voice setting re-synthesizes the text input, and locally re-caches the voice data synthesized from the text input through the TTS service according to the changed voice setting and received from the server in real time in a streaming manner in the terminal device. Wherein, based on the voice settings related to the TTS service being changed, the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length. In this case, decode and preview the buffered voice data. In this way, when the voice settings related to the TTS service are changed, the server does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.

In addition, based on the user's completion of editing according to the finalized voice settings, the terminal device can convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.

The above describes the detailed process of the method in which the terminal device invokes the TTS service on the server side to implement the speech preview of speech synthesis. The following describes the detailed process of the method for the terminal device to invoke the local TTS service to implement the speech preview of speech synthesis.

FIG. 4 is a detailed flowchart of a method for voice preview when locally executed by a terminal device according to an exemplary embodiment of the present disclosure.

As shown in FIG. 4, in step S401, text input is received. Since this step is the same as the operation of step S201, it will not be described repeatedly here.

In step S402, the terminal device performs speech synthesis on the text input locally through the TTS service to obtain speech data. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the terminal device, the terminal device can invoke the background TTS service, and the background TTS service is located locally on the terminal device, that is, the terminal device can invoke Local TTS service, for example, the terminal device calls the local TTS service through the API of the audio and video editing software, and transmits the text input to the local TTS service, and then uses the TTS service according to the user's voice settings related to the TTS service. The text input is synthesized into speech data. The terminal device may synthesize the text input into audio encoded data in a specific format through a local TTS service, for example, generate audio encoded data in AAC format. The voice settings related to the TTS service may include, for example, the user's desired timbre of the synthesized voice, background music, pitch, speech rate, tone, and the like.

In step S403, the terminal device buffers the speech data synthesized from the text input through the TTS service. For example, the cache can be real-time. In some embodiments, the terminal device can buffer the voice data synthesized by calling the local TTS service in real time. For example, the terminal device can buffer the audio encoded data in AAC format synthesized through the local TTS service in the buffer. .

In step S404, the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length. In some embodiments, based on the voice data buffered in the buffer by the terminal device reaching a playable length (ie, a certain size), for example, based on the terminal device buffering voice data with a playback length of 1 second in the buffer, the terminal device can The currently buffered voice data is decoded, and the decoded PCM data is played. In addition, when the above decoding is performed, the terminal device always caches the voice data just synthesized through the local TTS service in real time to ensure audio decoding. and continuity of playback. The terminal device described above directly implements speech synthesis by calling the local TTS service and buffers the synthesized speech data, and starts decoding and playing after buffering a certain amount of data, thereby realizing real-time preview and greatly reducing preview delay.

In addition, similar to the case when the TTS service described in FIG. 3 is executed on the server side, in the present disclosure, based on the voice setting related to the TTS service being changed, the terminal device may stop using the local TTS service according to the voice setting before the change. The speech synthesis operation of the text input, and utilize the TTS service to re-speech the text input according to the changed speech settings, and re-cache the speech synthesized from the text input through the TTS service according to the changed speech settings. data. Wherein, based on the voice settings related to the TTS service being changed, the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length. In this case, decode and preview the buffered voice data. In this way, when the voice settings related to the TTS service are changed, the terminal device does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.

In addition, based on the user's completion of editing according to the final voice settings, the terminal device can convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.

FIG. 5 is a block diagram of a voice preview apparatus 500 according to an exemplary embodiment of the present disclosure.

5 , the apparatus 500 for audio preview may include a receiving unit 510 , a buffering unit 520 , a decoding unit 530 and a playing unit 540 . The receiving unit 510 can receive text input. Here, since the content related to the text input has been described in detail above with reference to FIG. 2 , the repeated description will not be repeated here.

The buffering unit 520 may buffer the speech data synthesized based on the text input through the speech synthesis service (TTS service). For example, the cache can be real-time. Then, the decoding unit 530 may decode the buffered voice data when the synthesized voice data is buffered to a playable length, and the playing unit 540 may play the decoded audio data. In some embodiments, the TTS service may be a local TTS service invoked by the device 500, or may be a server-side TTS service invoked, in other words, a TTS service may be performed locally in the device 500 to synthesize the text input For voice data, the device 500 may also request the server side to perform TTS service to synthesize the uploaded text input into voice data. The detailed procedures of the method for speech processing when the TTS service is executed on the server side and when the TTS service is executed locally in the device 500 will be described below with reference to FIG. 6 and FIG. 7 , respectively.

FIG. 6 is a detailed block diagram of a voice preview apparatus 600 according to an exemplary embodiment of the present disclosure.

6 , the apparatus 600 for audio preview may include a receiving unit 610 , a buffering unit 620 , a decoding unit 630 , a playing unit 640 and a sending unit 650 .

The receiving unit 610 can receive text input, that is, the receiving unit 610 can perform operations corresponding to the step S210 described above with reference to FIG. 2 , and therefore, detailed descriptions are omitted here.

The sending unit 650 may send the text input to the server. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the device 600, the device 600 can invoke the background TTS service, which is located on the server side, that is, the device 600 can invoke the server and upload the text input to the server, and then the server synthesizes the text input received from the device 600 into voice data through the TTS service.

Also, the receiving unit 610 may receive voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the device 600 may receive voice data synthesized from the text input through the TTS service from the server in real time through streaming. In some embodiments, based on the server receiving the text input uploaded from the device 600, the server performs speech synthesis on the text input through the TTS service according to the request of the device 600, thereby generating audio encoded data in a specific format, such as generating AAC Format of audio encoded data. Thereafter, the generated audio encoded data in a specific format is packaged by the server and transmitted to the device 600 in real time in a streaming manner.

The buffering unit 620 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, based on the voice data synthesized by the TTS service received by the apparatus 600 from the server, the buffering unit 620 buffers the received voice data. For example, the buffering unit 620 may real-time buffer the ADTS-formatted data packets received from the server. cache.

The decoding unit 630 may decode the buffered speech data when the synthesized speech data is buffered to a playable length. In some embodiments, based on the voice data buffered by the buffering unit 620 reaching a playable length (ie a certain size), the decoding unit 630 can decode the currently buffered voice data, and the playing unit 640 can play the decoded PCM data In addition, when performing the above decoding, the buffering unit 620 always buffers the voice data received from the server in real time according to the streaming transmission mode, so as to ensure the continuity of the decoding operation of the decoding unit 630 and the playing operation of the playing unit 640 .

In addition, based on the voice setting related to the TTS service being changed, the sending unit 650 may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the voice after the change. It is set to re-synthesize the text input, and the buffering unit 620 may re-cache the voice data synthesized based on the text input through the TTS service according to the changed voice settings received by the receiving unit 610 from the server. Wherein, based on the voice settings related to the TTS service being changed, the cache unit 630 may delete the previously cached voice data synthesized according to the previous voice settings, and based on the re-cached voice data synthesized according to the changed voice settings by the cache unit 630 When the playable length is reached, the decoding unit 630 decodes the buffered voice data, and the playing unit 640 previews and plays the decoded voice data.

In addition, based on the user's completion of editing according to the finalized voice settings, the device 600 may convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.

In addition, since the voice preview method shown in FIG. 3 can be performed by the voice preview device 600 shown in FIG. 6 , any relevant details related to the operations performed by the units in FIG. 6 can be referred to in relation to FIG. 3 The corresponding descriptions are not repeated here.

FIG. 7 is a detailed block diagram of a voice preview apparatus 700 according to another exemplary embodiment of the present disclosure.

Referring to FIG. 7 , the apparatus 700 for previewing a voice may include a receiving unit 710 , a buffering unit 720 , a decoding unit 730 , a playing unit 740 and a voice synthesis unit 750 .

The receiving unit 710 may receive text input, that is, the receiving unit 710 may perform operations corresponding to the step S210 described above with reference to FIG. 2 , and thus will not be repeated here.

The speech synthesis unit 750 may perform speech synthesis on the text input locally through the TTS service to obtain speech data. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the device 700, the speech synthesis unit 750 may invoke the background TTS service, which is located locally on the device 700, that is, the speech synthesis unit 750 The unit 750 can call the local TTS service. For example, the speech synthesis unit 750 can call the local TTS service through the API of the audio and video editing software, and transmit the text input to the local TTS service, and then use the TTS service according to the user's and user's requirements. The TTS service-related voice settings synthesize the text input into voice data.

The buffering unit 720 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, the buffering unit 720 may buffer the speech data synthesized through the invoked local TTS service in real time.

The decoding unit 730 may decode the buffered speech data when the synthesized speech data is buffered by the buffering unit 720 to a playable length, and the playback unit 740 performs preview playback. In some embodiments, based on the voice data buffered by the buffering unit 720 reaching a playable length (ie a certain size), the decoding unit 730 can decode the currently buffered voice data, and the playing unit 740 can play the decoded PCM data In addition, when performing the above decoding, the buffering unit 720 always buffers the voice data just synthesized through the local TTS service in real time, so as to ensure the continuity of the decoding operation of the decoding unit 730 and the playing operation of the playing unit 740 .

Also, similar to the case when the TTS service is executed on the server side described in FIG. 6 , in the present disclosure, based on the voice setting related to the TTS service being changed, the voice synthesis unit 750 may stop using the TTS service to pair with the voice setting before the change. The voice synthesis operation of the text input, and re-speech the text input according to the changed voice settings, and the cache unit 720 can re-cache the text input synthesized based on the text input through the TTS service according to the changed voice settings. voice data. Wherein, based on the voice settings related to the TTS service being changed, the buffering unit 720 may delete the previously buffered voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings in the buffering unit 720 reaches In the case of a playable length, the decoding unit 730 decodes the buffered voice data, and the playing unit 740 performs preview playback.

In addition, based on the user's completion of editing according to the finalized voice settings, the device 700 may convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.

In addition, since the method for voice preview shown in FIG. 4 can be performed by the apparatus 700 for voice preview shown in FIG. 7 , any relevant details involved in the operations performed by the units in FIG. 7 can be referred to in relation to FIG. 4 The corresponding descriptions are not repeated here.

FIG. 8 is a block diagram of a system 800 for voice preview according to an exemplary embodiment of the present disclosure.

Referring to FIG. 8 , the system 800 includes a terminal device 810 and a server 820 . The terminal device 810 may be any one of 101 , 102 and 103 shown in FIG. 1 , and the server 820 may be the server 105 shown in FIG. 1 .

The terminal device 810 can receive the text input, send the text input to the server 820, and perform real-time buffering on the voice data synthesized from the text input through the TTS service received from the server 820, and when the received voice data is cached to the extent that it is available. In the case of playback length, the buffered voice data is decoded and played.

Since the operations performed by the terminal device 810 are respectively the same as the operations of the apparatus 600 described above with reference to FIG. 6 , for any relevant details involved in the operations performed by the terminal device 810 in FIG. The corresponding description of the apparatus 600 will not be repeated here.

The server 820 may perform speech synthesis on the text input received from the terminal device 810 through the TTS service to obtain voice data, and transmit the obtained voice data to the terminal device 810 in real time. Since the operations performed by the terminal device 810 are respectively the same as the operations of the server described above with reference to FIG. 6 , they are not repeated here.

9 is a block diagram of an electronic device 900 according to an embodiment of the present disclosure, the electronic device 900 may include at least one memory 910 and at least one processor 920, the at least one memory 910 stores a set of computer-executable instructions, based on A set of computer-executable instructions is executed by at least one processor to perform a method of speech previewing according to an embodiment of the present disclosure.

As an example, the electronic device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above set of instructions. Here, the electronic device does not have to be a single electronic device, but can also be any set of devices or circuits that can individually or jointly execute the above-mentioned instructions (or instruction sets). The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

In an electronic device, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor may execute instructions or code stored in memory, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.

The memory may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the memory may comprise a separate device such as an external disk drive, a storage array, or any other storage device that may be used by a database system. The memory and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device can be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a voice preview according to an exemplary embodiment of the present disclosure. method. Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card, or Extreme Digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid state disk, and any other apparatuses configured to store, in a non-transitory manner, a computer program and any associated data, data files and data structures and to provide said computer program and any associated data, data files and data structures The computer program is given to a processor or computer so that the processor or computer can execute the computer program. The computer programs in the above-mentioned computer-readable storage media can run in an environment deployed in computer equipment such as clients, hosts, proxy devices, servers, etc., and, in one example, the computer programs and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, a computer program product can also be provided, wherein instructions in the computer program product can be executed by at least one processor in an electronic device to implement the method for voice preview according to an exemplary embodiment of the present disclosure.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A method for voice preview, including:

receive text input;

Cache the speech data synthesized based on the text input through the speech synthesis service;

When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
The method of claim 1, further comprising:

sending the text input to a server;

Speech data synthesized from the text input by a speech synthesis service is received from a server.
The method of claim 2, wherein based on the speech settings related to the speech synthesis service being changed, notifying the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech setting before the change, and informing the server to use the speech synthesis service The text input is re-speech-synthesized according to the changed speech setting, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech setting received from the server is re-cached.
The method of claim 1, further comprising:

The text input is speech synthesized by a speech synthesis service to obtain speech data.
The method as claimed in claim 4, based on the speech setting related to the speech synthesis service being changed, stopping the speech synthesis operation of the text input by the speech synthesis service according to the speech setting before the change, and using the speech synthesis service according to the changed speech The speech setting re-speech-synthesizes the text input, and re-caches the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting color.
A voice preview device, comprising:

a receiving unit configured to receive text input;

a buffering unit configured to buffer the speech data synthesized based on the text input through the speech synthesis service;

a decoding unit, configured to decode the buffered voice data when the synthesized voice data is buffered to a playable length;

The playback unit is configured to play the decoded audio data.
The apparatus of claim 6, further comprising:

a sending unit configured to send the text input to the server,

Wherein, the receiving unit is further configured to receive, from the server, speech data synthesized from the text input through the speech synthesis service.
8. The apparatus of claim 7, wherein based on the speech setting related to the speech synthesis service being changed, the sending unit is configured to notify the server to stop the speech synthesis operation of the text input according to the speech setting before the change using the speech synthesis service, and The notification server re-speech-synthesizes the text input according to the changed speech settings using the speech synthesis service, and the buffering unit is configured to re-cache the text input received by the receiving unit from the server through the speech synthesis service based on the changed speech settings based on the changed speech settings. Enter the synthesized speech data into the text.
The apparatus of claim 6, further comprising:

A speech synthesis unit configured to perform speech synthesis on the text input through a speech synthesis service to obtain speech data.
10. The apparatus of claim 9, wherein based on the speech setting related to the speech synthesis service being changed, the speech synthesis unit is configured to stop the speech synthesis operation of the text input using the speech synthesis service according to the speech setting before the change, and according to The changed speech setting re-speech-synthesizes the text input, and the buffering unit is configured to re-cache the speech data synthesized based on the text input by the speech synthesis service according to the changed speech setting.
an electronic device,

including a processor; memory for storing executable instructions, wherein the processor is configured

Execute the executable instructions to implement the following steps:

receive text input;

Cache the speech data synthesized based on the text input through the speech synthesis service;

When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
The electronic device of claim 11, wherein the processor is configured to execute the executable instructions to:

sending the text input to a server;

Speech data synthesized from the text input by a speech synthesis service is received from a server.
13. The electronic device of claim 12, wherein the processor is configured to execute the executable instructions to:

Based on the speech settings related to the speech synthesis service being changed, notify the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and notify the server to use the speech synthesis service to re-synthesize all speech settings according to the changed speech settings. The text input is speech synthesized, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings received from the server is re-cached.
The electronic device of claim 11, wherein the processor is configured to execute the executable instructions to:

The text input is speech synthesized by a speech synthesis service to obtain speech data.
The electronic device of claim 14, wherein the processor is configured to execute the executable instructions to perform the steps of:

Based on the speech settings related to the speech synthesis service being changed, stop using the speech synthesis service to perform the speech synthesis operation on the text input according to the speech settings before the change, and use the speech synthesis service to perform the text input again according to the changed speech settings. Speech synthesis is performed, and the speech data synthesized by the speech synthesis service based on the text input is re-cached according to the changed speech settings.
A speech processing system, comprising:

The terminal device is configured to receive text input, send the text input to the server, and perform real-time buffering on the voice data synthesized from the text input through the speech synthesis service received from the server, and when the received voice data is cached to reach In the case of playable length, decode and play the buffered voice data; and

The server is configured to perform speech synthesis on the text input received from the terminal device through a speech synthesis service to obtain voice data, and transmit the obtained voice data to the terminal device in real time.
A computer-readable storage medium storing instructions that, based on the instructions being executed by at least one processor, cause the at least one processor to perform the following steps:

receive text input;

Cache the speech data synthesized based on the text input through the speech synthesis service;

When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
A computer program product, wherein instructions in the computer program product are executed by at least one processor in an electronic device to perform the following steps:

receive text input;

Cache the speech data synthesized based on the text input through the speech synthesis service;

When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.