CN112562638A

CN112562638A - Voice preview method and device and electronic equipment

Info

Publication number: CN112562638A
Application number: CN202011355823.8A
Authority: CN
Inventors: 陈翔宇; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-26
Also published as: WO2022110943A1

Abstract

The disclosure relates to a voice preview method, a voice preview device and electronic equipment, wherein the method comprises the following steps: receiving character input; caching voice data synthesized from the text input through a voice synthesis service in real time; and under the condition that the synthesized voice data is cached to reach the playable length, decoding and playing the cached voice data.

Description

Voice preview method and device and electronic equipment

Technical Field

The present disclosure relates to the field of signal processing, and in particular, to a method and an apparatus for voice preview, and an electronic device.

Background

In the related art, a scenario in which a terminal device (e.g., a mobile phone) uses a speech synthesis (TTS) service is as follows: inputting characters, generating a voice file by calling a Software Development Kit (SDK) on a network or off-line, then returning the voice file to the terminal equipment by the network or the file mode, and then playing the voice file by calling the terminal equipment. In a video editing scene, a user edits a shot video by using terminal equipment, edits characters, generates audio files with different timbres by using the edited characters and synthesizes the audio files into the video, thereby completing a dubbing process. However, in a similar video editing scenario, transmission using a file causes a relatively large delay. For example, the user needs to switch tone and preview in real time, and the conventional TTS SDK needs to have a process of generating a file and returning the file through a network. If the input characters are long and the whole file is large, the waiting time is too long, the real-time preview effect cannot be achieved, and the user experience is seriously influenced.

Disclosure of Invention

The present disclosure provides a method, an apparatus, and an electronic apparatus for voice preview, so as to solve at least the problem of poor real-time preview effect in the related art, and also not to solve any of the above problems. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for voice preview, including: receiving character input; caching voice data synthesized from the text input through a TTS service in real time; and under the condition that the synthesized voice data is cached to reach the playable length, decoding and playing the cached voice data.

Optionally, the method further comprises: sending the text input to a server; receiving, in real time, speech data synthesized from the text input through a TTS service from a server.

Optionally, when receiving a tone switching operation, the notification server stops a speech synthesis operation on the text input by using a TTS service according to a tone before switching, performs speech synthesis on the text input again by using the TTS service according to a tone after switching, and locally re-caches speech data synthesized from the text input by the TTS service according to the tone after switching, the speech data being received in real time from the server.

Optionally, the method further comprises: and carrying out voice synthesis on the text input locally through a TTS service to obtain voice data.

Optionally, when the receiving unit receives a tone switching operation, the receiving unit stops the speech synthesis operation of the text input by using the TTS service according to the tone before switching, performs speech synthesis on the text input again by using the TTS service according to the tone after switching, and re-buffers speech data synthesized from the text input by using the TTS service according to the tone after switching.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for voice preview, the apparatus including: a receiving unit configured to receive a text input; a caching unit configured to cache speech data synthesized from the text input through a TTS service in real time; a decoding unit configured to decode the buffered voice data in a case where the synthesized voice data is buffered up to a playable length; a playing unit configured to play the decoded audio data.

Optionally, the apparatus further comprises: a transmitting unit configured to transmit the text input to a server, wherein the receiving unit is further configured to receive, in real time, voice data synthesized from the text input through a TTS service from the server.

Alternatively, when the receiving unit receives a tone switching operation, the transmitting unit is configured to notify the server of stopping a speech synthesis operation for the text input according to a tone before switching using the TTS service and to notify the server of re-performing speech synthesis for the text input according to a tone after switching using the TTS service, and the buffering unit is configured to re-buffer speech data synthesized from the text input by the TTS service according to the tone after switching received in real time from the server by the receiving unit.

Optionally, the apparatus further comprises: a speech synthesis unit configured to perform speech synthesis on the text input through a TTS service to obtain speech data.

Alternatively, when the receiving unit receives a tone switching operation, the speech synthesis unit is configured to stop a speech synthesis operation for the text input according to a tone before switching by using a TTS service and to re-perform speech synthesis for the text input according to a tone after switching, and the buffer unit is configured to re-buffer speech data synthesized from the text input by the TTS service according to the tone after switching.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of voice preview as described above.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech processing system, including: a terminal device configured to receive text input, transmit the text input to a server, buffer voice data synthesized from the text input by a TTS service received from the server in real time, and decode and play the buffered voice data when the received voice data is buffered to a playable length; and the server is configured to perform voice synthesis on the text input received from the terminal equipment through a TTS service to obtain voice data, and transmit the obtained voice data to the terminal equipment in real time.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method of voice preview as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, wherein instructions in the computer program product are executed by at least one processor in an electronic device to perform the method of voice preview as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the delay is greatly reduced through real-time transmission, voice data with few cache are previewed in real time, and almost no waiting is needed. Meanwhile, when the tone is switched, the local terminal device or the notification server does not perform TTS on the remaining characters which are not subjected to TTS any more, so that the cost of TTS service is reduced, the speed of TTS preview of a user in video editing is increased, and the user experience is optimized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is an exemplary system architecture diagram in which exemplary embodiments of the present disclosure may be applied;

FIG. 2 is a flow chart illustrating a method of voice previewing according to an exemplary embodiment of the present disclosure;

fig. 3 is a detailed flowchart illustrating a method of voice preview when a TTS service is executed at a server side according to an exemplary embodiment of the present disclosure;

fig. 4 is a detailed flowchart illustrating a method of voice preview when a TTS service is locally executed at a terminal device according to an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating an apparatus for voice preview according to an exemplary embodiment of the present disclosure;

FIG. 6 is a detailed block diagram of an apparatus for voice preview shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 is a detailed block diagram of an apparatus for voice preview shown in accordance with another exemplary embodiment of the present disclosure;

FIG. 8 is a block diagram illustrating a system for voice previewing according to an exemplary embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

As mentioned in the background of the present disclosure, in the related art, when an audio file is generated by text input in a video editing scene and the audio file is previewed, the text file input by a user needs to undergo text input and/or upload, speech synthesis, speech coding, and download to complete the processes, that is, the conventional file using manner is performed in series, and the input text file can be previewed after the input text file is completed. In view of the above, the present disclosure provides a method and system for playing audio data synthesized by a TTS service in real time after receiving text input, and starting real-time preview of an audio file when very little data is cached, which hardly requires waiting, and can reduce the cost of the TTS service by not performing speech synthesis on text that has not been subjected to speech synthesis any more when the tone is switched.

Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. A user may use

terminal devices

101, 102, 103 to interact with server 105 via network 104 to receive or send messages (e.g., TTS service requests, audio-visual data upload requests, audio-visual data acquisition requests), and so on. Various communication client applications, such as a video recording application, an audio playing application, a video and audio editing application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing, recording and editing audio and video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal device

101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The

terminal devices

101, 102, 103 may be equipped with an image capture device (e.g., a camera) to capture video data. Further, the

terminal apparatuses

101, 102, 103 may also be mounted with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be mounted with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound.

The

terminal apparatuses

101, 102, 103 may perform acquisition of video data using an image acquisition device mounted thereon, and acquisition of audio data using an audio acquisition device mounted thereon. Also, the

terminal apparatuses

101, 102, 103 may perform a TTS service on the received text input to synthesize audio data from the text input, and may play the audio data using an audio processing component supporting audio play installed thereon.

The server 105 may be a server that provides various services, such as a background server that provides support for an audio/video recording application, an audio/video editing application, and the like installed on the

terminal devices

101, 102, and 103. The background server can analyze, perform TTS service, store and other processing on uploaded data such as text input, and can also receive TTS service requests sent by the

terminal devices

101, 102, 103 and feed back voice-synthesized audio data to the

terminal devices

101, 102, 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for voice previewing provided by the embodiment of the present application is generally executed by the

terminal devices

101, 102, and 103, and accordingly, the apparatus for voice previewing is generally disposed in the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

Fig. 2 is a flowchart illustrating a method of voice preview according to an exemplary embodiment of the present disclosure.

In step S201, a text input is received. Here, the text input may be text input or edited by a user in any manner on the terminal device, and specifically, when the user starts an edit preview function in the audio/video editing software on the terminal device, the user may directly input text in the audio/video editing software, or may directly load a text file received from another device or downloaded from a server into the audio/video editing software. The above are merely examples, and the present disclosure does not specifically limit the manner in which text input is received, and any manner in which text input is possible is included in the scope of the present disclosure.

In step S202, voice data synthesized from the text input by the TTS service is cached in real time. Then, in step S203, in the case where the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played. Specifically, the TTS service may be a local TTS service called by the terminal device through the audio/video editing software, or may be a server-side TTS service called by the audio/video editing software, in other words, the TTS service may be performed locally on the terminal device to synthesize the text input into voice data, or the terminal device may request the server-side TTS service to synthesize the uploaded text input into voice data. Detailed procedures of the method for speech processing when the TTS service is executed on the server side and when the TTS service is executed locally at the terminal device will be described below with reference to fig. 3 and 4, respectively.

Fig. 3 is a detailed flowchart illustrating a method of voice preview when a TTS service is executed on a server side according to an exemplary embodiment of the present disclosure.

In step S301, a text input is received. Since this step is the same as the operation of step S201, a description thereof will not be repeated here.

In step S302, the text input is sent to a server. Specifically, after a user starts an editing preview function in audio/video editing software on a terminal device, the terminal device may invoke a background TTS service, where the background TTS service is located on a server side, that is, the terminal device may invoke a TTS service on the server side and upload the text input to the server, and then the server synthesizes the text input received from the terminal device into speech data through the TTS service.

In step S303, voice data synthesized from the text input through the TTS service is received in real time from the server. In other words, the terminal device may receive, in real time, voice data synthesized from the text input through the TTS service from the server through a streaming manner. Specifically, when the server receives the text input uploaded from the terminal device, the server performs speech synthesis on the text input by the TTS service according to a request of the terminal device, thereby generating audio coded data in a specific format, for example, generating audio coded data in an Advanced Audio Coding (AAC) format. The request of the terminal device may include, among other things, various speech settings related to the TTS service, such as the timbre, pitch, speech rate, tone, background music, etc., of the speech that the user desires to synthesize. In addition, the format of the generated audio encoding data is not particularly limited by the present disclosure, and an audio format in which the generated audio encoding data can be subsequently streamed and each audio frame can be independently decoded and the duration of the audio frame is small is included in the scope of the present disclosure. Thereafter, the generated audio encoding data in the specific format is packetized by the server and transmitted to the terminal device in real time in a streaming manner, for example, the generated audio encoding data may be packetized by the server in an AAC streaming format (ADTS) format and transmitted back to the terminal device in real time in a streaming manner.

In step S304, voice data synthesized from the text input by the TTS service is cached in real time. Specifically, when the terminal device receives voice data synthesized by the TTS service from the server by streaming, the terminal device buffers the received voice data in a buffer in real time, for example, the terminal device may buffer data packets in an ADTS format received from the server in real time.

In step S305, in the case where the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played. Specifically, when the voice data buffered in the buffer by the terminal device reaches a playable length (i.e. a certain size), for example, when the terminal device buffers the voice data with a playable length of 1 second in the buffer, the terminal device may perform a decoding operation on the currently buffered voice data and play the decoded PCM data. The above-described process of receiving synthesized voice data from the server in real time by a streaming transmission manner, caching the received voice data in real time, and starting decoding and playing after a certain amount of data is cached can realize real-time preview, thereby greatly reducing preview delay.

In addition, in practical applications, a user may be dissatisfied with a voice synthesized according to a current voice setting related to a TTS service, and further, the voice setting related to the TTS service may be changed at the time of voice preview, for example, the user may change the tone, pitch, speed, tone, background music, or the like of the voice.

However, in the present disclosure, when a voice setting related to the TTS service is changed, the terminal device may notify the server to stop a voice synthesis operation for the text input according to a voice setting before the change using the TTS service, and notify the server to re-perform voice synthesis for the text input according to the changed voice setting using the TTS service, and locally re-cache voice data synthesized from the text input by the TTS service according to the changed voice setting, which is received in real time from the server in a streaming manner, at the terminal device. When the voice setting related to the TTS service is changed, the terminal device may delete previously buffered voice data synthesized according to the previous voice setting, and decode and preview-play the buffered voice data in a case where the re-buffered voice data synthesized according to the changed voice setting reaches a playable length. Thus, when the voice setting related to the TTS service is changed, the server can not perform voice synthesis on the residual text input which is not subjected to voice synthesis, thereby reducing the cost of the TTS service.

In addition, when the user completes editing according to the finally determined voice setting, the terminal device may convert the buffered ADTS data corresponding to the current voice setting into audio data in other formats such as M4A, and generate audio metadata such as information including duration, sampling rate, channel number, and the like, and finally provide the audio metadata to the audio-video editing SDK for use.

The above describes the detailed procedure of the method for implementing voice preview of voice synthesis by calling the TTS service on the server side by the terminal device, and the following describes the detailed procedure of the method for implementing voice preview of voice synthesis by calling the local TTS service by the terminal device.

Fig. 4 is a detailed flowchart illustrating a method of voice preview when a terminal device is locally executed according to an exemplary embodiment of the present disclosure.

As shown in fig. 4, in step S401, a text input is received. Since this step is the same as the operation of step S201, a description thereof will not be repeated here.

In step S402, the text input is locally speech-synthesized by a TTS service to obtain speech data. Specifically, after a user starts an editing preview function in audio/video editing software on a terminal device, the terminal device may invoke a background TTS service, where the background TTS service is local to the terminal device, that is, the terminal device may invoke a local TTS service, for example, the terminal device invokes the local TTS service through an API of the audio/video editing software and transmits the text input to the local TTS service, and then synthesizes the text input into voice data according to a voice setting of the user related to the TTS service through the TTS service. The terminal device can synthesize the text input into audio coded data in a specific format through a local TTS service, for example, generate audio coded data in an AAC format. The voice settings related to the TTS service may include, for example, the timbre, background music, pitch, speech rate, pitch, etc. of the voice that the user desires to synthesize.

In step S403, voice data synthesized from the text input by the TTS service is cached in real time. Specifically, the terminal device may cache in real time voice data synthesized by the invoked native TTS service, for example, the terminal device may cache audio coded data in AAC format synthesized by the native TTS service in a buffer.

In step S404, in the case where the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played. Specifically, when the voice data buffered in the buffer by the terminal device reaches a playable length (i.e. a certain size), for example, when the terminal device buffers the voice data with a playable length of 1 second in the buffer, the terminal device may perform a decoding operation on the currently buffered voice data and play the decoded PCM data, and in addition, when performing the above decoding, the terminal device always buffers the voice data just synthesized by the local TTS service in real time to ensure the continuity of audio decoding and playing. The terminal device described above realizes speech synthesis directly by calling local TTS service, caches the synthesized speech data, and starts decoding and playing after a certain amount of data is cached, so that real-time preview can be realized, and preview delay is greatly reduced.

Further, similar to the case described in fig. 3 where the TTS service is executed on the server side, in the present disclosure, when the voice setting related to the TTS service is changed, the terminal device may stop the voice synthesis operation for the text input according to the voice setting before the change using the local TTS service, and re-perform voice synthesis for the text input according to the changed voice setting using the TTS service, and re-cache the voice data synthesized from the text input by the TTS service according to the changed voice setting color. When the voice setting related to the TTS service is changed, the terminal device may delete previously buffered voice data synthesized according to the previous voice setting, and decode and preview-play the buffered voice data in a case where the re-buffered voice data synthesized according to the changed voice setting reaches a playable length. Therefore, when the voice setting related to the TTS service is changed, the terminal equipment does not perform voice synthesis on the residual text input which is not subjected to voice synthesis, and the cost of the TTS service is reduced.

In addition, when the user finishes editing according to the finally determined voice setting, the terminal device can convert the cached voice data corresponding to the current voice setting into audio data in other formats such as M4A, and simultaneously generate audio metadata such as information including duration, sampling rate, channel number and the like, and finally provide the audio metadata for the audio/video editing SDK.

Fig. 5 is a block diagram illustrating an apparatus 500 for voice preview according to an exemplary embodiment of the present disclosure.

Referring to fig. 5, the apparatus 500 for voice preview may include a receiving unit 510, a buffering unit 520, a decoding unit 530, and a playing unit 540. The receiving unit 510 may receive a text input. Here, since the contents related to the text input have been described in detail above with reference to fig. 2, a repeated description thereof will not be made.

The buffering unit 520 may buffer voice data synthesized from the text input through a TTS service in real time. Then, the decoding unit 530 may decode the buffered voice data in a case where the synthesized voice data is buffered up to a playable length, and the playing unit 540 may play the decoded audio data. Specifically, the TTS service may be a local TTS service invoked by the device 500, or may be an invoked server-side TTS service, in other words, the TTS service may be performed locally on the device 500 to synthesize the text input into speech data, or the device 500 may request the server-side TTS service to synthesize the uploaded text input into speech data. Detailed procedures of the method for speech processing when the TTS service is executed on the server side and when the TTS service is executed locally on the device 500 will be described below with reference to fig. 6 and 7, respectively.

Fig. 6 is a detailed block diagram of an apparatus 600 for voice preview according to an exemplary embodiment of the present disclosure.

Referring to fig. 6, the apparatus 600 for voice preview may include a receiving unit 610, a buffering unit 620, a decoding unit 630, a playing unit 640, and a transmitting unit 650.

The receiving unit 610 may receive a text input, that is, the receiving unit 610 may perform an operation corresponding to step S210 described above with reference to fig. 2, and thus, a detailed description thereof is omitted.

The sending unit 650 may send the text input to a server. Specifically, after the user starts the edit preview function in the audio/video editing software on the device 600, the device 600 may invoke a background TTS service, which is located on the server side, that is, the device 600 may invoke a TTS service on the server side and upload the text input to the server, and then the server synthesizes the text input received from the device 600 into speech data through the TTS service.

Further, the receiving unit 610 may receive voice data synthesized from the text input through a TTS service in real time from the server. In other words, device 600 may receive speech data synthesized from the text input by the TTS service in real time from a server by streaming. Specifically, when the server receives the text input uploaded from the device 600, the server performs speech synthesis on the text input by the TTS service according to the request of the device 600, thereby generating audio coded data in a specific format, for example, generating audio coded data in the AAC format. Thereafter, the generated audio encoding data of the specific format is packetized by the server and transmitted to the apparatus 600 in real time in a streaming manner.

The buffering unit 620 may buffer voice data synthesized from the text input through the TTS service in real time. Specifically, when the device 600 receives voice data synthesized by the TTS service from the server, the buffering unit 620 buffers the received voice data, for example, the buffering unit 620 may buffer data packets in an ADTS format received from the server in real time.

The decoding unit 630 may decode the buffered voice data in a case where the synthesized voice data is buffered up to a playable length. Specifically, when the voice data buffered by the buffer unit 620 reaches a playable length (i.e., a certain size), the decoding unit 630 may decode the currently buffered voice data and play the decoded PCM data by the playing unit 640, and in addition, when the above decoding is performed, the buffer unit 620 always buffers the voice data received from the server in real time in a streaming manner to ensure continuity of the decoding operation of the decoding unit 630 and the playing operation of the playing unit 640.

In addition, when a voice setting related to the TTS service is changed, the transmitting unit 650 may notify the server to stop a voice synthesis operation for the text input according to a voice setting before the change using the TTS service and notify the server to re-perform voice synthesis for the text input according to the changed voice setting using the TTS service, and the buffering unit 620 may re-buffer voice data synthesized from the text input by the TTS service according to the changed voice setting, received in real time from the server by the receiving unit 610. Wherein, when the voice setting related to the TTS service is changed, the buffering unit 630 may delete the previously buffered voice data synthesized according to the previous voice setting, and, when the voice data re-buffered by the buffering unit 630 and synthesized according to the changed voice setting reaches a playable length, the decoding unit 630 decodes the buffered voice data and the playing unit 640 performs a preview playing of the decoded voice data.

In addition, when the user has completed editing according to the finally determined speech setting, the apparatus 600 may convert the buffered ADTS data corresponding to the current speech setting into audio data in other format, such as M4A, and generate audio metadata, such as information including duration, sampling rate, channel number, etc., and finally provide the audio metadata to the audio-video editing SDK for use.

In addition, since the method for voice previewing shown in fig. 3 can be performed by the apparatus 600 for voice previewing shown in fig. 6, any relevant details involved in the operations performed with respect to the units in fig. 6 can be referred to the corresponding description with respect to fig. 3, and are not described herein again.

Fig. 7 is a detailed block diagram of an apparatus 700 for voice preview according to another exemplary embodiment of the present disclosure.

Referring to fig. 7, the apparatus 700 for voice preview may include a receiving unit 710, a buffering unit 720, a decoding unit 730, a playing unit 740, and a voice synthesizing unit 750.

The receiving unit 710 may receive a text input, that is, the receiving unit 710 may perform an operation corresponding to step S210 described above with reference to fig. 2, and thus, a detailed description thereof is omitted.

The speech synthesis unit 750 may perform speech synthesis on the text input locally through a TTS service to obtain speech data. Specifically, after the user starts the edit preview function in the audio/video editing software on the device 700, the speech synthesis unit 750 may invoke a background TTS service, where the background TTS service is local to the device 700, that is, the speech synthesis unit 750 may invoke a local TTS service, for example, the speech synthesis unit 750 may invoke the local TTS service through an API of the audio/video editing software and transmit the text input to the local TTS service, and then synthesize the text input into speech data according to the speech setting of the user related to the TTS service through the TTS service.

The buffering unit 720 may buffer voice data synthesized from the text input through the TTS service in real time. Specifically, the buffering unit 720 may buffer voice data synthesized by the invoked local TTS service in real time.

The decoding unit 730 can decode the buffered voice data in the case where the synthesized voice data is buffered by the buffering unit 720 to a playable length, and preview-play is performed by the playing unit 740. Specifically, when the voice data buffered by the buffer unit 720 reaches a playable length (i.e., a certain size), the decoding unit 730 may decode the currently buffered voice data, and the playback unit 740 may play back the decoded PCM data, and in addition, when the above decoding is performed, the buffer unit 720 always buffers the voice data just synthesized by the local TTS service in real time to ensure continuity of the decoding operation of the decoding unit 730 and the playback operation of the playback unit 740.

Further, similar to the case where the TTS service described in fig. 6 is executed on the server side, in the present disclosure, when the voice setting related to the TTS service is changed, the voice synthesis unit 750 may stop the voice synthesis operation of the text input according to the voice setting before the change using the TTS service and re-perform voice synthesis of the text input according to the changed voice setting, and the buffer unit 720 may re-buffer voice data synthesized from the text input by the TTS service according to the changed voice setting. Wherein, when the voice setting related to the TTS service is changed, the buffering unit 720 may delete the previously buffered voice data synthesized according to the previous voice setting, and in case that the voice data re-buffered by the buffering unit 720 and synthesized according to the changed voice setting reaches a playable length, the decoding unit 730 decodes the buffered voice data and performs a preview play by the playing unit 740.

In addition, when the user has completed editing according to the finally determined speech setting, the apparatus 700 may convert the buffered speech data corresponding to the current speech setting into audio data in other format, such as M4A, and generate audio metadata, such as information including duration, sampling rate, channel number, etc., and finally provide the audio metadata to the audio video editing SDK for use.

In addition, since the method for voice previewing shown in fig. 4 can be performed by the apparatus for voice previewing 700 shown in fig. 7, any relevant details involved in the operations performed with respect to the units in fig. 7 can be referred to the corresponding description with respect to fig. 4, and are not described herein again.

Fig. 8 is a block diagram illustrating a system 800 for voice preview according to an exemplary embodiment of the present disclosure.

Referring to fig. 8, a system 800 includes a terminal device 810 and a server 820. Wherein the terminal device 810 may be any one of 101, 102, and 103 shown in fig. 1, and the server 820 may be the server 105 shown in fig. 1.

Terminal device 810 may receive a text input, send the text input to server 820, cache speech data received from server 820 synthesized from the text input by a TTS service in real time, and decode and play the cached speech data when the received speech data is cached to a playable length.

Since the operations performed by the terminal device 810 are respectively the same as the operations of the apparatus 600 described above with reference to fig. 6, any relevant details related to the operations performed by the terminal device 810 in fig. 8 can be referred to the corresponding description of the apparatus 600 in fig. 6, and are not described again here.

The server 820 may perform voice synthesis on the text input received from the terminal device 810 through the TTS service to obtain voice data, and transmit the obtained voice data to the terminal device 810 in real time. Since the operations performed by the terminal device 810 are respectively the same as those of the server described above with reference to fig. 6, no further description is given here.

Fig. 9 is a block diagram of an electronic device 900 according to an embodiment of the disclosure, the electronic device 900 may include at least one memory 910 and at least one processor 920, the at least one memory 910 having stored therein a set of computer-executable instructions that, when executed by the at least one processor, perform a method of voice preview according to an embodiment of the disclosure.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor may execute instructions or code stored in the memory, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory may be integral to the processor, e.g., RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of voice preview according to an exemplary embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product in which instructions are executable by at least one processor in an electronic device to implement a method of voice preview according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of voice previewing, comprising:

receiving character input;

caching voice data synthesized from the text input through a voice synthesis service in real time;

and under the condition that the synthesized voice data is cached to reach the playable length, decoding and playing the cached voice data.

2. The method of claim 1, further comprising:

sending the text input to a server;

receiving, in real time, speech data synthesized from the text input by a speech synthesis service from a server.

3. The method of claim 2, wherein when a voice setting related to a voice synthesis service is changed, the notification server stops a voice synthesis operation for the text input according to a voice setting before the change using the voice synthesis service, the notification server re-performs the voice synthesis for the text input according to the changed voice setting using the voice synthesis service, and locally re-buffers voice data synthesized from the text input by the voice synthesis service according to the changed voice setting, which is received in real time from the server.

4. The method of claim 1, further comprising:

the text input is locally speech synthesized by a speech synthesis service to obtain speech data.

5. The method of claim 4, wherein when a voice setting related to a voice synthesis service is changed, a voice synthesis operation for the text input according to a voice setting before the change is stopped using the voice synthesis service, the text input is re-voice synthesized according to the changed voice setting using the voice synthesis service, and voice data synthesized from the text input by the voice synthesis service according to the changed voice setting color is re-buffered.

6. An apparatus for voice previewing, comprising:

a receiving unit configured to receive a text input;

a buffer unit configured to buffer voice data synthesized from the text input by a voice synthesis service in real time;

a decoding unit configured to decode the buffered voice data in a case where the synthesized voice data is buffered up to a playable length;

a playing unit configured to play the decoded audio data.

7. The apparatus of claim 6, further comprising:

a transmitting unit configured to transmit the text input to a server,

wherein the receiving unit is further configured to receive, in real time, speech data synthesized from the text input by a speech synthesis service from a server.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 5.

9. A speech processing system, comprising:

a terminal device configured to receive text input, transmit the text input to a server, buffer voice data received from the server and synthesized from the text input through a voice synthesis service in real time, and decode and play the buffered voice data when the received voice data is buffered to a playable length; and

and the server is configured to perform voice synthesis on the text input received from the terminal equipment through a voice synthesis service to obtain voice data, and transmit the obtained voice data to the terminal equipment in real time.

10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 5.