WO2022110943A1 - Speech preview method and apparatus - Google Patents

Speech preview method and apparatus Download PDF

Info

Publication number
WO2022110943A1
WO2022110943A1 PCT/CN2021/115113 CN2021115113W WO2022110943A1 WO 2022110943 A1 WO2022110943 A1 WO 2022110943A1 CN 2021115113 W CN2021115113 W CN 2021115113W WO 2022110943 A1 WO2022110943 A1 WO 2022110943A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
text input
speech synthesis
server
data
Prior art date
Application number
PCT/CN2021/115113
Other languages
French (fr)
Chinese (zh)
Inventor
陈翔宇
张晨
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2022110943A1 publication Critical patent/WO2022110943A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present disclosure relates to the field of signal processing, and in particular, to a method and device for previewing speech.
  • the scenario in which a terminal device uses a speech synthesis (TTS) service is: input text, generate a voice file by calling a network or an offline software development kit (SDK), and then return it to the terminal through the network or file. device, and thereafter, the terminal device plays by calling the voice file.
  • TTS speech synthesis
  • the user uses the terminal device to edit the captured video, edit the text, and then use the edited text to generate audio files with different timbres, which are then synthesized into the video to complete the dubbing process.
  • the present disclosure provides a voice preview method and device, and the technical solutions of the present disclosure are as follows:
  • a method for previewing a voice including: receiving text input; buffering voice data synthesized based on the text input through a voice synthesis service; In the case of the length, the buffered voice data is decoded and played.
  • the method further includes: sending the text input to the server; and receiving, from the server, speech data synthesized from the text input through a speech synthesis service.
  • the server based on the speech settings related to the speech synthesis service being changed, the server is notified to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and the server is notified using the speech synthesis service according to the changed speech settings.
  • the text input is re-speech-synthesized, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings received from the server is re-cached.
  • the method further includes: performing speech synthesis on the text input through a speech synthesis service to obtain speech data.
  • the speech synthesis operation of the text input using the speech synthesis service according to the speech settings before the change is stopped, and the text input is re-entered using the speech synthesis service according to the changed speech settings.
  • Speech synthesis is performed, and the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting color is re-cached.
  • an apparatus for voice preview comprising: a receiving unit configured to receive text input; a buffer unit configured to perform a speech synthesis service based on the text input The synthesized voice data is buffered; the decoding unit is configured to decode the buffered voice data when the synthesized voice data is buffered to a playable length; the playing unit is configured to play the decoded audio data.
  • the apparatus for voice preview further includes: a sending unit configured to send the text input to the server, wherein the receiving unit is further configured to receive from the server voice data synthesized from the text input through a speech synthesis service.
  • the sending unit is configured to notify the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to notify the server to use the speech synthesis service.
  • the text input is re-speech-synthesized according to the changed speech setting
  • the buffer unit is configured to re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting received by the receiving unit from the server.
  • the apparatus for speech previewing further includes: a speech synthesis unit configured to perform speech synthesis on the text input through a speech synthesis service to obtain speech data.
  • the speech synthesis unit is configured to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to restart according to the changed speech settings.
  • the text input is speech synthesized
  • the buffer unit is configured to re-cache the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings.
  • an electronic device including a processor; a memory for storing executable instructions, wherein the processor is configured to execute the executable instructions to implement the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
  • the processor is configured to execute executable instructions, the steps of: sending the textual input to the server; and receiving, from the server, speech data synthesized from the textual input by a speech synthesis service.
  • the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, notifying the server to stop utilizing the speech synthesis service to interpret the text input speech according to the speech settings before the change
  • the synthesis operation notifies the server to re-speech the text input according to the changed speech settings using the speech synthesis service, and re-caches the speech data synthesized from the server through the speech synthesis service based on the text input according to the changed speech settings.
  • the processor is configured to execute executable instructions to perform the steps of: performing speech synthesis on textual input through a speech synthesis service to obtain speech data.
  • the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change , using the speech synthesis service to re-speech the text input according to the changed speech settings, and re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech settings.
  • a speech processing system comprising: a terminal device configured to receive text input, send the text input to a server, The voice data synthesized by the text input is buffered in real time, and when the received voice data is buffered to a playable length, the buffered voice data is decoded and played; Speech synthesis is performed on the text input of the device to obtain voice data, and the obtained voice data is transmitted to the terminal device in real time.
  • a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is caused to perform the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
  • a computer program product wherein instructions in the computer program product are executed by at least one processor in an electronic device, and perform the following steps: receiving text input; , and buffer the synthesized speech data based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
  • the delay is greatly reduced through real-time transmission, and a real-time preview is started when very little voice data is buffered, with almost no waiting.
  • the local terminal device itself or informs the server not to perform TTS on the remaining texts that have not yet been TTS, which reduces the cost of TTS services, thereby improving the speed of TTS preview for users in video editing, optimizing user experience.
  • FIG. 1 is an exemplary system architecture diagram to which exemplary embodiments of the present disclosure may be applied;
  • FIG. 2 is a flowchart of a method for voice preview according to an exemplary embodiment of the present disclosure
  • FIG. 3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure
  • FIG. 4 is a detailed flowchart of a method for voice preview when a TTS service is locally executed on a terminal device according to an exemplary embodiment of the present disclosure
  • FIG. 5 is a block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure.
  • FIG. 6 is a detailed block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure.
  • Fig. 7 is a detailed block diagram of a voice preview apparatus according to another exemplary embodiment of the present disclosure.
  • FIG. 8 is a block diagram of a system for voice preview according to an exemplary embodiment of the present disclosure.
  • FIG. 9 is a block diagram of an electronic device according to an embodiment of the present disclosure.
  • the present disclosure proposes to cache the audio data synthesized through the TTS service in real time after receiving the text input, and start the real-time preview of the audio file when very little data is cached, almost without waiting, and at the same time when the timbre is switched. At this time, speech synthesis can no longer be performed on the text that has not been speech synthesized, which reduces the cost of TTS services.
  • FIG. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages (eg, TTS service request, audio and video data upload request, audio and video data acquisition request) and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as video recording applications, audio playback applications, video and audio editing applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102 and 103 are hardware, they can be various electronic devices with a display screen and capable of playing, recording and editing audio and video, including but not limited to smart phones, tablet computers, laptop computers and desktop computer, etc.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above, which can be implemented as multiple software or software modules (for example, used to provide distributed services), or as a single software or software module. software module. There is no specific limitation here.
  • the terminal devices 101, 102, 103 may be installed with image capture devices (eg, cameras) to capture video data.
  • the terminal devices 101, 102, 103 may also be installed with components for converting electrical signals into sounds (such as speakers) to play sounds, and may also be installed with devices for converting analog audio signals into digital audio signals (for example, microphone) to capture sound.
  • the terminal devices 101 , 102 , and 103 can use the image collection device installed on them to collect video data, and use the audio collection device installed on them to collect audio data. Moreover, the terminal devices 101, 102, 103 can perform TTS service on the received text input to synthesize audio data from the text input, and can play the audio data by using an audio processing component installed on it that supports audio playback.
  • the server 105 may be a server that provides various services, such as a background server that provides support for audio and video recording applications, audio and video editing applications, etc. installed on the terminal devices 101 , 102 , and 103 .
  • the background server can perform analysis, TTS service, storage and other processing of uploaded text input and other data, and can also receive TTS service requests sent by terminal devices 101, 102, and 103, and feed back the audio data synthesized by speech to the terminal devices 101, 102, and 103. 102, 103.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
  • the server is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module. There is no specific limitation here.
  • the voice preview methods provided in the embodiments of the present application are generally performed by the terminal devices 101 , 102 , and 103 , and correspondingly, the voice preview devices are generally set in the terminal devices 101 , 102 and 103 .
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. According to implementation requirements, there may be any number of terminal devices, networks and servers, which are not limited in the present disclosure.
  • Fig. 2 is a flow chart of a method for voice preview according to an exemplary embodiment of the present disclosure.
  • a terminal device receives text input.
  • the text input may be text input or edited by the user in any way on the terminal device.
  • the user can directly enter the audio and video editing software on the terminal device. Enter text into the video editing software, or directly load the text files received from other devices or downloaded from the server into the audio and video editing software.
  • the present disclosure does not specifically limit the manner of receiving text input, and any manner that can perform text input is included within the scope of the present disclosure.
  • the terminal device buffers the speech data synthesized based on the text input through the speech synthesis service (TTS service).
  • TTS service the speech synthesis service
  • the cache can be real-time.
  • the terminal device decodes and plays the buffered voice data.
  • the TTS service may be a local TTS service invoked by the terminal device through audio and video editing software, or may be a server-side TTS service invoked by the audio and video editing software.
  • the TTS service is performed to synthesize the text input into voice data, or the terminal device may request the server side to perform the TTS service to synthesize the uploaded text input into voice data.
  • FIG. 3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure.
  • step S301 the terminal device receives text input. Since this step is the same as the operation of step S201, it will not be described repeatedly here.
  • the terminal device sends the text input to the server.
  • the terminal device can invoke the background TTS service, which is located on the server side, that is, the terminal device can invoke the server The TTS service on the side, and upload the text input to the server, and then the server synthesizes the text input received from the terminal device into voice data through the TTS service.
  • the terminal device receives the voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the terminal device can receive the voice data synthesized from the text input through the TTS service from the server in real time through streaming.
  • the server based on the server receiving the text input uploaded from the terminal device, the server performs speech synthesis on the text input through the TTS service according to the request of the terminal device, thereby generating audio encoded data in a specific format, for example, generating an advanced Audio encoded data in Audio Coded (AAC) format.
  • AAC Audio Coded
  • the request of the terminal device may include various voice settings related to the TTS service, for example, the user expects the timbre, pitch, speech rate, tone, background music, etc. of the synthesized voice.
  • the present disclosure does not specifically limit the format of the generated audio encoded data, as long as the generated audio encoded data can be subsequently streamed and each audio frame can be independently decoded, and the audio format with a smaller audio frame duration are included within the scope of this disclosure.
  • the generated audio encoded data in a specific format is packaged by the server and transmitted to the terminal device in real time by streaming.
  • the generated audio encoded data can be packaged by the server in the AAC Streaming Transmission Format (ADTS) format, and It is transmitted back to the terminal device in real time according to the streaming mode.
  • ADTS AAC Streaming Transmission Format
  • step S304 the terminal device buffers the speech data synthesized from the text input through the TTS service in real time.
  • the terminal device buffers the received voice data in a buffer in real time.
  • the terminal device may receive the voice data from the server. Data packets in ADTS format are buffered in the buffer in real time.
  • step S305 the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length.
  • a playable length ie, a certain size
  • the terminal device can Decode the currently buffered voice data, and play the decoded PCM data.
  • the terminal device always receives the just synthesized voice data from the server according to the streaming Data is buffered in buffers to ensure continuity of decoding and playback.
  • the above-described process of receiving the synthesized voice data from the server in real time through streaming transmission, buffering the received voice data in real time, and starting decoding and playing after buffering a certain amount of data can realize real-time preview and greatly reduce the preview delay. .
  • the user may be dissatisfied with the voice synthesized according to the current voice settings related to the TTS service, and then change the voice settings related to the TTS service during the voice preview.
  • the user may change the voice settings. Timbre, pitch, speech rate, tone or background music, etc., because in the related art, the voice data synthesis corresponding to the entire text input is often completed through the TTS service according to the voice settings related to the TTS service set by the user and received from the server. Voice preview cannot be performed until all the voice data is synthesized. Therefore, if the user does not like the voice synthesized according to the current voice settings, resources will be wasted and user experience will be affected.
  • the terminal device may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the
  • the changed voice setting re-synthesizes the text input, and locally re-caches the voice data synthesized from the text input through the TTS service according to the changed voice setting and received from the server in real time in a streaming manner in the terminal device.
  • the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length. In this case, decode and preview the buffered voice data. In this way, when the voice settings related to the TTS service are changed, the server does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.
  • the terminal device can convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • the above describes the detailed process of the method in which the terminal device invokes the TTS service on the server side to implement the speech preview of speech synthesis.
  • the following describes the detailed process of the method for the terminal device to invoke the local TTS service to implement the speech preview of speech synthesis.
  • FIG. 4 is a detailed flowchart of a method for voice preview when locally executed by a terminal device according to an exemplary embodiment of the present disclosure.
  • step S401 text input is received. Since this step is the same as the operation of step S201, it will not be described repeatedly here.
  • the terminal device performs speech synthesis on the text input locally through the TTS service to obtain speech data.
  • the terminal device can invoke the background TTS service, and the background TTS service is located locally on the terminal device, that is, the terminal device can invoke Local TTS service, for example, the terminal device calls the local TTS service through the API of the audio and video editing software, and transmits the text input to the local TTS service, and then uses the TTS service according to the user's voice settings related to the TTS service.
  • the text input is synthesized into speech data.
  • the terminal device may synthesize the text input into audio encoded data in a specific format through a local TTS service, for example, generate audio encoded data in AAC format.
  • the voice settings related to the TTS service may include, for example, the user's desired timbre of the synthesized voice, background music, pitch, speech rate, tone, and the like.
  • the terminal device buffers the speech data synthesized from the text input through the TTS service.
  • the cache can be real-time.
  • the terminal device can buffer the voice data synthesized by calling the local TTS service in real time.
  • the terminal device can buffer the audio encoded data in AAC format synthesized through the local TTS service in the buffer. .
  • step S404 the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length.
  • a playable length ie, a certain size
  • the terminal device can The currently buffered voice data is decoded, and the decoded PCM data is played.
  • the terminal device always caches the voice data just synthesized through the local TTS service in real time to ensure audio decoding. and continuity of playback.
  • the terminal device described above directly implements speech synthesis by calling the local TTS service and buffers the synthesized speech data, and starts decoding and playing after buffering a certain amount of data, thereby realizing real-time preview and greatly reducing preview delay.
  • the terminal device may stop using the local TTS service according to the voice setting before the change.
  • the speech synthesis operation of the text input and utilize the TTS service to re-speech the text input according to the changed speech settings, and re-cache the speech synthesized from the text input through the TTS service according to the changed speech settings. data.
  • the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length.
  • the terminal device does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.
  • the terminal device can convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • audio metadata such as duration, Sample rate, number of channels and other information
  • FIG. 5 is a block diagram of a voice preview apparatus 500 according to an exemplary embodiment of the present disclosure.
  • the apparatus 500 for audio preview may include a receiving unit 510 , a buffering unit 520 , a decoding unit 530 and a playing unit 540 .
  • the receiving unit 510 can receive text input.
  • the repeated description will not be repeated here.
  • the buffering unit 520 may buffer the speech data synthesized based on the text input through the speech synthesis service (TTS service).
  • TTS service may be a local TTS service invoked by the device 500, or may be a server-side TTS service invoked, in other words, a TTS service may be performed locally in the device 500 to synthesize the text input
  • the device 500 may also request the server side to perform TTS service to synthesize the uploaded text input into voice data.
  • FIG. 6 is a detailed block diagram of a voice preview apparatus 600 according to an exemplary embodiment of the present disclosure.
  • the apparatus 600 for audio preview may include a receiving unit 610 , a buffering unit 620 , a decoding unit 630 , a playing unit 640 and a sending unit 650 .
  • the receiving unit 610 can receive text input, that is, the receiving unit 610 can perform operations corresponding to the step S210 described above with reference to FIG. 2 , and therefore, detailed descriptions are omitted here.
  • the sending unit 650 may send the text input to the server.
  • the device 600 can invoke the background TTS service, which is located on the server side, that is, the device 600 can invoke the server and upload the text input to the server, and then the server synthesizes the text input received from the device 600 into voice data through the TTS service.
  • the receiving unit 610 may receive voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the device 600 may receive voice data synthesized from the text input through the TTS service from the server in real time through streaming.
  • the server based on the server receiving the text input uploaded from the device 600, the server performs speech synthesis on the text input through the TTS service according to the request of the device 600, thereby generating audio encoded data in a specific format, such as generating AAC Format of audio encoded data. Thereafter, the generated audio encoded data in a specific format is packaged by the server and transmitted to the device 600 in real time in a streaming manner.
  • the buffering unit 620 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, based on the voice data synthesized by the TTS service received by the apparatus 600 from the server, the buffering unit 620 buffers the received voice data. For example, the buffering unit 620 may real-time buffer the ADTS-formatted data packets received from the server. cache.
  • the decoding unit 630 may decode the buffered speech data when the synthesized speech data is buffered to a playable length. In some embodiments, based on the voice data buffered by the buffering unit 620 reaching a playable length (ie a certain size), the decoding unit 630 can decode the currently buffered voice data, and the playing unit 640 can play the decoded PCM data. In addition, when performing the above decoding, the buffering unit 620 always buffers the voice data received from the server in real time according to the streaming transmission mode, so as to ensure the continuity of the decoding operation of the decoding unit 630 and the playing operation of the playing unit 640 .
  • the sending unit 650 may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the voice after the change. It is set to re-synthesize the text input, and the buffering unit 620 may re-cache the voice data synthesized based on the text input through the TTS service according to the changed voice settings received by the receiving unit 610 from the server.
  • the cache unit 630 may delete the previously cached voice data synthesized according to the previous voice settings, and based on the re-cached voice data synthesized according to the changed voice settings by the cache unit 630
  • the decoding unit 630 decodes the buffered voice data, and the playing unit 640 previews and plays the decoded voice data.
  • the device 600 may convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • any relevant details related to the operations performed by the units in FIG. 6 can be referred to in relation to FIG. 3 The corresponding descriptions are not repeated here.
  • FIG. 7 is a detailed block diagram of a voice preview apparatus 700 according to another exemplary embodiment of the present disclosure.
  • the apparatus 700 for previewing a voice may include a receiving unit 710 , a buffering unit 720 , a decoding unit 730 , a playing unit 740 and a voice synthesis unit 750 .
  • the receiving unit 710 may receive text input, that is, the receiving unit 710 may perform operations corresponding to the step S210 described above with reference to FIG. 2 , and thus will not be repeated here.
  • the speech synthesis unit 750 may perform speech synthesis on the text input locally through the TTS service to obtain speech data.
  • the speech synthesis unit 750 may invoke the background TTS service, which is located locally on the device 700, that is, the speech synthesis unit 750
  • the unit 750 can call the local TTS service.
  • the speech synthesis unit 750 can call the local TTS service through the API of the audio and video editing software, and transmit the text input to the local TTS service, and then use the TTS service according to the user's and user's requirements.
  • the TTS service-related voice settings synthesize the text input into voice data.
  • the buffering unit 720 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, the buffering unit 720 may buffer the speech data synthesized through the invoked local TTS service in real time.
  • the decoding unit 730 may decode the buffered speech data when the synthesized speech data is buffered by the buffering unit 720 to a playable length, and the playback unit 740 performs preview playback.
  • the decoding unit 730 can decode the currently buffered voice data, and the playing unit 740 can play the decoded PCM data
  • the buffering unit 720 always buffers the voice data just synthesized through the local TTS service in real time, so as to ensure the continuity of the decoding operation of the decoding unit 730 and the playing operation of the playing unit 740 .
  • the voice synthesis unit 750 may stop using the TTS service to pair with the voice setting before the change.
  • the buffering unit 720 may delete the previously buffered voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings in the buffering unit 720 reaches In the case of a playable length, the decoding unit 730 decodes the buffered voice data, and the playing unit 740 performs preview playback.
  • the device 700 may convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • audio metadata such as duration, Sample rate, number of channels and other information
  • FIG. 8 is a block diagram of a system 800 for voice preview according to an exemplary embodiment of the present disclosure.
  • the system 800 includes a terminal device 810 and a server 820 .
  • the terminal device 810 may be any one of 101 , 102 and 103 shown in FIG. 1
  • the server 820 may be the server 105 shown in FIG. 1 .
  • the terminal device 810 can receive the text input, send the text input to the server 820, and perform real-time buffering on the voice data synthesized from the text input through the TTS service received from the server 820, and when the received voice data is cached to the extent that it is available. In the case of playback length, the buffered voice data is decoded and played.
  • the server 820 may perform speech synthesis on the text input received from the terminal device 810 through the TTS service to obtain voice data, and transmit the obtained voice data to the terminal device 810 in real time. Since the operations performed by the terminal device 810 are respectively the same as the operations of the server described above with reference to FIG. 6 , they are not repeated here.
  • the electronic device 900 may include at least one memory 910 and at least one processor 920, the at least one memory 910 stores a set of computer-executable instructions, based on A set of computer-executable instructions is executed by at least one processor to perform a method of speech previewing according to an embodiment of the present disclosure.
  • the electronic device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above set of instructions.
  • the electronic device does not have to be a single electronic device, but can also be any set of devices or circuits that can individually or jointly execute the above-mentioned instructions (or instruction sets).
  • the electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
  • a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor.
  • processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
  • the processor may execute instructions or code stored in memory, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.
  • the memory may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the memory may comprise a separate device such as an external disk drive, a storage array, or any other storage device that may be used by a database system.
  • the memory and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the memory.
  • the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device can be connected to each other via a bus and/or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a voice preview according to an exemplary embodiment of the present disclosure.
  • Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card
  • the computer programs in the above-mentioned computer-readable storage media can run in an environment deployed in computer equipment such as clients, hosts, proxy devices, servers, etc., and, in one example, the computer programs and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
  • a computer program product can also be provided, wherein instructions in the computer program product can be executed by at least one processor in an electronic device to implement the method for voice preview according to an exemplary embodiment of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A speech preview method, comprising: receiving a text input (S201); performing real-time buffering on speech data synthesized, by means of a speech synthesis service, from the text input (S202); and when the synthesized speech data is buffered to reach a playable length, decoding and playing the buffered speech data (S203). A speech preview apparatus (600), comprising a receiving unit (610), a buffer unit (620), a decoding unit (630), a playing unit (640) and a sending unit (650). An electronic device, a speech processing system, a computer-readable storage medium and a computing program product. A delay is greatly reduced by means of real-time transmission, real-time preview is started without hardly any waiting time when very few pieces of speech data are buffered. When timbre switching is performed, a local terminal device itself no longer performs TTS on remaining text that has not been subjected to TTS, or notifies a server of same, such that the cost of TTS services is reduced, thereby improving the speed of TTS preview of a user in video editing, and optimizing the user experience.

Description

语音预览的方法及装置Method and device for voice preview
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开要求在2020年11月26日在中国提交的中国专利申请号No.202011355823.8的优先权,其全部内容通过引用并入全文。The present disclosure claims priority to Chinese Patent Application No. 202011355823.8 filed in China on Nov. 26, 2020, the entire contents of which are incorporated by reference in their entirety.
技术领域technical field
本公开涉及信号处理领域,尤其涉及一种语音预览的方法及装置。The present disclosure relates to the field of signal processing, and in particular, to a method and device for previewing speech.
背景技术Background technique
相关技术中,终端设备(例如手机)使用语音合成(TTS)服务的场景是:输入文字,通过调用网络或者离线的软件开发包(SDK)来生成语音文件,然后通过网络或者文件方式返回给终端设备,此后,终端设备通过调用语音文件来进行播放。在视频编辑场景下,用户利用终端设备编辑拍摄的视频,并编辑文字,然后利用编辑的文字生成不同音色的音频文件,再合成到视频中,从而完成配音过程。In the related art, the scenario in which a terminal device (such as a mobile phone) uses a speech synthesis (TTS) service is: input text, generate a voice file by calling a network or an offline software development kit (SDK), and then return it to the terminal through the network or file. device, and thereafter, the terminal device plays by calling the voice file. In the video editing scenario, the user uses the terminal device to edit the captured video, edit the text, and then use the edited text to generate audio files with different timbres, which are then synthesized into the video to complete the dubbing process.
发明内容SUMMARY OF THE INVENTION
本公开提供一种语音预览的方法及装置,本公开的技术方案如下:The present disclosure provides a voice preview method and device, and the technical solutions of the present disclosure are as follows:
根据本公开实施例的第一方面,提供一种语音预览的方法,包括:接收文字输入;对通过语音合成服务,基于文字输入合成的语音数据进行缓存;在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。According to a first aspect of the embodiments of the present disclosure, a method for previewing a voice is provided, including: receiving text input; buffering voice data synthesized based on the text input through a voice synthesis service; In the case of the length, the buffered voice data is decoded and played.
一些实施例中,还包括:将文字输入发送到服务器;从服务器接收通过语音合成服务从文字输入合成的语音数据。In some embodiments, the method further includes: sending the text input to the server; and receiving, from the server, speech data synthesized from the text input through a speech synthesis service.
一些实施例中,基于与语音合成服务有关的语音设置被改变,通知服务器停止利用语音合成服务根据改变前的语音设置对文字输入的语音合成操作,通知服务器利用语音合成服务根据改变后的语音设置重新对文字输入进行语音合成,并且重新缓存从服务器接收的根据改变后的语音设置通过语音合成服务基于文字输入合成的语音数据。In some embodiments, based on the speech settings related to the speech synthesis service being changed, the server is notified to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and the server is notified using the speech synthesis service according to the changed speech settings. The text input is re-speech-synthesized, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings received from the server is re-cached.
一些实施例中,还包括:通过语音合成服务对文字输入进行语音合成以获得语音数据。In some embodiments, the method further includes: performing speech synthesis on the text input through a speech synthesis service to obtain speech data.
一些实施例中,基于与语音合成服务有关的语音设置被改变,停止利用语音合成服务根据改变前的语音设置对文字输入的语音合成操作,利用语音合成服务根据改变后的语音设置重新对文字输入进行语音合成,并且重新缓存根据改变后的语音设置色通过语音合成服务基于文字输入合成的语音数据。In some embodiments, based on the speech settings related to the speech synthesis service being changed, the speech synthesis operation of the text input using the speech synthesis service according to the speech settings before the change is stopped, and the text input is re-entered using the speech synthesis service according to the changed speech settings. Speech synthesis is performed, and the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting color is re-cached.
根据本公开实施例的第二方面,提供一种语音预览的装置,所述装置包括:接收单元,被配置为接收文字输入;缓存单元,被配置为对通过语音合成服务,基于所述文字输入合成的语音数据进行缓存;解码单元,被配置为在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码;播放单元,被配置为对解码的音频数据进行播放。According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for voice preview, the apparatus comprising: a receiving unit configured to receive text input; a buffer unit configured to perform a speech synthesis service based on the text input The synthesized voice data is buffered; the decoding unit is configured to decode the buffered voice data when the synthesized voice data is buffered to a playable length; the playing unit is configured to play the decoded audio data.
一些实施例中,语音预览的装置还包括:发送单元,被配置为将文字输入发送到服务器,其中,接收单元还被配置为从服务器接收通过语音合成服务从文字输入合成的语音数据。In some embodiments, the apparatus for voice preview further includes: a sending unit configured to send the text input to the server, wherein the receiving unit is further configured to receive from the server voice data synthesized from the text input through a speech synthesis service.
一些实施例中,基于与语音合成服务有关的语音设置被改变,发送单元被配置为通知服务器停止利用语音合成服务根据改变前的语音设置对文字输入的语音合成操作,并通知服务器利用语音合成服务根据改变后的语音设置重新对文字输入进行语音合成,并且,缓存单元被配置为重新缓存由接收单元从服务 器接收的根据改变后的语音设置通过语音合成服务基于文字输入合成的语音数据。In some embodiments, based on the speech settings related to the speech synthesis service being changed, the sending unit is configured to notify the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to notify the server to use the speech synthesis service. The text input is re-speech-synthesized according to the changed speech setting, and the buffer unit is configured to re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting received by the receiving unit from the server.
一些实施例中,语音预览的装置还包括:语音合成单元,被配置为通过语音合成服务对文字输入进行语音合成以获得语音数据。In some embodiments, the apparatus for speech previewing further includes: a speech synthesis unit configured to perform speech synthesis on the text input through a speech synthesis service to obtain speech data.
一些实施例中,基于与语音合成服务有关的语音设置被改变,语音合成单元被配置为停止利用语音合成服务根据改变前的语音设置对文字输入的语音合成操作,并根据改变后的语音设置重新对文字输入进行语音合成,并且,缓存单元被配置为重新缓存根据改变后的语音设置通过语音合成服务基于文字输入合成的语音数据。In some embodiments, based on the speech settings related to the speech synthesis service being changed, the speech synthesis unit is configured to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to restart according to the changed speech settings. The text input is speech synthesized, and the buffer unit is configured to re-cache the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings.
根据本公开实施例的第三方面,提供了一种电子设备,包括处理器;用于存储可执行指令的存储器,其中,处理器被配置执行可执行指令,实现以下步骤:接收文字输入;对通过语音合成服务,基于文字输入合成的语音数据进行缓存;在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a processor; a memory for storing executable instructions, wherein the processor is configured to execute the executable instructions to implement the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
一些实施例中,其中,处理器被配置执行可执行指令,实现以下步骤:将文字输入发送到服务器;从服务器接收通过语音合成服务从文字输入合成的语音数据。In some embodiments, wherein the processor is configured to execute executable instructions, the steps of: sending the textual input to the server; and receiving, from the server, speech data synthesized from the textual input by a speech synthesis service.
一些实施例中,其中,处理器被配置执行可执行指令,实现以下步骤:基于与语音合成服务有关的语音设置被改变,通知服务器停止利用语音合成服务根据改变前的语音设置对文字输入的语音合成操作,通知服务器利用语音合成服务根据改变后的语音设置重新对文字输入进行语音合成,并且重新缓存从服务器接收的根据改变后的语音设置通过语音合成服务基于文字输入合成的语音数据。In some embodiments, wherein the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, notifying the server to stop utilizing the speech synthesis service to interpret the text input speech according to the speech settings before the change The synthesis operation notifies the server to re-speech the text input according to the changed speech settings using the speech synthesis service, and re-caches the speech data synthesized from the server through the speech synthesis service based on the text input according to the changed speech settings.
一些实施例中,其中,处理器被配置执行可执行指令,实现以下步骤:通过语音合成服务对文字输入进行语音合成以获得语音数据。In some embodiments, the processor is configured to execute executable instructions to perform the steps of: performing speech synthesis on textual input through a speech synthesis service to obtain speech data.
一些实施例中,其中,处理器被配置执行可执行指令,实现以下步骤:基于与语音合成服务有关的语音设置被改变,停止利用语音合成服务根据改变前的语音设置对文字输入的语音合成操作,利用语音合成服务根据改变后的语音设置重新对文字输入进行语音合成,并且重新缓存根据改变后的语音设置色通过语音合成服务基于文字输入合成的语音数据。In some embodiments, wherein the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change , using the speech synthesis service to re-speech the text input according to the changed speech settings, and re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech settings.
根据本公开实施例的第四方面,提供了一种语音处理系统,包括:终端设备,被配置为接收文字输入,将所述文字输入发送到服务器,对从服务器接收的通过TTS服务从所述文字输入合成的语音数据进行实时缓存,并在接收的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放;以及服务器,被配置为通过TTS服务对从终端设备接收的所述文字输入进行语音合成以获得语音数据,并将获得的语音数据实时传输给终端设备。According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech processing system, comprising: a terminal device configured to receive text input, send the text input to a server, The voice data synthesized by the text input is buffered in real time, and when the received voice data is buffered to a playable length, the buffered voice data is decoded and played; Speech synthesis is performed on the text input of the device to obtain voice data, and the obtained voice data is transmitted to the terminal device in real time.
根据本公开实施例的第五方面,提供了一种存储指令的计算机可读存储介质,当所述指令被至少一个处理器运行时,促使至少一个处理器,执行以下步骤:接收文字输入;对通过语音合成服务,基于文字输入合成的语音数据进行缓存;在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is caused to perform the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
根据本公开实施例的第六方面,提供了一种计算机程序产品,所述计算机程序产品中的指令被电子设备中的至少一个处理器运行,执行以下步骤:接收文字输入;对通过语音合成服务,基于文字输入合成的语音数据进行缓存;在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer program product, wherein instructions in the computer program product are executed by at least one processor in an electronic device, and perform the following steps: receiving text input; , and buffer the synthesized speech data based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
本公开实施例,通过实时传输将延时大幅度降低,缓存极少的语音数据,就开始实时预览,几乎是无等待。同时,在进行音色切换的时候,本地终端设备自身或者通知服务器不再对剩下的尚未进行TTS的文字进行TTS,降低TTS服务的成本,从而提高了用户在视频编辑中TTS预览的速度,优化了用户体验。In the embodiment of the present disclosure, the delay is greatly reduced through real-time transmission, and a real-time preview is started when very little voice data is buffered, with almost no waiting. At the same time, when performing tone switching, the local terminal device itself or informs the server not to perform TTS on the remaining texts that have not yet been TTS, which reduces the cost of TTS services, thereby improving the speed of TTS preview for users in video editing, optimizing user experience.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理,并不构成对本公开的不当限定。The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.
图1是本公开的示例性实施例可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which exemplary embodiments of the present disclosure may be applied;
图2是根据本公开示例性实施例示出的一种语音预览的方法的流程图;2 is a flowchart of a method for voice preview according to an exemplary embodiment of the present disclosure;
图3是根据本公开示例性实施例示出的当TTS服务在服务器侧执行时的语音预览的方法的详细流程图;3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure;
图4根据本公开示例性实施例示出的当TTS服务在终端设备本地执行时的语音预览的方法的详细流程图;4 is a detailed flowchart of a method for voice preview when a TTS service is locally executed on a terminal device according to an exemplary embodiment of the present disclosure;
图5是根据本公开示例性实施例示出的一种语音预览的装置的框图;5 is a block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure;
图6是根据本公开一个示例性实施例示出的一种语音预览的装置的详细框图;6 is a detailed block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure;
图7是根据本公开另一个示例性实施例示出的一种语音预览的装置的详细框图;Fig. 7 is a detailed block diagram of a voice preview apparatus according to another exemplary embodiment of the present disclosure;
图8是根据本公开示例性实施例示出的一种语音预览的系统的框图;8 is a block diagram of a system for voice preview according to an exemplary embodiment of the present disclosure;
图9是根据本公开实施例的一种电子设备的框图。FIG. 9 is a block diagram of an electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following examples are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.
如本公开背景技术中所提及的,在相关技术中,在视频编辑的场景下利用文字输入生成音频文件并进行音频文件预览时,用户输入的文字文件需要经历文字输入和/或上传、语音合成、语音编码、下载完成这些过程,也就是说,传统的文件使用方式是串行执行的,输入的文字文件需要等待以上步骤都完成后才可开始对下载完成的音频文件进行预览。针对此,本公开提出了能够在接收了文字输入之后通过实时缓存通过TTS服务合成的音频数据,并在缓存了极少数据时就开始音频文件的实时预览,几乎无需等待,同时在音色切换的时候能 够不再对尚未进行语音合成的文字进行语音合成,降低了TTS服务的成本。As mentioned in the background of the present disclosure, in the related art, when using text input to generate an audio file and preview the audio file in a video editing scenario, the text file input by the user needs to undergo text input and/or upload, voice The processes of synthesis, speech encoding, and downloading are completed, that is to say, the traditional file usage method is executed serially, and the input text file needs to wait for the completion of the above steps before starting to preview the downloaded audio file. In view of this, the present disclosure proposes to cache the audio data synthesized through the TTS service in real time after receiving the text input, and start the real-time preview of the audio file when very little data is cached, almost without waiting, and at the same time when the timbre is switched. At this time, speech synthesis can no longer be performed on the text that has not been speech synthesized, which reduces the cost of TTS services.
图1示出了本公开的示例性实施例可以应用于其中的示例性系统架构100。FIG. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息(例如TTS服务请求、音视频数据上传请求、音视频数据获取请求)等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如视频录制类应用、音频播放类应用、视频和音频编辑类应用、即时通信工具、邮箱客户端、社交平台软件等。终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是具有显示屏并且能够进行音视频的播放、录制和编辑的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中,其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。As shown in FIG. 1 , the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages (eg, TTS service request, audio and video data upload request, audio and video data acquisition request) and the like. Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as video recording applications, audio playback applications, video and audio editing applications, instant messaging tools, email clients, social platform software, and the like. The terminal devices 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102 and 103 are hardware, they can be various electronic devices with a display screen and capable of playing, recording and editing audio and video, including but not limited to smart phones, tablet computers, laptop computers and desktop computer, etc. When the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above, which can be implemented as multiple software or software modules (for example, used to provide distributed services), or as a single software or software module. software module. There is no specific limitation here.
终端设备101、102、103可以安装有图像采集装置(例如摄像头)以采集视频数据。此外,终端设备101、102、103也可以安装有用于将电信号转换为声音的组件(例如扬声器)以播放声音,并且还可以安装有用于将模拟音频信号转换为数字音频信号的装置(例如,麦克风)以采集声音。The terminal devices 101, 102, 103 may be installed with image capture devices (eg, cameras) to capture video data. In addition, the terminal devices 101, 102, 103 may also be installed with components for converting electrical signals into sounds (such as speakers) to play sounds, and may also be installed with devices for converting analog audio signals into digital audio signals (for example, microphone) to capture sound.
终端设备101、102、103可以利用安装于其上的图像采集装置进行视频数据的采集,利用安装于其上的音频采集装置进行音频数据的采集。并且,终端设备101、102、103可以对接收的文字输入进行TTS服务来从文字输入合成音 频数据,并可以利用安装于其上的支持音频播放的音频处理组件播放音频数据。The terminal devices 101 , 102 , and 103 can use the image collection device installed on them to collect video data, and use the audio collection device installed on them to collect audio data. Moreover, the terminal devices 101, 102, 103 can perform TTS service on the received text input to synthesize audio data from the text input, and can play the audio data by using an audio processing component installed on it that supports audio playback.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上所安装的音视频录制类应用、音视频编辑类应用等提供支持的后台服务器。后台服务器可以上传的文字输入等数据进行解析、TTS服务、存储等处理,并且还可以接收终端设备101、102、103所发送的TTS服务请求,并将语音合成的音频数据反馈至终端设备101、102、103。The server 105 may be a server that provides various services, such as a background server that provides support for audio and video recording applications, audio and video editing applications, etc. installed on the terminal devices 101 , 102 , and 103 . The background server can perform analysis, TTS service, storage and other processing of uploaded text input and other data, and can also receive TTS service requests sent by terminal devices 101, 102, and 103, and feed back the audio data synthesized by speech to the terminal devices 101, 102, and 103. 102, 103.
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the server may be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server. When the server is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module. There is no specific limitation here.
需要说明的是,本申请实施例所提供的语音预览的方法一般由终端设备101、102、103执行,相应地,语音预览的装置一般设置于终端设备101、102、103中。It should be noted that the voice preview methods provided in the embodiments of the present application are generally performed by the terminal devices 101 , 102 , and 103 , and correspondingly, the voice preview devices are generally set in the terminal devices 101 , 102 and 103 .
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器,本公开对此并无限制。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. According to implementation requirements, there may be any number of terminal devices, networks and servers, which are not limited in the present disclosure.
图2是根据本公开示例性实施例示出的一种语音预览的方法的流程图。Fig. 2 is a flow chart of a method for voice preview according to an exemplary embodiment of the present disclosure.
在步骤S201,终端设备(例如,终端设备101)接收文字输入。这里,文字输入可以是由用户在终端设备上利用任何方式输入或编辑的文字,在一些实施例中,基于用户启动了终端设备上的音视频编辑软件中的编辑预览功能,用户可直接在音视频编辑软件中输入文字,或者可将从其它设备接收或从服务器下载的文字文件直接加载到音视频编辑软件中。以上仅是示例,本公开不对接收文字输入的方式进行具体限定,任何可以进行文字输入的方式均包括在本公 开的范围内。In step S201, a terminal device (eg, terminal device 101) receives text input. Here, the text input may be text input or edited by the user in any way on the terminal device. In some embodiments, based on the user starting the editing preview function in the audio and video editing software on the terminal device, the user can directly enter the audio and video editing software on the terminal device. Enter text into the video editing software, or directly load the text files received from other devices or downloaded from the server into the audio and video editing software. The above are only examples, the present disclosure does not specifically limit the manner of receiving text input, and any manner that can perform text input is included within the scope of the present disclosure.
在步骤S202,终端设备对通过语音合成服务(TTS服务),基于文字输入合成的语音数据进行缓存。例如,缓存可以是实时的。然后,在步骤S203,终端设备在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。在一些实施例中,该TTS服务可以是终端设备通过音视频编辑软件调用的本地的TTS服务,也可以是通过音视频编辑软件调用的服务器侧的TTS服务,换句话说,可以在终端设备本地进行TTS服务以将所述文字输入合成为语音数据,也可以由终端设备请求服务器侧进行TTS服务以将上传的所述文字输入合成为语音数据。以下将参照附图3和附图4分别描述当TTS服务在服务器侧执行时以及当TTS服务在终端设备本地执行时的用于语音处理的方法的详细过程。In step S202, the terminal device buffers the speech data synthesized based on the text input through the speech synthesis service (TTS service). For example, the cache can be real-time. Then, in step S203, when the synthesized voice data is buffered to a playable length, the terminal device decodes and plays the buffered voice data. In some embodiments, the TTS service may be a local TTS service invoked by the terminal device through audio and video editing software, or may be a server-side TTS service invoked by the audio and video editing software. The TTS service is performed to synthesize the text input into voice data, or the terminal device may request the server side to perform the TTS service to synthesize the uploaded text input into voice data. The detailed process of the method for speech processing when the TTS service is executed on the server side and when the TTS service is executed locally on the terminal device will be described below with reference to FIG. 3 and FIG. 4 , respectively.
图3是根据本公开示例性实施例示出的当TTS服务在服务器侧执行时的语音预览的方法的详细流程图。FIG. 3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure.
在步骤S301,终端设备接收文字输入。由于该步骤与步骤S201的操作相同,因此这里不再对此进行重复描述。In step S301, the terminal device receives text input. Since this step is the same as the operation of step S201, it will not be described repeatedly here.
在步骤S302,终端设备将所述文字输入发送到服务器。在一些实施例中,在用户启动了终端设备上的音视频编辑软件中的编辑预览功能后,终端设备可调用后台TTS服务,该后台TTS服务位于服务器侧,也就是说,终端设备可调用服务器侧的TTS服务,并将所述文字输入上传到服务器,然后由服务器通过TTS服务将从终端设备接收到的所述文字输入合成为语音数据。In step S302, the terminal device sends the text input to the server. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the terminal device, the terminal device can invoke the background TTS service, which is located on the server side, that is, the terminal device can invoke the server The TTS service on the side, and upload the text input to the server, and then the server synthesizes the text input received from the terminal device into voice data through the TTS service.
在步骤S303,终端设备从服务器接收通过TTS服务从所述文字输入合成的语音数据。例如,接收可以是实时的。换句话说,终端设备可通过流传输方式从服务器实时接收通过TTS服务从所述文字输入合成的语音数据。在一些实 施例中,基于服务器接收到从终端设备上传的所述文字输入,服务器根据终端设备的请求通过TTS服务对所述文字输入进行语音合成,从而生成特定格式的音频编码数据,例如生成高级音频编码(AAC)格式的音频编码数据。其中,终端设备的请求可包括与TTS服务有关的各种语音设置,例如,用户期望合成后的语音的音色、音高、语速、音调、背景音乐等等。此外,本公开不对所生成的音频编码数据的格式进行特别限制,只要所生成的音频编码数据能够进行后续的流传输并且每一个音频帧都可以独立解码且音频帧的时长较小的音频格式都被包括在本公开的范围内。此后,所生成的特定格式的音频编码数据由服务器打包并被以流传输方式实时传输给终端设备,例如,所生成的音频编码数据可由服务器通过AAC流式传输格式(ADTS)格式进行打包,并按照流传输方式被实时回传给终端设备。In step S303, the terminal device receives the voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the terminal device can receive the voice data synthesized from the text input through the TTS service from the server in real time through streaming. In some embodiments, based on the server receiving the text input uploaded from the terminal device, the server performs speech synthesis on the text input through the TTS service according to the request of the terminal device, thereby generating audio encoded data in a specific format, for example, generating an advanced Audio encoded data in Audio Coded (AAC) format. Wherein, the request of the terminal device may include various voice settings related to the TTS service, for example, the user expects the timbre, pitch, speech rate, tone, background music, etc. of the synthesized voice. In addition, the present disclosure does not specifically limit the format of the generated audio encoded data, as long as the generated audio encoded data can be subsequently streamed and each audio frame can be independently decoded, and the audio format with a smaller audio frame duration are included within the scope of this disclosure. After that, the generated audio encoded data in a specific format is packaged by the server and transmitted to the terminal device in real time by streaming. For example, the generated audio encoded data can be packaged by the server in the AAC Streaming Transmission Format (ADTS) format, and It is transmitted back to the terminal device in real time according to the streaming mode.
在步骤S304,终端设备对通过TTS服务从所述文字输入合成的语音数据进行实时缓存。在一些实施例中,基于终端设备通过流传输方式从服务器接收到通过TTS服务合成的语音数据,终端设备将接收的语音数据实时缓存在缓存器中,例如,终端设备可将从服务器接收到的ADTS格式的数据包实时缓存在缓存器中。In step S304, the terminal device buffers the speech data synthesized from the text input through the TTS service in real time. In some embodiments, based on the terminal device receiving the voice data synthesized through the TTS service from the server through streaming, the terminal device buffers the received voice data in a buffer in real time. For example, the terminal device may receive the voice data from the server. Data packets in ADTS format are buffered in the buffer in real time.
在步骤S305,终端设备在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。在一些实施例中,基于终端设备在缓存器中缓存的语音数据达到可播放长度(即一定大小),例如,基于终端设备在缓存器中缓存了1秒播放长度的语音数据,终端设备就可以对当前缓存的语音数据进行解码操作,并对解码后的PCM数据进行播放,另外,在进行以上解码的过程中,终端设备始终按照流传输方式从服务器接收刚刚合成的语音数据并将接收的语音数据缓存在缓存器中,以保证解码和播放的连续性。以上描述的 通过流传输方式实时从服务器接收合成的语音数据、对接收的语音数据进行实时缓存、并在缓存了一定量的数据后就开始进行解码播放的过程可实现实时预览,大大降低预览延迟。In step S305, the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length. In some embodiments, based on the voice data buffered in the buffer by the terminal device reaching a playable length (ie, a certain size), for example, based on the voice data buffered by the terminal device in the buffer with a playback length of 1 second, the terminal device can Decode the currently buffered voice data, and play the decoded PCM data. In addition, during the above decoding process, the terminal device always receives the just synthesized voice data from the server according to the streaming Data is buffered in buffers to ensure continuity of decoding and playback. The above-described process of receiving the synthesized voice data from the server in real time through streaming transmission, buffering the received voice data in real time, and starting decoding and playing after buffering a certain amount of data can realize real-time preview and greatly reduce the preview delay. .
此外,在实际应用中,用户可能会对根据当前的与TTS服务有关的语音设置所合成的语音不满意,进而在语音预览时改变与TTS服务有关的语音设置,例如,用户可能会改变语音的音色、音高、语速、音调或背景音乐等,由于相关技术中,往往是通过TTS服务根据用户设置的与TTS服务有关的语音设置完成了与整个文字输入对应的语音数据合成并从服务器接收到合成的所有语音数据之后才能进行语音预览,因此,如果用户不喜欢根据当前的语音设置所合成的语音,那么必然会造成资源的浪费并且会影响用户体验。In addition, in practical applications, the user may be dissatisfied with the voice synthesized according to the current voice settings related to the TTS service, and then change the voice settings related to the TTS service during the voice preview. For example, the user may change the voice settings. Timbre, pitch, speech rate, tone or background music, etc., because in the related art, the voice data synthesis corresponding to the entire text input is often completed through the TTS service according to the voice settings related to the TTS service set by the user and received from the server. Voice preview cannot be performed until all the voice data is synthesized. Therefore, if the user does not like the voice synthesized according to the current voice settings, resources will be wasted and user experience will be affected.
但是,在本公开中,基于与TTS服务有关的语音设置被改变,终端设备可通知服务器停止利用TTS服务根据改变前的语音设置对所述文字输入的语音合成操作,并通知服务器利用TTS服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且在终端设备本地重新缓存按照流传输方式从服务器实时接收的根据改变后的语音设置通过TTS服务从所述文字输入合成的语音数据。其中,基于与TTS服务有关的语音设置被改变,终端设备可删除先前缓存的根据先前的语音设置合成的语音数据,并在重新缓存的根据改变后的语音设置合成的语音数据达到可播放长度的情况下,对缓存的语音数据进行解码和预览播放。这样可使得在与TTS服务有关的语音设置被改变时服务器不再对剩余的尚未完成语音合成的文字输入进行语音合成,从而降低了TTS服务的成本。However, in the present disclosure, based on the voice setting related to the TTS service being changed, the terminal device may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the The changed voice setting re-synthesizes the text input, and locally re-caches the voice data synthesized from the text input through the TTS service according to the changed voice setting and received from the server in real time in a streaming manner in the terminal device. Wherein, based on the voice settings related to the TTS service being changed, the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length. In this case, decode and preview the buffered voice data. In this way, when the voice settings related to the TTS service are changed, the server does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.
另外,基于用户根据最终确定的语音设置完成了编辑,终端设备可将缓存的与当前的语音设置相应的ADTS数据转换成例如M4A等其它格式的音频数据,同时生成音频元数据,例如包括时长、采样率、声道数等信息,并最终被 提供给音视频编辑SDK使用。In addition, based on the user's completion of editing according to the finalized voice settings, the terminal device can convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
以上描述了由终端设备调用服务器侧的TTS服务来实现语音合成的语音预览的方法的详细过程,下面将描述终端设备调用本地的TTS服务来实现语音合成的语音预览的方法的详细过程。The above describes the detailed process of the method in which the terminal device invokes the TTS service on the server side to implement the speech preview of speech synthesis. The following describes the detailed process of the method for the terminal device to invoke the local TTS service to implement the speech preview of speech synthesis.
图4是根据本公开示例性实施例示出的在终端设备本地执行时的语音预览的方法的详细流程图。FIG. 4 is a detailed flowchart of a method for voice preview when locally executed by a terminal device according to an exemplary embodiment of the present disclosure.
如图4中所示,在步骤S401,接收文字输入。由于该步骤与步骤S201的操作相同,因此这里不再对此进行重复描述。As shown in FIG. 4, in step S401, text input is received. Since this step is the same as the operation of step S201, it will not be described repeatedly here.
在步骤S402,终端设备在本地通过TTS服务对所述文字输入进行语音合成以获得语音数据。在一些实施例中,在用户启动了终端设备上的音视频编辑软件中的编辑预览功能后,终端设备可调用后台TTS服务,该后台TTS服务位于终端设备本地,也就是说,终端设备可调用本地的TTS服务,例如,终端设备通过音视频编辑软件的API调用本地的TTS服务,并将所述文字输入传送给本地的TTS服务,然后通过TTS服务根据用户的与TTS服务有关的语音设置将所述文字输入合成为语音数据。其中,终端设备可通过本地的TTS服务将所述文字输入合成为特定格式的音频编码数据,例如,生成AAC格式的音频编码数据。与TTS服务有关的语音设置可包括例如用户期望合成后的语音的音色、背景音乐、音高、语速、音调等等。In step S402, the terminal device performs speech synthesis on the text input locally through the TTS service to obtain speech data. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the terminal device, the terminal device can invoke the background TTS service, and the background TTS service is located locally on the terminal device, that is, the terminal device can invoke Local TTS service, for example, the terminal device calls the local TTS service through the API of the audio and video editing software, and transmits the text input to the local TTS service, and then uses the TTS service according to the user's voice settings related to the TTS service. The text input is synthesized into speech data. The terminal device may synthesize the text input into audio encoded data in a specific format through a local TTS service, for example, generate audio encoded data in AAC format. The voice settings related to the TTS service may include, for example, the user's desired timbre of the synthesized voice, background music, pitch, speech rate, tone, and the like.
在步骤S403,终端设备对通过TTS服务从所述文字输入合成的语音数据进行缓存。例如,缓存可以是实时的。在一些实施例中,终端设备可对通过调用的本地的TTS服务而合成的语音数据进行实时缓存,例如,终端设备可将通过本地的TTS服务合成的AAC格式的音频编码数据缓存在缓存器中。In step S403, the terminal device buffers the speech data synthesized from the text input through the TTS service. For example, the cache can be real-time. In some embodiments, the terminal device can buffer the voice data synthesized by calling the local TTS service in real time. For example, the terminal device can buffer the audio encoded data in AAC format synthesized through the local TTS service in the buffer. .
在步骤S404,终端设备在合成的语音数据被缓存达到可播放长度的情况下, 对缓存的语音数据进行解码并播放。在一些实施例中,基于终端设备在缓存器中缓存的语音数据达到可播放长度(即一定大小),例如,基于终端设备在缓存器中缓存了1秒播放长度的语音数据,终端设备可对当前缓存的语音数据进行解码操作,并对解码后的PCM数据进行播放,另外,在进行以上解码时,终端设备始终实时地对通过本地的TTS服务刚刚合成的语音数据进行缓存,以保证音频解码和播放的连续性。以上描述的终端设备直接通过调用本地的TTS服务实现语音合成并对合成的语音数据进行缓存,并且在缓存了一定量的数据后就开始进行解码播放,从而可实现实时预览,大大降低预览延迟。In step S404, the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length. In some embodiments, based on the voice data buffered in the buffer by the terminal device reaching a playable length (ie, a certain size), for example, based on the terminal device buffering voice data with a playback length of 1 second in the buffer, the terminal device can The currently buffered voice data is decoded, and the decoded PCM data is played. In addition, when the above decoding is performed, the terminal device always caches the voice data just synthesized through the local TTS service in real time to ensure audio decoding. and continuity of playback. The terminal device described above directly implements speech synthesis by calling the local TTS service and buffers the synthesized speech data, and starts decoding and playing after buffering a certain amount of data, thereby realizing real-time preview and greatly reducing preview delay.
此外,与图3描述的TTS服务在服务器侧执行时的情况类似,在本公开中,基于与TTS服务有关的语音设置被改变,终端设备可停止利用本地的TTS服务根据改变前的语音设置对所述文字输入的语音合成操作,并利用TTS服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且重新缓存根据改变后的语音设置色通过TTS服务从所述文字输入合成的语音数据。其中,基于与TTS服务有关的语音设置被改变,终端设备可删除先前缓存的根据先前的语音设置合成的语音数据,并在重新缓存的根据改变后的语音设置合成的语音数据达到可播放长度的情况下,对缓存的语音数据进行解码和预览播放。这样可使得在与TTS服务有关的语音设置被改变时终端设备不再对剩余的尚未完成语音合成的文字输入进行语音合成,从而降低了TTS服务的成本。In addition, similar to the case when the TTS service described in FIG. 3 is executed on the server side, in the present disclosure, based on the voice setting related to the TTS service being changed, the terminal device may stop using the local TTS service according to the voice setting before the change. The speech synthesis operation of the text input, and utilize the TTS service to re-speech the text input according to the changed speech settings, and re-cache the speech synthesized from the text input through the TTS service according to the changed speech settings. data. Wherein, based on the voice settings related to the TTS service being changed, the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length. In this case, decode and preview the buffered voice data. In this way, when the voice settings related to the TTS service are changed, the terminal device does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.
另外,基于用户根据最终确定的语音设置完成了编辑,终端设备可将缓存的与当前的语音设置相应的语音数据转换成例如M4A等其它格式的音频数据,同时生成音频元数据,例如包括时长、采样率、声道数等信息,并最终提供给音视频编辑SDK使用。In addition, based on the user's completion of editing according to the final voice settings, the terminal device can convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
图5是根据本公开示例性实施例示出的一种语音预览的装置500的框图。FIG. 5 is a block diagram of a voice preview apparatus 500 according to an exemplary embodiment of the present disclosure.
参照图5,语音预览的装置500可包括接收单元510、缓存单元520、解码单元530和播放单元540。接收单元510可接收文字输入。这里,由于以上已经参照图2对文字输入有关的内容进行了详细的描述,因此,这里不再进行重复描述。5 , the apparatus 500 for audio preview may include a receiving unit 510 , a buffering unit 520 , a decoding unit 530 and a playing unit 540 . The receiving unit 510 can receive text input. Here, since the content related to the text input has been described in detail above with reference to FIG. 2 , the repeated description will not be repeated here.
缓存单元520可对通过语音合成服务(TTS服务),基于文字输入合成的语音数据进行缓存。例如,缓存可以是实时的。然后,解码单元530可在合成的语音数据被缓存达到可播放长度的情况下对缓存的语音数据进行解码,并且播放单元540可对解码的音频数据进行播放。在一些实施例中,该TTS服务可以是装置500调用的本地的TTS服务,也可以是调用的服务器侧的TTS服务,换句话说,可以在装置500本地进行TTS服务以将所述文字输入合成为语音数据,也可以由装置500请求服务器侧进行TTS服务以将上传的所述文字输入合成为语音数据。以下将参照附图6和附图7分别描述当TTS服务在服务器侧执行时以及当TTS服务在装置500本地执行时的用于语音处理的方法的详细过程。The buffering unit 520 may buffer the speech data synthesized based on the text input through the speech synthesis service (TTS service). For example, the cache can be real-time. Then, the decoding unit 530 may decode the buffered voice data when the synthesized voice data is buffered to a playable length, and the playing unit 540 may play the decoded audio data. In some embodiments, the TTS service may be a local TTS service invoked by the device 500, or may be a server-side TTS service invoked, in other words, a TTS service may be performed locally in the device 500 to synthesize the text input For voice data, the device 500 may also request the server side to perform TTS service to synthesize the uploaded text input into voice data. The detailed procedures of the method for speech processing when the TTS service is executed on the server side and when the TTS service is executed locally in the device 500 will be described below with reference to FIG. 6 and FIG. 7 , respectively.
图6是根据本公开一个示例性实施例示出的一种语音预览的装置600的详细框图。FIG. 6 is a detailed block diagram of a voice preview apparatus 600 according to an exemplary embodiment of the present disclosure.
参照图6,语音预览的装置600可包括接收单元610、缓存单元620、解码单元630、播放单元640和发送单元650。6 , the apparatus 600 for audio preview may include a receiving unit 610 , a buffering unit 620 , a decoding unit 630 , a playing unit 640 and a sending unit 650 .
接收单元610可接收文字输入,即,接收单元610可执行与以上参照图2描述的步骤S210相应的操作,因此,这里不再进行赘述。The receiving unit 610 can receive text input, that is, the receiving unit 610 can perform operations corresponding to the step S210 described above with reference to FIG. 2 , and therefore, detailed descriptions are omitted here.
发送单元650可将所述文字输入发送到服务器。在一些实施例中,在用户启动了装置600上的音视频编辑软件中的编辑预览功能后,装置600可调用后台TTS服务,该后台TTS服务位于服务器侧,也就是说,装置600可调用服务器侧的TTS服务,并将所述文字输入上传到服务器,然后由服务器通过TTS 服务将从装置600接收到的所述文字输入合成为语音数据。The sending unit 650 may send the text input to the server. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the device 600, the device 600 can invoke the background TTS service, which is located on the server side, that is, the device 600 can invoke the server and upload the text input to the server, and then the server synthesizes the text input received from the device 600 into voice data through the TTS service.
此外,接收单元610可从服务器接收通过TTS服务从所述文字输入合成的语音数据。例如,接收可以是实时的。换句话说,装置600可通过流传输方式从服务器实时接收通过TTS服务从所述文字输入合成的语音数据。在一些实施例中,基于服务器接收到从装置600上传的所述文字输入,服务器根据装置600的请求通过TTS服务对所述文字输入进行语音合成,从而生成特定格式的音频编码数据,例如生成AAC格式的音频编码数据。此后,所生成的特定格式的音频编码数据由服务器打包并被以流传输方式实时传输给装置600。Also, the receiving unit 610 may receive voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the device 600 may receive voice data synthesized from the text input through the TTS service from the server in real time through streaming. In some embodiments, based on the server receiving the text input uploaded from the device 600, the server performs speech synthesis on the text input through the TTS service according to the request of the device 600, thereby generating audio encoded data in a specific format, such as generating AAC Format of audio encoded data. Thereafter, the generated audio encoded data in a specific format is packaged by the server and transmitted to the device 600 in real time in a streaming manner.
缓存单元620可对通过TTS服务从所述文字输入合成的语音数据进行实时缓存。在一些实施例中,基于装置600从服务器接收到通过TTS服务合成的语音数据,缓存单元620对接收的语音数据进行缓存,例如,缓存单元620可实时对从服务器接收到的ADTS格式的数据包进行缓存。The buffering unit 620 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, based on the voice data synthesized by the TTS service received by the apparatus 600 from the server, the buffering unit 620 buffers the received voice data. For example, the buffering unit 620 may real-time buffer the ADTS-formatted data packets received from the server. cache.
解码单元630可在合成的语音数据被缓存达到可播放长度的情况下对缓存的语音数据进行解码。在一些实施例中,基于缓存单元620缓存的语音数据达到可播放长度(即一定大小),解码单元630可对当前缓存的语音数据进行解码,并由播放单元640对解码后的PCM数据进行播放,另外,在进行以上解码时,缓存单元620始终按照流传输方式对从服务器接收到的语音数据进行实时缓存,以保证解码单元630的解码操作和播放单元640的播放操作的连续性。The decoding unit 630 may decode the buffered speech data when the synthesized speech data is buffered to a playable length. In some embodiments, based on the voice data buffered by the buffering unit 620 reaching a playable length (ie a certain size), the decoding unit 630 can decode the currently buffered voice data, and the playing unit 640 can play the decoded PCM data In addition, when performing the above decoding, the buffering unit 620 always buffers the voice data received from the server in real time according to the streaming transmission mode, so as to ensure the continuity of the decoding operation of the decoding unit 630 and the playing operation of the playing unit 640 .
另外,基于与TTS服务有关的语音设置被改变,发送单元650可通知服务器停止利用TTS服务根据改变前的语音设置对所述文字输入的语音合成操作,并通知服务器利用TTS服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且,缓存单元620可重新缓存由接收单元610从服务器接收的根据改变后的语音设置通过TTS服务基于所述文字输入合成的语音数据。其中, 基于与TTS服务有关的语音设置被改变,缓存单元630可删除先前缓存的根据先前的语音设置合成的语音数据,并且,基于缓存单元630重新缓存的根据改变后的语音设置合成的语音数据达到可播放长度的情况下,解码单元630对缓存的语音数据进行解码,并由播放单元640对解码的语音数据进行预览播放。In addition, based on the voice setting related to the TTS service being changed, the sending unit 650 may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the voice after the change. It is set to re-synthesize the text input, and the buffering unit 620 may re-cache the voice data synthesized based on the text input through the TTS service according to the changed voice settings received by the receiving unit 610 from the server. Wherein, based on the voice settings related to the TTS service being changed, the cache unit 630 may delete the previously cached voice data synthesized according to the previous voice settings, and based on the re-cached voice data synthesized according to the changed voice settings by the cache unit 630 When the playable length is reached, the decoding unit 630 decodes the buffered voice data, and the playing unit 640 previews and plays the decoded voice data.
另外,基于用户根据最终确定的语音设置完成了编辑,装置600可将缓存的与当前的语音设置相应的ADTS数据转换成例如M4A等其它格式的音频数据,同时生成音频元数据,例如包括时长、采样率、声道数等信息,并最终提供给音视频编辑SDK使用。In addition, based on the user's completion of editing according to the finalized voice settings, the device 600 may convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
此外,由于图3所示的语音预览的方法可由图6所示的语音预览的装置600来执行,因此,关于图6中的单元所执行的操作中涉及的任何相关细节均可参见关于图3的相应描述,这里都不再赘述。In addition, since the voice preview method shown in FIG. 3 can be performed by the voice preview device 600 shown in FIG. 6 , any relevant details related to the operations performed by the units in FIG. 6 can be referred to in relation to FIG. 3 The corresponding descriptions are not repeated here.
图7是根据本公开另一个示例性实施例示出的一种语音预览的装置700的详细框图。FIG. 7 is a detailed block diagram of a voice preview apparatus 700 according to another exemplary embodiment of the present disclosure.
参照图7,语音预览的装置700可包括接收单元710、缓存单元720、解码单元730、播放单元740和语音合成单元750。Referring to FIG. 7 , the apparatus 700 for previewing a voice may include a receiving unit 710 , a buffering unit 720 , a decoding unit 730 , a playing unit 740 and a voice synthesis unit 750 .
接收单元710可接收文字输入,即,接收单元710可执行与以上参照图2描述的步骤S210相应的操作,因此,这里不再进行赘述。The receiving unit 710 may receive text input, that is, the receiving unit 710 may perform operations corresponding to the step S210 described above with reference to FIG. 2 , and thus will not be repeated here.
语音合成单元750可在本地通过TTS服务对所述文字输入进行语音合成以获得语音数据。在一些实施例中,在用户启动了装置700上的音视频编辑软件中的编辑预览功能后,语音合成单元750可调用后台TTS服务,该后台TTS服务位于装置700本地,也就是说,语音合成单元750可调用本地的TTS服务,例如,语音合成单元750可通过音视频编辑软件的API调用本地的TTS服务,并将所述文字输入传送给本地的TTS服务,然后通过TTS服务根据用户的与 TTS服务有关的语音设置将所述文字输入合成为语音数据。The speech synthesis unit 750 may perform speech synthesis on the text input locally through the TTS service to obtain speech data. In some embodiments, after the user activates the editing preview function in the audio and video editing software on the device 700, the speech synthesis unit 750 may invoke the background TTS service, which is located locally on the device 700, that is, the speech synthesis unit 750 The unit 750 can call the local TTS service. For example, the speech synthesis unit 750 can call the local TTS service through the API of the audio and video editing software, and transmit the text input to the local TTS service, and then use the TTS service according to the user's and user's requirements. The TTS service-related voice settings synthesize the text input into voice data.
缓存单元720可对通过TTS服务从所述文字输入合成的语音数据进行实时缓存。在一些实施例中,缓存单元720可对通过调用的本地的TTS服务而合成的语音数据进行实时缓存。The buffering unit 720 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, the buffering unit 720 may buffer the speech data synthesized through the invoked local TTS service in real time.
解码单元730可在合成的语音数据被缓存单元720缓存达到可播放长度的情况下对缓存的语音数据进行解码,并由播放单元740进行预览播放。在一些实施例中,基于缓存单元720缓存的语音数据达到可播放长度(即一定大小),解码单元730可对当前缓存的语音数据进行解码,并且播放单元740可对解码后的PCM数据进行播放,另外,在进行以上解码时,缓存单元720始终实时地对通过本地的TTS服务刚刚合成的语音数据进行缓存,以保证解码单元730的解码操作和播放单元740的播放操作的连续性。The decoding unit 730 may decode the buffered speech data when the synthesized speech data is buffered by the buffering unit 720 to a playable length, and the playback unit 740 performs preview playback. In some embodiments, based on the voice data buffered by the buffering unit 720 reaching a playable length (ie a certain size), the decoding unit 730 can decode the currently buffered voice data, and the playing unit 740 can play the decoded PCM data In addition, when performing the above decoding, the buffering unit 720 always buffers the voice data just synthesized through the local TTS service in real time, so as to ensure the continuity of the decoding operation of the decoding unit 730 and the playing operation of the playing unit 740 .
此外,与图6描述的TTS服务在服务器侧执行时的情况类似,在本公开中,基于与TTS服务有关的语音设置被改变,语音合成单元750可停止利用TTS服务根据改变前的语音设置对所述文字输入的语音合成操作,并根据改变后的语音设置重新对所述文字输入进行语音合成,并且,缓存单元720可重新缓存根据改变后的语音设置通过TTS服务基于所述文字输入合成的语音数据。其中,基于与TTS服务有关的语音设置被改变,缓存单元720可删除先前缓存的根据先前的语音设置合成的语音数据,并且在缓存单元720重新缓存的根据改变后的语音设置合成的语音数据达到可播放长度的情况下,解码单元730对缓存的语音数据进行解码,并由播放单元740进行预览播放。Also, similar to the case when the TTS service is executed on the server side described in FIG. 6 , in the present disclosure, based on the voice setting related to the TTS service being changed, the voice synthesis unit 750 may stop using the TTS service to pair with the voice setting before the change. The voice synthesis operation of the text input, and re-speech the text input according to the changed voice settings, and the cache unit 720 can re-cache the text input synthesized based on the text input through the TTS service according to the changed voice settings. voice data. Wherein, based on the voice settings related to the TTS service being changed, the buffering unit 720 may delete the previously buffered voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings in the buffering unit 720 reaches In the case of a playable length, the decoding unit 730 decodes the buffered voice data, and the playing unit 740 performs preview playback.
另外,基于用户根据最终确定的语音设置完成了编辑,装置700可将缓存的与当前的语音设置相应的语音数据转换成例如M4A等其它格式的音频数据,同时生成音频元数据,例如包括时长、采样率、声道数等信息,并最终提供给 音视频编辑SDK使用。In addition, based on the user's completion of editing according to the finalized voice settings, the device 700 may convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
此外,由于图4所示的语音预览的方法可由图7所示的语音预览的装置700来执行,因此,关于图7中的单元所执行的操作中涉及的任何相关细节均可参见关于图4的相应描述,这里都不再赘述。In addition, since the method for voice preview shown in FIG. 4 can be performed by the apparatus 700 for voice preview shown in FIG. 7 , any relevant details involved in the operations performed by the units in FIG. 7 can be referred to in relation to FIG. 4 The corresponding descriptions are not repeated here.
图8是根据本公开示例性实施例示出的一种语音预览的系统800的框图。FIG. 8 is a block diagram of a system 800 for voice preview according to an exemplary embodiment of the present disclosure.
参照图8,系统800包括终端设备810和服务器820。其中,终端设备810可以是图1中所示出的101、102和103中的任意一个,服务器820可以是图1中所示出的服务器105。Referring to FIG. 8 , the system 800 includes a terminal device 810 and a server 820 . The terminal device 810 may be any one of 101 , 102 and 103 shown in FIG. 1 , and the server 820 may be the server 105 shown in FIG. 1 .
终端设备810可接收文字输入,将所述文字输入发送到服务器820,对从服务器820接收的通过TTS服务从所述文字输入合成的语音数据进行实时缓存,并在接收的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。The terminal device 810 can receive the text input, send the text input to the server 820, and perform real-time buffering on the voice data synthesized from the text input through the TTS service received from the server 820, and when the received voice data is cached to the extent that it is available. In the case of playback length, the buffered voice data is decoded and played.
由于终端设备810所执行的操作分别与以上参照图6描述的装置600的操作相同,因此,关于图8中的终端设备810所执行的操作中涉及的任何相关细节均可参见关于图6的有关装置600的相应描述,这里都不再赘述。Since the operations performed by the terminal device 810 are respectively the same as the operations of the apparatus 600 described above with reference to FIG. 6 , for any relevant details involved in the operations performed by the terminal device 810 in FIG. The corresponding description of the apparatus 600 will not be repeated here.
服务器820可通过TTS服务对从终端设备810接收的所述文字输入进行语音合成以获得语音数据,并将获得的语音数据实时传输给终端设备810。由于终端设备810所执行的操作分别与以上参照图6描述的服务器的操作相同,因此此处不再进行赘述。The server 820 may perform speech synthesis on the text input received from the terminal device 810 through the TTS service to obtain voice data, and transmit the obtained voice data to the terminal device 810 in real time. Since the operations performed by the terminal device 810 are respectively the same as the operations of the server described above with reference to FIG. 6 , they are not repeated here.
图9是根据本公开实施例的一种电子设备900的框图,该电子设备900可包括至少一个存储器910和至少一个处理器920,所述至少一个存储器910中存储有计算机可执行指令集合,基于计算机可执行指令集合被至少一个处理器执行,执行根据本公开实施例的语音预览的方法。9 is a block diagram of an electronic device 900 according to an embodiment of the present disclosure, the electronic device 900 may include at least one memory 910 and at least one processor 920, the at least one memory 910 stores a set of computer-executable instructions, based on A set of computer-executable instructions is executed by at least one processor to perform a method of speech previewing according to an embodiment of the present disclosure.
作为示例,电子设备可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里,电子设备并非必须是单个的电子设备,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。电子设备还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子设备。As an example, the electronic device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above set of instructions. Here, the electronic device does not have to be a single electronic device, but can also be any set of devices or circuits that can individually or jointly execute the above-mentioned instructions (or instruction sets). The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
在电子设备中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In an electronic device, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
处理器可运行存储在存储器中的指令或代码,其中,存储器还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,网络接口装置可采用任何已知的传输协议。The processor may execute instructions or code stored in memory, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.
存储器可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储器可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器能够读取存储在存储器中的文件。The memory may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the memory may comprise a separate device such as an external disk drive, a storage array, or any other storage device that may be used by a database system. The memory and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the memory.
此外,电子设备还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。电子设备的所有组件可经由总线和/或网络而彼此连接。In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device can be connected to each other via a bus and/or a network.
根据本公开的实施例,还可提供一种存储指令的计算机可读存储介质,其中,当指令被至少一个处理器运行时,促使至少一个处理器执行根据本公开示 例性实施例的语音预览的方法。这里的计算机可读存储介质的示例包括:只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如,多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置,所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,此外,在一个示例中,计算机程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上,使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a voice preview according to an exemplary embodiment of the present disclosure. method. Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card, or Extreme Digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid state disk, and any other apparatuses configured to store, in a non-transitory manner, a computer program and any associated data, data files and data structures and to provide said computer program and any associated data, data files and data structures The computer program is given to a processor or computer so that the processor or computer can execute the computer program. The computer programs in the above-mentioned computer-readable storage media can run in an environment deployed in computer equipment such as clients, hosts, proxy devices, servers, etc., and, in one example, the computer programs and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
根据本公开的实施例中,还可提供一种计算机程序产品,该计算机程序产品中的指令可被电子设备中的至少一个处理器执行以实现根据本公开示例性实施例的语音预览的方法。According to an embodiment of the present disclosure, a computer program product can also be provided, wherein instructions in the computer program product can be executed by at least one processor in an electronic device to implement the method for voice preview according to an exemplary embodiment of the present disclosure.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公 开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (18)

  1. 一种语音预览的方法,包括:A method for voice preview, including:
    接收文字输入;receive text input;
    对通过语音合成服务,基于所述文字输入合成的语音数据进行缓存;Cache the speech data synthesized based on the text input through the speech synthesis service;
    在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
  2. 如权利要求1所述的方法,还包括:The method of claim 1, further comprising:
    将所述文字输入发送到服务器;sending the text input to a server;
    从服务器接收通过语音合成服务从所述文字输入合成的语音数据。Speech data synthesized from the text input by a speech synthesis service is received from a server.
  3. 如权利要求2所述的方法,基于与语音合成服务有关的语音设置被改变,通知服务器停止利用语音合成服务根据改变前的语音设置对所述文字输入的语音合成操作,通知服务器利用语音合成服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且重新缓存从服务器接收的根据改变后的语音设置通过语音合成服务基于所述文字输入合成的语音数据。The method of claim 2, wherein based on the speech settings related to the speech synthesis service being changed, notifying the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech setting before the change, and informing the server to use the speech synthesis service The text input is re-speech-synthesized according to the changed speech setting, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech setting received from the server is re-cached.
  4. 如权利要求1所述的方法,还包括:The method of claim 1, further comprising:
    通过语音合成服务对所述文字输入进行语音合成以获得语音数据。The text input is speech synthesized by a speech synthesis service to obtain speech data.
  5. 如权利要求4所述的方法,基于与语音合成服务有关的语音设置被改变,停止利用语音合成服务根据改变前的语音设置对所述文字输入的语音合成操作,利用语音合成服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且重新缓存根据改变后的语音设置色通过语音合成服务基于所述文字输入合成的语音数据。The method as claimed in claim 4, based on the speech setting related to the speech synthesis service being changed, stopping the speech synthesis operation of the text input by the speech synthesis service according to the speech setting before the change, and using the speech synthesis service according to the changed speech The speech setting re-speech-synthesizes the text input, and re-caches the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting color.
  6. 一种语音预览的装置,包括:A voice preview device, comprising:
    接收单元,被配置为接收文字输入;a receiving unit configured to receive text input;
    缓存单元,被配置为对通过语音合成服务,基于所述文字输入合成的语音数据进行缓存;a buffering unit configured to buffer the speech data synthesized based on the text input through the speech synthesis service;
    解码单元,被配置为在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码;a decoding unit, configured to decode the buffered voice data when the synthesized voice data is buffered to a playable length;
    播放单元,被配置为对解码的音频数据进行播放。The playback unit is configured to play the decoded audio data.
  7. 如权利要求6所述的装置,还包括:The apparatus of claim 6, further comprising:
    发送单元,被配置为将所述文字输入发送到服务器,a sending unit configured to send the text input to the server,
    其中,接收单元还被配置为从服务器接收通过语音合成服务从所述文字输入合成的语音数据。Wherein, the receiving unit is further configured to receive, from the server, speech data synthesized from the text input through the speech synthesis service.
  8. 如权利要求7所述的装置,基于与语音合成服务有关的语音设置被改变,发送单元被配置为通知服务器停止利用语音合成服务根据改变前的语音设置对所述文字输入的语音合成操作,并通知服务器利用语音合成服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且,缓存单元被配置为重新缓存由接收单元从服务器接收的根据改变后的语音设置通过语音合成服务基于所述文字输入合成的语音数据。8. The apparatus of claim 7, wherein based on the speech setting related to the speech synthesis service being changed, the sending unit is configured to notify the server to stop the speech synthesis operation of the text input according to the speech setting before the change using the speech synthesis service, and The notification server re-speech-synthesizes the text input according to the changed speech settings using the speech synthesis service, and the buffering unit is configured to re-cache the text input received by the receiving unit from the server through the speech synthesis service based on the changed speech settings based on the changed speech settings. Enter the synthesized speech data into the text.
  9. 如权利要求6所述的装置,还包括:The apparatus of claim 6, further comprising:
    语音合成单元,被配置为通过语音合成服务对所述文字输入进行语音合成以获得语音数据。A speech synthesis unit configured to perform speech synthesis on the text input through a speech synthesis service to obtain speech data.
  10. 如权利要求9所述的装置,基于与语音合成服务有关的语音设置被改变,语音合成单元被配置为停止利用语音合成服务根据改变前的语音设置对所述文字输入的语音合成操作,并根据改变后的语音设置重新对所述文字输入进行语音合成,并且,缓存单元被配置为重新缓存根据改变后的语音设置通过语 音合成服务基于所述文字输入合成的语音数据。10. The apparatus of claim 9, wherein based on the speech setting related to the speech synthesis service being changed, the speech synthesis unit is configured to stop the speech synthesis operation of the text input using the speech synthesis service according to the speech setting before the change, and according to The changed speech setting re-speech-synthesizes the text input, and the buffering unit is configured to re-cache the speech data synthesized based on the text input by the speech synthesis service according to the changed speech setting.
  11. 一种电子设备,an electronic device,
    包括处理器;用于存储可执行指令的存储器,其中,所述处理器被配置including a processor; memory for storing executable instructions, wherein the processor is configured
    执行所述可执行指令,实现以下步骤:Execute the executable instructions to implement the following steps:
    接收文字输入;receive text input;
    对通过语音合成服务,基于所述文字输入合成的语音数据进行缓存;Cache the speech data synthesized based on the text input through the speech synthesis service;
    在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
  12. 如权利要求11所述的电子设备,其中,所述处理器被配置执行所述可执行指令,实现以下步骤:The electronic device of claim 11, wherein the processor is configured to execute the executable instructions to:
    将所述文字输入发送到服务器;sending the text input to a server;
    从服务器接收通过语音合成服务从所述文字输入合成的语音数据。Speech data synthesized from the text input by a speech synthesis service is received from a server.
  13. 如权利要求12所述的电子设备,其中,所述处理器被配置执行所述可执行指令,实现以下步骤:13. The electronic device of claim 12, wherein the processor is configured to execute the executable instructions to:
    基于与语音合成服务有关的语音设置被改变,通知服务器停止利用语音合成服务根据改变前的语音设置对所述文字输入的语音合成操作,通知服务器利用语音合成服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且重新缓存从服务器接收的根据改变后的语音设置通过语音合成服务基于所述文字输入合成的语音数据。Based on the speech settings related to the speech synthesis service being changed, notify the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and notify the server to use the speech synthesis service to re-synthesize all speech settings according to the changed speech settings. The text input is speech synthesized, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings received from the server is re-cached.
  14. 如权利要求11所述的电子设备,其中,所述处理器被配置执行所述可执行指令,实现以下步骤:The electronic device of claim 11, wherein the processor is configured to execute the executable instructions to:
    通过语音合成服务对所述文字输入进行语音合成以获得语音数据。The text input is speech synthesized by a speech synthesis service to obtain speech data.
  15. 如权利要求14所述的电子设备,其中,所述处理器被配置执行所述可 执行指令,实现以下步骤:The electronic device of claim 14, wherein the processor is configured to execute the executable instructions to perform the steps of:
    基于与语音合成服务有关的语音设置被改变,停止利用语音合成服务根据改变前的语音设置对所述文字输入的语音合成操作,利用语音合成服务根据改变后的语音设置重新对所述文字输入进行语音合成,并且重新缓存根据改变后的语音设置色通过语音合成服务基于所述文字输入合成的语音数据。Based on the speech settings related to the speech synthesis service being changed, stop using the speech synthesis service to perform the speech synthesis operation on the text input according to the speech settings before the change, and use the speech synthesis service to perform the text input again according to the changed speech settings. Speech synthesis is performed, and the speech data synthesized by the speech synthesis service based on the text input is re-cached according to the changed speech settings.
  16. 一种语音处理系统,包括:A speech processing system, comprising:
    终端设备,被配置为接收文字输入,将所述文字输入发送到服务器,对从服务器接收的通过语音合成服务从所述文字输入合成的语音数据进行实时缓存,并在接收的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放;以及The terminal device is configured to receive text input, send the text input to the server, and perform real-time buffering on the voice data synthesized from the text input through the speech synthesis service received from the server, and when the received voice data is cached to reach In the case of playable length, decode and play the buffered voice data; and
    服务器,被配置为通过语音合成服务对从终端设备接收的所述文字输入进行语音合成以获得语音数据,并将获得的语音数据实时传输给终端设备。The server is configured to perform speech synthesis on the text input received from the terminal device through a speech synthesis service to obtain voice data, and transmit the obtained voice data to the terminal device in real time.
  17. 一种存储指令的计算机可读存储介质,基于所述指令被至少一个处理器运行,促使所述至少一个处理器,执行以下步骤:A computer-readable storage medium storing instructions that, based on the instructions being executed by at least one processor, cause the at least one processor to perform the following steps:
    接收文字输入;receive text input;
    对通过语音合成服务,基于所述文字输入合成的语音数据进行缓存;Cache the speech data synthesized based on the text input through the speech synthesis service;
    在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进行解码并播放。When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
  18. 一种计算机程序产品,所述计算机程序产品中的指令被电子设备中的至少一个处理器运行,执行以下步骤:A computer program product, wherein instructions in the computer program product are executed by at least one processor in an electronic device to perform the following steps:
    接收文字输入;receive text input;
    对通过语音合成服务,基于所述文字输入合成的语音数据进行缓存;Cache the speech data synthesized based on the text input through the speech synthesis service;
    在合成的语音数据被缓存达到可播放长度的情况下,对缓存的语音数据进 行解码并播放。When the synthesized voice data is buffered to a playable length, the buffered voice data is decoded and played.
PCT/CN2021/115113 2020-11-26 2021-08-27 Speech preview method and apparatus WO2022110943A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011355823.8A CN112562638A (en) 2020-11-26 2020-11-26 Voice preview method and device and electronic equipment
CN202011355823.8 2020-11-26

Publications (1)

Publication Number Publication Date
WO2022110943A1 true WO2022110943A1 (en) 2022-06-02

Family

ID=75046232

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115113 WO2022110943A1 (en) 2020-11-26 2021-08-27 Speech preview method and apparatus

Country Status (2)

Country Link
CN (1) CN112562638A (en)
WO (1) WO2022110943A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562638A (en) * 2020-11-26 2021-03-26 北京达佳互联信息技术有限公司 Voice preview method and device and electronic equipment
CN113066474A (en) * 2021-03-31 2021-07-02 北京猎户星空科技有限公司 Voice broadcasting method, device, equipment and medium
CN116110410B (en) * 2023-04-14 2023-06-30 北京算能科技有限公司 Audio data processing method, device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187773A1 (en) * 2004-02-02 2005-08-25 France Telecom Voice synthesis system
US20060200355A1 (en) * 2005-03-01 2006-09-07 Gil Sideman System and method for a real time client server text to speech interface
CN102169689A (en) * 2011-03-25 2011-08-31 深圳Tcl新技术有限公司 Realization method of speech synthesis plug-in
CN106531167A (en) * 2016-11-18 2017-03-22 北京云知声信息技术有限公司 Speech information processing method and device
CN108810608A (en) * 2018-05-24 2018-11-13 烽火通信科技股份有限公司 Live streaming based on IPTV and time-shift playing state switching system and method
CN112562638A (en) * 2020-11-26 2021-03-26 北京达佳互联信息技术有限公司 Voice preview method and device and electronic equipment

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101014996A (en) * 2003-09-17 2007-08-08 摩托罗拉公司 Speech synthesis
KR20060075320A (en) * 2004-12-28 2006-07-04 주식회사 팬택앤큐리텔 A mobile communication terminal and it's control method for proffering text information through voice composition
KR100798408B1 (en) * 2006-04-21 2008-01-28 주식회사 엘지텔레콤 Communication device and method for supplying text to speech function
KR100856786B1 (en) * 2006-07-27 2008-09-05 주식회사 와이즌와이드 System for multimedia naration using 3D virtual agent and method thereof
US8121842B2 (en) * 2008-12-12 2012-02-21 Microsoft Corporation Audio output of a document from mobile device
US10453479B2 (en) * 2011-09-23 2019-10-22 Lessac Technologies, Inc. Methods for aligning expressive speech utterances with text and systems therefor
CN103916716B (en) * 2013-01-08 2017-06-20 北京信威通信技术股份有限公司 The code rate smoothing method of realtime video transmission under a kind of wireless network
CN104810015A (en) * 2015-03-24 2015-07-29 深圳市创世达实业有限公司 Voice converting device, voice synthesis method and sound box using voice converting device and supporting text storage
CN107993646A (en) * 2016-10-25 2018-05-04 北京分音塔科技有限公司 A kind of method for realizing real-time voice intertranslation
CN107370814B (en) * 2017-07-21 2018-09-04 掌阅科技股份有限公司 E-book reads aloud processing method, terminal device and computer storage media
CN108337528B (en) * 2018-01-17 2021-04-16 浙江大华技术股份有限公司 Method and equipment for previewing video
CN108847215B (en) * 2018-08-29 2020-07-17 北京云知声信息技术有限公司 Method and device for voice synthesis based on user timbre
CN111105778A (en) * 2018-10-29 2020-05-05 阿里巴巴集团控股有限公司 Speech synthesis method, speech synthesis device, computing equipment and storage medium
CN110600004A (en) * 2019-09-09 2019-12-20 腾讯科技(深圳)有限公司 Voice synthesis playing method and device and storage medium
CN110769167A (en) * 2019-10-30 2020-02-07 合肥名阳信息技术有限公司 Method for video dubbing based on text-to-speech technology
CN111105779B (en) * 2020-01-02 2022-07-08 标贝(北京)科技有限公司 Text playing method and device for mobile client
CN111179973B (en) * 2020-01-06 2022-04-05 思必驰科技股份有限公司 Speech synthesis quality evaluation method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187773A1 (en) * 2004-02-02 2005-08-25 France Telecom Voice synthesis system
US20060200355A1 (en) * 2005-03-01 2006-09-07 Gil Sideman System and method for a real time client server text to speech interface
CN102169689A (en) * 2011-03-25 2011-08-31 深圳Tcl新技术有限公司 Realization method of speech synthesis plug-in
CN106531167A (en) * 2016-11-18 2017-03-22 北京云知声信息技术有限公司 Speech information processing method and device
CN108810608A (en) * 2018-05-24 2018-11-13 烽火通信科技股份有限公司 Live streaming based on IPTV and time-shift playing state switching system and method
CN112562638A (en) * 2020-11-26 2021-03-26 北京达佳互联信息技术有限公司 Voice preview method and device and electronic equipment

Also Published As

Publication number Publication date
CN112562638A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
WO2022110943A1 (en) Speech preview method and apparatus
US11019119B2 (en) Web-based live broadcast
US11336953B2 (en) Video processing method, electronic device, and computer-readable medium
US7818355B2 (en) System and method for managing content
WO2020155964A1 (en) Audio/video switching method and apparatus, and computer device and readable storage medium
CN101582926A (en) Method for realizing redirection of playing remote media and system
WO2018157743A1 (en) Media data processing method, device, system and storage medium
WO2019062667A1 (en) Method and device for transmitting conference content
WO2018192183A1 (en) Method and apparatus for processing video file during wireless screen delivery
JP2019050554A (en) Method and apparatus for providing voice service
WO2021136161A1 (en) Playback mode determining method and apparatus
US9819429B2 (en) Efficient load sharing and accelerating of audio post-processing
CN109842590B (en) Processing method and device for survey task and computer readable storage medium
GB2508138A (en) Delivering video content to a device by storing multiple formats
WO2022227625A1 (en) Signal processing method and apparatus
CN113192526B (en) Audio processing method and audio processing device
US9762704B2 (en) Service based media player
US7403605B1 (en) System and method for local replacement of music-on-hold
JP7282981B2 (en) METHOD AND SYSTEM FOR PLAYING STREAMING CONTENT USING LOCAL STREAMING SERVER
KR100991264B1 (en) Method and system for playing and sharing music sources on an electric device
KR101428472B1 (en) An apparatus for presenting cloud streaming service and a method thereof
JP7333731B2 (en) Method and apparatus for providing call quality information
EP4164198B1 (en) Method and system for generating media content
CN115631758B (en) Audio signal processing method, apparatus, device and storage medium
CN113364672B (en) Method, device, equipment and computer readable medium for determining media gateway information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896447

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/09/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21896447

Country of ref document: EP

Kind code of ref document: A1