WO2022110943A1 - Procédé et appareil de prévisualisation de la parole - Google Patents

Procédé et appareil de prévisualisation de la parole Download PDF

Info

Publication number
WO2022110943A1
WO2022110943A1 PCT/CN2021/115113 CN2021115113W WO2022110943A1 WO 2022110943 A1 WO2022110943 A1 WO 2022110943A1 CN 2021115113 W CN2021115113 W CN 2021115113W WO 2022110943 A1 WO2022110943 A1 WO 2022110943A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
text input
speech synthesis
server
data
Prior art date
Application number
PCT/CN2021/115113
Other languages
English (en)
Chinese (zh)
Inventor
陈翔宇
张晨
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2022110943A1 publication Critical patent/WO2022110943A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present disclosure relates to the field of signal processing, and in particular, to a method and device for previewing speech.
  • the scenario in which a terminal device uses a speech synthesis (TTS) service is: input text, generate a voice file by calling a network or an offline software development kit (SDK), and then return it to the terminal through the network or file. device, and thereafter, the terminal device plays by calling the voice file.
  • TTS speech synthesis
  • the user uses the terminal device to edit the captured video, edit the text, and then use the edited text to generate audio files with different timbres, which are then synthesized into the video to complete the dubbing process.
  • the present disclosure provides a voice preview method and device, and the technical solutions of the present disclosure are as follows:
  • a method for previewing a voice including: receiving text input; buffering voice data synthesized based on the text input through a voice synthesis service; In the case of the length, the buffered voice data is decoded and played.
  • the method further includes: sending the text input to the server; and receiving, from the server, speech data synthesized from the text input through a speech synthesis service.
  • the server based on the speech settings related to the speech synthesis service being changed, the server is notified to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and the server is notified using the speech synthesis service according to the changed speech settings.
  • the text input is re-speech-synthesized, and the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings received from the server is re-cached.
  • the method further includes: performing speech synthesis on the text input through a speech synthesis service to obtain speech data.
  • the speech synthesis operation of the text input using the speech synthesis service according to the speech settings before the change is stopped, and the text input is re-entered using the speech synthesis service according to the changed speech settings.
  • Speech synthesis is performed, and the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting color is re-cached.
  • an apparatus for voice preview comprising: a receiving unit configured to receive text input; a buffer unit configured to perform a speech synthesis service based on the text input The synthesized voice data is buffered; the decoding unit is configured to decode the buffered voice data when the synthesized voice data is buffered to a playable length; the playing unit is configured to play the decoded audio data.
  • the apparatus for voice preview further includes: a sending unit configured to send the text input to the server, wherein the receiving unit is further configured to receive from the server voice data synthesized from the text input through a speech synthesis service.
  • the sending unit is configured to notify the server to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to notify the server to use the speech synthesis service.
  • the text input is re-speech-synthesized according to the changed speech setting
  • the buffer unit is configured to re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech setting received by the receiving unit from the server.
  • the apparatus for speech previewing further includes: a speech synthesis unit configured to perform speech synthesis on the text input through a speech synthesis service to obtain speech data.
  • the speech synthesis unit is configured to stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change, and to restart according to the changed speech settings.
  • the text input is speech synthesized
  • the buffer unit is configured to re-cache the speech data synthesized based on the text input by the speech synthesis service according to the changed speech settings.
  • an electronic device including a processor; a memory for storing executable instructions, wherein the processor is configured to execute the executable instructions to implement the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
  • the processor is configured to execute executable instructions, the steps of: sending the textual input to the server; and receiving, from the server, speech data synthesized from the textual input by a speech synthesis service.
  • the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, notifying the server to stop utilizing the speech synthesis service to interpret the text input speech according to the speech settings before the change
  • the synthesis operation notifies the server to re-speech the text input according to the changed speech settings using the speech synthesis service, and re-caches the speech data synthesized from the server through the speech synthesis service based on the text input according to the changed speech settings.
  • the processor is configured to execute executable instructions to perform the steps of: performing speech synthesis on textual input through a speech synthesis service to obtain speech data.
  • the processor is configured to execute executable instructions to implement the steps of: based on the speech settings related to the speech synthesis service being changed, stop using the speech synthesis service to perform speech synthesis operations on the text input according to the speech settings before the change , using the speech synthesis service to re-speech the text input according to the changed speech settings, and re-cache the speech data synthesized by the speech synthesis service based on the text input according to the changed speech settings.
  • a speech processing system comprising: a terminal device configured to receive text input, send the text input to a server, The voice data synthesized by the text input is buffered in real time, and when the received voice data is buffered to a playable length, the buffered voice data is decoded and played; Speech synthesis is performed on the text input of the device to obtain voice data, and the obtained voice data is transmitted to the terminal device in real time.
  • a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is caused to perform the following steps: receiving text input; Through the speech synthesis service, the synthesized speech data is buffered based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
  • a computer program product wherein instructions in the computer program product are executed by at least one processor in an electronic device, and perform the following steps: receiving text input; , and buffer the synthesized speech data based on the text input; when the synthesized speech data is buffered to a playable length, the buffered speech data is decoded and played.
  • the delay is greatly reduced through real-time transmission, and a real-time preview is started when very little voice data is buffered, with almost no waiting.
  • the local terminal device itself or informs the server not to perform TTS on the remaining texts that have not yet been TTS, which reduces the cost of TTS services, thereby improving the speed of TTS preview for users in video editing, optimizing user experience.
  • FIG. 1 is an exemplary system architecture diagram to which exemplary embodiments of the present disclosure may be applied;
  • FIG. 2 is a flowchart of a method for voice preview according to an exemplary embodiment of the present disclosure
  • FIG. 3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure
  • FIG. 4 is a detailed flowchart of a method for voice preview when a TTS service is locally executed on a terminal device according to an exemplary embodiment of the present disclosure
  • FIG. 5 is a block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure.
  • FIG. 6 is a detailed block diagram of an apparatus for voice preview according to an exemplary embodiment of the present disclosure.
  • Fig. 7 is a detailed block diagram of a voice preview apparatus according to another exemplary embodiment of the present disclosure.
  • FIG. 8 is a block diagram of a system for voice preview according to an exemplary embodiment of the present disclosure.
  • FIG. 9 is a block diagram of an electronic device according to an embodiment of the present disclosure.
  • the present disclosure proposes to cache the audio data synthesized through the TTS service in real time after receiving the text input, and start the real-time preview of the audio file when very little data is cached, almost without waiting, and at the same time when the timbre is switched. At this time, speech synthesis can no longer be performed on the text that has not been speech synthesized, which reduces the cost of TTS services.
  • FIG. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages (eg, TTS service request, audio and video data upload request, audio and video data acquisition request) and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as video recording applications, audio playback applications, video and audio editing applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102 and 103 are hardware, they can be various electronic devices with a display screen and capable of playing, recording and editing audio and video, including but not limited to smart phones, tablet computers, laptop computers and desktop computer, etc.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above, which can be implemented as multiple software or software modules (for example, used to provide distributed services), or as a single software or software module. software module. There is no specific limitation here.
  • the terminal devices 101, 102, 103 may be installed with image capture devices (eg, cameras) to capture video data.
  • the terminal devices 101, 102, 103 may also be installed with components for converting electrical signals into sounds (such as speakers) to play sounds, and may also be installed with devices for converting analog audio signals into digital audio signals (for example, microphone) to capture sound.
  • the terminal devices 101 , 102 , and 103 can use the image collection device installed on them to collect video data, and use the audio collection device installed on them to collect audio data. Moreover, the terminal devices 101, 102, 103 can perform TTS service on the received text input to synthesize audio data from the text input, and can play the audio data by using an audio processing component installed on it that supports audio playback.
  • the server 105 may be a server that provides various services, such as a background server that provides support for audio and video recording applications, audio and video editing applications, etc. installed on the terminal devices 101 , 102 , and 103 .
  • the background server can perform analysis, TTS service, storage and other processing of uploaded text input and other data, and can also receive TTS service requests sent by terminal devices 101, 102, and 103, and feed back the audio data synthesized by speech to the terminal devices 101, 102, and 103. 102, 103.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
  • the server is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module. There is no specific limitation here.
  • the voice preview methods provided in the embodiments of the present application are generally performed by the terminal devices 101 , 102 , and 103 , and correspondingly, the voice preview devices are generally set in the terminal devices 101 , 102 and 103 .
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. According to implementation requirements, there may be any number of terminal devices, networks and servers, which are not limited in the present disclosure.
  • Fig. 2 is a flow chart of a method for voice preview according to an exemplary embodiment of the present disclosure.
  • a terminal device receives text input.
  • the text input may be text input or edited by the user in any way on the terminal device.
  • the user can directly enter the audio and video editing software on the terminal device. Enter text into the video editing software, or directly load the text files received from other devices or downloaded from the server into the audio and video editing software.
  • the present disclosure does not specifically limit the manner of receiving text input, and any manner that can perform text input is included within the scope of the present disclosure.
  • the terminal device buffers the speech data synthesized based on the text input through the speech synthesis service (TTS service).
  • TTS service the speech synthesis service
  • the cache can be real-time.
  • the terminal device decodes and plays the buffered voice data.
  • the TTS service may be a local TTS service invoked by the terminal device through audio and video editing software, or may be a server-side TTS service invoked by the audio and video editing software.
  • the TTS service is performed to synthesize the text input into voice data, or the terminal device may request the server side to perform the TTS service to synthesize the uploaded text input into voice data.
  • FIG. 3 is a detailed flowchart of a method for voice preview when a TTS service is executed on the server side according to an exemplary embodiment of the present disclosure.
  • step S301 the terminal device receives text input. Since this step is the same as the operation of step S201, it will not be described repeatedly here.
  • the terminal device sends the text input to the server.
  • the terminal device can invoke the background TTS service, which is located on the server side, that is, the terminal device can invoke the server The TTS service on the side, and upload the text input to the server, and then the server synthesizes the text input received from the terminal device into voice data through the TTS service.
  • the terminal device receives the voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the terminal device can receive the voice data synthesized from the text input through the TTS service from the server in real time through streaming.
  • the server based on the server receiving the text input uploaded from the terminal device, the server performs speech synthesis on the text input through the TTS service according to the request of the terminal device, thereby generating audio encoded data in a specific format, for example, generating an advanced Audio encoded data in Audio Coded (AAC) format.
  • AAC Audio Coded
  • the request of the terminal device may include various voice settings related to the TTS service, for example, the user expects the timbre, pitch, speech rate, tone, background music, etc. of the synthesized voice.
  • the present disclosure does not specifically limit the format of the generated audio encoded data, as long as the generated audio encoded data can be subsequently streamed and each audio frame can be independently decoded, and the audio format with a smaller audio frame duration are included within the scope of this disclosure.
  • the generated audio encoded data in a specific format is packaged by the server and transmitted to the terminal device in real time by streaming.
  • the generated audio encoded data can be packaged by the server in the AAC Streaming Transmission Format (ADTS) format, and It is transmitted back to the terminal device in real time according to the streaming mode.
  • ADTS AAC Streaming Transmission Format
  • step S304 the terminal device buffers the speech data synthesized from the text input through the TTS service in real time.
  • the terminal device buffers the received voice data in a buffer in real time.
  • the terminal device may receive the voice data from the server. Data packets in ADTS format are buffered in the buffer in real time.
  • step S305 the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length.
  • a playable length ie, a certain size
  • the terminal device can Decode the currently buffered voice data, and play the decoded PCM data.
  • the terminal device always receives the just synthesized voice data from the server according to the streaming Data is buffered in buffers to ensure continuity of decoding and playback.
  • the above-described process of receiving the synthesized voice data from the server in real time through streaming transmission, buffering the received voice data in real time, and starting decoding and playing after buffering a certain amount of data can realize real-time preview and greatly reduce the preview delay. .
  • the user may be dissatisfied with the voice synthesized according to the current voice settings related to the TTS service, and then change the voice settings related to the TTS service during the voice preview.
  • the user may change the voice settings. Timbre, pitch, speech rate, tone or background music, etc., because in the related art, the voice data synthesis corresponding to the entire text input is often completed through the TTS service according to the voice settings related to the TTS service set by the user and received from the server. Voice preview cannot be performed until all the voice data is synthesized. Therefore, if the user does not like the voice synthesized according to the current voice settings, resources will be wasted and user experience will be affected.
  • the terminal device may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the
  • the changed voice setting re-synthesizes the text input, and locally re-caches the voice data synthesized from the text input through the TTS service according to the changed voice setting and received from the server in real time in a streaming manner in the terminal device.
  • the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length. In this case, decode and preview the buffered voice data. In this way, when the voice settings related to the TTS service are changed, the server does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.
  • the terminal device can convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • the above describes the detailed process of the method in which the terminal device invokes the TTS service on the server side to implement the speech preview of speech synthesis.
  • the following describes the detailed process of the method for the terminal device to invoke the local TTS service to implement the speech preview of speech synthesis.
  • FIG. 4 is a detailed flowchart of a method for voice preview when locally executed by a terminal device according to an exemplary embodiment of the present disclosure.
  • step S401 text input is received. Since this step is the same as the operation of step S201, it will not be described repeatedly here.
  • the terminal device performs speech synthesis on the text input locally through the TTS service to obtain speech data.
  • the terminal device can invoke the background TTS service, and the background TTS service is located locally on the terminal device, that is, the terminal device can invoke Local TTS service, for example, the terminal device calls the local TTS service through the API of the audio and video editing software, and transmits the text input to the local TTS service, and then uses the TTS service according to the user's voice settings related to the TTS service.
  • the text input is synthesized into speech data.
  • the terminal device may synthesize the text input into audio encoded data in a specific format through a local TTS service, for example, generate audio encoded data in AAC format.
  • the voice settings related to the TTS service may include, for example, the user's desired timbre of the synthesized voice, background music, pitch, speech rate, tone, and the like.
  • the terminal device buffers the speech data synthesized from the text input through the TTS service.
  • the cache can be real-time.
  • the terminal device can buffer the voice data synthesized by calling the local TTS service in real time.
  • the terminal device can buffer the audio encoded data in AAC format synthesized through the local TTS service in the buffer. .
  • step S404 the terminal device decodes and plays the buffered voice data when the synthesized voice data is buffered to a playable length.
  • a playable length ie, a certain size
  • the terminal device can The currently buffered voice data is decoded, and the decoded PCM data is played.
  • the terminal device always caches the voice data just synthesized through the local TTS service in real time to ensure audio decoding. and continuity of playback.
  • the terminal device described above directly implements speech synthesis by calling the local TTS service and buffers the synthesized speech data, and starts decoding and playing after buffering a certain amount of data, thereby realizing real-time preview and greatly reducing preview delay.
  • the terminal device may stop using the local TTS service according to the voice setting before the change.
  • the speech synthesis operation of the text input and utilize the TTS service to re-speech the text input according to the changed speech settings, and re-cache the speech synthesized from the text input through the TTS service according to the changed speech settings. data.
  • the terminal device can delete the previously cached voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings reaches the playable length.
  • the terminal device does not perform speech synthesis on the remaining text input that has not yet completed speech synthesis, thereby reducing the cost of the TTS service.
  • the terminal device can convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • audio metadata such as duration, Sample rate, number of channels and other information
  • FIG. 5 is a block diagram of a voice preview apparatus 500 according to an exemplary embodiment of the present disclosure.
  • the apparatus 500 for audio preview may include a receiving unit 510 , a buffering unit 520 , a decoding unit 530 and a playing unit 540 .
  • the receiving unit 510 can receive text input.
  • the repeated description will not be repeated here.
  • the buffering unit 520 may buffer the speech data synthesized based on the text input through the speech synthesis service (TTS service).
  • TTS service may be a local TTS service invoked by the device 500, or may be a server-side TTS service invoked, in other words, a TTS service may be performed locally in the device 500 to synthesize the text input
  • the device 500 may also request the server side to perform TTS service to synthesize the uploaded text input into voice data.
  • FIG. 6 is a detailed block diagram of a voice preview apparatus 600 according to an exemplary embodiment of the present disclosure.
  • the apparatus 600 for audio preview may include a receiving unit 610 , a buffering unit 620 , a decoding unit 630 , a playing unit 640 and a sending unit 650 .
  • the receiving unit 610 can receive text input, that is, the receiving unit 610 can perform operations corresponding to the step S210 described above with reference to FIG. 2 , and therefore, detailed descriptions are omitted here.
  • the sending unit 650 may send the text input to the server.
  • the device 600 can invoke the background TTS service, which is located on the server side, that is, the device 600 can invoke the server and upload the text input to the server, and then the server synthesizes the text input received from the device 600 into voice data through the TTS service.
  • the receiving unit 610 may receive voice data synthesized from the text input through the TTS service from the server. For example, reception may be in real time. In other words, the device 600 may receive voice data synthesized from the text input through the TTS service from the server in real time through streaming.
  • the server based on the server receiving the text input uploaded from the device 600, the server performs speech synthesis on the text input through the TTS service according to the request of the device 600, thereby generating audio encoded data in a specific format, such as generating AAC Format of audio encoded data. Thereafter, the generated audio encoded data in a specific format is packaged by the server and transmitted to the device 600 in real time in a streaming manner.
  • the buffering unit 620 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, based on the voice data synthesized by the TTS service received by the apparatus 600 from the server, the buffering unit 620 buffers the received voice data. For example, the buffering unit 620 may real-time buffer the ADTS-formatted data packets received from the server. cache.
  • the decoding unit 630 may decode the buffered speech data when the synthesized speech data is buffered to a playable length. In some embodiments, based on the voice data buffered by the buffering unit 620 reaching a playable length (ie a certain size), the decoding unit 630 can decode the currently buffered voice data, and the playing unit 640 can play the decoded PCM data. In addition, when performing the above decoding, the buffering unit 620 always buffers the voice data received from the server in real time according to the streaming transmission mode, so as to ensure the continuity of the decoding operation of the decoding unit 630 and the playing operation of the playing unit 640 .
  • the sending unit 650 may notify the server to stop the voice synthesis operation of the text input using the TTS service according to the voice setting before the change, and notify the server to use the TTS service according to the voice after the change. It is set to re-synthesize the text input, and the buffering unit 620 may re-cache the voice data synthesized based on the text input through the TTS service according to the changed voice settings received by the receiving unit 610 from the server.
  • the cache unit 630 may delete the previously cached voice data synthesized according to the previous voice settings, and based on the re-cached voice data synthesized according to the changed voice settings by the cache unit 630
  • the decoding unit 630 decodes the buffered voice data, and the playing unit 640 previews and plays the decoded voice data.
  • the device 600 may convert the buffered ADTS data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata at the same time, including, for example, duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • any relevant details related to the operations performed by the units in FIG. 6 can be referred to in relation to FIG. 3 The corresponding descriptions are not repeated here.
  • FIG. 7 is a detailed block diagram of a voice preview apparatus 700 according to another exemplary embodiment of the present disclosure.
  • the apparatus 700 for previewing a voice may include a receiving unit 710 , a buffering unit 720 , a decoding unit 730 , a playing unit 740 and a voice synthesis unit 750 .
  • the receiving unit 710 may receive text input, that is, the receiving unit 710 may perform operations corresponding to the step S210 described above with reference to FIG. 2 , and thus will not be repeated here.
  • the speech synthesis unit 750 may perform speech synthesis on the text input locally through the TTS service to obtain speech data.
  • the speech synthesis unit 750 may invoke the background TTS service, which is located locally on the device 700, that is, the speech synthesis unit 750
  • the unit 750 can call the local TTS service.
  • the speech synthesis unit 750 can call the local TTS service through the API of the audio and video editing software, and transmit the text input to the local TTS service, and then use the TTS service according to the user's and user's requirements.
  • the TTS service-related voice settings synthesize the text input into voice data.
  • the buffering unit 720 may buffer the speech data synthesized from the text input through the TTS service in real time. In some embodiments, the buffering unit 720 may buffer the speech data synthesized through the invoked local TTS service in real time.
  • the decoding unit 730 may decode the buffered speech data when the synthesized speech data is buffered by the buffering unit 720 to a playable length, and the playback unit 740 performs preview playback.
  • the decoding unit 730 can decode the currently buffered voice data, and the playing unit 740 can play the decoded PCM data
  • the buffering unit 720 always buffers the voice data just synthesized through the local TTS service in real time, so as to ensure the continuity of the decoding operation of the decoding unit 730 and the playing operation of the playing unit 740 .
  • the voice synthesis unit 750 may stop using the TTS service to pair with the voice setting before the change.
  • the buffering unit 720 may delete the previously buffered voice data synthesized according to the previous voice settings, and the re-cached voice data synthesized according to the changed voice settings in the buffering unit 720 reaches In the case of a playable length, the decoding unit 730 decodes the buffered voice data, and the playing unit 740 performs preview playback.
  • the device 700 may convert the buffered voice data corresponding to the current voice settings into audio data in other formats such as M4A, and generate audio metadata, such as duration, Sample rate, number of channels and other information, and finally provided to the audio and video editing SDK for use.
  • audio metadata such as duration, Sample rate, number of channels and other information
  • FIG. 8 is a block diagram of a system 800 for voice preview according to an exemplary embodiment of the present disclosure.
  • the system 800 includes a terminal device 810 and a server 820 .
  • the terminal device 810 may be any one of 101 , 102 and 103 shown in FIG. 1
  • the server 820 may be the server 105 shown in FIG. 1 .
  • the terminal device 810 can receive the text input, send the text input to the server 820, and perform real-time buffering on the voice data synthesized from the text input through the TTS service received from the server 820, and when the received voice data is cached to the extent that it is available. In the case of playback length, the buffered voice data is decoded and played.
  • the server 820 may perform speech synthesis on the text input received from the terminal device 810 through the TTS service to obtain voice data, and transmit the obtained voice data to the terminal device 810 in real time. Since the operations performed by the terminal device 810 are respectively the same as the operations of the server described above with reference to FIG. 6 , they are not repeated here.
  • the electronic device 900 may include at least one memory 910 and at least one processor 920, the at least one memory 910 stores a set of computer-executable instructions, based on A set of computer-executable instructions is executed by at least one processor to perform a method of speech previewing according to an embodiment of the present disclosure.
  • the electronic device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above set of instructions.
  • the electronic device does not have to be a single electronic device, but can also be any set of devices or circuits that can individually or jointly execute the above-mentioned instructions (or instruction sets).
  • the electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
  • a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor.
  • processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
  • the processor may execute instructions or code stored in memory, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.
  • the memory may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the memory may comprise a separate device such as an external disk drive, a storage array, or any other storage device that may be used by a database system.
  • the memory and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the memory.
  • the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device can be connected to each other via a bus and/or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a voice preview according to an exemplary embodiment of the present disclosure.
  • Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card
  • the computer programs in the above-mentioned computer-readable storage media can run in an environment deployed in computer equipment such as clients, hosts, proxy devices, servers, etc., and, in one example, the computer programs and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
  • a computer program product can also be provided, wherein instructions in the computer program product can be executed by at least one processor in an electronic device to implement the method for voice preview according to an exemplary embodiment of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Un procédé de prévisualisation de la parole consiste : à recevoir une entrée de texte (S201) ; à réaliser une mise en mémoire tampon en temps réel sur des données de parole synthétisées, au moyen d'un service de synthèse de la parole, à partir de l'entrée de texte (S202) ; et lorsque les données de parole synthétisées sont mises en mémoire tampon pour atteindre une longueur pouvant être lue, à décoder et à lire les données de parole mises en mémoire tampon (S203). L'invention concerne également un appareil de prévisualisation de la parole (600), comprenant une unité de réception (610), une unité de mémoire tampon (620), une unité de décodage (630), une unité de lecture (640) et une unité d'envoi (650). L'invention concerne également un dispositif électronique, un système de traitement de la parole, un support de stockage lisible par ordinateur et un produit-programme informatique. Un retard est fortement réduit au moyen d'une transmission en temps réel, une prévisualisation en temps réel est démarrée quasiment sans temps d'attente lorsque très peu de données de parole sont mises en mémoire tampon. Lorsqu'un changement de timbre est effectué, un dispositif de terminal local lui-même n'effectue plus de TTS sur le texte restant qui n'a pas été soumis au TTS, ou notifie à un serveur associé, de telle sorte que le coût des services TTS est réduit, ce qui permet d'améliorer la vitesse de prévisualisation TTS d'un utilisateur lors de montages vidéo et d'optimiser l'expérience de l'utilisateur.
PCT/CN2021/115113 2020-11-26 2021-08-27 Procédé et appareil de prévisualisation de la parole WO2022110943A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011355823.8 2020-11-26
CN202011355823.8A CN112562638A (zh) 2020-11-26 2020-11-26 语音预览的方法、装置及电子设备

Publications (1)

Publication Number Publication Date
WO2022110943A1 true WO2022110943A1 (fr) 2022-06-02

Family

ID=75046232

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115113 WO2022110943A1 (fr) 2020-11-26 2021-08-27 Procédé et appareil de prévisualisation de la parole

Country Status (2)

Country Link
CN (1) CN112562638A (fr)
WO (1) WO2022110943A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562638A (zh) * 2020-11-26 2021-03-26 北京达佳互联信息技术有限公司 语音预览的方法、装置及电子设备
CN113066474A (zh) * 2021-03-31 2021-07-02 北京猎户星空科技有限公司 一种语音播报方法、装置、设备及介质
CN116110410B (zh) * 2023-04-14 2023-06-30 北京算能科技有限公司 音频数据处理方法、装置、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187773A1 (en) * 2004-02-02 2005-08-25 France Telecom Voice synthesis system
US20060200355A1 (en) * 2005-03-01 2006-09-07 Gil Sideman System and method for a real time client server text to speech interface
CN102169689A (zh) * 2011-03-25 2011-08-31 深圳Tcl新技术有限公司 一种语音合成插件的实现方法
CN106531167A (zh) * 2016-11-18 2017-03-22 北京云知声信息技术有限公司 一种语音信息的处理方法及装置
CN108810608A (zh) * 2018-05-24 2018-11-13 烽火通信科技股份有限公司 基于iptv的直播与时移播放状态切换系统及方法
CN112562638A (zh) * 2020-11-26 2021-03-26 北京达佳互联信息技术有限公司 语音预览的方法、装置及电子设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101014996A (zh) * 2003-09-17 2007-08-08 摩托罗拉公司 语音合成
KR20060075320A (ko) * 2004-12-28 2006-07-04 주식회사 팬택앤큐리텔 음성 합성을 통한 텍스트 정보를 제공하는 이동통신단말기 및 그 제어 방법
US8121842B2 (en) * 2008-12-12 2012-02-21 Microsoft Corporation Audio output of a document from mobile device
CN103916716B (zh) * 2013-01-08 2017-06-20 北京信威通信技术股份有限公司 一种无线网络下视频实时传输的码率平滑方法
CN104810015A (zh) * 2015-03-24 2015-07-29 深圳市创世达实业有限公司 语音转化装置、方法及使用该装置的支持文本存储的音箱
CN111105778A (zh) * 2018-10-29 2020-05-05 阿里巴巴集团控股有限公司 语音合成方法、装置、计算设备和存储介质
CN111105779B (zh) * 2020-01-02 2022-07-08 标贝(北京)科技有限公司 用于移动客户端的文本播放方法和装置
CN111179973B (zh) * 2020-01-06 2022-04-05 思必驰科技股份有限公司 语音合成质量评价方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187773A1 (en) * 2004-02-02 2005-08-25 France Telecom Voice synthesis system
US20060200355A1 (en) * 2005-03-01 2006-09-07 Gil Sideman System and method for a real time client server text to speech interface
CN102169689A (zh) * 2011-03-25 2011-08-31 深圳Tcl新技术有限公司 一种语音合成插件的实现方法
CN106531167A (zh) * 2016-11-18 2017-03-22 北京云知声信息技术有限公司 一种语音信息的处理方法及装置
CN108810608A (zh) * 2018-05-24 2018-11-13 烽火通信科技股份有限公司 基于iptv的直播与时移播放状态切换系统及方法
CN112562638A (zh) * 2020-11-26 2021-03-26 北京达佳互联信息技术有限公司 语音预览的方法、装置及电子设备

Also Published As

Publication number Publication date
CN112562638A (zh) 2021-03-26

Similar Documents

Publication Publication Date Title
WO2022110943A1 (fr) Procédé et appareil de prévisualisation de la parole
US11019119B2 (en) Web-based live broadcast
US11336953B2 (en) Video processing method, electronic device, and computer-readable medium
US7818355B2 (en) System and method for managing content
WO2021159770A1 (fr) Procédé de lecture de vidéo, dispositif, appareil et support de stockage
JP2015029317A (ja) デジタルコンテンツをパーソナルコンピュータから携帯用ハンドセットへ転送するための方法と装置
WO2020155964A1 (fr) Procédé et appareil de commutation audio/vidéo, et dispositif informatique et support d'informations lisible
CN101582926A (zh) 实现远程媒体播放重定向的方法和系统
WO2018157743A1 (fr) Procédé de traitement de données multimédias, dispositif, système, et support de stockage
WO2018192183A1 (fr) Procédé et appareil permettant le traitement d'un fichier vidéo pendant une distribution d'écran sans fil
JP2019050554A (ja) 音声サービスを提供するための方法および装置
WO2021136161A1 (fr) Procédé et appareil de détermination de mode de lecture
WO2019062667A1 (fr) Procédé et dispositif de transmission de contenu de conférence
US9819429B2 (en) Efficient load sharing and accelerating of audio post-processing
GB2508138A (en) Delivering video content to a device by storing multiple formats
WO2022227625A1 (fr) Procédé et appareil de traitement de signaux
CN109842590B (zh) 一种查勘任务的处理方法、装置及计算机可读存储介质
CN113192526B (zh) 音频处理方法和音频处理装置
US9762704B2 (en) Service based media player
US7403605B1 (en) System and method for local replacement of music-on-hold
JP7282981B2 (ja) ローカルストリーミングサーバを利用したストリーミングコンテンツの再生方法およびシステム
KR100991264B1 (ko) 전자 단말기의 음원 재생 배포 방법 및 그 시스템
JP7333731B2 (ja) 通話品質情報を提供する方法および装置
CN115631758B (zh) 音频信号处理方法、装置、设备和存储介质
US20230114327A1 (en) Method and system for generating media content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896447

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/09/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21896447

Country of ref document: EP

Kind code of ref document: A1