WO2023173269A1 - 数据处理方法和装置 - Google Patents

数据处理方法和装置 Download PDF

Info

Publication number
WO2023173269A1
WO2023173269A1 PCT/CN2022/080823 CN2022080823W WO2023173269A1 WO 2023173269 A1 WO2023173269 A1 WO 2023173269A1 CN 2022080823 W CN2022080823 W CN 2022080823W WO 2023173269 A1 WO2023173269 A1 WO 2023173269A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
resource
processed
fourier transform
time fourier
Prior art date
Application number
PCT/CN2022/080823
Other languages
English (en)
French (fr)
Inventor
陈亮
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202280008495.5A priority Critical patent/CN117157705A/zh
Priority to PCT/CN2022/080823 priority patent/WO2023173269A1/zh
Publication of WO2023173269A1 publication Critical patent/WO2023173269A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts

Definitions

  • the present application relates to the field of artificial intelligence, and more specifically, to a data processing method and device.
  • Speech synthesis technology also known as text-to-speech (TTS) technology
  • TTS text-to-speech
  • the complexity of the TTS algorithm is also constantly increasing, and the TTS algorithm with higher complexity is not suitable for implementation on the terminal side with limited computing resources.
  • the traditional solution is to deploy the TTS algorithm on the server (for example, a cloud server), complete the speech synthesis through the server, and then deliver it to the user via the network.
  • the server for example, a cloud server
  • this method will have poor business continuity and low processing efficiency. The problem.
  • Embodiments of the present application provide a data processing method and device, which can improve business continuity and processing efficiency.
  • a data processing method is provided.
  • the method is applied to a server.
  • the server communicates with a terminal device through a network.
  • the method includes: receiving data to be processed from the terminal device; obtaining the data to be processed according to the data to be processed.
  • the characteristic data the data amount of the characteristic data is smaller than the data amount of the target data of the data to be processed, the characteristic data is used to synthesize the target data; the characteristic data is sent to the terminal device.
  • the terminal device can be a smart terminal such as a mobile phone, a personal computer, a vehicle, or an information processing center;
  • the server can be a server with data processing functions such as a cloud server, a network server, an application server, and a management server.
  • data processing in the embodiments of this application may refer to streaming data processing.
  • streaming data processing refers to synthesizing target data in segments, and the target data synthesized first is played first. While the first synthesized data is played, subsequent data is also synthesized. There is no need to wait until the entire target data is synthesized before broadcasting. This reduces the waiting time for data synthesis.
  • the data processing methods in the embodiments of the present application can be applied to technical fields such as voice processing, image processing, or video processing.
  • voice processing image processing
  • video processing video processing
  • they are applied to streaming in the technical field of voice processing.
  • speech synthesis technology Take speech synthesis technology as an example.
  • the terminal device can synthesize target data based on the characteristic data and play the target data.
  • the server is usually used to directly synthesize the target data, and then the synthesized target data is sent to the terminal device, and finally played through the terminal device.
  • this method is adopted.
  • the server is mainly used to obtain characteristic data and send the characteristic data to the terminal device. Since the data amount of the characteristic data is smaller than the data amount of the target data of the data to be processed, compared with directly sending the target data In particular, it can effectively reduce the demand for network resources during data transmission, improve the smoothness of target data playback, thereby ensuring the playback quality of the final target data, and at the same time improve business continuity and processing efficiency.
  • sending the characteristic data to the terminal device includes: when the network resource meets the first condition, sending the characteristic data to the terminal device, wherein the first The condition includes that the network resource is less than or equal to the first resource, and the first resource is the minimum resource required for transmitting the target data.
  • the first resource is the minimum resource required when transmitting the target data. It can also be described as the first resource is the critical resource required when transmitting the target data. It should be understood that, taking speech synthesis technology as an example, critical resources can make the transmission duration of the target duration speech equal to the target duration; when the network resources are greater than the critical resources, the transmission duration of the target duration speech will be less than the target duration, and the speech playback will be smoother; when When the network resources are less than the critical resources, the transmission time of the target duration voice will be longer than the target duration, and voice playback will be stuck.
  • the network resource may be network bandwidth, and the following description takes network bandwidth as an example.
  • the network resources can be determined in real time by the terminal device, or can be implemented by other network monitoring devices, which is not limited in the embodiments of the present application.
  • the characteristic data when the network resources are less than or equal to the minimum resources required to transmit the target data, the characteristic data can be sent to the terminal device because the data amount of the characteristic data is smaller than the target data of the data to be processed.
  • the amount of data, and the characteristic data can be used to synthesize target data, thereby ensuring the playback quality of the final target data when network resources are limited, and at the same time improving business continuity and processing efficiency.
  • the characteristic data when the first condition is that the network resource is greater than or equal to the second resource and less than or equal to the first resource, the characteristic data includes first characteristic data; And/or, when the first condition is that the network resource is greater than or equal to the third resource and less than or equal to the second resource, the characteristic data includes second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to The data amount of the second characteristic data, the second resource is the minimum resource required to transmit the first characteristic data, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the characteristic data may also include the second characteristic data.
  • the first characteristic data can be selected.
  • the data to be processed includes speech data to be processed
  • the target data includes target speech
  • the feature data includes acoustic features
  • the network resource includes network bandwidth
  • the data to be processed may include text sequences or phoneme sequences (that is, text sequences or phoneme sequences to be synthesized into speech), etc. This is not limited in the embodiments of the present application.
  • the acoustic features may include features such as mel spectrum or downsampled short-time Fourier transform spectrogram, which are not limited in this embodiment of the present application.
  • the first feature data includes Mel spectrum features;
  • the second feature data includes a first short-time Fourier transform spectrum, wherein the first short-time Fourier transform spectrum
  • the time Fourier transform spectrogram is a spectrogram obtained by downsampling the original short-time Fourier transform spectrogram, which is obtained based on the speech data to be processed.
  • the data amount of the Mel spectrum feature can be recorded as 1/R of the data amount of the target data; the data amount of the down-sampled short-time Fourier transform spectrum can be recorded as the target 1/M of the data size.
  • the second resource is 1/R of the first resource; the third resource is 1/M of the first resource, where R and M are both positive numbers greater than 1. It should be understood that the values of R and M can be determined according to the actual situation. For details, please refer to the relevant descriptions in Mode 2 and Mode 3 below.
  • the specific process of obtaining the original short-time Fourier transform spectrogram based on the speech data to be processed may include: obtaining mel spectrum features based on the speech data to be processed, then synthesizing the target speech based on the mel spectrum features, and finally synthesizing the mel spectrum features.
  • the synthesized target speech is subjected to short-time Fourier transform to obtain the original short-time Fourier transform spectrogram.
  • the above-mentioned Mel spectrum characteristics can also be other acoustic characteristics without limitation.
  • the original short-time Fourier transform spectrogram can also be obtained directly from the speech data to be processed, which is not limited in the embodiment of the present application.
  • the second characteristic data also includes residual data
  • the residual data is the original short-time Fourier transform spectrum
  • the second short-time Fourier transform is a spectrum obtained by upsampling the first short-time Fourier transform spectrum.
  • the server can be used to Calculate and send the residual data to the terminal device, so that the terminal device can take the error into account when restoring the original data based on the downsampled data, thereby improving the synthesis quality of the target data and ensuring the playback quality of the final target data.
  • adjacent sliding windows of the original short-time Fourier transform spectrum do not overlap; and/or, the original short-time Fourier transform spectrum Only the amplitude portion of the spectrum is included.
  • the original short-time Fourier transform spectrum only includes the amplitude part of the spectrum, which can also be described as the original short-time Fourier transform spectrum does not include the phase part of the spectrum.
  • adjacent sliding windows of the original short-time Fourier transform spectrum do not overlap; and/or the original short-time Fourier transform spectrum only includes the amplitude part of the spectrum, which can reduce the short-time Fourier transform spectrum.
  • the data size of the Fourier transform spectrum is not limited to the embodiment of the present application.
  • the method further includes: receiving from the terminal device Another data to be processed; the target data of the other data to be processed is obtained according to the another data to be processed; and the target data of the other data to be processed is sent to the terminal device.
  • the above-mentioned data to be processed and another data to be processed may be the same.
  • the characteristic data may also be sent to the terminal device.
  • the computing power of the server is stronger, the quality of the target data synthesized directly by the server is usually better.
  • a data processing method is provided.
  • the method is applied to a terminal device.
  • the terminal device communicates with a server through a network.
  • the method includes: sending data to be processed to the server; receiving characteristics of the data to be processed from the server. data, the data volume of the characteristic data is smaller than the data volume of the target data of the data to be processed; the target data is generated based on the characteristic data.
  • the method further includes: controlling playing of the target data.
  • the server is usually used to directly synthesize the target data, and then the terminal device receives the target data from the server and plays the target data.
  • this method will not work. Issues of poorer business continuity and less efficient handling. Specifically, when network resources are abundant, the transmission of target data is faster and the playback is smoother; when network resources are limited, the transmission of target data is slower, causing playback to freeze and poor fluency, which in turn causes Business continuity is poor and processing efficiency is low.
  • the terminal device is mainly used to receive the characteristic data of the data to be processed from the server, and the target data is generated based on the characteristic data. Since the data amount of the characteristic data is smaller than the data amount of the target data of the data to be processed, therefore Compared with directly receiving target data, it can effectively reduce the demand for network resources during data transmission, improve the smoothness of target data playback, thereby ensuring the playback quality of the final target data, and at the same time improve business continuity and processing efficiency.
  • receiving the characteristic data of the data to be processed from the server includes: when the network resource meets the first condition, receiving the characteristic data of the data to be processed from the server.
  • the first condition includes that the network resource is less than or equal to the first resource, and the first resource is the minimum resource required for transmitting the target data.
  • the characteristic data of the data to be processed can be received from the server, because the data amount of the characteristic data is smaller than the data to be processed.
  • the amount of target data can be reduced, thereby ensuring the playback quality of the final target data when network resources are limited, while improving business continuity and processing efficiency.
  • the characteristic data when the first condition is that the network resource is greater than or equal to the second resource and less than or equal to the first resource, the characteristic data includes first characteristic data; And/or, when the first condition is that the network resource is greater than or equal to the third resource and less than or equal to the second resource, the characteristic data includes second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to The data amount of the second characteristic data, the second resource is the minimum resource required to transmit the first characteristic data, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the data to be processed includes speech data to be processed
  • the target data includes target speech
  • the feature data includes acoustic features
  • the network resource includes network bandwidth
  • the first feature data includes Mel spectrum features;
  • the second feature data includes a first short-time Fourier transform spectrum, wherein the first short-time Fourier transform spectrum
  • the time Fourier transform spectrogram is a spectrogram obtained by downsampling the original short-time Fourier transform spectrogram, which is obtained based on the speech data to be processed.
  • the characteristic data when the characteristic data includes second characteristic data, and the second characteristic data includes the first short-time Fourier transform spectrum, the characteristic data according to the characteristic data
  • Data generation of the target data includes: upsampling the first short-time Fourier transform spectrum to obtain a second short-time Fourier transform spectrum; performing an inverse short-time Fourier transform spectrum on the second short-time Fourier transform spectrum. Fourier transform is used to obtain the target speech.
  • the second feature data also includes residual data, where the residual data is the original short-time Fourier transform spectrum and the second short-time Fourier transform The difference between transformed spectra.
  • performing inverse short-time Fourier transform on the second short-time Fourier transform spectrogram to obtain the target speech includes: performing an inverse short-time Fourier transform on the second short-time Fourier transform spectrum.
  • the target speech is obtained by performing an inverse short-time Fourier transform on the sum of the leaf transform spectrogram and the residual data.
  • the server can be used to Calculate and send the residual data to the terminal device, so that the terminal device can take the error into account when restoring the original data based on the downsampled data, thereby improving the synthesis quality of the target data and ensuring the playback quality of the final target data.
  • adjacent sliding windows of the original short-time Fourier transform spectrum do not overlap; and/or, the original short-time Fourier transform spectrum Only the amplitude portion of the spectrum is included.
  • the method further includes: sending another to the server. One data to be processed; receiving target data of another data to be processed from the server.
  • the characteristic data of the data to be processed may also be received from the server.
  • the computing power of the server is stronger, the quality of the target data synthesized directly by the server is usually better.
  • the target data of the data to be processed can be directly received from the server to improve the quality of the target data.
  • the method when the network resource meets a third condition, the third condition includes that the network resource is smaller than the third resource, the method further includes: determining another to-be-processed resource Data; obtain the target data of the further data to be processed according to the further data to be processed.
  • the above-mentioned data to be processed, another data to be processed, and yet another data to be processed may be the same.
  • target data when network resources are insufficient to transmit characteristic data, target data can be synthesized directly on the terminal device to improve business continuity and processing efficiency.
  • a data processing device which can communicate with a terminal device through a network.
  • the device may be a server, or a chip, processor or module in the server.
  • the device includes: a transceiver module, used to receive data to be processed from the terminal device; a processing module, used to obtain characteristic data of the data to be processed based on the data to be processed, and the data volume of the characteristic data is smaller than the target of the data to be processed. The amount of data, the characteristic data is used to synthesize the target data; the transceiver module is also used to send the characteristic data to the terminal device.
  • the transceiver module has the capability of sending and/or receiving data.
  • the transceiver module is also configured to send the characteristic data to the terminal device when the network resource meets a first condition, wherein the first condition includes the network
  • the resource is less than or equal to the first resource, and the first resource is the minimum resource required for transmitting the target data.
  • the characteristic data when the first condition is that the network resource is greater than or equal to the second resource and less than or equal to the first resource, the characteristic data includes the first characteristic data; And/or, when the first condition is that the network resource is greater than or equal to the third resource and less than or equal to the second resource, the characteristic data includes second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to The data amount of the second characteristic data, the second resource is the minimum resource required to transmit the first characteristic data, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the data to be processed includes speech data to be processed
  • the target data includes target speech
  • the feature data includes acoustic features
  • the network resource includes network bandwidth
  • the first characteristic data includes Mel spectrum characteristics;
  • the second characteristic data includes a first short-time Fourier transform spectrum, wherein the first short-time Fourier transform spectrum
  • the time Fourier transform spectrogram is a spectrogram obtained by downsampling the original short-time Fourier transform spectrogram, which is obtained based on the speech data to be processed.
  • the second characteristic data also includes residual data
  • the residual data is the original short-time Fourier transform spectrum and the second short-time Fourier transform
  • the second short-time Fourier transform spectrum is a spectrum obtained by upsampling the first short-time Fourier transform spectrum.
  • adjacent sliding windows of the original short-time Fourier transform spectrum do not overlap; and/or, the original short-time Fourier transform spectrum Only the amplitude portion of the spectrum is included.
  • the transceiver module when the network resource satisfies the second condition, the second condition includes that the network resource is greater than the first resource, the transceiver module is also used to receive data from the terminal.
  • the device receives another data to be processed; the processing module is also used to obtain the target data of the another data to be processed according to the other data to be processed; the transceiver module is also used to send the other data to be processed to the terminal device The target data of the data.
  • a fourth aspect provides a data processing device that can communicate with a server through a network.
  • the device may be a terminal device, or a chip, processor or module in the terminal device.
  • the device includes: a transceiver module for sending data to be processed to the server; receiving characteristic data of the data to be processed from the server, the data volume of the characteristic data being smaller than the data volume of the target data of the data to be processed; a processing module, Used to generate the target data based on the feature data.
  • the transceiver module has the capability of sending and/or receiving data.
  • the transceiver module is also configured to receive the characteristic data of the data to be processed from the server when the network resource meets the first condition, wherein the first condition It includes that the network resource is less than or equal to the first resource, and the first resource is the minimum resource required for transmitting the target data.
  • the characteristic data when the first condition is that the network resource is greater than or equal to the second resource and less than or equal to the first resource, the characteristic data includes the first characteristic data; And/or, when the first condition is that the network resource is greater than or equal to the third resource and less than or equal to the second resource, the characteristic data includes second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to The data amount of the second characteristic data, the second resource is the minimum resource required to transmit the first characteristic data, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the data to be processed includes speech data to be processed
  • the target data includes target speech
  • the feature data includes acoustic features
  • the network resource includes network bandwidth
  • the first feature data includes Mel spectrum features;
  • the second feature data includes a first short-time Fourier transform spectrum, wherein the first short-time Fourier transform spectrum
  • the time Fourier transform spectrogram is a spectrogram obtained by downsampling the original short-time Fourier transform spectrogram, which is obtained based on the speech data to be processed.
  • the processing module when the characteristic data includes second characteristic data, and the second characteristic data includes the first short-time Fourier transform spectrum, the processing module also Used to perform upsampling on the first short-time Fourier transform spectrum to obtain a second short-time Fourier transform spectrum; and perform an inverse short-time Fourier transform on the second short-time Fourier transform spectrum. Get the target voice.
  • the second feature data also includes residual data, the residual data being the original short-time Fourier transform spectrum and the second short-time Fourier transform The difference between transformed spectra.
  • the processing module is further configured to perform an inverse short-time Fourier transform on the sum of the second short-time Fourier transform spectrum and the residual data. Get the target voice.
  • adjacent sliding windows of the original short-time Fourier transform spectrum do not overlap; and/or, the original short-time Fourier transform spectrum Only the amplitude portion of the spectrum is included.
  • the transceiver module when the network resource meets the second condition, the second condition includes that the network resource is larger than the first resource, the transceiver module is also used to send a message to the server Send another data to be processed; receive target data of the other data to be processed from the server.
  • the processing module when the network resource meets a third condition, the third condition includes that the network resource is smaller than the third resource, the processing module is also used to determine another Data to be processed; target data of the further data to be processed is obtained according to the further data to be processed.
  • a data processing system including the data processing device as in the third aspect or any possible implementation of the third aspect and the fourth aspect or any possible implementation of the fourth aspect. data processing device.
  • a data processing device including at least one processor and an interface circuit.
  • the at least one processor is used to obtain data to be processed through the interface circuit, and perform the first aspect or any possibility of the first aspect.
  • Data processing methods in the implementation are provided, including at least one processor and an interface circuit.
  • a data processing device including at least one processor and a communication interface.
  • the at least one processor is used to communicate with the server through the communication interface and perform the second aspect or any possible method of the second aspect. Data processing methods in the implementation.
  • a vehicle including a sensor and a data processing device.
  • the sensor is used to obtain user data in the cabin.
  • the user data in the cabin is used to generate data to be processed.
  • the data processing device is used to perform the second aspect. Or the data processing method in any possible implementation of the second aspect.
  • a ninth aspect provides a computer-readable storage medium, characterized in that it includes instructions; the instructions are used to implement the data processing method in the first aspect or any possible implementation of the first aspect; and/ Or, implement the data processing method in the second aspect or any possible implementation manner of the second aspect.
  • a computer program product which is characterized in that it includes: a computer program, which when the computer program is run, causes the computer to execute the data in the first aspect or any possible implementation of the first aspect. Processing method; and/or, perform the data processing method in the second aspect or any possible implementation of the second aspect.
  • a computing device comprising: at least one processor and a memory, the at least one processor being coupled to the memory and configured to read and execute instructions in the memory to execute as described in The data processing method in one aspect or any possible implementation of the first aspect; and/or, performing the data processing method in the second aspect or any possible implementation of the second aspect.
  • a chip in a twelfth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in the memory through the data interface and executes the first aspect or any of the possibilities of the first aspect.
  • the data processing method in the implementation manner; and/or, perform the data processing method in the second aspect or any possible implementation manner of the second aspect.
  • the chip may further include a memory, in which instructions are stored, and the processor is configured to execute the instructions stored in the memory.
  • the The processor is configured to execute the data processing method in the first aspect or any possible implementation of the first aspect; and/or, execute the data processing method in the second aspect or any possible implementation of the second aspect. .
  • a chip system in a thirteenth aspect, includes at least one processor for supporting the functions involved in implementing the above-mentioned first aspect or certain implementations of the first aspect, for example, for example, receiving or processing the above-mentioned Data and/or information involved in the method.
  • the chip system further includes a memory, the memory is used to store program instructions and data, and the memory is located within the processor or outside the processor.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • Figure 1 is an example diagram of a flow of voice interaction provided by an embodiment of the present application.
  • Figure 2 is an example diagram of server streaming speech synthesis and delivery delay provided by an embodiment of the present application.
  • Figure 3 is an example diagram of a data processing method provided by an embodiment of the present application.
  • Figure 4 is an example diagram of a speech synthesis system architecture provided by an embodiment of the present application.
  • Figure 5 is an example diagram of a device-cloud combined streaming speech synthesis process provided by an embodiment of the present application.
  • Figure 6 is an example diagram of another device-cloud combined streaming speech synthesis process provided by an embodiment of the present application.
  • FIG. 7 is an example diagram of another device-cloud combined streaming speech synthesis process provided by an embodiment of the present application.
  • FIG. 8 is an example diagram of a vehicle-mounted UI interface provided by an embodiment of the present application.
  • Figure 9 is an example diagram of another vehicle-mounted UI interface provided by an embodiment of the present application.
  • Figure 10 is a data processing device 1000 provided by an embodiment of the present application.
  • Figure 11 is a data processing device 1100 provided by an embodiment of the present application.
  • Figure 12 is a data processing system 1200 provided by an embodiment of the present application.
  • Figure 13 is an exemplary block diagram of the hardware structure of the data processing device provided by the embodiment of the present application.
  • Speech synthesis technology also known as TTS technology
  • TTS technology is an important direction in the field of speech processing, aiming to allow machines to generate natural and beautiful human speech.
  • TTS technology can be applied alone to scenarios such as voice broadcasting (such as consultation broadcasting, order broadcasting, news broadcasting, etc.), reading and listening to books (such as reading novels, reading stories, etc.); it can also be applied in various voice interaction scenarios, such as In human-computer interaction scenarios of electronic devices.
  • electronic devices may include, for example, desktop computers, notebook computers, smartphones, tablets, personal digital assistants (personal digital assistants, PDAs), wearable devices, smart speakers, televisions, drones, vehicles, and vehicle-mounted devices (such as car machine, vehicle-mounted computer, vehicle-mounted chip, etc.) or robot, etc.
  • PDAs personal digital assistants
  • smart speakers televisions, drones, vehicles
  • vehicle-mounted devices such as car machine, vehicle-mounted computer, vehicle-mounted chip, etc.
  • robot etc.
  • the following is an exemplary introduction to the application of TTS
  • FIG. 1 is an example diagram of a flow of voice interaction provided by an embodiment of the present application. It should be understood that in this example, TTS technology is mainly embedded in the overall solution of voice interaction as a tail link (ie, the exit of voice interaction). As shown in Figure 1, the voice interaction process 100 includes:
  • Step 1 The smart terminal receives the voice command issued by the user, and then sends the voice command to the automatic speech recognition (automatic speech recognition, ASR) module.
  • ASR automatic speech recognition
  • smart terminals can collect voice commands issued by users through sound sensors such as microphones.
  • the voice command can also be subjected to appropriate pre-processing operations such as noise reduction and echo cancellation to reduce interference caused by noise.
  • the ASR module can be used to recognize the received voice instructions and output the recognized text sequence to the natural language understanding (natural language understanding, NLU) module.
  • natural language understanding natural language understanding, NLU
  • Step 3 The NLU module extracts semantic information such as intent and slot included in the received text sequence, and outputs the extracted semantic information to the dialogue management (DM) module.
  • the DM module includes dialogue state tracking (DST) and dialogue policy learning (DPL).
  • Step 4 The DST in the DM module updates the current system status based on the input semantic information, while the DPL decides what action to take next based on the current system status. Then the DM module outputs the determined decision action to natural language generation (natural language). generation, NLG) module.
  • natural language generation natural language generation, NLG
  • Step 5 The NLG module generates a corresponding text sequence based on the received decision action as feedback for human-computer voice interaction, and then outputs the text sequence (ie, the data to be processed below) to the TTS module.
  • Step 6 The TTS module synthesizes speech according to the text sequence output by the NLG module, and sends the synthesized speech to the smart terminal. Finally, the synthesized speech is played to the user through the playback device of the smart terminal, thereby realizing human-computer speech interaction.
  • non-leakage synthesis refers to one-time synthesis of speech based on the incoming text, and one-time return and playback of the synthesized speech; while streaming synthesis means that after the text is passed into the TTS module, the TTS module will synthesize the speech in segments. And the synthesized speech is played first, and the subsequent speech is also synthesized while the speech is broadcast. There is no need to wait for the entire speech to be synthesized before broadcasting, which can reduce the waiting time of speech synthesis.
  • the complexity of TTS algorithms is also constantly increasing, and the complexity is relatively high.
  • the TTS algorithm is not suitable for implementation on the terminal side with limited computing resources (for example, the above electronic device).
  • the traditional solution is to deploy the TTS algorithm on the server, synthesize the speech through the server, and then deliver it to the user through the network.
  • network resources are limited, users may experience lag when receiving the voice sent by the server, which seriously affects the quality of synthesized voice.
  • network bandwidth as an example, the following describes the server streaming speech synthesis and delivery process in different network bandwidth scenarios with reference to Figure 2.
  • Figure 2 is an example diagram of server streaming speech synthesis and delivery delay provided by an embodiment of the present application.
  • the speech stream with a length of n seconds (s) that needs to be synthesized is mainly segmented and synthesized at intervals of 1s, that is, segmented synthesis is 0s-1s, 1s-2s, 2s-3s. , 3s-4s, ..., (n-1)s-ns and other voice streams with a length of 1s, and the synthesized voice stream is first delivered and played, where n is a positive number greater than 0.
  • a voice stream with a length of 1s means that the playback time of the voice stream is 1s; in Figure 2, boxes of different lengths will be used to represent the delivery consumption of a voice stream with a length of 1s in different network bandwidth scenarios. time or playback time, where the delivery time refers to the time (in real time) required to deliver the voice stream.
  • the terminal side After the next 1s voice stream is delivered, the terminal side needs to wait for an additional period of time. Continue to play. It is precisely because of the existence of this waiting time that the voice playback is intermittent and causes lagging. And when the network bandwidth is smaller, the waiting time on the terminal side will be longer, and the user experience caused by lag will be worse. As a result, in scenarios where network bandwidth is unstable, the voice received by users is sometimes of high quality and sometimes lags, resulting in poor business continuity and low processing efficiency.
  • embodiments of the present application provide a data processing method, which can be executed by a server and/or a terminal device, or by a chip, module, or processor provided in the server and/or terminal device.
  • This method mainly uses the server to obtain characteristic data and sends the characteristic data to the terminal device, and then the terminal device generates target data based on the characteristic data. Since the data volume of the characteristic data is smaller than the data volume of the target data of the data to be processed, so that Compared with sending target data directly, it can effectively reduce the demand for network resources during data transmission and improve the smoothness of target data playback, thereby ensuring the playback quality of the final target data and improving business continuity and processing. efficiency.
  • the terminal device may be, for example, any of the above electronic devices;
  • the server may be a server with data processing functions such as a cloud server, a network server, an application server or a management server.
  • the data processing in the embodiments of this application can be applied to streaming data processing.
  • Streaming data processing refers to synthesizing target data in segments, and the target data synthesized first is played first. While the first synthesized data is played, subsequent data is also synthesized. There is no need to wait until the entire target data is synthesized before broadcasting, which can reduce Waiting time for data synthesis.
  • the data processing methods in the embodiments of the present application can be applied to technical fields such as voice processing, image processing, or video processing.
  • voice processing image processing
  • video processing video processing
  • they are applied to streaming in the technical field of voice processing.
  • speech synthesis technology Take speech synthesis technology as an example.
  • Figure 3 is an example diagram of a data processing method provided by an embodiment of the present application. It should be understood that the method 300 can be applied to a system composed of a server and a terminal device, and the terminal device and the server communicate through a network. As shown in FIG. 3 , the method 300 may include S310 to S340. Each step in the method 300 is described in detail below.
  • S310 The terminal device sends data to be processed to the server. Accordingly, the server receives data to be processed from the terminal device.
  • S320 The server obtains the characteristic data of the data to be processed based on the data to be processed.
  • the data amount of the characteristic data is smaller than the data amount of the target data of the data to be processed, and the characteristic data is used to synthesize the target data.
  • S330 The server sends characteristic data to the terminal device.
  • the terminal device receives the characteristic data of the data to be processed from the server.
  • S340 The terminal device generates target data according to the characteristic data.
  • the method 300 may further include: controlling the playback of the target data.
  • the server is mainly used to obtain characteristic data and send the characteristic data to the terminal device, and the terminal device generates target data based on the characteristic data. Since the data amount of the characteristic data is smaller than the target data of the data to be processed, Therefore, compared with directly sending the target data, it can effectively reduce the demand for network resources during the data transmission process, improve the smoothness of the target data playback, thereby ensuring the playback quality of the final target data, and at the same time improving the business continuity and processing efficiency.
  • the characteristic data in the embodiment of this application refers to: intermediate data that converts the data to be processed into target data, which has the characteristic information of the data to be processed, so that under the condition of inputting the data to be processed, the data to be processed can be first Transform this feature data, and then convert this feature data into the final target data.
  • target data which has the characteristic information of the data to be processed
  • the data to be processed can be first Transform this feature data, and then convert this feature data into the final target data.
  • the text sequence can be first transformed into acoustic feature data, and then the acoustic feature data can be transformed into the final target speech.
  • the acoustic features when transformed into the decoding operation of the vocoder, can be mel spectrum features as the input of the vocoder; or, when transformed into the inverse short-time Fourier transform, the feature data can be short-time inverse Fourier transform. time Fourier spectrum; or, when transformed into model inference of a deep neural network, the feature data can be the acoustic features output by the hidden layer of the deep neural network.
  • the above step S330 may include: when the network resource meets the first condition, the server sends characteristic data to the terminal device.
  • the terminal device receives the characteristic data from the server.
  • the first condition includes that the network resource is less than or equal to the first resource, and the first resource is the minimum resource required for transmitting the target data.
  • the first resource is the minimum resource required when transmitting the target data. It can also be described as the first resource is the critical resource required when transmitting the target data. It should be understood that, taking speech synthesis technology as an example, critical resources can make the transmission duration of the target duration speech equal to the target duration; when the network resources are greater than the critical resources, the transmission duration of the target duration speech will be less than the target duration, and the speech playback will be smoother; when When the network resources are less than the critical resources, the transmission time of the target duration voice will be longer than the target duration, and voice playback will be stuck. It should be understood that in actual operation, the above judgment conditions (that is, the network resources are greater than the critical resources or less than the critical resources) may also include the situation of being equal to the critical resources, and are not limited.
  • the network resource may be, for example, network bandwidth.
  • network bandwidth is used as an example in the following embodiments.
  • the situation of other network resources (or network conditions) is similar. , such as indicators reflecting network effectiveness, such as time domain, frequency domain, or the size of time-frequency resources, etc.; and indicators reflecting network reliability, such as network channel quality, etc.
  • the evaluation of network channel quality can include bandwidth, Delay, signal-to-noise ratio, bit error rate, jitter, etc.
  • the critical bandwidth can be recorded as B.
  • the B value can be calculated by the following formula (1):
  • F s is the sampling rate of the target data.
  • the value of F s can be 24000; bitwidth is the bit width of each target data sampling point.
  • the value of bitwidth can be 16 bit; cost is the signal transmission.
  • the cost in the process for example, when using 8/10bit encoding, the value is 1.25.
  • the value of the critical bandwidth is related to the actual value of each data in the formula, and the above critical bandwidth is only an example.
  • the network resources can be determined in real time by the terminal device or the network device, or by other network monitoring devices, which is not limited in the embodiments of the present application.
  • the server when the network resources are less than or equal to the minimum resources required for transmitting the target data, the server can send characteristic data to the terminal device, and the terminal device generates target data based on the characteristic data. Due to the characteristics of the characteristic data, The amount of data is smaller than the amount of target data of the data to be processed, so that when network resources are limited, the playback quality of the final target data can be guaranteed, and at the same time, business continuity and processing efficiency can be improved.
  • the characteristic data may include the first characteristic data; and/or when the first condition is that the network resource is greater than or equal to the first resource
  • the characteristic data may include the second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to the data amount of the second characteristic data, and the second resource is used to transmit the first characteristic data.
  • the minimum resource required, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the characteristic data may also include the second characteristic data.
  • the target data synthesized by the first feature data is of higher quality than the target data synthesized by the second feature data. better. Therefore, when the network resource is greater than or equal to the second resource and less than or equal to the first resource, selecting the first characteristic data can improve voice quality, and selecting the second characteristic data can reduce network resource consumption.
  • the data to be processed may include speech data to be processed
  • the target data may include target speech
  • the feature data may include acoustic features
  • the network resources may include network bandwidth.
  • the speech data to be processed may include text sequences or phoneme sequences, which are not limited in the embodiments of the present application.
  • the acoustic features may include features such as mel spectrum or downsampled short-time Fourier transform spectrogram, which are not limited in this embodiment of the present application.
  • the first feature data may include Mel spectrum features; the second feature data may include a first short-time Fourier transform spectrogram.
  • the first short-time Fourier transform spectrogram is a spectrum obtained by downsampling the original short-time Fourier transform spectrogram, and the original short-time Fourier transform spectrogram can be obtained based on the speech data to be processed.
  • STFT short time Fourier transformation
  • the specific process of obtaining the original short-time Fourier transform spectrogram based on the speech data to be processed may include: obtaining mel spectrum features based on the speech data to be processed, then synthesizing the target speech based on the mel spectrum features, and finally synthesizing the mel spectrum features.
  • the synthesized target speech is subjected to short-time Fourier transform to obtain the original short-time Fourier transform spectrogram.
  • the above-mentioned Mel spectrum characteristics can also be other acoustic characteristics without limitation.
  • the original short-time Fourier transform spectrogram can also be obtained directly from the speech data to be processed, which is not limited in the embodiment of the present application.
  • the data amount of the Mel spectrum feature can be recorded as 1/R of the data amount of the target data; the data amount of the down-sampled short-time Fourier transform spectrum can be recorded as the target 1/M of the data size.
  • the second resource is 1/R of the first resource; the third resource is 1/M of the first resource, where R and M are both positive numbers greater than 1. It should be understood that the values of R and M can be determined according to the actual situation. For details, please refer to the relevant descriptions in modes b to d below.
  • the above step S340 may include: the terminal device performs a processing on the first short-time Fourier transform spectrum. Perform upsampling to obtain the second short-time Fourier transform spectrogram; then perform inverse short time fourier transformation (ISTFT) on the second short-time Fourier transform spectrogram to obtain the target speech.
  • ISTFT inverse short time fourier transformation
  • the second feature data may also include residual data, which is the difference between the original short-time Fourier transform spectrum and the second short-time Fourier transform spectrum, where the second short-time Fourier transform spectrum is The Fourier transform spectrum is a spectrum obtained by upsampling the first short-time Fourier transform spectrum.
  • performing ISTFT on the second short-time Fourier transform spectrogram to obtain the target speech may include: performing ISTFT on the sum of the second short-time Fourier transform spectrogram and residual data to obtain the target speech.
  • the server can be used to calculate and send the data.
  • the residual data is given to the terminal device, so that the terminal device can take the error into account when restoring the original data based on the downsampled data, thereby improving the synthesis quality of the target data and ensuring the playback quality of the final target data.
  • adjacent sliding windows of the original short-time Fourier transform spectrum may not overlap, thereby reducing the data size of the short-time Fourier transform spectrum.
  • the above-mentioned original short-time Fourier transform spectrum may only include the amplitude part of the spectrum, thereby reducing the data size of the short-time Fourier transform spectrum.
  • the original short-time Fourier transform spectrum only includes the amplitude part of the spectrum, which can also be described as the original short-time Fourier transform spectrum does not include the phase part of the spectrum.
  • the method 300 may also include: the terminal device sends the data to be processed to the server, and accordingly, the server receives the data to be processed from the terminal device. ;
  • the server obtains the target data according to the data to be processed; the server sends the target data to the terminal device, and accordingly, the terminal device receives the target data from the server.
  • the terminal device sends the data to be processed to the server, and accordingly, the terminal device receives the target data from the server.
  • the characteristic data may also be sent to the terminal device.
  • the computing power of the server is stronger, the quality of the target data synthesized directly by the server is usually better, and the voice playback quality is better in this case.
  • the server when the network resource is larger than the first resource, you can choose to use the server to synthesize and send the target data to the terminal device to improve the quality of the target data.
  • the method 300 may also include: Determine the data to be processed; obtain the target data based on the data to be processed. For details, please refer to the relevant description in method e below.
  • target data when network resources are insufficient to transmit characteristic data, target data can be synthesized directly on the terminal device to improve business continuity and processing efficiency.
  • FIG 4 is an example diagram of a speech synthesis system architecture provided by an embodiment of the present application.
  • a speech synthesis system is deployed on the cloud side and the terminal side respectively.
  • the speech synthesis system deployed on the terminal side includes acoustic model, vocoder, ISTFT, reduction algorithm, residual Difference compensation and other modules, it should be understood that considering the limited computing power and limited storage space on the terminal side, the acoustic model and vocoder include lightweight algorithms; the speech synthesis system deployed on the cloud side includes the acoustic model, Vocoder, STFT, downsampling, reduction algorithm, residual calculation and other modules. Among them, the acoustic model and vocoder can include heavyweight algorithms to improve the quality of speech synthesis. It should be understood that the system architecture 400 can implement different speech synthesis methods under different network bandwidth conditions. The speech synthesis method under different network bandwidth conditions is introduced below with reference to Figure 4.
  • Network bandwidth situation 1 Network bandwidth > B, that is, the network bandwidth is sufficient.
  • B represents the critical bandwidth. It should be understood that under the critical bandwidth, the time taken to deliver speech of the target duration is equal to the target duration.
  • speech synthesis can be completed directly on the cloud server and then delivered to the terminal device.
  • the acoustic model deployed in the cloud server can first generate acoustic features based on the data to be processed, and then the vocoder synthesizes the speech waveform (i.e., synthesized speech) based on the acoustic features, and then sends the synthesized speech to the terminal device through the network .
  • the speech waveform i.e., synthesized speech
  • the delivery time of the speech of the target duration will be less than the target duration.
  • the speech synthesis is performed directly on the cloud side, and is delivered to the terminal device in real time for playback, so that no lag can be achieved.
  • Voice playback and because the voice synthesis is completed on the cloud side, the sound quality is also relatively good.
  • Network bandwidth situation 2 B/2 ⁇ network bandwidth ⁇ B.
  • method b in Figure 4 and a device-cloud combined streaming speech synthesis process shown in Figure 5 can complete speech synthesis in a way that combines the cloud and the terminal.
  • the acoustic model deployed in the cloud server can first generate acoustic features based on the data to be processed.
  • the acoustic feature may be a Mel spectrum feature.
  • the cloud server directly delivers the Mel spectrum to the terminal device, and then the vocoder deployed in the terminal device generates a voice stream in real time based on the received Mel spectrum. After the voice stream is synthesized, it is sent to the playback device for playback, completing the interaction of the terminal device. User's reply.
  • the size of the Mel spectrum obtained by the acoustic model is:
  • the corresponding size of the voice stream of the same duration is: T*F S *2.
  • the ratio of the same duration voice stream and the corresponding Mel spectrum data is:
  • T is the duration of the speech stream
  • F S is the time domain sampling rate of the synthesized speech
  • t shift is the corresponding frame shift when calculating the Mel spectrum
  • Dim is the Mel spectrum dimension
  • R is the speech stream data of the same duration and the corresponding Mel The ratio between spectral data.
  • the data volume of the voice stream of the same duration is usually greater than the data volume of the corresponding Mel spectrum. Then the data volume of the Mel spectrum can be recorded as 1/R of the data volume of the voice stream of the same duration, which means that if the Mel spectrum is directly delivered Data instead of voice stream data, the data volume will be reduced to 1/R times of the original. At this time, the dependence on bandwidth can be reduced to the original 1/R. Then when the network bandwidth ⁇ B/R, card-free can be achieved Dun's voice playback.
  • R the actual size of R depends on the values of F S , t shift and Dim.
  • F S 24KHz
  • Dim 80
  • t shift 12.5ms
  • R 1.9
  • the data amount of the speech stream of the same duration is 1.9 times the data amount of its corresponding Mel spectrum.
  • the dependence on bandwidth can be approximately reduced to 1/1.9 of the original, that is, When the network bandwidth is ⁇ B/1.9, lag-free voice playback can be achieved.
  • Network bandwidth situation 3 B/4 ⁇ network bandwidth ⁇ B/2.
  • the voice synthesis process can be completed by combining the cloud and the terminal. synthesis.
  • the acoustic model deployed in the cloud server can first generate acoustic features based on the data to be processed.
  • the acoustic signature may be a Mel spectrum.
  • the vocoder deployed on the cloud server synthesizes a speech waveform based on the acoustic characteristics, and the STFT module performs STFT on the speech waveform to obtain the original STFT spectrogram.
  • the downsampling module downsamples the obtained original STFT spectrum, and sends the downsampled STFT spectrum to the terminal device.
  • the restoration algorithm module deployed in the terminal device restores (i.e., upsamples) the received downsampled STFT spectrogram in real time, and uses the ISTFT module to perform ISTFT on the restored STFT spectrogram to generate a voice stream. After the voice stream synthesis is completed Send it to the playback device for playback, and complete the terminal device's reply to the user.
  • the cloud server downsamples the original STFT spectrum, and then when the terminal device upsamples the STFT spectrum to restore the STFT spectrum, the restored STFT spectrum will be different from the original STFT spectrum. There are errors (i.e., residual data), resulting in low synthesis quality of the target speech.
  • the embodiment of this application also proposes another implementation manner.
  • speech synthesis can also be completed by combining the cloud and the terminal.
  • the cloud server after the cloud server obtains the downsampled STFT spectrum, the cloud server also upsamples the downsampled STFT spectrum to restore the STFT spectrum, and based on the original STFT spectrum and the restored STFT The difference between the spectra determines the residual data, and then sends the residual data to the terminal device while sending the downsampled STFT spectrum, so that when the terminal device receives the downsampled STFT spectrum, it can First restore the downsampled STFT spectrogram, then perform residual compensation on the restored STFT spectrogram, and then use the ISTFT module to perform ISTFT on the residual-compensated STFT spectrogram to generate a speech stream. After the speech stream is synthesized, it is sent to The playback device performs playback and completes the terminal device's reply to the user.
  • the adjacent sliding windows of the above-mentioned original STFT spectrum may not overlap, and may only include the amplitude part of the spectrum.
  • the data size of the original STFT spectrum can be expressed according to the following formula (3):
  • T is the duration of the speech stream
  • F S is the time domain sampling rate of the synthesized speech
  • N hop is the number of frame shift points, when the sliding windows do not overlap
  • N hop and N fft are equal
  • N fft represents the number of points when performing STFT.
  • the network bandwidth occupied by the residual data during transmission is small, for example, it may be about B/10, so its impact will be ignored in the embodiment of this application.
  • the residual data can be calculated through the following formula (4):
  • Res represents the restored residual data
  • imgstft represents the original STFT spectrum
  • (fun_downsample(imgstft) represents the STFT spectrum obtained by downsampling the original STFT spectrum
  • represents fun_recover(fun_downsample(imgstft)) represents the downsampling of the original STFT spectrum.
  • the STFT spectrum obtained by sampling is restored to the spectrum obtained.
  • the data amount of the downsampled STFT spectrum can be recorded as 1/M of the data amount of the original STFT spectrum. If the downsampled STFT spectrogram is directly sent instead of the voice stream data, the data size will be reduced to 1/M times of the original. At this time, the dependence on bandwidth can be reduced to 1/M of the original, then when the network bandwidth ⁇ During B/M, voice playback without lag can be achieved.
  • the M represents the degree of downsampling of the STFT spectrum. The larger the value of M, the greater the degree of downsampling, which means that the data amount of the downsampled STFT spectrum is smaller.
  • M 4.
  • method b can be used when B/2 ⁇ network bandwidth ⁇ B; When B/4 ⁇ network bandwidth ⁇ B/2, adopt method c or d to further improve the speech synthesis quality.
  • Network bandwidth situation 4 Network bandwidth ⁇ B/4.
  • speech synthesis and playback can be completed directly on the terminal device.
  • the acoustic model deployed on the terminal device can first generate acoustic features based on the data to be processed, and then the vocoder synthesizes speech based on the acoustic features, and the speech is played in real time by the playback device.
  • the network bandwidth is ⁇ B/4
  • the limitation of network bandwidth will cause lagging during speech playback. Therefore, in this case, the data to be processed does not need to be uploaded to the cloud, and the speech synthesis process is completely performed on the terminal side.
  • the acoustic model and vocoder usually deployed on the terminal side include lightweight TTS algorithms, the quality of speech synthesized on the terminal side is not as good as the sound quality on the cloud side.
  • the terminal device can directly determine the method of synthesizing speech according to the network conditions of the environment. For example, when the network bandwidth is sufficient, the terminal device can determine to synthesize speech in the cloud; when the network bandwidth is limited but can meet the transmission of the above acoustic characteristic data, the terminal device can determine to synthesize speech using a combination of terminal and cloud; when the network bandwidth is severe When the network is insufficient or disconnected, the terminal device can determine to synthesize speech directly on the terminal device.
  • the user can also choose the speech synthesis method independently according to the actual situation. For example, when users have high requirements for voice broadcast speed, they can choose the method of synthesizing speech on the terminal device to reduce the start-up delay; when users have high requirements for sound quality and the environment and network are sufficient, they can choose the method of cloud synthesis; When users have high requirements for sound quality but the environment network is not very sufficient, the device-cloud combination method can be used.
  • the terminal device can also first determine the speech synthesis method based on the network conditions of the environment and then recommend it to the user, allowing the user to choose whether to synthesize speech according to the recommendation of the terminal device.
  • FIG. 8 is an example diagram of a vehicle-mounted UI interface provided by an embodiment of the present application.
  • the vehicle can issue an instruction to the vehicle, for example, "Interaction Assistant, I want to choose the broadcast mode!, and the vehicle receives this instruction.
  • the vehicle can display different voice broadcast modes such as fast listening, smooth listening, and vivid listening on the central control display for users to choose.
  • fast listening refers to completing speech synthesis on the terminal side to achieve fast voice broadcast
  • smooth listening refers to completing speech synthesis through device-cloud integration to achieve high-quality and smooth voice broadcast
  • vivid listening refers to completing speech completely in the cloud.
  • Synthesis to achieve voice broadcast with better sound quality allows users to independently select the voice broadcast mode according to actual needs and network environment. For example, when users have high requirements for voice broadcast speed, they can select “Quick Listen” on the central control display; when users have high requirements for sound quality and the environment and network are sufficient, they can select “Vivid Listen”; When the sound quality requirements are high but the environmental network is not very sufficient, you can choose "Smooth Listening".
  • FIG 9 is an example diagram of another vehicle-mounted UI interface provided by an embodiment of the present application.
  • the user wants the vehicle to recommend a broadcast mode, he can issue an instruction to the vehicle, for example, "Interactive assistant, please recommend a broadcast mode!, and the vehicle receives After this instruction, the appropriate voice broadcast mode can be recommended to the user based on the network conditions of the environment. For example, when the vehicle detects that the network is sufficient, it can recommend Vivid Listening to the user. At this time, if the user agrees with the recommendation, he or she can select "Yes"; if the user does not agree with the recommendation, he or she can select "No” and follow their own needs. Select a voice announcement mode.
  • FIG. 10 is a data processing device 1000 provided by an embodiment of the present application.
  • the device 1000 can communicate with a terminal device through a network.
  • the device 1000 may be a server, or may be a chip, processor or module in the server, without limitation.
  • the device 1000 includes: a transceiver module 1010 and a processing module 1020. It should be understood that the transceiver module 1010 has the capability of transmitting and/or receiving data, and the transceiver module 1010 may be an interface circuit in specific implementation.
  • the transceiver module 1010 is used to receive the data to be processed from the terminal device; the processing module 1020 is used to obtain the characteristic data of the data to be processed according to the data to be processed, and the data amount of the characteristic data is smaller than the data amount of the target data of the data to be processed. , the characteristic data is used to synthesize target data; the transceiver module 1010 is also used to send the characteristic data to the terminal device.
  • the transceiver module 1010 may also be configured to send characteristic data to the terminal device when the network resource meets a first condition, where the first condition includes that the network resource is less than or equal to the first resource, and the first resource is the transmission target data. the minimum resources required.
  • the characteristic data may include the first characteristic data; and/or when the first condition is that the network resource is greater than or equal to the first resource
  • the characteristic data may include the second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to the data amount of the second characteristic data, and the second resource is used to transmit the first characteristic data.
  • the minimum resource required, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the data to be processed may include voice data to be processed
  • the target data may include target voice
  • the feature data may include acoustic features
  • the network resources may include network bandwidth.
  • the first feature data may include Mel spectrum features; the second feature data may include a first short-time Fourier transform spectrum, wherein the first short-time Fourier transform spectrum is a comparison of the original short-time Fourier transform spectrum.
  • the spectrogram obtained by downsampling the Fourier transform spectrogram, and the original short-time Fourier transform spectrogram can be obtained based on the speech data to be processed.
  • the second feature data may also include residual data.
  • the residual data is the difference between the original short-time Fourier transform spectrum and the second short-time Fourier transform spectrum.
  • the second short-time Fourier transform The spectrum is a spectrum obtained by upsampling the first short-time Fourier transform spectrum.
  • adjacent sliding windows of the original short-time Fourier transform spectrum may not overlap; and/or the original short-time Fourier transform spectrum may only include the amplitude part of the spectrum.
  • the transceiver module 1010 can also be used to receive another data to be processed from the terminal device; the processing module 1020 can also be used to , obtain the target data of another data to be processed according to another data to be processed; the transceiver module 1010 can also be used to send the target data of another data to be processed to the terminal device.
  • the above-mentioned data to be processed and another data to be processed may be the same.
  • FIG. 11 is a data processing device 1100 provided by an embodiment of the present application.
  • the device 1100 can communicate with a server through a network.
  • the device 1100 may be a terminal device, or may be a chip, processor or module in the terminal device, without limitation.
  • the device 1100 includes: a transceiver module 1110 and a processing module 1120. It should be understood that the transceiver module 1110 has the capability of transmitting and/or receiving data, and the transceiver module 1110 may be an interface circuit in specific implementation.
  • the transceiver module 1110 is used to send the data to be processed to the server; receive the characteristic data of the data to be processed from the server, the data amount of the characteristic data is smaller than the data amount of the target data of the data to be processed; the processing module 1120 is used to calculate the data according to the characteristics. Data generates target data.
  • the transceiver module 1110 may also be configured to receive the characteristic data of the data to be processed from the server when the network resource meets a first condition, where the first condition includes that the network resource is less than or equal to the first resource, and the first resource is The minimum resources required to transmit the target data.
  • the characteristic data may include the first characteristic data; and/or when the first condition is that the network resource is greater than or equal to the first resource
  • the characteristic data may include the second characteristic data, wherein the data amount of the first characteristic data is greater than or equal to the data amount of the second characteristic data, and the second resource is used to transmit the first characteristic data.
  • the minimum resource required, and the third resource is the minimum resource required to transmit the second characteristic data.
  • the data to be processed may include voice data to be processed
  • the target data may include target voice
  • the feature data may include acoustic features
  • the network resources may include network bandwidth.
  • the first feature data may include Mel spectrum features; the second feature data may include a first short-time Fourier transform spectrum, wherein the first short-time Fourier transform spectrum is a comparison of the original short-time Fourier transform spectrum.
  • the spectrogram obtained by downsampling the Fourier transform spectrogram, and the original short-time Fourier transform spectrogram can be obtained based on the speech data to be processed.
  • the processing module 1120 may also be configured to perform the first short-time Fourier transform spectrum Perform upsampling to obtain the second short-time Fourier transform spectrogram; perform ISTFT on the second short-time Fourier transform spectrogram to obtain the target speech.
  • the second feature data may also include residual data, and the residual data is the difference between the original short-time Fourier transform spectrum and the second short-time Fourier transform spectrum.
  • the processing module 1120 may also be configured to perform ISTFT on the sum of the second short-time Fourier transform spectrogram and the residual data to obtain the target speech.
  • adjacent sliding windows of the original short-time Fourier transform spectrum may not overlap; and/or the original short-time Fourier transform spectrum may only include the amplitude part of the spectrum.
  • the transceiver module 1110 can also be used to send another data to be processed to the server; receive another data to be processed from the server target data.
  • the processing module 1120 can also be used to determine another data to be processed; and obtain another data to be processed according to the further data to be processed. Target data for processing data.
  • the above-mentioned data to be processed, another data to be processed, and yet another data to be processed may be the same.
  • Figure 12 is a data processing system 1200 provided by an embodiment of the present application.
  • the system 1200 includes a device 1000 and a device 1100, and the device 1000 is applied to a server and can be used to perform related operations corresponding to the server in the method embodiment of the present application.
  • the device 1100 is applied to a terminal device and can be Used to perform related operations corresponding to the terminal device in the method embodiment of the present application.
  • Figure 13 is an exemplary block diagram of the hardware structure of the data processing device provided by the embodiment of the present application.
  • the device 1300 may specifically be a computer device.
  • the device 1300 includes a memory 1310, a processor 1320, a communication interface 1330, and a bus 1340.
  • the memory 1310, the processor 1320, and the communication interface 1330 implement communication connections between each other through the bus 1340.
  • the memory 1310 may be a read-only memory (ROM), a static storage device, a dynamic storage device or a random access memory (RAM).
  • the memory 1310 can store programs. When the program stored in the memory 1310 is executed by the processor 1320, the processor 1320 and the communication interface 1330 are used to perform related operations corresponding to the data processing device 1000 in the embodiment of the present application; and/or, use To perform related operations corresponding to the data processing device 1100 in the embodiment of the present application.
  • the processor 1320 may be a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more integrated circuits.
  • the circuit is used to execute relevant programs to realize the functions required to be executed by the processing module 1020 in the data processing device 1000 in the embodiment of the present application, or to realize the required execution of the processing module 1120 in the data processing device 1100 in the embodiment of the present application. function.
  • the processor 1320 may also be an integrated circuit chip with signal processing capabilities.
  • the related operations corresponding to the server in the method embodiment of the present application; and/or the related operations corresponding to the terminal device in the method embodiment of the present application can be implemented in the form of integrated logic circuits or software of hardware in the processor 1320 The instruction is completed.
  • the above-mentioned processor 1320 can also be a general-purpose processor, a digital signal processor (DSP), an ASIC, a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 1310.
  • the processor 1320 reads the information in the memory 1310, and combines its hardware to complete the functions required to be performed by the modules included in the data processing device of the embodiment of the present application, or to perform the functions of the server in the method embodiment of the present application. Corresponding related operations; and/or, used to perform related operations corresponding to the terminal device in the method embodiment of the present application. For example, the processor 1320 may perform the above steps S320 and S340.
  • the communication interface 1330 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 1300 and other devices or communication networks.
  • the communication interface 1330 can be used to implement the functions required by the transceiver module 1010 in the data processing device 1000 shown in Figure 10; or, the communication interface 1330 can be used to implement the functions required by the transceiver module 1110 in the data processing device 1100 shown in Figure 11 function performed.
  • the communication interface 1330 may perform the above steps S310 and S330.
  • Bus 1340 may include a path that carries information between various components of device 1300 (eg, memory 1310, processor 1320, communication interface 1330).
  • Embodiments of the present application also provide a vehicle, including a sensor and a data processing device.
  • the sensor is used to obtain user data in the cabin
  • the user data in the cabin is used to generate data to be processed
  • the data processing device is used to execute the method embodiments of the present application.
  • Embodiments of the present application also provide a computer-readable storage medium, which is characterized in that it includes instructions; the instructions are used to implement related operations corresponding to the server in the method embodiments of the present application; and/or, implement the method of the present application.
  • Related operations corresponding to the terminal device in the example are characterized in that it includes instructions; the instructions are used to implement related operations corresponding to the server in the method embodiments of the present application; and/or, implement the method of the present application.
  • Related operations corresponding to the terminal device in the example are related operations corresponding to the server in the method embodiments of the present application.
  • the embodiment of the present application also provides a computer program product, which is characterized in that it includes: a computer program that, when the computer program is run, causes the computer to perform related operations corresponding to the server in the method embodiment of the present application; and/or, Perform related operations corresponding to the terminal device in the method embodiment of this application.
  • An embodiment of the present application also provides a computing device, including: at least one processor and a memory.
  • the at least one processor is coupled to the memory and is used to read and execute instructions in the memory to execute the present application.
  • Embodiments of the present application also provide a chip.
  • the chip includes a processor and a data interface.
  • the processor reads instructions stored in the memory through the data interface and executes related tasks corresponding to the server in the method embodiment of the present application. Operation; and/or, perform related operations corresponding to the terminal device in the method embodiment of the present application.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the traditional solution or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium and includes several The instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
  • multiple refers to two or more.
  • and/or is used to describe the association of associated objects, indicating three relationships that can exist independently.
  • a and/or B can mean: A exists alone, B exists alone, or both. There are A and B.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请实施例提供了一种数据处理方法和装置,可以应用于自动驾驶、人工智能等领域。其中,该方法包括:终端设备向服务器发送待处理数据;服务器根据待处理数据得到待处理数据的特征数据,其中,该特征数据的数据量小于该待处理数据的目标数据的数据量,该特征数据用于合成该目标数据;服务器向终端设备发送特征数据;终端设备根据特征数据生成目标数据。本申请方案能够提高业务的连续性和处理效率。

Description

数据处理方法和装置 技术领域
本申请涉及人工智能领域,并且更具体地,涉及一种数据处理方法和装置。
背景技术
语音合成技术又被称为文本转语音(text to speech,TTS)技术,是语音处理技术领域的一个重要方向,旨在让机器生成自然、动听的人类语音。随着语音合成技术的不断进步,TTS算法的复杂度也在不断地提高,而复杂度较高的TTS算法不适宜在计算资源受限的终端侧实施。
为解决上述问题,传统方案将TTS算法部署在服务器(例如,云端服务器),通过服务器完成语音的合成再经由网络下发给用户,但这种方式会存在业务连续性较差和处理效率较低的问题。
发明内容
本申请实施例提供一种数据处理方法和装置,能够提高业务的连续性和处理效率。
第一方面,提供了一种数据处理方法,该方法应用于服务器,该服务器通过网络与终端设备通信,该方法包括:从该终端设备接收待处理数据;根据该待处理数据得到该待处理数据的特征数据,该特征数据的数据量小于该待处理数据的目标数据的数据量,该特征数据用于合成该目标数据;向该终端设备发送该特征数据。
可选地,终端设备可以是手机、个人电脑、车辆或者信息处理中心等智能终端;服务器可以是云端服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的服务器。
可选地,本申请实施例中的数据处理可以是指流式数据处理。应理解,流式数据处理是指,分段合成目标数据,且先合成的目标数据先播放,在先合成数据播放的同时后续的数据也在合成,不用等到整个目标数据合成完再进行播报,这样可以减少数据合成的等待时间。
可选地,本申请实施例中的数据处理方法可以应用于语音处理、图像处理或视频处理等技术领域,但为便于描述,在下文实施例中均以应用于语音处理技术领域中的流式语音合成技术为例。
应理解,在服务器向终端设备发送特征数据之后,终端设备可以根据该特征数据合成目标数据,并对该目标数据进行播放。
在现有方案中,通常是利用服务器直接合成目标数据,再将所合成的目标数据发送给终端设备,最后通过终端设备对其进行播放,但在网络资源不稳定的场景下,采用这种方式会存在业务连续性较差和处理效率较低的问题。具体地,当网络资源充裕时,目标数据的传输较快,播放也较为流畅;在网络资源受限时,目标数据的传输较慢,导致播放会出 现卡顿现象,流畅性比较差,进而造成业务的连续性较差和处理效率较低。
在本申请实施例中,主要利用服务器得到特征数据并发送该特征数据给终端设备,由于该特征数据的数据量小于该待处理数据的目标数据的数据量,从而使得相较于直接发送目标数据而言,能够有效降低数据传输过程中对于网络资源的需求,提高目标数据播放的流畅性,进而能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
结合第一方面,在第一方面的某些实现方式中,该向该终端设备发送该特征数据包括:当网络资源满足第一条件时,向该终端设备发送该特征数据,其中,该第一条件包括该网络资源小于或等于第一资源,该第一资源为传输该目标数据时所需求的最小资源。
应理解,第一资源为传输该目标数据时所需求的最小资源,也可以描述为,第一资源为传输该目标数据时所需求的临界资源。应理解,以语音合成技术为例,临界资源能够使得指目标时长语音的传输时长等于目标时长;当网络资源大于临界资源时,目标时长语音的传输时长会小于目标时长,语音播放较为流畅;当网络资源小于临界资源时,目标时长语音的传输时长会大于目标时长,语音播放会存在卡顿现象。
可选地,在本申请实施例中,网络资源可以是网络带宽,且下文均以网络带宽为例进行描述。
可选地,网络资源可以由终端设备实时确定,也可以由其他网络监测设备实现,本申请实施例对此不做限定。
在本申请实施例中,当网络资源小于或等于传输该目标数据时所需求的最小资源时,可以向该终端设备发送该特征数据,由于该特征数据的数据量小于该待处理数据的目标数据的数据量,且该特征数据能够用于合成目标数据,从而使得在网络资源受限时,能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
结合第一方面,在第一方面的某些实现方式中,当该第一条件为该网络资源大于或等于第二资源且小于或等于该第一资源时,该特征数据包括第一特征数据;和/或,当该第一条件为该网络资源大于或等于第三资源且小于或等于第二资源时,该特征数据包括第二特征数据,其中,该第一特征数据的数据量大于或等于该第二特征数据的数据量,该第二资源为传输该第一特征数据所需求的最小资源,该第三资源为传输该第二特征数据所需求的最小资源。
可选地,当网络资源大于或等于第二资源且小于或等于第一资源时,该特征数据也可以包括第二特征数据。但需理解的是,由于第一特征数据的数据量大于或等于该第二特征数据的数据量,因而通过第一特征数据合成的目标数据相较于通过第二特征数据合成的目标数据而言质量更优。因而,在本申请实施例中,在网络资源大于或等于第二资源且小于或等于第一资源时,可以选用第一特征数据。
在本申请实施例中,当网络资源处于不同范围时,可以采用不同的特征数据,以使得在不同的网络资源条件下均能够保证业务的连续性和处理效率,同时能够保证最终目标数据的播放质量。
结合第一方面,在第一方面的某些实现方式中,该待处理数据包括待处理语音数据,该目标数据包括目标语音,该特征数据包括声学特征,该网络资源包括网络带宽。
可选地,在应用于语音处理技术领域时,待处理数据可以包括文本序列或音素序列(即待合成语音的文本序列或音素序列)等,本申请实施例对此不做限定。
可选地,声学特征可以包括梅尔谱或经过降采样的短时傅里叶变换谱图等特征,本申请实施例对此不做限定。
结合第一方面,在第一方面的某些实现方式中,该第一特征数据包括梅尔谱特征;该第二特征数据包括第一短时傅里叶变换谱图,其中,该第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,该原始短时傅里叶变换谱图根据该待处理语音数据得到。
可选地,在本申请实施例中,可以将梅尔谱特征的数据量记为目标数据的数据量的1/R;将降采样的短时傅里叶变换谱图的数据量记为目标数据的数据量的1/M。那么对应地,第二资源则为第一资源的1/R;第三资源为第一资源的1/M,其中,R和M均为大于1的正数。应理解,R与M的值可以根据实际情况确定,具体可参加下文方式2和方式3中的相关描述。
可选地,原始短时傅里叶变换谱图根据待处理语音数据得到的具体过程可以包括:根据待处理语音数据得到梅尔谱特征,再根据该梅尔谱特征合成目标语音,最后对所合成的目标语音进行短时傅里叶变换得到该原始短时傅里叶变换谱图。可选地,上述梅尔谱特征也可以为其他声学特征,不做限定。可选地,原始短时傅里叶变换谱图也可以直接根据待处理语音数据得到,本申请实施例对此不做限定。
结合第一方面,在第一方面的某些实现方式中,该第二特征数据还包括残差数据,该残差数据为该原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差,该第二短时傅里叶变换谱图为对该第一短时傅里叶变换谱图进行升采样得到的谱图。
应理解,通常对原始数据进行降采样,再进行升采样还原数据时,所还原的数据相较于原始数据会存在误差(即残差数据),因而在本申请实施例中,可以利用服务器来计算并发送该残差数据给终端设备,使得终端设备在根据降采样数据还原原始数据时,可以考虑到该误差,进而能够提高目标数据的合成质量,同时能够保证最终目标数据的播放质量。
结合第一方面,在第一方面的某些实现方式中,该原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,该原始短时傅里叶变换谱图只包括谱图的幅度部分。
可选地,原始短时傅里叶变换谱图只包括谱图的幅度部分也可以描述为,原始短时傅里叶变换谱图不包括谱图的相位部分。
在本申请实施例中,原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,原始短时傅里叶变换谱图只包括谱图的幅度部分,能够降低短时傅里叶变换谱图的数据大小。
结合第一方面,在第一方面的某些实现方式中,当该网络资源满足第二条件时,该第二条件包括该网络资源大于该第一资源,该方法还包括:从该终端设备接收另一待处理数据;根据该另一待处理数据得到该另一待处理数据的目标数据;向该终端设备发送该另一待处理数据的目标数据。
在本申请实施例中,上述待处理数据和另一待处理数据可以相同。
可选地,当网络资源大于第一资源时,也可以向终端设备发送特征数据。但应理解的是,由于服务器的计算能力更强,所以通常利用服务器直接所合成的目标数据的质量更优。
因而,在本申请实施例中,在网络资源大于第一资源时,可以选择利用服务器来合成并发送目标数据给终端设备,以提高目标数据的质量。
第二方面,提供了一种数据处理方法,该方法应用于终端设备,该终端设备通过网络 与服务器通信,该方法包括:向该服务器发送待处理数据;从该服务器接收该待处理数据的特征数据,该特征数据的数据量小于该待处理数据的目标数据的数据量;根据该特征数据生成该目标数据。
可选地,在根据该特征数据生成该目标数据之后,该方法还包括:控制播放该目标数据。
在现有方案中,通常是利用服务器直接合成目标数据,然后终端设备再从服务器接收该目标数据,并对该目标数据进行播放,但在网络资源不稳定的场景下,采用这种方式会存在业务连续性较差和处理效率较低的问题。具体地,当网络资源充裕时,目标数据的传输较快,播放也较为流畅;在网络资源受限时,目标数据的传输较慢,导致播放会出现卡顿现象,流畅性比较差,进而造成业务的连续性较差和处理效率较低。
在本申请实施例中,主要利用终端设备从服务器接收待处理数据的特征数据,并根据该特征数据生成目标数据,由于该特征数据的数据量小于该待处理数据的目标数据的数据量,从而使得相较于直接接收目标数据而言,能够有效降低数据传输过程中对于网络资源的需求,提高目标数据播放的流畅性,进而能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
结合第二方面,在第二方面的某些实现方式中,该从该服务器接收该待处理数据的特征数据包括:当网络资源满足第一条件时,从该服务器接收该待处理数据的特征数据,其中,该第一条件包括该网络资源小于或等于第一资源,该第一资源为传输该目标数据时所需求的最小资源。
在本申请实施例中,当网络资源小于或等于传输该目标数据时所需求的最小资源时,可以从该服务器接收该待处理数据的特征数据,由于该特征数据的数据量小于该待处理数据的目标数据的数据量,从而使得在网络资源受限时,能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
结合第二方面,在第二方面的某些实现方式中,当该第一条件为该网络资源大于或等于第二资源且小于或等于该第一资源时,该特征数据包括第一特征数据;和/或,当该第一条件为该网络资源大于或等于第三资源且小于或等于第二资源时,该特征数据包括第二特征数据,其中,该第一特征数据的数据量大于或等于该第二特征数据的数据量,该第二资源为传输该第一特征数据所需求的最小资源,该第三资源为传输该第二特征数据所需求的最小资源。
在本申请实施例中,当网络资源处于不同范围时,可以采用不同的特征数据,以使得在不同的网络资源条件下均能够保证业务的连续性和处理效率,同时能够保证最终目标数据的播放质量。
结合第二方面,在第二方面的某些实现方式中,该待处理数据包括待处理语音数据,该目标数据包括目标语音,该特征数据包括声学特征,该网络资源包括网络带宽。
结合第二方面,在第二方面的某些实现方式中,该第一特征数据包括梅尔谱特征;该第二特征数据包括第一短时傅里叶变换谱图,其中,该第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,该原始短时傅里叶变换谱图根据该待处理语音数据得到。
结合第二方面,在第二方面的某些实现方式中,当该特征数据包括第二特征数据,且 该第二特征数据包括该第一短时傅里叶变换谱图时,该根据该特征数据生成该目标数据包括:对该第一短时傅里叶变换谱图进行升采样得到第二短时傅里叶变换谱图;对该第二短时傅里叶变换谱图进行逆短时傅里叶变换得到该目标语音。
结合第二方面,在第二方面的某些实现方式中,该第二特征数据还包括残差数据,该残差数据为原始短时傅里叶变换谱图与该第二短时傅里叶变换谱图之差。
结合第二方面,在第二方面的某些实现方式中,该对该第二短时傅里叶变换谱图进行逆短时傅里叶变换得到该目标语音包括:对第二短时傅里叶变换谱图和该残差数据之和进行逆短时傅里叶变换得到该目标语音。
应理解,通常对原始数据进行降采样,再进行升采样还原数据时,所还原的数据相较于原始数据会存在误差(即残差数据),因而在本申请实施例中,可以利用服务器来计算并发送该残差数据给终端设备,使得终端设备在根据降采样数据还原原始数据时,可以考虑到该误差,进而能够提高目标数据的合成质量,同时能够保证最终目标数据的播放质量。
结合第二方面,在第二方面的某些实现方式中,该原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,该原始短时傅里叶变换谱图只包括谱图的幅度部分。
结合第二方面,在第二方面的某些实现方式中,当该网络资源满足第二条件时,该第二条件包括该网络资源大于该第一资源,该方法还包括:向该服务器发送另一待处理数据;从该服务器接收该另一待处理数据的目标数据。
可选地,当网络资源大于第一资源时,也可以从服务器接收待处理数据的特征数据。但应理解的是,由于服务器的计算能力更强,所以通常利用服务器直接所合成的目标数据的质量更优。
因而,在本申请实施例中,在网络资源大于第一资源时,可以直接从服务器接收待处理数据的目标数据,以提高目标数据的质量。
结合第二方面,在第二方面的某些实现方式中,当该网络资源满足第三条件时,该第三条件包括该网络资源小于该第三资源,该方法还包括:确定又一待处理数据;根据该又一待处理数据得到该又一待处理数据的目标数据。
在本申请实施例中,上述待处理数据、另一待处理数据以及又一待处理数据可以相同。
在本申请实施例中,当网络资源不足以传输特征数据时,可以直接在终端设备上合成目标数据,以提高业务的连续性和处理效率。
第三方面,提供了一种数据处理装置,该装置可以通过网络与终端设备通信。可选地,该装置可以为服务器,也可以为服务器中的芯片、处理器或模组等。该装置包括:收发模块,用于从该终端设备接收待处理数据;处理模块,用于根据该待处理数据得到该待处理数据的特征数据,该特征数据的数据量小于该待处理数据的目标数据的数据量,该特征数据用于合成该目标数据;该收发模块还用于,向该终端设备发送该特征数据。
其中,该收发模块具有数据发送和/或接收的能力。
结合第三方面,在第三方面的某些实现方式中,该收发模块还用于,当网络资源满足第一条件时,向该终端设备发送该特征数据,其中,该第一条件包括该网络资源小于或等于第一资源,该第一资源为传输该目标数据时所需求的最小资源。
结合第三方面,在第三方面的某些实现方式中,当该第一条件为该网络资源大于或等于第二资源且小于或等于该第一资源时,该特征数据包括第一特征数据;和/或,当该第 一条件为该网络资源大于或等于第三资源且小于或等于第二资源时,该特征数据包括第二特征数据,其中,该第一特征数据的数据量大于或等于该第二特征数据的数据量,该第二资源为传输该第一特征数据所需求的最小资源,该第三资源为传输该第二特征数据所需求的最小资源。
结合第三方面,在第三方面的某些实现方式中,该待处理数据包括待处理语音数据,该目标数据包括目标语音,该特征数据包括声学特征,该网络资源包括网络带宽。
结合第三方面,在第三方面的某些实现方式中,该第一特征数据包括梅尔谱特征;该第二特征数据包括第一短时傅里叶变换谱图,其中,该第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,该原始短时傅里叶变换谱图根据该待处理语音数据得到。
结合第三方面,在第三方面的某些实现方式中,该第二特征数据还包括残差数据,该残差数据为该原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差,该第二短时傅里叶变换谱图为对该第一短时傅里叶变换谱图进行升采样得到的谱图。
结合第三方面,在第三方面的某些实现方式中,该原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,该原始短时傅里叶变换谱图只包括谱图的幅度部分。
结合第三方面,在第三方面的某些实现方式中,当该网络资源满足第二条件时,该第二条件包括该网络资源大于该第一资源,该收发模块还用于,从该终端设备接收另一待处理数据;该处理模块还用于,根据该另一待处理数据得到该另一待处理数据的目标数据;该收发模块还用于,向该终端设备发送该另一待处理数据的目标数据。
第四方面,提供了一种数据处理装置,该装置可以通过网络与服务器通信。可选地,该装置可以为终端设备,也可以为终端设备中的芯片、处理器或模组等。该装置包括:收发模块,用于向该服务器发送待处理数据;从该服务器接收该待处理数据的特征数据,该特征数据的数据量小于该待处理数据的目标数据的数据量;处理模块,用于根据该特征数据生成该目标数据。
其中,该收发模块具有数据发送和/或接收的能力。
结合第四方面,在第四方面的某些实现方式中,该收发模块还用于,当网络资源满足第一条件时,从该服务器接收该待处理数据的特征数据,其中,该第一条件包括该网络资源小于或等于第一资源,该第一资源为传输该目标数据时所需求的最小资源。
结合第四方面,在第四方面的某些实现方式中,当该第一条件为该网络资源大于或等于第二资源且小于或等于该第一资源时,该特征数据包括第一特征数据;和/或,当该第一条件为该网络资源大于或等于第三资源且小于或等于第二资源时,该特征数据包括第二特征数据,其中,该第一特征数据的数据量大于或等于该第二特征数据的数据量,该第二资源为传输该第一特征数据所需求的最小资源,该第三资源为传输该第二特征数据所需求的最小资源。
结合第四方面,在第四方面的某些实现方式中,该待处理数据包括待处理语音数据,该目标数据包括目标语音,该特征数据包括声学特征,该网络资源包括网络带宽。
结合第四方面,在第四方面的某些实现方式中,该第一特征数据包括梅尔谱特征;该第二特征数据包括第一短时傅里叶变换谱图,其中,该第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,该原始短时傅里叶变换谱图根据该待处理语 音数据得到。
结合第四方面,在第四方面的某些实现方式中,当该特征数据包括第二特征数据,且该第二特征数据包括该第一短时傅里叶变换谱图时,该处理模块还用于,对该第一短时傅里叶变换谱图进行升采样得到第二短时傅里叶变换谱图;对该第二短时傅里叶变换谱图进行逆短时傅里叶变换得到该目标语音。
结合第四方面,在第四方面的某些实现方式中,该第二特征数据还包括残差数据,该残差数据为原始短时傅里叶变换谱图与该第二短时傅里叶变换谱图之差。
结合第四方面,在第四方面的某些实现方式中,该处理模块还用于,对该第二短时傅里叶变换谱图和该残差数据之和进行逆短时傅里叶变换得到该目标语音。
结合第四方面,在第四方面的某些实现方式中,该原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,该原始短时傅里叶变换谱图只包括谱图的幅度部分。
结合第四方面,在第四方面的某些实现方式中,当该网络资源满足第二条件时,该第二条件包括该网络资源大于该第一资源,该收发模块还用于,向该服务器发送另一待处理数据;从该服务器接收该另一待处理数据的目标数据。
结合第四方面,在第四方面的某些实现方式中,当该网络资源满足第三条件时,该第三条件包括该网络资源小于该第三资源,该处理模块还用于,确定又一待处理数据;根据该又一待处理数据得到该又一待处理数据的目标数据。
第五方面,提供了一种数据处理系统,包括如第三方面或者第三方面的任一可能的实现方式中的数据处理装置和如第四方面或者第四方面的任一可能的实现方式中的数据处理装置。
第六方面,提供了一种数据处理装置,包括至少一个处理器和接口电路,该至少一个处理器用于通过该接口电路获取待处理数据,且执行如第一方面或者第一方面的任一可能的实现方式中的数据处理方法。
第七方面,提供了一种数据处理装置,包括至少一个处理器和通信接口,该至少一个处理器用于通过该通信接口与服务器通信,且执行如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
第八方面,提供了一种车辆,包括传感器和数据处理装置,该传感器用于获取舱内用户数据,该舱内用户数据用于生成待处理数据,该数据处理装置用于执行如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
第九方面,提供了一种计算机可读存储介质,其特征在于,包括指令;所述指令用于实现如第一方面或者第一方面的任一可能的实现方式中的数据处理方法;和/或,实现如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
第十方面,提供了一种算机程序产品,其特征在于,包括:计算机程序,当计算机程序被运行时,使得计算机执行如第一方面或者第一方面的任一可能的实现方式中的数据处理方法;和/或,执行如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
第十一方面,提供了一种计算设备,包括:至少一个处理器和存储器,所述至少一个处理器与所述存储器耦合,用于读取并执行所述存储器中的指令,以执行如第一方面或者第一方面的任一可能的实现方式中的数据处理方法;和/或,执行如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
第十二方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如第一方面或者第一方面的任一可能的实现方式中的数据处理方法;和/或,执行如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行如第一方面或者第一方面的任一可能的实现方式中的数据处理方法;和/或,执行如第二方面或者第二方面的任一可能的实现方式中的数据处理方法。
第十三方面,提供了一种芯片系统,该芯片系统包括至少一个处理器,用于支持实现上述第一方面或第一方面的某些实现中所涉及的功能,例如,例如接收或处理上述方法中所涉及的数据和/或信息。
在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存程序指令和数据,存储器位于处理器之内或处理器之外。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
附图说明
图1是本申请实施例提供的一种语音交互的流程示例图。
图2是本申请实施例提供的一种服务器流式语音合成及下发时延示例图。
图3是本申请实施例提供的一种数据处理方法的示例图。
图4是本申请实施例提供的一种语音合成的系统架构示例图。
图5是本申请实施例提供的一种端云结合的流式语音的合成过程示例图。
图6是本申请实施例提供的另一种端云结合的流式语音的合成过程示例图。
图7是本申请实施例提供的又一种端云结合的流式语音的合成过程示例图。
图8是本申请实施例提供的一种车载UI界面的示例图。
图9是本申请实施例提供的另一种车载UI界面的示例图。
图10是本申请实施例提供的一种数据处理装置1000。
图11是本申请实施例提供的一种数据处理装置1100。
图12是本申请实施例提供的一种数据处理系统1200。
图13是本申请实施例提供的数据处理装置的硬件结构示例性框图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
语音合成技术又被称为TTS技术,是语音处理领域的一个重要方向,旨在让机器生成自然、动听的人类语音。TTS技术既可以单独适用于语音播报(例如咨询播报、订单播报、新闻播报等)、阅读听书(例如读小说、读故事等)等场景中;也可以应用在各种语音交互场景中,例如电子设备的人机交互场景中。其中,电子设备例如可以包括台式电脑、笔记本电脑、智能手机、平板电脑、个人数字助理(personal digital assistant,PDA)、可穿戴设备、智能音箱、电视、无人机、车辆、车载装置(例如车机、车载电脑、车载芯片等)或机器人等等。下面结合图1对TTS技术在语音交互场景中的应用进行示例性介绍。
图1是本申请实施例提供的一种语音交互的流程示例图。应理解,在该示例中,TTS技术主要作为尾部环节(即语音交互的出口)嵌入到语音交互的整体方案中。如图1所示,语音交互的流程100包括:
步骤1,智能终端接收用户发出的语音指令,然后将该语音指令发送给自动语音识别(automatic speech recognition,ASR)模块。
其中,智能终端可以通过麦克风等声音传感器,采集用户发出的语音指令。
可选地,在智能终端将该语音指令发送给ASR模块之前,还可以先对该语音指令进行降噪、回声消除等适当的前处理操作,以降低噪声等带来的干扰。
步骤2,ASR模块可以用于对接收到的语音指令进行识别,输出经过识别后的文本序列给自然语言理解(natural language understanding,NLU)模块。
步骤3,NLU模块对接收到的文本序列中所包含的意图、槽位等语义信息进行提取,并输出提取的语义信息给对话管理(dialogue management,DM)模块。其中,DM模块包含对话状态追踪(dialogue state tracking,DST)和对话策略学习(dialogue policy learning,DPL)。
Step4,DM模块中的DST根据输入的语义信息更新当前的系统状态,而DPL则根据当前的系统状态决定下一步采取何种动作,然后DM模块将确定的决定动作输出给自然语言生成(natural language generation,NLG)模块。
Step5,NLG模块根据接收到的决定动作生成对应的文本序列,作为人机语音交互的反馈,然后再将该文本序列(即下文中的待处理数据)输出给TTS模块。
Step6,TTS模块根据NLG模块输出的文本序列合成语音,并将所合成的语音发送给智能终端,最终通过智能终端的播放设备将合成语音播放给用户,从而实现人机语音交互。
在语音合成中,合成方式主要包括非流式合成和流式合成。其中,非流失合成指的是根据传入的文本一次性合成语音,并一次性返回并播放合成的语音;而流式合成指的是文本传入TTS模块后,TTS模块会分段合成语音,且先合成地语音先播放,在语音播报的同时后续的语音也在合成,不用等到整段语音合成完再进行播报,这样可以减少语音合成的等待时间。
随着语音合成技术的不断进步,以及用户对合成语音的自然度、流畅度、可懂度、甚至定制化程度更高的要求,TTS算法的复杂度也在不断地提高,而复杂度较高的TTS算法不适宜在计算资源受限的终端侧(例如,以上电子设备)中实施。为解决上述问题,传统方案将TTS算法部署在服务器,通过服务器合成语音再经由网络下发给用户。但在网络资源受限的情况下,用户接收到服务器所下发的语音会存在卡顿现象,严重影响着合成语音的质量。下面以网络带宽为例,结合图2对不同网络带宽场景下服务器流式语音合成及下发过程进行介绍。
图2是本申请实施例提供的一种服务器流式语音合成及下发时延示例图。应理解,在该示例中,主要是将需要合成的长度为n秒(s)的语音流以1s为单位间隔进行分段合成,即就是分段合成0s-1s、1s-2s、2s-3s、3s-4s、…、(n-1)s-ns等长度为1s的语音流,且先合成地语音流先下发并播放,其中n为大于0的正数。其中,长度为1s的语音流是指语音流的播放耗时为1s;在图2中,将采用不同长度的方框来分别表示长度为1s的语音流在不同网络带宽场景下的下发耗时或播放耗时,其中,下发耗时是指下发语音流所需要的时 间(即时长)。
如图2中的(a)所示,在网络带宽充裕时,长度为1s的语音流的下发耗时小于1s,终端侧语音播放流畅无卡顿;如图2中的(b)所示,在网络带宽处于临界状态时,长度为1s的语音流的下发耗时等于1s,此时虽短期不卡顿,但存在卡顿的风险;如图2中的(c)所示,在网络带宽不足时,长度为1s的语音流的下发耗时大于1s,导致终端侧每播放完1s的语音流,都还需要额外在等待一段时间,等下1s的语音流下发完毕后,再继续播放,正是由于这个等待时间的存在,导致了语音播放的断断续续,造成了卡顿的现象。且当网络带宽越小,终端侧等待的时间也就越长,卡顿造成的用户体验感觉也就越差。导致在网络带宽不稳定的场景下,用户所接收到的语音时而优质,时而卡顿,存在业务连续性较差和处理效率较低问题。
基于此,本申请实施例提供了一种数据处理方法,该方法可以由服务器和/或终端设备执行,也可以由设置在服务器和/或终端设备内的芯片、模组、或处理器执行。该方法主要利用服务器得到特征数据并发送该特征数据给终端设备,再由终端设备根据该特征数据生成目标数据,由于该特征数据的数据量小于该待处理数据的目标数据的数据量,从而使得相较于直接发送目标数据而言,能够有效降低数据传输过程中对于网络资源的需求,提高目标数据播放的流畅性,进而能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
可选地,终端设备例如可以是以上任意一种电子设备;服务器可以是云端服务器、网络服务器、应用服务器或管理服务器等具有数据处理功能的服务器。
可选地,本申请实施例中的数据处理可以应用于流式数据处理。流式数据处理是指,分段合成目标数据,且先合成的目标数据先播放,在先合成数据播放的同时后续的数据也在合成,不用等到整个目标数据合成完再进行播报,这样可以减少数据合成的等待时间。
可选地,本申请实施例中的数据处理方法可以应用于语音处理、图像处理或视频处理等技术领域,但为便于描述,在下文实施例中均以应用于语音处理技术领域中的流式语音合成技术为例。
图3是本申请实施例提供的一种数据处理方法的示例图。应理解,该方法300可以应用于服务器和终端设备组成的系统中,且终端设备和服务器之间通过网络通信。如图3所示,方法300可以包括S310至S340,下面对方法300中的各个步骤进行详细描述。
S310,终端设备向服务器发送待处理数据。相应地,服务器从终端设备接收待处理数据。
S320,服务器根据待处理数据得到待处理数据的特征数据。
其中,该特征数据的数据量小于该待处理数据的目标数据的数据量,该特征数据用于合成该目标数据。
S330,服务器向终端设备发送特征数据。相应地,终端设备从服务器接收待处理数据的特征数据。
S340,终端设备根据特征数据生成目标数据。
可选地,在终端设备根据特征数据生成目标数据之后,方法300还可以包括:控制播放该目标数据。
在本申请实施例中,主要利用服务器得到特征数据并发送该特征数据给终端设备,并 由终端设备根据该特征数据生成目标数据,由于该特征数据的数据量小于该待处理数据的目标数据的数据量,从而使得相较于直接发送目标数据而言,能够有效降低数据传输过程中对于网络资源的需求,提高目标数据播放的流畅性,进而能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
本申请实施例中的特征数据是指:将待处理数据转化为目标数据的中间数据,其具备待处理数据的特征信息,使得在给定输入待处理数据的条件下,可以先将待处理数据变换为该特征数据,然后再将该特征数据转化为最终的目标数据。示例性地,以语音合成为例,在给定输入文本序列的条件下,可先将文本序列变换为声学特征数据,然后再将声学特征数据变换为最终的目标语音。具体来讲,当变换为声码器的解码操作时,声学特征可以是梅尔谱特征,作为声码器的输入;或者,当变换为短时傅里叶逆变换时,特征数据可以为短时傅里叶图谱;或者,当变换为深度神经网络的模型推理时,特征数据可以是深度神经网络的隐藏层输出的声学特征。
可选地,上述步骤S330可以包括:当网络资源满足第一条件时,服务器向终端设备发送特征数据。相应地,当网络资源满足第一条件时,终端设备从服务器接收特征数据。其中,第一条件包括网络资源小于或等于第一资源,该第一资源为传输目标数据时所需求的最小资源。
应理解,第一资源为传输该目标数据时所需求的最小资源,也可以描述为,第一资源为传输该目标数据时所需求的临界资源。应理解,以语音合成技术为例,临界资源能够使得指目标时长语音的传输时长等于目标时长;当网络资源大于临界资源时,目标时长语音的传输时长会小于目标时长,语音播放较为流畅;当网络资源小于临界资源时,目标时长语音的传输时长会大于目标时长,语音播放会存在卡顿现象。应理解,实际操作中,以上判断条件(即网络资源大于临界资源或小于临界资源)中还可以包括等于临界资源的情况,不做限定。
可选地,在本申请实施例中,网络资源例如可以是网络带宽,且为便于描述,下文实施例中均以网络带宽为例,其它网络资源(或称为网络条件)的情况与之类似,例如反应网络有效性的指标,例如时域、频域、或时频资源的大小等;再如反映网络可靠性的指标,例如网络信道质量等,对于网络信道质量的评价,可以包括带宽、时延、信噪比、误码率、抖动等。
可选地,在本申请实施例中,可以将临界带宽记为B。作为一个示例,B值可以通过如下公式(1)进行计算:
B=F s*bitwidth*cost      (1)
其中,F s为目标数据的采样率,示例性地,F s的值可以为24000;bitwidth为每个目标数据采样点的位宽,示例性地,bitwidth的值可以为16bit;cost为信号传输过程中的代价,示例性地,当采用8/10bit编码时该值为1.25。应理解,临界带宽的值与公式中各个数据的实际取值有关,以上临界带宽仅作为示例。
可选地,网络资源可以由终端设备或网络设备实时确定,也可以由其他网络监测设备确定,本申请实施例对此不做限定。
在本申请实施例中,当网络资源小于或等于传输该目标数据时所需求的最小资源时,服务器可以向终端设备发送特征数据,并由终端设备根据该特征数据生成目标数据,由于特征数据的数据量小于待处理数据的目标数据的数据量,从而使得在网络资源受限时,能够保证最终目标数据的播放质量,同时能够提高业务的连续性和处理效率。
可选地,当第一条件为网络资源大于或等于第二资源且小于或等于第一资源时,特征数据可以包括第一特征数据;和/或,当第一条件为网络资源大于或等于第三资源且小于或等于第二资源时,特征数据可以包括第二特征数据,其中,第一特征数据的数据量大于或等于第二特征数据的数据量,第二资源为传输第一特征数据所需求的最小资源,第三资源为传输该第二特征数据所需求的最小资源。
可选地,当网络资源大于或等于第二资源且小于或等于第一资源时,特征数据也可以包括第二特征数据。但需理解的是,由于第一特征数据的数据量大于或等于第二特征数据的数据量,因而通过第一特征数据合成的目标数据相较于通过第二特征数据合成的目标数据而言质量更优。因而,在网络资源大于或等于第二资源且小于或等于第一资源时,选用第一特征数据,可以提升语音质量,选用第二特征数据可以减少网络资源消耗。
在本申请实施例中,当网络资源处于不同范围时,可以采用不同的特征数据,以使得在不同的网络资源条件下均能够保证业务的连续性和处理效率,同时能够保证最终目标数据的播放质量。
可选地,以应用于语音合成技术领域为例,上述待处理数据可以包括待处理语音数据,上述目标数据可以包括目标语音,上述特征数据可以包括声学特征,上述网络资源可以包括网络带宽,该待处理语音数据可以包括文本序列或音素序列等,本申请实施例对此不做限定。为便于描述,下文实施例中均以应用于语音合成技术领域为例进行介绍。可选地,声学特征可以包括梅尔谱或经过降采样的短时傅里叶变换谱图等特征,本申请实施例对此不做限定。
可选地,以应用于语音合成技术领域为例,上述第一特征数据可以包括梅尔谱特征;上述第二特征数据可以包括第一短时傅里叶变换谱图。其中,第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,原始短时傅里叶变换谱图可以根据待处理语音数据得到。
应理解,短时傅里叶变换(short time fourier transformation,STFT),就是对短时的信号做傅里叶变换。原理如下:对一段长语音信号,分帧、加窗,再对每一帧做傅里叶变换,之后把每一帧的结果沿另一维度堆叠,得到一张图(类似于二维信号),这张图就是声谱图。
可选地,原始短时傅里叶变换谱图根据待处理语音数据得到的具体过程可以包括:根据待处理语音数据得到梅尔谱特征,再根据该梅尔谱特征合成目标语音,最后对所合成的目标语音进行短时傅里叶变换得到该原始短时傅里叶变换谱图。可选地,上述梅尔谱特征也可以为其他声学特征,不做限定。可选地,原始短时傅里叶变换谱图也可以直接根据待处理语音数据得到,本申请实施例对此不做限定。
可选地,在本申请实施例中,可以将梅尔谱特征的数据量记为目标数据的数据量的1/R;将降采样的短时傅里叶变换谱图的数据量记为目标数据的数据量的1/M。那么对应地,第二资源则为第一资源的1/R;第三资源为第一资源的1/M,其中R和M均为大于1 的正数。应理解,R与M的值可以根据实际情况确定,具体可参加下文方式b至方式d中的相关描述。
相应地,当特征数据包括第二特征数据,且第二特征数据包括该第一短时傅里叶变换谱图时,上述步骤S340可以包括:终端设备对第一短时傅里叶变换谱图进行升采样得到第二短时傅里叶变换谱图;再对第二短时傅里叶变换谱图进行逆短时傅里叶变换(inverse short time fourier transformation,ISTFT)得到目标语音。
可选地,第二特征数据还可以包括残差数据,该残差数据为原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差,其中,第二短时傅里叶变换谱图为对第一短时傅里叶变换谱图进行升采样得到的谱图。
相应地,上述对第二短时傅里叶变换谱图进行ISTFT得到目标语音可以包括:对第二短时傅里叶变换谱图和残差数据之和进行ISTFT得到目标语音。
对原始数据进行降采样,再进行升采样还原数据时,所还原的数据相较于原始数据会存在误差(即残差数据),因而在本申请实施例中,可以利用服务器来计算并发送该残差数据给终端设备,使得终端设备在根据降采样数据还原原始数据时,可以考虑到该误差,进而能够提高目标数据的合成质量,同时能够保证最终目标数据的播放质量。
可选地,上述原始短时傅里叶变换谱图的相邻滑窗之间可以不重叠,从而能够降低短时傅里叶变换谱图的数据大小。
可选地,上述原始短时傅里叶变换谱图可以只包括谱图的幅度部分,从而能够降低短时傅里叶变换谱图的数据大小。可选地,原始短时傅里叶变换谱图只包括谱图的幅度部分也可以描述为,原始短时傅里叶变换谱图不包括谱图的相位部分。
可选地,当网络资源满足第二条件时,第二条件包括网络资源大于第一资源,方法300还可以包括:终端设备向服务器发送待处理数据,相应地,服务器从终端设备接收待处理数据;服务器根据待处理数据得到目标数据;服务器向终端设备发送目标数据,相应地,终端设备从服务器接收目标数据,具体可参加下文方式a中的相关描述。
可选地,当网络资源大于第一资源时,也可以向终端设备发送特征数据。但应理解的是,由于服务器的计算能力更强,所以通常利用服务器直接所合成的目标数据的质量更优,此时语音播放质量更优。
因而,可选的,在网络资源大于第一资源时,可以选择利用服务器来合成并发送目标数据给终端设备,以提高目标数据的质量。
可选地,当网络资源满足第三条件时,第三条件包括网络资源小于该第三资源,或者,处于断网场景(如隧道内、地下车库内等环境)时,方法300还可以包括:确定待处理数据;根据待处理数据得到目标数据,具体可参见下文方式e中的相关描述。
在本申请实施例中,当网络资源不足以传输特征数据时,可以直接在终端设备上合成目标数据,以提高业务的连续性和处理效率。
下面结合图4至图9对本申请实施例方案在语音合成领域中的应用进行详细介绍。应理解,下文中所述的云端服务器和网络带宽等仅为示例,不构成对本申请实施例的限定。
图4是本申请实施例提供的一种语音合成的系统架构示例图。如图4所示,在该系统架构400中,云端侧和终端侧分别部署有一套语音合成系统,其中,终端侧所部署的语音合成系统包括声学模型、声码器、ISTFT、还原算法、残差补偿等模块,应理解,考虑到 终端侧有限的计算能力以及有限的存储空间,该声学模型和声码器包括的是轻量级的算法;云端侧所部署的语音合成系统包括声学模型、声码器、STFT、降采样、还原算法、残差计算等模块,其中,该声学模型和声码器可以包括重量级的算法,以提高语音合成质量。应理解,该系统架构400能够实现在不同网络带宽情况下采用不同的语音合成方法。下面结合图4对不同网络带宽情况下的语音合成方法进行介绍。
网络带宽情况1:网络带宽>B,即网络带宽充裕。其中,B表示临界带宽。应理解,在临界带宽下,目标时长的语音的下发耗时等于目标时长。
在该情况下,如图4中的方式a,可以直接在云端服务器完成语音的合成,再下发给终端设备。具体地,部署在云端服务器中的声学模型可以先根据待处理数据生成声学特征,然后声码器根据该声学特征合成语音波形(即合成语音),再通过网络将合成的语音下发给终端设备。
应理解,在网络带宽充裕时,目标时长的语音的下发耗时会小于目标时长,此时直接在云侧执行语音的合成,且实时下发给终端设备后播放,就能实现无卡顿的语音播放,且由于是在云端侧完成语音的合成,使得音质效果也比较好。
网络带宽情况2:B/2≤网络带宽≤B。
在该情况下,如图4中的方式b以及图5所示的一种端云结合的流式语音的合成过程,可以以云端和终端结合的方式完成语音的合成。具体地,部署在云端服务器中的声学模型可以先根据待处理数据生成声学特征。可选地,该声学特征可以是Mel谱特征。然后云端服务器直接下发Mel谱给终端设备,接着部署在终端设备中的声码器实时根据接收到的Mel谱生成语音流,语音流合成完毕后送入播放设备进行播放,完成终端设备的对用户的回复。
通常声学模型得到的Mel谱的大小为:
Figure PCTCN2022080823-appb-000001
其所对应地相同时长的语音流的大小为:T*F S*2。那么,相同时长语音流和对应的Mel谱数据比值为:
Figure PCTCN2022080823-appb-000002
其中,T为语音流的持续时长,F S为合成语音的时域采样率,t shift为Mel谱计算时对应的帧移,Dim为Mel谱维度,R为相同时长的语音流数据与对应Mel谱数据之间的比值。
相同时长的语音流的数据量通常大于其所对应的Mel谱的数据量,那么Mel谱的数据量可以记为相同时长的语音流的数据量的1/R,意味着如果直接下发Mel谱数据而不是语音流数据,数据量就会降低为原本的1/R倍,此时对带宽的依赖可以降低为原本的1/R,那么在网络带宽≥B/R时,就可以实现无卡顿的语音播放。
由公式(2)可以看出,实际R的大小取决于F S、t shift以及Dim的取值。例如,可以令F S=24KHz,Dim=80,t shift=12.5ms,得到R=1.9,即相同时长语音流的数据量是其对 应的Mel谱的数据量的1.9倍。由此可以,如果直接下发Mel谱数据而不是语音流数据,那么的数据量就会降低为原本的1/1.9倍,此时对带宽的依赖可以近似降低为原本的1/1.9,即在网络带宽≥B/1.9时,就可以实现无卡顿的语音播放。
R值可以随着实际F S、t shift以及Dim的取值的不同而不同,但通常都在2附近浮动。因此,为便于描述,在本申请实施例中将以Mel谱的数据量近似等于同时长语音流的数据量的1/2为例进行介绍,对应地,在网络带宽≥B/2时,就可以实现无卡顿的语音播放。且为便于描述,在该实施例中,直接以R=2为例进行描述,但应理解,实际操作中,并不限于此。
应理解,通常通过纯云端合成语音的音质更优,因而,在该语音合成系统400中,可以在网络带宽>B时,直接采用纯云端合成语音即可;在B/2≤网络带宽≤B时,则采用这种端云结合的方式合成语音,以减少在该网络带宽场景下发生语音播放的卡顿现象,实现近似无损的还原纯云端的合成音质。
网络带宽情况3:B/4≤网络带宽≤B/2。
在该情况下,在一种实现方式中,如图4中的方式d以及图6所示的另一种端云结合的流式语音的合成过程,可以以云端和终端结合的方式完成语音的合成。
具体地,部署在云端服务器中的声学模型可以先根据待处理数据生成声学特征。可选地,该声学特征可以是Mel谱。然后云端服务器所部署的声码器根据该声学特征合成语音波形,STFT模块对该语音波形进行STFT得到原始STFT谱图。降采样模块对所得到原始STFT谱图进行降采样,并将降采样后的STFT谱图发送给终端设备。接着部署在终端设备中的还原算法模块实时对接收到的降采样STFT谱图进行还原(即升采样),并利用ISTFT模块对还原后的STFT谱图进行ISTFT生成语音流,语音流合成完毕后送入播放设备进行播放,完成终端设备的对用户的回复。
应理解,在方式d中,云端服务器对原始STFT谱图进行降采样,然后在终端设备对STFT谱图进行升采样还原STFT谱图时,所还原的STFT谱图相较于原始STFT谱图会存在误差(即残差数据),造成目标语音的合成质量较低。为解决该问题,本申请实施例还提出了另一种实现方式。
在另一种实现方式中,如图4中的方式c以及图7所示的又一种端云结合的流式语音的合成过程,同样可以以云端和终端结合的方式完成语音的合成。
具体地,基于方式d,在云端服务器得到降采样后的STFT谱图后,云端服务器还对降采样后的STFT谱图进行升采样来还原STFT谱图,并根据原始STFT谱图和还原的STFT谱图之间的差异确定残差数据,然后在下发降采样后的STFT谱图的同时将该残差数据也发送给终端设备,使得终端设备在接收到降采样后的STFT谱图时,可以先对降采样后的STFT谱图进行还原,然后再对还原STFT谱图进行残差补偿,接着利用ISTFT模块对经过残差补偿的STFT谱图进行ISTFT生成语音流,语音流合成完毕后送入播放设备进行播放,完成终端设备的对用户的回复。
可选地,上述原始STFT谱图的相邻滑窗之间可以不重叠,且可以只包括谱图的幅度 部分。在这种情况下,原始STFT谱图的数据大小可以按照如下公式(3)表示:
Figure PCTCN2022080823-appb-000003
其中,
Figure PCTCN2022080823-appb-000004
为原始STFT谱图的数据大小,T*F S*2为对应语音流的数据量。T为语音流的持续时长,F S为合成语音的时域采样率,N hop为帧移点数,当滑窗不重叠时N hop和N fft相等,N fft表示做STFT时的点数。可见,该原始STFT谱图的数据量几乎可以和对应语音流的数据量等大。
残差数据传输时所占网络带宽较小,例如,可以约为B/10,因而在本申请实施例中将忽略其影响。可选地,残差数据可以通过如下公式(4)进行计算:
Res=imgstft-fun_recover(fun_downsample(imgstft))   (4)
其中,Res表示还原后的残差数据,imgstft表示原始STFT谱图,(fun_downsample(imgstft)表示根据对原始STFT谱图进行降采样得到的STFT谱图,表示fun_recover(fun_downsample(imgstft))表示对降采样得到的STFT谱图进行还原得到的谱图。
可选地,在本申请实施例中,可以将降采样后的STFT谱图的数据量记为原始STFT谱图的数据量的1/M。如果直接下发降采样后的STFT谱图而不是语音流数据,数据大小就会降低为原本的1/M倍,此时对带宽的依赖可以降低为原本的1/M,那么在网络带宽≥B/M时,就可以实现无卡顿的语音播放。
该M代表STFT图谱的降采样程度,M值越大,表示降采样程度越大,意味着降采样后的STFT谱图的数据量越小。
示例性地,在本申请实施例中,可以令M=4。意味着如果直接下发降采样后的STFT谱图而不是语音流数据,那么数据量就会降低为原本的1/M倍,此时对带宽的依赖可以近似降低为原本的1/4,即在网络带宽≥B/4时,就可以实现无卡顿的语音播放。应理解M=4仅为示例,实际操作中,可以不限于此。
由于直接下发Mel谱特征较降采样后的STFT谱图而言,能够保留更多的语音特征,因而在该语音合成系统400中,可以在B/2≤网络带宽≤B时采用方式b;在B/4≤网络带宽≤B/2时采用方式c或d,以进一步提高语音合成质量。
网络带宽情况4:网络带宽≤B/4。
在该情况下,如图4中的方式e,可以直接在终端设备完成语音的合成并播放。具体地,部署在终端设备的声学模型可以先根据待处理数据生成声学特征,然后声码器根据该声学特征合成语音,并由播放设备进行实时播放。
在网络带宽≤B/4时,即使采用端云结合的方式来合成语音,也会由于网络带宽的限制,造成语音播放时的卡顿现象。因此,在该情况下,待处理数据可以不上传至云端,语音合成过程完全在终端侧执行。但由于终端侧通常所部署的声学模型和声码器包括的是轻量级的TTS算法,因此通过终端侧合成语音的质量不如云端侧的音质效果。
综上,根据上述不同方法合成语音的音质效果由方式a至方式e递减。
可选地,在实际操作中,终端设备可以直接根据所处环境的网络情况来确定合成语音的方式。例如,在网络带宽充裕时,终端设备可以确定在云端合成语音;在网络带宽有限,但能满足上述声学特征数据的传输时,终端设备可以确定采用端云结合的方式合成语音;在网络带宽严重不足或断网时,终端设备可以确定直接在终端设备合成语音。
可选地,也可以由用户根据实际情况自主选择语音合成的方式。例如,在用户对语音播报速度有较高要求时,可以选用在终端设备上合成语音的方式,减少起播延迟;在用户对音质要求很高且环境网络充裕时,可以选用云端合成的方式;在用户对音质要求高但环境网络不是很充裕时,可以选用端云结合的方式。
可选地,终端设备也可以先根据所处环境的网络情况来确定合成语音的方式然后推荐给用户,让用户自主选择是否按照终端设备的推荐进行语音的合成。
下面结合图8和图9以车辆为例,对车载人机交互的用户界面(User Interface,UI)进行介绍。
图8是本申请实施例提供的一种车载UI界面的示例图。如图8所示,在人机语音交互过程中,若用户希望能够自由选择语音播报模式,则可以向车辆发出指令,例如,“交互助手,我要选择播报模式!”,车辆接收到该指令后,车辆可以将快速听、流畅听、生动听等不同语音播报模式显示在中控显示屏以供用户选择。其中,快速听是指通过终端侧完成语音合成,实现快速的语音播报;流畅听是指通过端云结合的方式完成语音合成,实现优质不卡的语音播报;生动听是指完全在云端完成语音合成,实现音质更优的语音播报。使得用户可以根据实际需求和网络环境自主选择语音播报模式。例如,在用户对语音播报速度有较高要求时,可以在中控显示屏上选择“快速听”;在用户对音质要求很高且环境网络充裕时,可以选用“生动听”;在用户对音质要求高但环境网络不是很充裕时,可以选用“流畅听”。
图9是本申请实施例提供的另一种车载UI界面的示例图。如图9所示,在人机语音交互过程中,若用户希望由车辆推荐一种播报模式,则可以向车辆发出指令,例如,“交互助手,请推荐一种播报模式!”,车辆接收到该指令后,可以根据所处环境的网络情况来为用户推荐合适的语音播报模式。例如,在车辆检测到网络充裕时,可以为用户推荐生动听,此时,若用户同意该推荐,可以选择“是”;若用户不同意该推荐;可以选择“否”,并按照自己的需求选择一种语音播报模式。
图10是本申请实施例提供的一种数据处理装置1000,该装置1000可以通过网络与终端设备通信。可选地,该装置1000可以为服务器,也可以为服务器中的芯片、处理器或模组等,不做限定。该装置1000包括:收发模块1010和处理模块1020。应理解,该收发模块1010具有数据发送和/或接收的能力,且该收发模块1010在具体实现上可以是接口电路。
其中,收发模块1010,用于从终端设备接收待处理数据;处理模块1020,用于根据待处理数据得到待处理数据的特征数据,该特征数据的数据量小于待处理数据的目标数据的数据量,特征数据用于合成目标数据;收发模块1010还用于,向终端设备发送特征数据。
可选地,收发模块1010还可以用于,当网络资源满足第一条件时,向终端设备发送特征数据,其中,第一条件包括网络资源小于或等于第一资源,第一资源为传输目标数据 时所需求的最小资源。
可选地,当第一条件为网络资源大于或等于第二资源且小于或等于第一资源时,特征数据可以包括第一特征数据;和/或,当第一条件为网络资源大于或等于第三资源且小于或等于第二资源时,特征数据可以包括第二特征数据,其中,第一特征数据的数据量大于或等于第二特征数据的数据量,第二资源为传输第一特征数据所需求的最小资源,第三资源为传输第二特征数据所需求的最小资源。
可选地,待处理数据可以包括待处理语音数据,目标数据可以包括目标语音,特征数据可以包括声学特征,网络资源可以包括网络带宽。
可选地,第一特征数据可以包括梅尔谱特征;第二特征数据可以包括第一短时傅里叶变换谱图,其中,第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,原始短时傅里叶变换谱图可以根据待处理语音数据得到。
可选地,第二特征数据还可以包括残差数据,残差数据为原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差,第二短时傅里叶变换谱图为对第一短时傅里叶变换谱图进行升采样得到的谱图。
可选地,原始短时傅里叶变换谱图的相邻滑窗之间可以不重叠;和/或,原始短时傅里叶变换谱图可以只包括谱图的幅度部分。
可选地,当网络资源满足第二条件时,该第二条件包括网络资源大于第一资源,收发模块1010还可以用于,从终端设备接收另一待处理数据;处理模块1020还可以用于,根据另一待处理数据得到另一待处理数据的目标数据;收发模块1010还可以用于,向终端设备发送另一待处理数据的目标数据。
在本申请实施例中,上述待处理数据和另一待处理数据可以相同。
图11是本申请实施例提供的一种数据处理装置1100,该装置1100可以通过网络与服务器通信。可选地,该装置1100可以为终端设备,也可以为终端设备中的芯片、处理器或模组等,不做限定。该装置1100包括:收发模块1110和处理模块1120。应理解,该收发模块1110具有数据发送和/或接收的能力,且该收发模块1110在具体实现上可以是接口电路。
其中,收发模块1110,用于向服务器发送待处理数据;从服务器接收待处理数据的特征数据,该特征数据的数据量小于待处理数据的目标数据的数据量;处理模块1120,用于根据特征数据生成目标数据。
可选地,收发模块1110还可以用于,当网络资源满足第一条件时,从服务器接收待处理数据的特征数据,其中,第一条件包括网络资源小于或等于第一资源,第一资源为传输目标数据时所需求的最小资源。
可选地,当第一条件为网络资源大于或等于第二资源且小于或等于第一资源时,特征数据可以包括第一特征数据;和/或,当第一条件为网络资源大于或等于第三资源且小于或等于第二资源时,特征数据可以包括第二特征数据,其中,第一特征数据的数据量大于或等于第二特征数据的数据量,第二资源为传输第一特征数据所需求的最小资源,第三资源为传输第二特征数据所需求的最小资源。
可选地,待处理数据可以包括待处理语音数据,目标数据可以包括目标语音,特征数据可以包括声学特征,网络资源可以包括网络带宽。
可选地,第一特征数据可以包括梅尔谱特征;第二特征数据可以包括第一短时傅里叶变换谱图,其中,第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,原始短时傅里叶变换谱图可以根据待处理语音数据得到。
可选地,当特征数据包括第二特征数据,且第二特征数据包括第一短时傅里叶变换谱图时,处理模块1120还可以用于,对第一短时傅里叶变换谱图进行升采样得到第二短时傅里叶变换谱图;对第二短时傅里叶变换谱图进行ISTFT得到目标语音。
可选地,第二特征数据还可以包括残差数据,残差数据为原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差。
可选地,处理模块1120还可以用于,对第二短时傅里叶变换谱图和残差数据之和进行ISTFT得到目标语音。
可选地,原始短时傅里叶变换谱图的相邻滑窗之间可以不重叠;和/或,原始短时傅里叶变换谱图可以只包括谱图的幅度部分。
可选地,当网络资源满足第二条件时,该第二条件包括网络资源大于第一资源,收发模块1110还可以用于,向服务器发送另一待处理数据;从服务器接收另一待处理数据的目标数据。
可选地,当网络资源满足第三条件时,该第三条件包括网络资源小于第三资源,处理模块1120还可以用于,确定又一待处理数据;根据又一待处理数据得到又一待处理数据的目标数据。
在本申请实施例中,上述待处理数据、另一待处理数据和又一待处理数据可以相同。
图12是本申请实施例提供的一种数据处理系统1200。如图12所示,该系统1200包括装置1000和装置1100,且该装置1000应用于服务器,可以用于执行本申请方法实施例中服务器所对应的相关操作,该装置1100应用于终端设备,可以用于执行本申请方法实施例中终端设备所对应的相关操作。
图13是本申请实施例提供的数据处理装置的硬件结构示例性框图。可选地,该装置1300具体可以是一种计算机设备。该装置1300包括存储器1310、处理器1320、通信接口1330以及总线1340。其中,存储器1310、处理器1320、通信接口1330通过总线1340实现彼此之间的通信连接。
存储器1310可以是只读存储器(read-only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器1310可以存储程序,当存储器1310中存储的程序被处理器1320执行时,处理器1320和通信接口1330用于执行本申请实施例中数据处理装置1000所对应的相关操作;和/或,用于执行本申请实施例中数据处理装置1100所对应的相关操作。
处理器1320可以采用通用的中央处理器(central processing unit,CPU),微处理器,专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的数据处理装置1000中的处理模块1020所需执行的功能,或者以实现本申请实施例的数据处理装置1100中的处理模块1120所需执行的功能。
处理器1320还可以是一种集成电路芯片,具有信号处理能力。在实现过程中,本申请方法实施例中服务器所对应的相关操作;和/或,本申请方法实施例中终端设备所对应 的相关操作可以通过处理器1320中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器1320还可以是通用处理器、数字信号处理器(digital signal processor,DSP)、ASIC、现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1310,处理器1320读取存储器1310中的信息,结合其硬件完成本申请实施例的数据处理装置中包括的模块所需执行的功能,或者执行本申请方法实施例中服务器所对应的相关操作;和/或,用于执行本申请方法实施例中终端设备所对应的相关操作。例如,处理器1320可以执行上述步骤S320和步骤S340。
通信接口1330使用例如但不限于收发器一类的收发装置,来实现装置1300与其他设备或通信网络之间的通信。通信接口1330可以用于实现图10所示数据处理装置1000中的收发模块1010所需执行的功能;或者,通信接口1330可以用于实现图11所示数据处理装置1100中的收发模块1110所需执行的功能。例如,通信接口1330可以执行上述步骤S310和步骤S330。
总线1340可包括在装置1300各个部件(例如,存储器1310、处理器1320、通信接口1330)之间传送信息的通路。
本申请实施例还提供了一种车辆,包括传感器和数据处理装置,传感器用于获取舱内用户数据,舱内用户数据用于生成待处理数据,数据处理装置用于执行本申请方法实施例中终端设备所对应的相关操作。
本申请实施例还提供了一种计算机可读存储介质,其特征在于,包括指令;所述指令用于实现本申请方法实施例中服务器所对应的相关操作;和/或,实现本申请方法实施例中终端设备所对应的相关操作。
本申请实施例还提供了一种算机程序产品,其特征在于,包括:计算机程序,当计算机程序被运行时,使得计算机执行本申请方法实施例中服务器所对应的相关操作;和/或,执行本申请方法实施例中终端设备所对应的相关操作。
本申请实施例还提供了一种计算设备,包括:至少一个处理器和存储器,所述至少一个处理器与所述存储器耦合,用于读取并执行所述存储器中的指令,以执行本申请方法实施例中服务器所对应的相关操作;和/或,执行本申请方法实施例中终端设备所对应的相关操作。
本申请实施例还提供了一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行本申请方法实施例中服务器所对应的相关操作;和/或,执行本申请方法实施例中终端设备所对应的相关操作。
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对传统方案做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。本申请实施例中,“多个”是指两个或两个以上。本申请实施例中,“和/或”用于描述关联对象的关联关系,表示可以独立存在的三种关系,例如,A和/或B,可以表示:单独存在A,单独存在B,或同时存在A和B。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (45)

  1. 一种数据处理方法,其特征在于,所述方法应用于服务器,所述服务器通过网络与终端设备通信,所述方法包括:
    从所述终端设备接收待处理数据;
    根据所述待处理数据得到所述待处理数据的特征数据,所述特征数据的数据量小于所述待处理数据的目标数据的数据量,所述特征数据用于合成所述目标数据;
    向所述终端设备发送所述特征数据。
  2. 如权利要求1所述的方法,其特征在于,所述向所述终端设备发送所述特征数据包括:
    当网络资源满足第一条件时,向所述终端设备发送所述特征数据,其中,所述第一条件包括所述网络资源小于或等于第一资源,所述第一资源为传输所述目标数据时所需求的最小资源。
  3. 如权利要求2所述的方法,其特征在于,
    当所述第一条件为所述网络资源大于或等于第二资源且小于或等于所述第一资源时,所述特征数据包括第一特征数据;和/或,
    当所述第一条件为所述网络资源大于或等于第三资源且小于或等于第二资源时,所述特征数据包括第二特征数据,其中,所述第一特征数据的数据量大于或等于所述第二特征数据的数据量,所述第二资源为传输所述第一特征数据所需求的最小资源,所述第三资源为传输所述第二特征数据所需求的最小资源。
  4. 如权利要求3所述的方法,其特征在于,所述待处理数据包括待处理语音数据,所述目标数据包括目标语音,所述特征数据包括声学特征,所述网络资源包括网络带宽。
  5. 如权利要求4所述的方法,其特征在于,所述第一特征数据包括梅尔谱特征;所述第二特征数据包括第一短时傅里叶变换谱图,其中,所述第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,所述原始短时傅里叶变换谱图根据所述待处理语音数据得到。
  6. 如权利要求5所述的方法,其特征在于,所述第二特征数据还包括残差数据,所述残差数据为所述原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差,所述第二短时傅里叶变换谱图为对所述第一短时傅里叶变换谱图进行升采样得到的谱图。
  7. 如权利要求5或6所述的方法,其特征在于,所述原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,所述原始短时傅里叶变换谱图只包括谱图的幅度部分。
  8. 如权利要求2所述的方法,其特征在于,当所述网络资源满足第二条件时,所述第二条件包括所述网络资源大于所述第一资源,所述方法还包括:
    从所述终端设备接收另一待处理数据;
    根据所述另一待处理数据得到所述另一待处理数据的目标数据;
    向所述终端设备发送所述另一待处理数据的目标数据。
  9. 一种数据处理方法,其特征在于,所述方法应用于终端设备,所述终端设备通过网络与服务器通信,所述方法包括:
    向所述服务器发送待处理数据;
    从所述服务器接收所述待处理数据的特征数据,所述特征数据的数据量小于所述待处理数据的目标数据的数据量;
    根据所述特征数据生成所述目标数据。
  10. 如权利要求9所述的方法,其特征在于,所述从所述服务器接收所述待处理数据的特征数据包括:
    当网络资源满足第一条件时,从所述服务器接收所述待处理数据的特征数据,其中,所述第一条件包括所述网络资源小于或等于第一资源,所述第一资源为传输所述目标数据时所需求的最小资源。
  11. 如权利要求10所述的方法,其特征在于,
    当所述第一条件为所述网络资源大于或等于第二资源且小于或等于所述第一资源时,所述特征数据包括第一特征数据;和/或,
    当所述第一条件为所述网络资源大于或等于第三资源且小于或等于第二资源时,所述特征数据包括第二特征数据,其中,所述第一特征数据的数据量大于或等于所述第二特征数据的数据量,所述第二资源为传输所述第一特征数据所需求的最小资源,所述第三资源为传输所述第二特征数据所需求的最小资源。
  12. 如权利要求11所述的方法,其特征在于,所述待处理数据包括待处理语音数据,所述目标数据包括目标语音,所述特征数据包括声学特征,所述网络资源包括网络带宽。
  13. 如权利要求12所述的方法,其特征在于,所述第一特征数据包括梅尔谱特征;所述第二特征数据包括第一短时傅里叶变换谱图,其中,所述第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,所述原始短时傅里叶变换谱图根据所述待处理语音数据得到。
  14. 如权利要求13所述的方法,其特征在于,当所述特征数据包括第二特征数据,且所述第二特征数据包括所述第一短时傅里叶变换谱图时,所述根据所述特征数据生成所述目标数据包括:
    对所述第一短时傅里叶变换谱图进行升采样得到第二短时傅里叶变换谱图;
    对所述第二短时傅里叶变换谱图进行逆短时傅里叶变换得到所述目标语音。
  15. 如权利要求14所述的方法,其特征在于,所述第二特征数据还包括残差数据,所述残差数据为原始短时傅里叶变换谱图与所述第二短时傅里叶变换谱图之差。
  16. 如权利要求15所述的方法,其特征在于,所述对所述第二短时傅里叶变换谱图进行逆短时傅里叶变换得到所述目标语音包括:
    对所述第二短时傅里叶变换谱图和所述残差数据之和进行逆短时傅里叶变换得到所述目标语音。
  17. 如权利要求13至16中任一项所述的方法,其特征在于,所述原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,所述原始短时傅里叶变换谱图只包括谱图的幅度部分。
  18. 如权利要求10所述的方法,其特征在于,当所述网络资源满足第二条件时,所述第二条件包括所述网络资源大于所述第一资源,所述方法还包括:
    向所述服务器发送另一待处理数据;
    从所述服务器接收所述另一待处理数据的目标数据。
  19. 如权利要求11所述的方法,其特征在于,当所述网络资源满足第三条件时,所述第三条件包括所述网络资源小于所述第三资源,所述方法还包括:
    确定又一待处理数据;
    根据所述又一待处理数据得到所述又一待处理数据的目标数据。
  20. 一种数据处理装置,其特征在于,所述装置通过网络与终端设备通信,所述装置包括:
    收发模块,用于从所述终端设备接收待处理数据;
    处理模块,用于根据所述待处理数据得到所述待处理数据的特征数据,所述特征数据的数据量小于所述待处理数据的目标数据的数据量,所述特征数据用于合成所述目标数据;
    所述收发模块还用于,向所述终端设备发送所述特征数据。
  21. 如权利要求20所述的装置,其特征在于,所述收发模块还用于,
    当网络资源满足第一条件时,向所述终端设备发送所述特征数据,其中,所述第一条件包括所述网络资源小于或等于第一资源,所述第一资源为传输所述目标数据时所需求的最小资源。
  22. 如权利要求21所述的装置,其特征在于,
    当所述第一条件为所述网络资源大于或等于第二资源且小于或等于所述第一资源时,所述特征数据包括第一特征数据;和/或,
    当所述第一条件为所述网络资源大于或等于第三资源且小于或等于第二资源时,所述特征数据包括第二特征数据,其中,所述第一特征数据的数据量大于或等于所述第二特征数据的数据量,所述第二资源为传输所述第一特征数据所需求的最小资源,所述第三资源为传输所述第二特征数据所需求的最小资源。
  23. 如权利要求22所述的装置,其特征在于,所述待处理数据包括待处理语音数据,所述目标数据包括目标语音,所述特征数据包括声学特征,所述网络资源包括网络带宽。
  24. 如权利要求23所述的装置,其特征在于,所述第一特征数据包括梅尔谱特征;所述第二特征数据包括第一短时傅里叶变换谱图,其中,所述第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,所述原始短时傅里叶变换谱图根据所述待处理语音数据得到。
  25. 如权利要求24所述的装置,其特征在于,所述第二特征数据还包括残差数据,所述残差数据为所述原始短时傅里叶变换谱图与第二短时傅里叶变换谱图之差,所述第二短时傅里叶变换谱图为对所述第一短时傅里叶变换谱图进行升采样得到的谱图。
  26. 如权利要求24或25所述的装置,其特征在于,所述原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,所述原始短时傅里叶变换谱图只包括谱图的幅度部分。
  27. 如权利要求21所述的装置,其特征在于,当所述网络资源满足第二条件时,所述第二条件包括所述网络资源大于所述第一资源,所述收发模块还用于,
    从所述终端设备接收另一待处理数据;
    所述处理模块还用于,根据所述另一待处理数据得到所述另一待处理数据的目标数据;
    所述收发模块还用于,向所述终端设备发送所述另一待处理数据的目标数据。
  28. 一种数据处理装置,其特征在于,所述装置通过网络与服务器通信,所述装置包括:
    收发模块,用于向所述服务器发送待处理数据;从所述服务器接收所述待处理数据的特征数据,所述特征数据的数据量小于所述待处理数据的目标数据的数据量;
    处理模块,用于根据所述特征数据生成所述目标数据。
  29. 如权利要求28所述的装置,其特征在于,所述收发模块还用于,当网络资源满足第一条件时,从所述服务器接收所述待处理数据的特征数据,其中,所述第一条件包括所述网络资源小于或等于第一资源,所述第一资源为传输所述目标数据时所需求的最小资源。
  30. 如权利要求29所述的装置,其特征在于,
    当所述第一条件为所述网络资源大于或等于第二资源且小于或等于所述第一资源时,所述特征数据包括第一特征数据;和/或,
    当所述第一条件为所述网络资源大于或等于第三资源且小于或等于第二资源时,所述特征数据包括第二特征数据,其中,所述第一特征数据的数据量大于或等于所述第二特征数据的数据量,所述第二资源为传输所述第一特征数据所需求的最小资源,所述第三资源为传输所述第二特征数据所需求的最小资源。
  31. 如权利要求30所述的装置,其特征在于,所述待处理数据包括待处理语音数据,所述目标数据包括目标语音,所述特征数据包括声学特征,所述网络资源包括网络带宽。
  32. 如权利要求31所述的装置,其特征在于,所述第一特征数据包括梅尔谱特征;所述第二特征数据包括第一短时傅里叶变换谱图,其中,所述第一短时傅里叶变换谱图为对原始短时傅里叶变换谱图进行降采样得到的谱图,所述原始短时傅里叶变换谱图根据所述待处理语音数据得到。
  33. 如权利要求32所述的装置,其特征在于,当所述特征数据包括第二特征数据,且所述第二特征数据包括所述第一短时傅里叶变换谱图时,所述处理模块还用于,
    对所述第一短时傅里叶变换谱图进行升采样得到第二短时傅里叶变换谱图;
    对所述第二短时傅里叶变换谱图进行逆短时傅里叶变换得到所述目标语音。
  34. 如权利要求33所述的装置,其特征在于,所述第二特征数据还包括残差数据,所述残差数据为原始短时傅里叶变换谱图与所述第二短时傅里叶变换谱图之差。
  35. 如权利要求34所述的装置,其特征在于,所述处理模块还用于,对所述第二短时傅里叶变换谱图和所述残差数据之和进行逆短时傅里叶变换得到所述目标语音。
  36. 如权利要求32至35中任一项所述的装置,其特征在于,所述原始短时傅里叶变换谱图的相邻滑窗之间不重叠;和/或,所述原始短时傅里叶变换谱图只包括谱图的幅度部分。
  37. 如权利要求29所述的装置,其特征在于,当所述网络资源满足第二条件时,所述第二条件包括所述网络资源大于所述第一资源,所述收发模块还用于,
    向所述服务器发送另一待处理数据;
    从所述服务器接收所述另一待处理数据的目标数据。
  38. 如权利要求30所述的装置,其特征在于,当所述网络资源满足第三条件时,所 述第三条件包括所述网络资源小于所述第三资源,所述处理模块还用于,
    确定又一待处理数据;
    根据所述又一待处理数据得到所述又一待处理数据的目标数据。
  39. 一种数据处理系统,其特征在于,包括如权利要求20至27中任一项所述的数据处理装置和如权利要求28至38中任一项所述的数据处理装置。
  40. 一种数据处理装置,其特征在于,包括至少一个处理器和接口电路,所述至少一个处理器用于通过所述接口电路获取待处理数据,且执行如权利要求1至8中任一项所述的数据处理方法。
  41. 一种数据处理装置,其特征在于,包括至少一个处理器和通信接口,所述至少一个处理器用于通过所述通信接口与服务器通信,且执行如权利要求9至19中任一项所述的数据处理方法。
  42. 一种车辆,其特征在于,包括传感器和数据处理装置,所述传感器用于获取舱内用户数据,所述舱内用户数据用于生成待处理数据,所述数据处理装置用于执行如权利要求9至19中任一项所述的数据处理方法。
  43. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1至8中任一项所述的数据处理方法;和/或,执行如权利要求9至19中任一项所述的数据处理方法。
  44. 一种计算机可读存储介质,其特征在于,包括指令;所述指令用于实现如权利要求1至8中任一项所述的数据处理方法;和/或,实现如权利要求9至19中任一项所述的数据处理方法。
  45. 一种算机程序产品,其特征在于,包括:计算机程序,当计算机程序被运行时,使得计算机执行如权利要求1至8中任一项所述的数据处理方法;和/或,执行如权利要求9至19中任一项所述的数据处理方法。
PCT/CN2022/080823 2022-03-15 2022-03-15 数据处理方法和装置 WO2023173269A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280008495.5A CN117157705A (zh) 2022-03-15 2022-03-15 数据处理方法和装置
PCT/CN2022/080823 WO2023173269A1 (zh) 2022-03-15 2022-03-15 数据处理方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/080823 WO2023173269A1 (zh) 2022-03-15 2022-03-15 数据处理方法和装置

Publications (1)

Publication Number Publication Date
WO2023173269A1 true WO2023173269A1 (zh) 2023-09-21

Family

ID=88022044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080823 WO2023173269A1 (zh) 2022-03-15 2022-03-15 数据处理方法和装置

Country Status (2)

Country Link
CN (1) CN117157705A (zh)
WO (1) WO2023173269A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053821A (zh) * 2017-12-12 2018-05-18 腾讯科技(深圳)有限公司 生成音频数据的方法和装置
US20190363768A1 (en) * 2017-01-13 2019-11-28 Huawei Technologies Co., Ltd. High-Speed Data Transmission Degradation Method, Device, and System
CN111276120A (zh) * 2020-01-21 2020-06-12 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
CN113676404A (zh) * 2021-08-23 2021-11-19 北京字节跳动网络技术有限公司 数据传输方法、装置、设备、存储介质及程序
CN114006890A (zh) * 2021-10-26 2022-02-01 深圳Tcl新技术有限公司 一种数据传输方法、设备及存储介质和终端设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190363768A1 (en) * 2017-01-13 2019-11-28 Huawei Technologies Co., Ltd. High-Speed Data Transmission Degradation Method, Device, and System
CN108053821A (zh) * 2017-12-12 2018-05-18 腾讯科技(深圳)有限公司 生成音频数据的方法和装置
CN111276120A (zh) * 2020-01-21 2020-06-12 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
CN113676404A (zh) * 2021-08-23 2021-11-19 北京字节跳动网络技术有限公司 数据传输方法、装置、设备、存储介质及程序
CN114006890A (zh) * 2021-10-26 2022-02-01 深圳Tcl新技术有限公司 一种数据传输方法、设备及存储介质和终端设备

Also Published As

Publication number Publication date
CN117157705A (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
US10950249B2 (en) Audio watermark encoding/decoding
US10546593B2 (en) Deep learning driven multi-channel filtering for speech enhancement
CN110709924A (zh) 视听语音分离
US20150310858A1 (en) Shared hidden layer combination for speech recognition systems
US10978081B2 (en) Audio watermark encoding/decoding
WO2019233362A1 (zh) 基于深度学习的语音音质增强方法、装置和系统
US11908461B2 (en) Deliberation model-based two-pass end-to-end speech recognition
EP3078022B1 (en) Multi-path audio processing
US20200211540A1 (en) Context-based speech synthesis
WO2021052285A1 (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
US20230186901A1 (en) Attention-Based Joint Acoustic and Text On-Device End-to-End Model
Kumar et al. Murmured speech recognition using hidden markov model
EP4289129A1 (en) Systems and methods of handling speech audio stream interruptions
WO2023173269A1 (zh) 数据处理方法和装置
US20030033144A1 (en) Integrated sound input system
KR20170052090A (ko) 효율적인 음성 통화를 위한 샘플링 레이트 변환 방법 및 시스템
WO2020068401A1 (en) Audio watermark encoding/decoding
JP7333371B2 (ja) 話者分離基盤の自動通訳方法、話者分離基盤の自動通訳サービスを提供するユーザ端末、及び、話者分離基盤の自動通訳サービス提供システム
US20240127827A1 (en) Matching audio using machine learning based audio representations
CN117795597A (zh) 用于自动语音辨识的联合声学回声消除、语音增强和话音分离
US20230298612A1 (en) Microphone Array Configuration Invariant, Streaming, Multichannel Neural Enhancement Frontend for Automatic Speech Recognition
TW202333144A (zh) 音訊訊號重構
WO2024050192A1 (en) Data reconstruction using machine-learning predictive coding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931309

Country of ref document: EP

Kind code of ref document: A1