WO2021238338A1 - Speech synthesis method and device - Google Patents

Speech synthesis method and device Download PDF

Info

Publication number
WO2021238338A1
WO2021238338A1 PCT/CN2021/080403 CN2021080403W WO2021238338A1 WO 2021238338 A1 WO2021238338 A1 WO 2021238338A1 CN 2021080403 W CN2021080403 W CN 2021080403W WO 2021238338 A1 WO2021238338 A1 WO 2021238338A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
duration
text
converted
pronunciation
Prior art date
Application number
PCT/CN2021/080403
Other languages
French (fr)
Chinese (zh)
Inventor
别凡虎
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021238338A1 publication Critical patent/WO2021238338A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • This application belongs to the technical field of terminal artificial intelligence and the technical field from text to speech, and in particular relates to a speech synthesis method and device.
  • terminal equipment can not only receive the voice information sent by the user, but also play the voice information to the user.
  • the user does not need to consult the text displayed by the terminal device, and can learn the information displayed by the terminal device only through hearing.
  • the terminal device can obtain the text to be converted, and perform feature extraction on the text to be converted to obtain the language feature, and then determine the phoneme duration of each phoneme corresponding to the text to be converted through the language feature, and finally generate according to the respective phoneme duration and language feature Voice data.
  • the embodiments of the present application provide a speech synthesis method and device, which can solve the problem of excessively mechanized speech synthesis.
  • an embodiment of the present application provides a speech synthesis method, including:
  • voice data is generated.
  • the determining the duration range of each phoneme corresponding to the text to be converted includes:
  • the duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
  • the determining the average pronunciation duration and the variance of the pronunciation duration of each of the phonemes corresponding to the text to be converted And the distribution density of pronunciation duration including:
  • the text to be converted is input into a preset duration model, and the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model are obtained.
  • the determination is made according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme
  • the duration range of each phoneme includes:
  • the duration range of each phoneme is determined by a normal distribution algorithm.
  • the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:
  • the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:
  • the user data including age information and personality information of the user
  • the phoneme duration of each phoneme is determined.
  • the generating voice data according to the text to be converted and the phoneme duration of each phoneme includes :
  • the voice data is generated through a preset acoustic model and a vocoder.
  • an embodiment of the present application provides a speech synthesis device, including:
  • the range determination module is used to determine the duration range of each phoneme corresponding to the text to be converted
  • a duration determining module configured to determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme
  • the generating module is used to generate voice data according to the text to be converted and the phoneme duration of each phoneme.
  • the range determining module is specifically configured to determine the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted;
  • the duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
  • the range determination module is further specifically configured to input the text to be converted into a preset text analysis model , Obtain the pronunciation duration distribution density of each phoneme output by the text analysis model; input the text to be converted into a preset duration model to obtain the average pronunciation duration of each phoneme output by the duration model and Variance of pronunciation time.
  • the range determination module is further specifically configured to perform according to the average pronunciation duration and the pronunciation duration of each phoneme.
  • the variance and the pronunciation duration distribution density are determined by the normal distribution algorithm to determine the duration range of each phoneme.
  • the duration determining module is specifically configured to obtain, for each phoneme, the position of the text corresponding to the phoneme in the text to be converted The text semantic information of the phoneme; based on the duration range of the phoneme and the text semantic information of the phoneme, the phoneme duration of the phoneme is determined.
  • the duration determination module is specifically configured to obtain user data, where the user data includes age information and personality information of the user; based on the duration range of the phoneme and the User data to determine the phoneme duration of each phoneme.
  • the generating module is specifically configured to perform according to the text to be converted and the phoneme duration of each phoneme , Generating the voice data through a preset acoustic model and vocoder.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes all
  • the computer program implements the speech synthesis method as described in any one of the above-mentioned first aspects.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and is characterized in that, when the computer program is executed by a processor, the implementation is as in the above-mentioned first aspect Any one of the speech synthesis methods.
  • the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the speech synthesis method described in any one of the above-mentioned first aspects.
  • an embodiment of the present application provides a chip system, the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory to implement the above-mentioned first aspect Any one of the speech synthesis method.
  • the chip system may be a single chip or a chip module composed of multiple chips.
  • the embodiment of this application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme, and finally according to the text to be converted and each phoneme.
  • the phoneme duration of the phoneme generates speech data.
  • the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted.
  • the same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.
  • FIG. 1 is a schematic diagram of a speech synthesis scenario involved in a speech synthesis method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of another speech synthesis scenario involved in the speech synthesis method provided by an embodiment of the present application;
  • Fig. 3 is a structural block diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart for determining the duration range of a phoneme according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of a duration range provided by an embodiment of the present application.
  • Fig. 7 is a structural block diagram of a speech synthesis device provided by an embodiment of the present application.
  • the speech synthesis method provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super mobile personal computers
  • AR augmented reality
  • VR virtual reality
  • UMPC ultra-mobile personal computer
  • netbooks netbooks
  • PDA personal digital assistant
  • the terminal device may be a station (STAION, ST) in a WLAN, a cell phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication functions, computing devices or other processing devices connected to wireless modems, in-vehicle devices, car networking terminals, computers, laptop computers, handheld communication devices , Handheld computing devices, satellite wireless devices, wireless modem cards, television set top boxes (STB), customer premise equipment (customer premise equipment, CPE), and/or other equipment used to communicate on wireless systems, and download First-generation communication systems, for example, mobile terminals in 5G networks or mobile terminals in the future evolution of Public Land Mobile Network (PLMN) networks, etc.
  • SIP Session Initiation Protocol
  • WLL Wireless Local Loop
  • PDA Personal Digital Assistant
  • handheld devices with wireless communication functions computing devices or other processing devices connected to wireless modems
  • the wearable device can also be a general term for using wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, Watches, clothing and shoes, etc.
  • a wearable device is a portable device that is directly worn on the body or integrated into the user's clothes or accessories.
  • Wearable devices are not only a hardware device, but also realize powerful functions through software support, data interaction, and cloud interaction.
  • wearable smart devices include full-featured, large-sized, complete or partial functions that can be implemented without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, and need to be used in conjunction with other devices such as smart phones. , Such as all kinds of smart bracelets and smart jewelry for physical sign monitoring.
  • FIG. 1 is a schematic diagram of a speech synthesis scenario involved in a speech synthesis method provided by an embodiment of the present application. See FIG. The phoneme duration of each phoneme corresponding to the converted text is adjusted, so that different voice data can be generated based on the same text to be converted.
  • the terminal device 110 may obtain the text to be converted, and input the text to be converted into a pre-trained text analysis model and duration model, respectively, to obtain the average pronunciation duration and pronunciation of each phoneme corresponding to the text to be converted.
  • the duration variance and the pronunciation duration distribution density can be based on the normal distribution rule to determine the duration range of each phoneme according to the average pronunciation duration, the variance of the pronunciation duration and the pronunciation duration distribution density of each phoneme.
  • the terminal device 110 may then combine the corresponding text semantic information of each phoneme in the text to be converted based on the duration range of each phoneme, and/or pre-stored user data including age information and personality information of the user, and convert the duration Any duration in the range is determined as the phoneme duration of each phoneme, so that speech data is generated according to the duration of each phoneme and the text to be converted.
  • the average pronunciation duration is used to represent the average of the duration of each phoneme for the same phoneme
  • the variance of the pronunciation duration is used to represent the degree of difference between the duration of each phoneme of the same phoneme and the average pronunciation duration
  • the distribution density of pronunciation duration is used to represent the correspondence of the same phoneme. Probability of different phoneme duration.
  • a phoneme is the smallest phonetic unit divided according to the natural attributes of the speech. It is analyzed according to the pronunciation actions in the syllable, and an action constitutes a phoneme. For example, taking the pronunciation rules of Pinyin as an example, the initials corresponding to the pinyin of each character can be used as one phoneme, and the finals of the pinyin as another phoneme. For example, in “weather”, the phoneme corresponding to the word “ ⁇ ” can include “t” "And “ian”, the phonemes corresponding to the word "qi” can include "q" and "i".
  • the speech synthesis scenario may also include a server 120, and the terminal device 110 may be connected to the server 120, so that the server 120 can convert the text to be converted, and obtain a text based on the same text to be converted. Different voice data.
  • the terminal device 110 may first send the text to be converted to the server 120, and the server 120 may determine the duration range of each phoneme according to the text to be converted, and then combine the semantic information of the text to be converted and the pre-stored user data, The phoneme duration of each phoneme can be determined from each duration range, so that voice data can be generated according to each phoneme duration and the text to be converted, and the generated voice data can be sent to the terminal device 110, then the terminal device 110 can receive and play the voice generated by the server 120 data.
  • the following embodiments only take the terminal device 110 included in the speech synthesis scenario and do not include the server 120 as an example for description. In actual applications, not only can the terminal device 110 convert the voice data, but also the server The voice data obtained by conversion 120 is not limited in the embodiment of the present application.
  • Fig. 3 is a structural block diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device may include a processor 310, an external memory interface 320, an internal memory 321, a universal serial bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, and an antenna 1.
  • Antenna 2 mobile communication module 350, wireless communication module 360, audio module 370, speaker 370A, receiver 370B, microphone 370C, earphone interface 370D, sensor module 380, buttons 390, motor 391, indicator 392, camera 393, display 394, and subscriber identification module (subscriber identification module, SIM) card interface 395, etc.
  • SIM subscriber identification module
  • the sensor module 380 can include pressure sensor 380A, gyroscope sensor 380B, air pressure sensor 380C, magnetic sensor 380D, acceleration sensor 380E, distance sensor 380F, proximity light sensor 380G, fingerprint sensor 380H, temperature sensor 380J, touch sensor 380K, ambient light Sensor 380L, bone conduction sensor 380M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal device.
  • the terminal device may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 310 may include one or more processing units.
  • the processor 310 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller can be the nerve center and command center of the terminal device.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 310 to store instructions and data.
  • the memory in the processor 310 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 310. If the processor 310 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 310 is reduced, and the efficiency of the system is improved.
  • the processor 310 may include one or more interfaces.
  • the interface can include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 310 may include multiple sets of I2C buses.
  • the processor 310 may couple the touch sensor 380K, charger, flash, camera 393, etc., respectively through different I2C bus interfaces.
  • the processor 310 may couple the touch sensor 380K through an I2C interface, so that the processor 310 and the touch sensor 380K communicate through the I2C bus interface to realize the touch function of the terminal device.
  • the I2S interface can be used for audio communication.
  • the processor 310 may include multiple sets of I2S buses.
  • the processor 310 may be coupled with the audio module 370 through an I2S bus to implement communication between the processor 310 and the audio module 370.
  • the audio module 370 may transmit audio signals to the wireless communication module 360 through the I2S interface, so as to realize the function of answering calls through the Bluetooth headset.
  • the PCM interface can also be used for audio communication to sample, quantize and encode analog signals.
  • the audio module 370 and the wireless communication module 360 may be coupled through a PCM bus interface.
  • the audio module 370 may also transmit audio signals to the wireless communication module 360 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • the UART interface is generally used to connect the processor 310 and the wireless communication module 360.
  • the processor 310 communicates with the Bluetooth module in the wireless communication module 360 through the UART interface to realize the Bluetooth function.
  • the audio module 370 may transmit audio signals to the wireless communication module 360 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
  • the MIPI interface can be used to connect the processor 310 with the display screen 394, the camera 393 and other peripheral devices.
  • the MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc.
  • the processor 310 and the camera 393 communicate through a CSI interface to implement the shooting function of the terminal device.
  • the processor 310 and the display screen 394 communicate through the DSI interface to realize the display function of the terminal device.
  • the GPIO interface can be configured through software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 310 with the camera 393, the display screen 394, the wireless communication module 360, the audio module 370, the sensor module 380, and so on.
  • the GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 330 is an interface that complies with the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, and so on.
  • the USB interface 330 can be used to connect a charger to charge the terminal device, and can also be used to transfer data between the terminal device and peripheral devices. It can also be used to connect earphones and play audio through earphones.
  • the interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely a schematic description, and does not constitute a structural limitation of the terminal device.
  • the terminal device may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the charging management module 340 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 340 may receive the charging input of the wired charger through the USB interface 330.
  • the charging management module 340 may receive the wireless charging input through the wireless charging coil of the terminal device. While the charging management module 340 charges the battery 342, it can also supply power to the electronic device through the power management module 341.
  • the power management module 341 is used to connect the battery 342, the charging management module 340 and the processor 310.
  • the power management module 341 receives input from the battery 342 and/or the charge management module 340, and supplies power to the processor 310, the internal memory 321, the external memory, the display screen 394, the camera 393, and the wireless communication module 360.
  • the power management module 341 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance).
  • the power management module 341 may also be provided in the processor 310.
  • the power management module 341 and the charging management module 340 may also be provided in the same device.
  • the wireless communication function of the terminal device can be realized by the antenna 1, the antenna 2, the mobile communication module 350, the wireless communication module 360, the modem processor, and the baseband processor.
  • the antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the terminal device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • antenna 1 can be multiplexed as a diversity antenna of a wireless local area network.
  • the antenna can be used in combination with a tuning switch.
  • the mobile communication module 350 can provide wireless communication solutions including 2G/3G/4G/5G, etc., which are applied to terminal devices.
  • the mobile communication module 350 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.
  • the mobile communication module 350 can receive electromagnetic waves by the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 350 can also amplify the signal modulated by the modem processor, and convert it to electromagnetic wave radiation via the antenna 1.
  • at least part of the functional modules of the mobile communication module 350 may be provided in the processor 310.
  • at least part of the functional modules of the mobile communication module 350 and at least part of the modules of the processor 310 may be provided in the same device.
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low-frequency baseband signal is processed by the baseband processor and then passed to the application processor.
  • the application processor outputs sound signals through audio equipment (not limited to the speaker 370A, the receiver 370B, etc.), or displays images or videos through the display screen 394.
  • the modem processor may be an independent device.
  • the modem processor may be independent of the processor 310 and be provided in the same device as the mobile communication module 350 or other functional modules.
  • the wireless communication module 360 can provide applications on terminal devices including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 360 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 360 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 310.
  • the wireless communication module 360 can also receive the signal to be sent from the processor 310, perform frequency modulation, amplify it, and convert it into electromagnetic waves and radiate it through the
  • the antenna 1 of the terminal device is coupled with the mobile communication module 350, and the antenna 2 is coupled with the wireless communication module 360, so that the terminal device can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include the global positioning system (GPS), the global navigation satellite system (GLONASS), the Beidou navigation satellite system (BDS), and the quasi-zenith satellite system (quasi). -zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite-based augmentation systems
  • the terminal device realizes the display function through the GPU, the display screen 394, and the application processor.
  • the GPU is an image processing microprocessor, which is connected to the display screen 394 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 310 may include one or more GPUs, which execute program instructions to generate or change display information.
  • the display screen 394 is used to display images, videos, and the like.
  • the display screen 394 includes a display panel.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the terminal device may include one or N display screens 394, and N is a positive integer greater than one.
  • the terminal device can realize the shooting function through ISP, camera 393, video codec, GPU, display screen 394 and application processor.
  • the ISP is used to process the data fed back by the camera 393. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transfers the electrical signal to the ISP for processing and is converted into an image visible to the naked eye.
  • ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 393.
  • the camera 393 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal device may include 1 or N cameras 393, and N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the terminal device selects the frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device can support one or more video codecs.
  • the terminal device can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • NPU can realize the intelligent cognition of terminal equipment and other applications, such as: image recognition, face recognition, speech recognition, text understanding, etc.
  • the external memory interface 320 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device.
  • the external memory card communicates with the processor 310 through the external memory interface 320 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 321 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 310 executes various functional applications and data processing of the terminal device by running instructions stored in the internal memory 321.
  • the internal memory 321 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function.
  • the data storage area can store data (such as audio data, phone book, etc.) created during the use of the terminal device.
  • the internal memory 321 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the terminal device can implement audio functions through the audio module 370, the speaker 370A, the receiver 370B, the microphone 370C, the earphone interface 370D, and the application processor. For example, music playback, recording, etc.
  • the audio module 370 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 370 can also be used to encode and decode audio signals.
  • the audio module 370 may be provided in the processor 310, or part of the functional modules of the audio module 370 may be provided in the processor 310.
  • the speaker 370A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the terminal device can listen to music through the speaker 370A, or listen to a hands-free call.
  • the receiver 370B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the terminal device answers a call or voice message, it can receive the voice by bringing the receiver 370B close to the human ear.
  • Microphone 370C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 370C through the mouth, and input the sound signal to the microphone 370C.
  • the terminal device can be provided with at least one microphone 370C.
  • the terminal device can be provided with two microphones 370C, which can implement noise reduction functions in addition to collecting sound signals.
  • the terminal device may also be provided with three, four or more microphones 370C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the earphone interface 370D is used to connect wired earphones.
  • the earphone interface 370D may be a USB interface 330, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association
  • the pressure sensor 380A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 380A may be provided on the display screen 394.
  • the capacitive pressure sensor may include at least two parallel plates with conductive materials. When a force is applied to the pressure sensor 380A, the capacitance between the electrodes changes. The terminal equipment determines the strength of the pressure based on the change in capacitance. When a touch operation acts on the display screen 394, the terminal device detects the intensity of the touch operation according to the pressure sensor 380A.
  • the terminal device may also calculate the touched position based on the detection signal of the pressure sensor 380A.
  • touch operations that act on the same touch position but have different touch operation intensities can correspond to different operation instructions. For example: when a touch operation whose intensity is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.
  • the gyroscope sensor 380B can be used to determine the motion posture of the terminal device. In some embodiments, the angular velocity of the terminal device around three axes (ie, x, y, and z axes) can be determined by the gyroscope sensor 380B.
  • the gyro sensor 380B can be used for shooting anti-shake. Exemplarily, when the shutter is pressed, the gyroscope sensor 380B detects the shake angle of the terminal device, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to counteract the shake of the terminal device through a reverse movement to achieve anti-shake.
  • the gyroscope sensor 380B can also be used for navigation and somatosensory game scenes.
  • the air pressure sensor 380C is used to measure air pressure.
  • the terminal device uses the air pressure value measured by the air pressure sensor 380C to calculate the altitude to assist positioning and navigation.
  • the magnetic sensor 380D includes a Hall sensor.
  • the terminal device can use the magnetic sensor 380D to detect the opening and closing of the flip holster.
  • the terminal device when the terminal device is a flip machine, the terminal device can detect the opening and closing of the flip according to the magnetic sensor 380D. Then, according to the detected opening and closing status of the leather case or the opening and closing status of the flip cover, features such as automatic unlocking of the flip cover are set.
  • the acceleration sensor 380E can detect the magnitude of the acceleration of the terminal device in various directions (generally three axes). When the terminal device is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and be used in applications such as horizontal and vertical screen switching, pedometers and so on.
  • the terminal device can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the terminal device can use the distance sensor 380F to measure the distance to achieve fast focusing.
  • the proximity light sensor 380G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode.
  • the light emitting diode may be an infrared light emitting diode.
  • the terminal device emits infrared light to the outside through the light-emitting diode.
  • Terminal equipment uses photodiodes to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device. When insufficient reflected light is detected, the terminal device can determine that there is no object near the terminal device.
  • the terminal device can use the proximity light sensor 380G to detect that the user holds the terminal device close to the ear to talk, so as to automatically turn off the screen to save power.
  • the proximity light sensor 380G can also be used in leather case mode, and the pocket mode will automatically unlock and lock the screen.
  • the ambient light sensor 380L is used to sense the brightness of the ambient light.
  • the terminal device can adaptively adjust the brightness of the display screen 394 according to the perceived brightness of the ambient light.
  • the ambient light sensor 380L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 380L can also cooperate with the proximity light sensor 380G to detect whether the terminal device is in the pocket to prevent accidental touch.
  • the fingerprint sensor 380H is used to collect fingerprints. Terminal devices can use the collected fingerprint characteristics to unlock fingerprints, access application locks, take photos with fingerprints, and answer calls with fingerprints.
  • the temperature sensor 380J is used to detect temperature.
  • the terminal device uses the temperature detected by the temperature sensor 380J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 380J exceeds a threshold value, the terminal device executes to reduce the performance of the processor located near the temperature sensor 380J in order to reduce power consumption and implement thermal protection.
  • the terminal device when the temperature is lower than another threshold, the terminal device heats the battery 342 to avoid abnormal shutdown of the terminal device due to low temperature.
  • the terminal device boosts the output voltage of the battery 342 to avoid abnormal shutdown caused by low temperature.
  • Touch sensor 380K also called “touch panel”.
  • the touch sensor 380K can be arranged on the display screen 394, and the touch screen is composed of the touch sensor 380K and the display screen 394, which is also called a “touch screen”.
  • the touch sensor 380K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • the visual output related to the touch operation can be provided through the display screen 394.
  • the touch sensor 380K may also be disposed on the surface of the terminal device, which is different from the position of the display screen 394.
  • the bone conduction sensor 380M can acquire vibration signals. In some embodiments, the bone conduction sensor 380M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 380M can also contact the human pulse and receive blood pressure beating signals. In some embodiments, the bone conduction sensor 380M may also be provided in the earphone, combined with the bone conduction earphone.
  • the audio module 370 can parse the voice signal based on the vibration signal of the vibrating bone block of the voice obtained by the bone conduction sensor 380M, and realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beating signal obtained by the bone conduction sensor 380M, and realize the heart rate detection function.
  • the button 390 includes a power-on button, a volume button, and so on.
  • the button 390 may be a mechanical button. It can also be a touch button.
  • the terminal device can receive key input and generate key signal input related to the user settings and function control of the terminal device.
  • the motor 391 can generate vibration prompts.
  • the motor 391 can be used for incoming call vibration notification, and can also be used for touch vibration feedback.
  • touch operations that act on different applications can correspond to different vibration feedback effects.
  • Acting on touch operations in different areas of the display screen 394 the motor 391 can also correspond to different vibration feedback effects.
  • Different application scenarios for example: time reminding, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 392 can be an indicator light, which can be used to indicate the charging status, power change, and can also be used to indicate messages, missed calls, notifications, and so on.
  • the SIM card interface 395 is used to connect to the SIM card.
  • the SIM card can be inserted into the SIM card interface 395 or pulled out from the SIM card interface 395 to achieve contact and separation with the terminal device.
  • the terminal device can support 1 or N SIM card interfaces, and N is a positive integer greater than 1.
  • the SIM card interface 395 can support Nano SIM cards, Micro SIM cards, SIM cards, etc.
  • the same SIM card interface 395 can insert multiple cards at the same time. The types of the multiple cards can be the same or different.
  • the SIM card interface 395 can also be compatible with different types of SIM cards.
  • the SIM card interface 395 can also be compatible with external memory cards.
  • the terminal device interacts with the network through the SIM card to realize functions such as call and data communication.
  • the terminal device adopts an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.
  • Fig. 4 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application. As an example and not a limitation, the method can be applied to the above-mentioned terminal device. Referring to Fig. 4, the method includes:
  • Step 401 Determine the duration range of each phoneme corresponding to the text to be converted.
  • the terminal device can determine the phoneme duration of each phoneme corresponding to the text to be converted, and combine the language features of the text to be converted to generate voice data through a preset acoustic model and vocoder. Moreover, in the process of determining the phoneme duration, the duration range of each phoneme can be determined first, so that in subsequent steps, different phoneme durations can be selected based on the duration range to improve the naturalness and diversity of the generated speech data.
  • the terminal device can first extract each phoneme in the text to be converted, and then obtain the pronunciation information of each phoneme based on the pre-trained model, and then determine the duration range of each phoneme according to the pronunciation information of each phoneme.
  • this step 401 may include: step 401a and step 401b.
  • the terminal device can input the text to be converted into a pre-trained model, extract each phoneme of the text to be converted through the model, and can also analyze the text to be converted to determine the pronunciation information of each phoneme, such as the pronunciation time, the variance of the pronunciation time, and the pronunciation time. Distribution density, so that in the subsequent steps, the pronunciation range of each phoneme can be determined according to the pronunciation information.
  • the terminal device can first split the text to be converted to obtain multiple texts arranged in sequence, and then extract at least one phoneme of each text based on the pronunciation rules of each text. After the phonemes of each text are extracted, multiple phonemes of the text to be converted can be obtained.
  • the terminal device can use the initials and vowels corresponding to each character as the phonemes of the character respectively. If the text to be converted is "it is sunny today", the multiple phonemes of the text to be converted can be separately “J", “in”, “t”, “ian”, “t”, “ian”, “q”, "i", “h”, “ao”, “q”, "ing", “ l” and “ang”.
  • the terminal device can input the text to be converted into different models to obtain different pronunciation information.
  • the terminal device may input the text to be converted into a preset text analysis model to obtain the pronunciation time distribution density of each phoneme output by the text analysis model, where the text analysis model may be a deep neural network (DNN) Model.
  • the terminal device may input the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.
  • DNN deep neural network
  • the terminal device can substitute different parameters included in the pronunciation information into the calculation formula according to a preset calculation formula, so that the duration range of each phoneme can be obtained. For example, it can be assumed that the phoneme duration of each phoneme obeys a normal distribution, and the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme can be calculated by using the normal distribution algorithm and the formula of the normal distribution. Determine the duration range of each phoneme.
  • the average pronunciation duration of the x-th phoneme is t(x)
  • the variance of the pronunciation duration is std2(x)
  • the pronunciation duration distribution density is p(x)
  • the formula p(x) N( t(x), std2(x))
  • x1 and x2 are obtained by solving for x. If x1 is less than x2, the interval [x1, x2] can be used as the duration range of the xth phoneme.
  • Step 402 Determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme.
  • the terminal device can determine the phoneme duration of each phoneme based on the text semantic information of each phoneme in the text to be converted, the age of the user’s personality, or randomly, so that the terminal device can be based on the same text to be converted Generate different voice data.
  • the terminal device can obtain the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted, and then determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme .
  • the terminal device can first determine the text corresponding to the phoneme, and then determine the sentence in the text to be converted, analyze the semantics of the sentence, and combine it with the text to be converted. Transform the semantics of all sentences in the text to determine the textual semantic information of the phoneme. After that, the terminal device can select the phoneme duration of the phoneme from the multiple durations corresponding to the duration range according to the text semantic information.
  • a short duration can be selected from the duration range as the phoneme duration of the phoneme.
  • the terminal device may obtain user data, and the user data may include the user's age information and personality information, and the terminal device may determine the phoneme duration of each phoneme from the duration range of each phoneme according to the age and personality of the user, thereby generating Voice information that matches the user.
  • the terminal device can obtain pre-stored user data, or request user data from the server, and determine the voice type that matches the user based on the user data, and then correspond to the user’s voice type from the duration range Among the multiple durations of, select the phoneme duration of the phoneme.
  • the voice type that matches the user can be a slow and tidy type, and accordingly, a long duration can be selected as the phoneme duration of the phoneme.
  • user data may also include other information indicating the type of user's speech.
  • user data may include search data indicating user emotions, and may also include shopping data indicating whether the user has recently purchased goods, etc. The application embodiment does not limit this.
  • the terminal device can obtain text semantic information and user data at the same time, it can further determine the phoneme duration of each phoneme according to the weights corresponding to the text semantic information and the user data. However, if the terminal device cannot obtain neither the text semantic information nor the user data, it can determine the phoneme duration of each phoneme according to the normal distribution of each phoneme in the text to be converted.
  • the application embodiment does not limit the way of determining the phoneme duration.
  • Step 403 Generate voice data according to the text to be converted and the phoneme duration of each phoneme.
  • the terminal device can use different methods to generate voice data based on the phoneme duration of each phoneme. For example, the terminal device can use a parameter method, a splicing method, or an end-to-end method to generate voice data, and no matter which method is used to generate the voice data, the phoneme duration of each phoneme corresponding to the text to be converted can be determined in the foregoing manner.
  • the terminal device can first determine the phoneme duration of each phoneme according to the above method, and then input the phoneme duration and the language features of the extracted text to be converted into the acoustic model to obtain the fundamental frequency and other parameters used to generate the voice data.
  • the voice data is generated by the vocoder according to the fundamental frequency and other parameters.
  • the process of generating voice data in a splicing method or an end-to-end manner is similar to the process of generating voice data in the above-mentioned parameter method, and will not be repeated here.
  • the speech synthesis method determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the corresponding phoneme The phoneme duration, and finally the voice data is generated according to the text to be converted and the phoneme duration of each phoneme.
  • the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted.
  • the same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.
  • the value of the phoneme duration will not have a huge deviation, thereby avoiding the value of the phoneme duration being too large or too small to cause abnormal voice data The situation improves the stability of speech synthesis.
  • FIG. 7 is a structural block diagram of a speech synthesis device provided in an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
  • the device includes:
  • the range determining module 701 is used to determine the duration range of each phoneme corresponding to the text to be converted;
  • the duration determining module 702 is configured to determine any duration in the duration range as the phoneme duration of each phoneme;
  • the generating module 703 is configured to generate voice data according to the text to be converted and the phoneme duration of each phoneme.
  • the range determination module 701 is specifically configured to determine the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme corresponding to the text to be converted; according to the average pronunciation duration and pronunciation duration of each phoneme The variance and the distribution density of pronunciation duration determine the duration range of each phoneme.
  • the range determining module 701 is further specifically configured to input the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model; and input the text to be converted
  • the preset duration model obtains the average pronunciation duration and the variance of the pronunciation duration of each phoneme output by the duration model.
  • the range determining module 701 is further specifically configured to determine the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
  • the duration determining module 702 is specifically configured to obtain, for each phoneme, the textual semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted; based on the duration range of the phoneme and the The textual semantic information of a phoneme determines the phoneme duration of the phoneme.
  • the duration determining module 702 is specifically configured to obtain user data, the user data including age information and personality information of the user; based on the duration range of the phoneme and the user data, determine the phoneme duration of each phoneme.
  • the generating module 703 is specifically configured to generate the voice data through a preset acoustic model and vocoder according to the text to be converted and the phoneme duration of each phoneme.
  • the speech synthesis device determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the corresponding phoneme The phoneme duration, and finally the voice data is generated according to the text to be converted and the phoneme duration of each phoneme.
  • the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted.
  • the same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.
  • the disclosed device and method may be implemented in other ways.
  • the system embodiment described above is merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be It can be combined or integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in this application can be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may at least include: any entity or device capable of carrying computer program code to the terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.

Abstract

A speech synthesis method and device, which are suitable for the fields of terminal artificial intelligence technology and text-to-speech technology. The method comprises: determining the duration range of each phoneme corresponding to text to be converted (401); determining any duration within the duration range of each phoneme to be a phoneme duration of a corresponding phoneme (402); and generating speech data according to the text and the phoneme duration of each phoneme (403). For multiple pieces of speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple pieces of speech data may have different values on the basis of the same duration range, and then a variety of different speech data may be synthesized, which avoids the synthesizing of the same speech data every time for the same text, reduces the mechanical nature of speech synthesis, and improves the naturalness and diversity of speech synthesis.

Description

语音合成方法及装置Speech synthesis method and device
本申请要求于2020年5月26日提交国家知识产权局、申请号为202010456116.1、申请名称为“语音合成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office, the application number is 202010456116.1, and the application name is "Speech Synthesis Method and Apparatus" on May 26, 2020, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请属于终端人工智能技术领域及从文本到语音技术领域,尤其涉及一种语音合成方法及装置。This application belongs to the technical field of terminal artificial intelligence and the technical field from text to speech, and in particular relates to a speech synthesis method and device.
背景技术Background technique
随着人工智能技术的不断发展,终端设备不但可以接收用户发出的语音信息,还可以向用户播放语音信息,用户无需查阅终端设备展示的文字,仅通过听觉就可以获知终端设备展示的信息。With the continuous development of artificial intelligence technology, terminal equipment can not only receive the voice information sent by the user, but also play the voice information to the user. The user does not need to consult the text displayed by the terminal device, and can learn the information displayed by the terminal device only through hearing.
相关技术中,终端设备可以获取待转换文本,并对待转换文本进行特征提取,得到语言特征,再通过语言特征确定待转换文本对应的每个音素的音素时长,最后根据各个音素时长和语言特征生成语音数据。In the related art, the terminal device can obtain the text to be converted, and perform feature extraction on the text to be converted to obtain the language feature, and then determine the phoneme duration of each phoneme corresponding to the text to be converted through the language feature, and finally generate according to the respective phoneme duration and language feature Voice data.
但是,终端设备在合成语音数据的过程中,针对同一待转换文本,多次生成的语音数据均是相同的,导致语音合成过于机械化。However, in the process of synthesizing voice data by the terminal device, for the same text to be converted, the voice data generated multiple times are the same, which leads to excessive mechanization of the voice synthesis.
发明内容Summary of the invention
本申请实施例提供了一种语音合成方法及装置,可以解决语音合成过于机械化的问题。The embodiments of the present application provide a speech synthesis method and device, which can solve the problem of excessively mechanized speech synthesis.
第一方面,本申请实施例提供了一种语音合成方法,包括:In the first aspect, an embodiment of the present application provides a speech synthesis method, including:
确定待转换文本对应的每个音素的时长范围;Determine the duration range of each phoneme corresponding to the text to be converted;
将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长;Determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;
根据所述待转换文本和每个所述音素的音素时长,生成语音数据。According to the text to be converted and the phoneme duration of each phoneme, voice data is generated.
在第一方面的第一种可能的实现方式中,所述确定待转换文本对应的每个音素的时长范围,包括:In a first possible implementation manner of the first aspect, the determining the duration range of each phoneme corresponding to the text to be converted includes:
确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度;Determining the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted;
根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围。The duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
基于第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,包括:Based on the first possible implementation manner of the first aspect, in the second possible implementation manner of the first aspect, the determining the average pronunciation duration and the variance of the pronunciation duration of each of the phonemes corresponding to the text to be converted And the distribution density of pronunciation duration, including:
将所述待转换文本输入预先设置的文本分析模型,得到所述文本分析模型输出的每个所述音素的发音时长分布密度;Inputting the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model;
将所述待转换文本输入预先设置的时长模型,得到所述时长模型输出的每个所述音素的平均发音时长和发音时长方差。The text to be converted is input into a preset duration model, and the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model are obtained.
基于第一方面的第一种可能的实现方式,在第一方面的第三种可能的实现方式中,所述根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围,包括:Based on the first possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, the determination is made according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme The duration range of each phoneme includes:
根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,通过正态分布算法确定每个所述音素的时长范围。According to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme, the duration range of each phoneme is determined by a normal distribution algorithm.
在第一方面的第四种可能的实现方式中,所述将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长,包括:In a fourth possible implementation manner of the first aspect, the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:
对于每个所述音素,根据所述音素对应的文字在所述待转换文本中的位置,获取所述音素的文本语义信息;For each of the phonemes, acquiring the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted;
基于所述音素的时长范围和所述音素的文本语义信息,确定所述音素的音素时长。Determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
在第一方面的第五种可能的实现方式中,所述将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长,包括:In a fifth possible implementation manner of the first aspect, the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:
获取用户数据,所述用户数据包括用户的年龄信息和性格信息;Acquiring user data, the user data including age information and personality information of the user;
基于所述音素的时长范围和所述用户数据,确定每个所述音素的音素时长。Based on the duration range of the phoneme and the user data, the phoneme duration of each phoneme is determined.
基于第一方面的任意一种可能的实现方式,在第一方面的第六种可能的实现方式中,所述根据所述待转换文本和每个所述音素的音素时长,生成语音数据,包括:Based on any possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the generating voice data according to the text to be converted and the phoneme duration of each phoneme includes :
根据所述待转换文本和每个所述音素的音素时长,通过预先设置的声学模型和声码器生成所述语音数据。According to the text to be converted and the phoneme duration of each phoneme, the voice data is generated through a preset acoustic model and a vocoder.
第二方面,本申请实施例提供了一种语音合成装置,包括:In the second aspect, an embodiment of the present application provides a speech synthesis device, including:
范围确定模块,用于确定待转换文本对应的每个音素的时长范围;The range determination module is used to determine the duration range of each phoneme corresponding to the text to be converted;
时长确定模块,用于将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长;A duration determining module, configured to determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;
生成模块,用于根据所述待转换文本和每个所述音素的音素时长,生成语音数据。The generating module is used to generate voice data according to the text to be converted and the phoneme duration of each phoneme.
在第二方面的第一种可能的实现方式中,所述范围确定模块,具体用于确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度;根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围。In a first possible implementation manner of the second aspect, the range determining module is specifically configured to determine the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted; The duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
基于第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述范围确定模块,还具体用于将所述待转换文本输入预先设置的文本分析模型,得到所述文本分析模型输出的每个所述音素的发音时长分布密度;将所述待转换文本输入预先设置的时长模型,得到所述时长模型输出的每个所述音素的平均发音时长和发音时长方差。Based on the first possible implementation manner of the second aspect, in the second possible implementation manner of the second aspect, the range determination module is further specifically configured to input the text to be converted into a preset text analysis model , Obtain the pronunciation duration distribution density of each phoneme output by the text analysis model; input the text to be converted into a preset duration model to obtain the average pronunciation duration of each phoneme output by the duration model and Variance of pronunciation time.
基于第二方面的第一种可能的实现方式,在第二方面的第三种可能的实现方式中,所述范围确定模块,还具体用于根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,通过正态分布算法确定每个所述音素的时长范围。Based on the first possible implementation manner of the second aspect, in the third possible implementation manner of the second aspect, the range determination module is further specifically configured to perform according to the average pronunciation duration and the pronunciation duration of each phoneme. The variance and the pronunciation duration distribution density are determined by the normal distribution algorithm to determine the duration range of each phoneme.
在第二方面的第四种可能的实现方式中,所述时长确定模块,具体用于对于每个所述音素,根据所述音素对应的文字在所述待转换文本中的位置,获取所述音素的文 本语义信息;基于所述音素的时长范围和所述音素的文本语义信息,确定所述音素的音素时长。In a fourth possible implementation manner of the second aspect, the duration determining module is specifically configured to obtain, for each phoneme, the position of the text corresponding to the phoneme in the text to be converted The text semantic information of the phoneme; based on the duration range of the phoneme and the text semantic information of the phoneme, the phoneme duration of the phoneme is determined.
在第二方面的第五种可能的实现方式中,所述时长确定模块,具体用于获取用户数据,所述用户数据包括用户的年龄信息和性格信息;基于所述音素的时长范围和所述用户数据,确定每个所述音素的音素时长。In a fifth possible implementation manner of the second aspect, the duration determination module is specifically configured to obtain user data, where the user data includes age information and personality information of the user; based on the duration range of the phoneme and the User data to determine the phoneme duration of each phoneme.
基于第二方面的任意一种可能的实现方式,在第二方面的第六种可能的实现方式中,所述生成模块,具体用于根据所述待转换文本和每个所述音素的音素时长,通过预先设置的声学模型和声码器生成所述语音数据。Based on any possible implementation manner of the second aspect, in the sixth possible implementation manner of the second aspect, the generating module is specifically configured to perform according to the text to be converted and the phoneme duration of each phoneme , Generating the voice data through a preset acoustic model and vocoder.
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如上述第一方面中任一项所述的语音合成方法。In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes all The computer program implements the speech synthesis method as described in any one of the above-mentioned first aspects.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如上述第一方面中任一项所述的语音合成方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and is characterized in that, when the computer program is executed by a processor, the implementation is as in the above-mentioned first aspect Any one of the speech synthesis methods.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的语音合成方法。In a fifth aspect, the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the speech synthesis method described in any one of the above-mentioned first aspects.
第六方面,本申请实施例提供了一种芯片系统,所述芯片系统包括处理器,所述处理器与存储器耦合,所述处理器执行存储器中存储的计算机程序,以实现如上述第一方面中任一项所述的语音合成方法。In a sixth aspect, an embodiment of the present application provides a chip system, the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory to implement the above-mentioned first aspect Any one of the speech synthesis method.
其中,所述芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。Wherein, the chip system may be a single chip or a chip module composed of multiple chips.
本申请实施例与现有技术相比存在的有益效果是:Compared with the prior art, the embodiments of the present application have the following beneficial effects:
本申请实施例通过确定待转换文本对应的每个音素的时长范围,再将每个音素的时长范围中的任一时长,确定为相对应的音素的音素时长,最后根据待转换文本和每个音素的音素时长,生成语音数据。针对相同待转换文本的多个语音数据,多个语音数据中同一音素的音素时长可能基于相同的时长范围取值不同,则可以合成得到多种不同的语音数据,避免了针对同一待转换文本每次合成得到相同的语音数据,降低了语音合成的机械性,提高了语音合成的自然度和多样性。The embodiment of this application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme, and finally according to the text to be converted and each phoneme. The phoneme duration of the phoneme generates speech data. For multiple speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted. The same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.
附图说明Description of the drawings
图1是本申请实施例提供的语音合成方法所涉及的一种语音合成场景的场景示意图;FIG. 1 is a schematic diagram of a speech synthesis scenario involved in a speech synthesis method provided by an embodiment of the present application;
图2是本申请实施例提供的语音合成方法所涉及的另一种语音合成场景的场景示意图;2 is a schematic diagram of another speech synthesis scenario involved in the speech synthesis method provided by an embodiment of the present application;
图3是本申请实施例提供的一种终端设备的结构框图;Fig. 3 is a structural block diagram of a terminal device provided by an embodiment of the present application;
图4是本申请实施例提供的一种语音合成方法的示意性流程图;FIG. 4 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;
图5是本申请实施例提供的一种确定音素的时长范围的示意性流程图;FIG. 5 is a schematic flowchart for determining the duration range of a phoneme according to an embodiment of the present application;
图6是本申请实施例提供的一种时长范围的示意图;FIG. 6 is a schematic diagram of a duration range provided by an embodiment of the present application;
图7是本申请实施例提供的一种语音合成装置的结构框图。Fig. 7 is a structural block diagram of a speech synthesis device provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的 具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请实施例中,“一个或多个”是指一个、两个或两个以上;“和/或”,描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。The terms used in the following embodiments are only for the purpose of describing specific embodiments, and are not intended to limit the application. As used in the specification and appended claims of this application, the singular expressions "a", "an", "said", "above", "the" and "this" are intended to also This includes expressions such as "one or more" unless the context clearly indicates to the contrary. It should also be understood that in the embodiments of the present application, "one or more" refers to one, two, or more than two; and "and/or" describes the association relationship of associated objects, which means that there may be three relationships; for example, A and/or B can mean: A alone exists, A and B exist at the same time, and B exists alone, where A and B can be singular or plural. The character "/" generally indicates that the associated objects before and after are in an "or" relationship.
本申请实施例提供的语音合成方法可以应用于手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。The speech synthesis method provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super mobile personal computers For terminal devices (ultra-mobile personal computer, UMPC), netbooks, and personal digital assistants (personal digital assistant, PDA), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.
例如,所述终端设备可以是WLAN中的站点(STAION,ST),可以是蜂窝电话、无绳电话、会话启动协议(Session InitiationProtocol,SIP)电话、无线本地环路(Wireless Local Loop,WLL)站、个人数字处理(Personal Digital Assistant,PDA)设备、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、车载设备、车联网终端、电脑、膝上型计算机、手持式通信设备、手持式计算设备、卫星无线设备、无线调制解调器卡、电视机顶盒(set top box,STB)、用户驻地设备(customer premise equipment,CPE)和/或用于在无线系统上进行通信的其它设备以及下一代通信系统,例如,5G网络中的移动终端或者未来演进的公共陆地移动网络(Public Land Mobile Network,PLMN)网络中的移动终端等。For example, the terminal device may be a station (STAION, ST) in a WLAN, a cell phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication functions, computing devices or other processing devices connected to wireless modems, in-vehicle devices, car networking terminals, computers, laptop computers, handheld communication devices , Handheld computing devices, satellite wireless devices, wireless modem cards, television set top boxes (STB), customer premise equipment (customer premise equipment, CPE), and/or other equipment used to communicate on wireless systems, and download First-generation communication systems, for example, mobile terminals in 5G networks or mobile terminals in the future evolution of Public Land Mobile Network (PLMN) networks, etc.
作为示例而非限定,当所述终端设备为可穿戴设备时,该可穿戴设备还可以是应用穿戴式技术对日常穿戴进行智能化设计、开发出可以穿戴的设备的总称,如眼镜、手套、手表、服饰及鞋等。可穿戴设备即直接穿在身上,或是整合到用户的衣服或配件的一种便携式设备。可穿戴设备不仅仅是一种硬件设备,更是通过软件支持以及数据交互、云端交互来实现强大的功能。广义穿戴式智能设备包括功能全、尺寸大、可不依赖智能手机实现完整或者部分的功能,如智能手表或智能眼镜等,以及只专注于某一类应用功能,需要和其它设备如智能手机配合使用,如各类进行体征监测的智能手环、智能首饰等。As an example and not a limitation, when the terminal device is a wearable device, the wearable device can also be a general term for using wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, Watches, clothing and shoes, etc. A wearable device is a portable device that is directly worn on the body or integrated into the user's clothes or accessories. Wearable devices are not only a hardware device, but also realize powerful functions through software support, data interaction, and cloud interaction. In a broad sense, wearable smart devices include full-featured, large-sized, complete or partial functions that can be implemented without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, and need to be used in conjunction with other devices such as smart phones. , Such as all kinds of smart bracelets and smart jewelry for physical sign monitoring.
图1是本申请实施例提供的语音合成方法所涉及的一种语音合成场景的场景示意图,参见图1,该语音合成场景中可以包括终端设备110,终端设备110可以获取待转换文本,并对待转换文本对应的每个音素的音素时长进行调整,从而可以基于同一待转换文本生成不同的语音数据。FIG. 1 is a schematic diagram of a speech synthesis scenario involved in a speech synthesis method provided by an embodiment of the present application. See FIG. The phoneme duration of each phoneme corresponding to the converted text is adjusted, so that different voice data can be generated based on the same text to be converted.
在一种可能的实现方式中,终端设备110可以获取待转换文本,并将待转换文本分别输入预先训练的文本分析模型和时长模型,得到待转换文本对应的每个音素的平 均发音时长、发音时长方差和发音时长分布密度,从而可以基于正态分布规则,根据每个音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个音素的时长范围。In a possible implementation, the terminal device 110 may obtain the text to be converted, and input the text to be converted into a pre-trained text analysis model and duration model, respectively, to obtain the average pronunciation duration and pronunciation of each phoneme corresponding to the text to be converted. The duration variance and the pronunciation duration distribution density can be based on the normal distribution rule to determine the duration range of each phoneme according to the average pronunciation duration, the variance of the pronunciation duration and the pronunciation duration distribution density of each phoneme.
之后,终端设备110可以再基于每个音素的时长范围,结合每个音素在待转换文本中对应的文本语义信息,和/或,预先存储的包括用户年龄信息和性格信息的用户数据,将时长范围中的任一时长确定为各个音素的音素时长,从而根据各个音素时长和待转换文本,生成语音数据。After that, the terminal device 110 may then combine the corresponding text semantic information of each phoneme in the text to be converted based on the duration range of each phoneme, and/or pre-stored user data including age information and personality information of the user, and convert the duration Any duration in the range is determined as the phoneme duration of each phoneme, so that speech data is generated according to the duration of each phoneme and the text to be converted.
其中,平均发音时长用于表示针对同一音素的各个音素时长的平均值,该发音时长方差用于表示同一音素的各个音素时长与平均发音时长的差异程度,发音时长分布密度用于表示同一音素对应不同音素时长的概率。Among them, the average pronunciation duration is used to represent the average of the duration of each phoneme for the same phoneme, the variance of the pronunciation duration is used to represent the degree of difference between the duration of each phoneme of the same phoneme and the average pronunciation duration, and the distribution density of pronunciation duration is used to represent the correspondence of the same phoneme. Probability of different phoneme duration.
另外,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。例如,以拼音的发音规则为例,可以将每个文字的拼音所对应的声母作为一个音素,拼音的韵母作为另一个音素,如“天气”中,文字“天”对应的音素可以包括“t”和“ian”,文字“气”对应的音素可以包括“q”和“i”。In addition, a phoneme is the smallest phonetic unit divided according to the natural attributes of the speech. It is analyzed according to the pronunciation actions in the syllable, and an action constitutes a phoneme. For example, taking the pronunciation rules of Pinyin as an example, the initials corresponding to the pinyin of each character can be used as one phoneme, and the finals of the pinyin as another phoneme. For example, in "weather", the phoneme corresponding to the word "天" can include "t" "And "ian", the phonemes corresponding to the word "qi" can include "q" and "i".
需要说明的是,在实际应用中,参见图2,语音合成场景中还可以包括服务器120,终端设备110可以与服务器120连接,使得服务器120可以对待转换文本进行转换,得到基于同一待转换文本的不同语音数据。It should be noted that, in actual applications, referring to Figure 2, the speech synthesis scenario may also include a server 120, and the terminal device 110 may be connected to the server 120, so that the server 120 can convert the text to be converted, and obtain a text based on the same text to be converted. Different voice data.
在生成语音数据的过程中,终端设备110可以先向服务器120发送待转换文本,服务器120可以根据待转换文本确定各个音素的时长范围,再结合待转换文本的语义信息以及预先存储的用户数据,可以从各个时长范围确定各个音素的音素时长,从而可以根据各个音素时长以及待转换文本生成语音数据,并向终端设备110发送生成的语音数据,则终端设备110可以接收并播放服务器120生成的语音数据。In the process of generating voice data, the terminal device 110 may first send the text to be converted to the server 120, and the server 120 may determine the duration range of each phoneme according to the text to be converted, and then combine the semantic information of the text to be converted and the pre-stored user data, The phoneme duration of each phoneme can be determined from each duration range, so that voice data can be generated according to each phoneme duration and the text to be converted, and the generated voice data can be sent to the terminal device 110, then the terminal device 110 can receive and play the voice generated by the server 120 data.
为了简便说明,下述实施例仅是以语音合成场景中包括终端设备110,不包括服务器120为例进行说明,而在实际应用中,不但可以通过终端设备110转换得到语音数据,还可以通过服务器120转换得到语音数据,本申请实施例对此不做限定。For the sake of simplicity, the following embodiments only take the terminal device 110 included in the speech synthesis scenario and do not include the server 120 as an example for description. In actual applications, not only can the terminal device 110 convert the voice data, but also the server The voice data obtained by conversion 120 is not limited in the embodiment of the present application.
图3是本申请实施例提供的一种终端设备的结构框图。参考图3,终端设备可以包括处理器310,外部存储器接口320,内部存储器321,通用串行总线(universal serial bus,USB)接口330,充电管理模块340,电源管理模块341,电池342,天线1,天线2,移动通信模块350,无线通信模块360,音频模块370,扬声器370A,受话器370B,麦克风370C,耳机接口370D,传感器模块380,按键390,马达391,指示器392,摄像头393,显示屏394,以及用户标识模块(subscriber identification module,SIM)卡接口395等。其中传感器模块380可以包括压力传感器380A,陀螺仪传感器380B,气压传感器380C,磁传感器380D,加速度传感器380E,距离传感器380F,接近光传感器380G,指纹传感器380H,温度传感器380J,触摸传感器380K,环境光传感器380L,骨传导传感器380M等。Fig. 3 is a structural block diagram of a terminal device provided by an embodiment of the present application. 3, the terminal device may include a processor 310, an external memory interface 320, an internal memory 321, a universal serial bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, and an antenna 1. , Antenna 2, mobile communication module 350, wireless communication module 360, audio module 370, speaker 370A, receiver 370B, microphone 370C, earphone interface 370D, sensor module 380, buttons 390, motor 391, indicator 392, camera 393, display 394, and subscriber identification module (subscriber identification module, SIM) card interface 395, etc. The sensor module 380 can include pressure sensor 380A, gyroscope sensor 380B, air pressure sensor 380C, magnetic sensor 380D, acceleration sensor 380E, distance sensor 380F, proximity light sensor 380G, fingerprint sensor 380H, temperature sensor 380J, touch sensor 380K, ambient light Sensor 380L, bone conduction sensor 380M, etc.
可以理解的是,本发明实施例示意的结构并不构成对终端设备的具体限定。在本申请另一些实施例中,终端设备可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal device. In other embodiments of the present application, the terminal device may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.
处理器310可以包括一个或多个处理单元,例如:处理器310可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 310 may include one or more processing units. For example, the processor 310 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors.
其中,控制器可以是终端设备的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。Among them, the controller can be the nerve center and command center of the terminal device. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
处理器310中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器310中的存储器为高速缓冲存储器。该存储器可以保存处理器310刚用过或循环使用的指令或数据。如果处理器310需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器310的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 310 to store instructions and data. In some embodiments, the memory in the processor 310 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 310. If the processor 310 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 310 is reduced, and the efficiency of the system is improved.
在一些实施例中,处理器310可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 310 may include one or more interfaces. The interface can include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器310可以包含多组I2C总线。处理器310可以通过不同的I2C总线接口分别耦合触摸传感器380K,充电器,闪光灯,摄像头393等。例如:处理器310可以通过I2C接口耦合触摸传感器380K,使处理器310与触摸传感器380K通过I2C总线接口通信,实现终端设备的触摸功能。The I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 310 may include multiple sets of I2C buses. The processor 310 may couple the touch sensor 380K, charger, flash, camera 393, etc., respectively through different I2C bus interfaces. For example, the processor 310 may couple the touch sensor 380K through an I2C interface, so that the processor 310 and the touch sensor 380K communicate through the I2C bus interface to realize the touch function of the terminal device.
I2S接口可以用于音频通信。在一些实施例中,处理器310可以包含多组I2S总线。处理器310可以通过I2S总线与音频模块370耦合,实现处理器310与音频模块370之间的通信。在一些实施例中,音频模块370可以通过I2S接口向无线通信模块360传递音频信号,实现通过蓝牙耳机接听电话的功能。The I2S interface can be used for audio communication. In some embodiments, the processor 310 may include multiple sets of I2S buses. The processor 310 may be coupled with the audio module 370 through an I2S bus to implement communication between the processor 310 and the audio module 370. In some embodiments, the audio module 370 may transmit audio signals to the wireless communication module 360 through the I2S interface, so as to realize the function of answering calls through the Bluetooth headset.
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块370与无线通信模块360可以通过PCM总线接口耦合。在一些实施例中,音频模块370也可以通过PCM接口向无线通信模块360传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。The PCM interface can also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 370 and the wireless communication module 360 may be coupled through a PCM bus interface. In some embodiments, the audio module 370 may also transmit audio signals to the wireless communication module 360 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器310与无线通信模块360。例如:处理器310通过UART接口与无线通信模块360中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块370可以通过UART接口向无线通信模块360传递音频信号,实现通过蓝牙耳机播放音乐的功能。The UART interface is a universal serial data bus used for asynchronous communication. The bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, the UART interface is generally used to connect the processor 310 and the wireless communication module 360. For example, the processor 310 communicates with the Bluetooth module in the wireless communication module 360 through the UART interface to realize the Bluetooth function. In some embodiments, the audio module 370 may transmit audio signals to the wireless communication module 360 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
MIPI接口可以被用于连接处理器310与显示屏394,摄像头393等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器310和摄像头393通过CSI接口通信,实现终端设备的拍摄功能。处理器310和显示屏394通过DSI接口通信,实现终端设备的显示功能。The MIPI interface can be used to connect the processor 310 with the display screen 394, the camera 393 and other peripheral devices. The MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc. In some embodiments, the processor 310 and the camera 393 communicate through a CSI interface to implement the shooting function of the terminal device. The processor 310 and the display screen 394 communicate through the DSI interface to realize the display function of the terminal device.
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器310与摄像头393,显示屏394,无线通信模块360,音频模块370,传感器模块380等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。The GPIO interface can be configured through software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 310 with the camera 393, the display screen 394, the wireless communication module 360, the audio module 370, the sensor module 380, and so on. The GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
USB接口330是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口330可以用于连接充电器为终端设备充电,也可以用于终端设备与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。The USB interface 330 is an interface that complies with the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, and so on. The USB interface 330 can be used to connect a charger to charge the terminal device, and can also be used to transfer data between the terminal device and peripheral devices. It can also be used to connect earphones and play audio through earphones. The interface can also be used to connect other electronic devices, such as AR devices.
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备的结构限定。在本申请另一些实施例中,终端设备也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely a schematic description, and does not constitute a structural limitation of the terminal device. In other embodiments of the present application, the terminal device may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
充电管理模块340用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块340可以通过USB接口330接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块340可以通过终端设备的无线充电线圈接收无线充电输入。充电管理模块340为电池342充电的同时,还可以通过电源管理模块341为电子设备供电。The charging management module 340 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 340 may receive the charging input of the wired charger through the USB interface 330. In some wireless charging embodiments, the charging management module 340 may receive the wireless charging input through the wireless charging coil of the terminal device. While the charging management module 340 charges the battery 342, it can also supply power to the electronic device through the power management module 341.
电源管理模块341用于连接电池342,充电管理模块340与处理器310。电源管理模块341接收电池342和/或充电管理模块340的输入,为处理器310,内部存储器321,外部存储器,显示屏394,摄像头393,和无线通信模块360等供电。电源管理模块341还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块341也可以设置于处理器310中。在另一些实施例中,电源管理模块341和充电管理模块340也可以设置于同一个器件中。The power management module 341 is used to connect the battery 342, the charging management module 340 and the processor 310. The power management module 341 receives input from the battery 342 and/or the charge management module 340, and supplies power to the processor 310, the internal memory 321, the external memory, the display screen 394, the camera 393, and the wireless communication module 360. The power management module 341 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance). In some other embodiments, the power management module 341 may also be provided in the processor 310. In other embodiments, the power management module 341 and the charging management module 340 may also be provided in the same device.
终端设备的无线通信功能可以通过天线1,天线2,移动通信模块350,无线通信模块360,调制解调处理器以及基带处理器等实现。The wireless communication function of the terminal device can be realized by the antenna 1, the antenna 2, the mobile communication module 350, the wireless communication module 360, the modem processor, and the baseband processor.
天线1和天线2用于发射和接收电磁波信号。终端设备中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。The antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the terminal device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, antenna 1 can be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna can be used in combination with a tuning switch.
移动通信模块350可以提供应用在终端设备上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块350可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块350可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块350还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块350的至少部分功能模块可以被设置于处理器 310中。在一些实施例中,移动通信模块350的至少部分功能模块可以与处理器310的至少部分模块被设置在同一个器件中。The mobile communication module 350 can provide wireless communication solutions including 2G/3G/4G/5G, etc., which are applied to terminal devices. The mobile communication module 350 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like. The mobile communication module 350 can receive electromagnetic waves by the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modem processor for demodulation. The mobile communication module 350 can also amplify the signal modulated by the modem processor, and convert it to electromagnetic wave radiation via the antenna 1. In some embodiments, at least part of the functional modules of the mobile communication module 350 may be provided in the processor 310. In some embodiments, at least part of the functional modules of the mobile communication module 350 and at least part of the modules of the processor 310 may be provided in the same device.
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器370A,受话器370B等)输出声音信号,或通过显示屏394显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器310,与移动通信模块350或其他功能模块设置在同一个器件中。The modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low-frequency baseband signal is processed by the baseband processor and then passed to the application processor. The application processor outputs sound signals through audio equipment (not limited to the speaker 370A, the receiver 370B, etc.), or displays images or videos through the display screen 394. In some embodiments, the modem processor may be an independent device. In other embodiments, the modem processor may be independent of the processor 310 and be provided in the same device as the mobile communication module 350 or other functional modules.
无线通信模块360可以提供应用在终端设备上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块360可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块360经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器310。无线通信模块360还可以从处理器310接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。The wireless communication module 360 can provide applications on terminal devices including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 360 may be one or more devices integrating at least one communication processing module. The wireless communication module 360 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 310. The wireless communication module 360 can also receive the signal to be sent from the processor 310, perform frequency modulation, amplify it, and convert it into electromagnetic waves and radiate it through the antenna 2.
在一些实施例中,终端设备的天线1和移动通信模块350耦合,天线2和无线通信模块360耦合,使得终端设备可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。In some embodiments, the antenna 1 of the terminal device is coupled with the mobile communication module 350, and the antenna 2 is coupled with the wireless communication module 360, so that the terminal device can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include the global positioning system (GPS), the global navigation satellite system (GLONASS), the Beidou navigation satellite system (BDS), and the quasi-zenith satellite system (quasi). -zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).
终端设备通过GPU,显示屏394,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏394和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器310可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The terminal device realizes the display function through the GPU, the display screen 394, and the application processor. The GPU is an image processing microprocessor, which is connected to the display screen 394 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 310 may include one or more GPUs, which execute program instructions to generate or change display information.
显示屏394用于显示图像,视频等。显示屏394包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备可以包括1个或N个显示屏394,N 为大于1的正整数。The display screen 394 is used to display images, videos, and the like. The display screen 394 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the terminal device may include one or N display screens 394, and N is a positive integer greater than one.
终端设备可以通过ISP,摄像头393,视频编解码器,GPU,显示屏394以及应用处理器等实现拍摄功能。The terminal device can realize the shooting function through ISP, camera 393, video codec, GPU, display screen 394 and application processor.
ISP用于处理摄像头393反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头393中。The ISP is used to process the data fed back by the camera 393. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transfers the electrical signal to the ISP for processing and is converted into an image visible to the naked eye. ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 393.
摄像头393用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端设备可以包括1个或N个摄像头393,N为大于1的正整数。The camera 393 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the terminal device may include 1 or N cameras 393, and N is a positive integer greater than 1.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端设备在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the terminal device selects the frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
视频编解码器用于对数字视频压缩或解压缩。终端设备可以支持一种或多种视频编解码器。这样,终端设备可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital video. The terminal device can support one or more video codecs. In this way, the terminal device can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端设备的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. NPU can realize the intelligent cognition of terminal equipment and other applications, such as: image recognition, face recognition, speech recognition, text understanding, etc.
外部存储器接口320可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端设备的存储能力。外部存储卡通过外部存储器接口320与处理器310通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 320 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device. The external memory card communicates with the processor 310 through the external memory interface 320 to realize the data storage function. For example, save music, video and other files in an external memory card.
内部存储器321可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器310通过运行存储在内部存储器321的指令,从而执行终端设备的各种功能应用以及数据处理。内部存储器321可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器321可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。The internal memory 321 may be used to store computer executable program code, where the executable program code includes instructions. The processor 310 executes various functional applications and data processing of the terminal device by running instructions stored in the internal memory 321. The internal memory 321 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the terminal device. In addition, the internal memory 321 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
终端设备可以通过音频模块370,扬声器370A,受话器370B,麦克风370C,耳机接口370D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The terminal device can implement audio functions through the audio module 370, the speaker 370A, the receiver 370B, the microphone 370C, the earphone interface 370D, and the application processor. For example, music playback, recording, etc.
音频模块370用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频 输入转换为数字音频信号。音频模块370还可以用于对音频信号编码和解码。在一些实施例中,音频模块370可以设置于处理器310中,或将音频模块370的部分功能模块设置于处理器310中。The audio module 370 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 370 can also be used to encode and decode audio signals. In some embodiments, the audio module 370 may be provided in the processor 310, or part of the functional modules of the audio module 370 may be provided in the processor 310.
扬声器370A,也称“喇叭”,用于将音频电信号转换为声音信号。终端设备可以通过扬声器370A收听音乐,或收听免提通话。The speaker 370A, also called "speaker", is used to convert audio electrical signals into sound signals. The terminal device can listen to music through the speaker 370A, or listen to a hands-free call.
受话器370B,也称“听筒”,用于将音频电信号转换成声音信号。当终端设备接听电话或语音信息时,可以通过将受话器370B靠近人耳接听语音。The receiver 370B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the terminal device answers a call or voice message, it can receive the voice by bringing the receiver 370B close to the human ear.
麦克风370C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风370C发声,将声音信号输入到麦克风370C。终端设备可以设置至少一个麦克风370C。在另一些实施例中,终端设备可以设置两个麦克风370C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,终端设备还可以设置三个,四个或更多麦克风370C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。 Microphone 370C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 370C through the mouth, and input the sound signal to the microphone 370C. The terminal device can be provided with at least one microphone 370C. In other embodiments, the terminal device can be provided with two microphones 370C, which can implement noise reduction functions in addition to collecting sound signals. In other embodiments, the terminal device may also be provided with three, four or more microphones 370C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
耳机接口370D用于连接有线耳机。耳机接口370D可以是USB接口330,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。The earphone interface 370D is used to connect wired earphones. The earphone interface 370D may be a USB interface 330, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
压力传感器380A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器380A可以设置于显示屏394。压力传感器380A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器380A,电极之间的电容改变。终端设备根据电容的变化确定压力的强度。当有触摸操作作用于显示屏394,终端设备根据压力传感器380A检测所述触摸操作强度。终端设备也可以根据压力传感器380A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。The pressure sensor 380A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 380A may be provided on the display screen 394. There are many types of pressure sensor 380A, such as resistive pressure sensor, inductive pressure sensor, capacitive pressure sensor and so on. The capacitive pressure sensor may include at least two parallel plates with conductive materials. When a force is applied to the pressure sensor 380A, the capacitance between the electrodes changes. The terminal equipment determines the strength of the pressure based on the change in capacitance. When a touch operation acts on the display screen 394, the terminal device detects the intensity of the touch operation according to the pressure sensor 380A. The terminal device may also calculate the touched position based on the detection signal of the pressure sensor 380A. In some embodiments, touch operations that act on the same touch position but have different touch operation intensities can correspond to different operation instructions. For example: when a touch operation whose intensity is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.
陀螺仪传感器380B可以用于确定终端设备的运动姿态。在一些实施例中,可以通过陀螺仪传感器380B确定终端设备围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器380B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器380B检测终端设备抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消终端设备的抖动,实现防抖。陀螺仪传感器380B还可以用于导航,体感游戏场景。The gyroscope sensor 380B can be used to determine the motion posture of the terminal device. In some embodiments, the angular velocity of the terminal device around three axes (ie, x, y, and z axes) can be determined by the gyroscope sensor 380B. The gyro sensor 380B can be used for shooting anti-shake. Exemplarily, when the shutter is pressed, the gyroscope sensor 380B detects the shake angle of the terminal device, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to counteract the shake of the terminal device through a reverse movement to achieve anti-shake. The gyroscope sensor 380B can also be used for navigation and somatosensory game scenes.
气压传感器380C用于测量气压。在一些实施例中,终端设备通过气压传感器380C测得的气压值计算海拔高度,辅助定位和导航。The air pressure sensor 380C is used to measure air pressure. In some embodiments, the terminal device uses the air pressure value measured by the air pressure sensor 380C to calculate the altitude to assist positioning and navigation.
磁传感器380D包括霍尔传感器。终端设备可以利用磁传感器380D检测翻盖皮套的开合。在一些实施例中,当终端设备是翻盖机时,终端设备可以根据磁传感器380D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自 动解锁等特性。The magnetic sensor 380D includes a Hall sensor. The terminal device can use the magnetic sensor 380D to detect the opening and closing of the flip holster. In some embodiments, when the terminal device is a flip machine, the terminal device can detect the opening and closing of the flip according to the magnetic sensor 380D. Then, according to the detected opening and closing status of the leather case or the opening and closing status of the flip cover, features such as automatic unlocking of the flip cover are set.
加速度传感器380E可检测终端设备在各个方向上(一般为三轴)加速度的大小。当终端设备静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。The acceleration sensor 380E can detect the magnitude of the acceleration of the terminal device in various directions (generally three axes). When the terminal device is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and be used in applications such as horizontal and vertical screen switching, pedometers and so on.
距离传感器380F,用于测量距离。终端设备可以通过红外或激光测量距离。在一些实施例中,拍摄场景,终端设备可以利用距离传感器380F测距以实现快速对焦。Distance sensor 380F, used to measure distance. The terminal device can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the terminal device can use the distance sensor 380F to measure the distance to achieve fast focusing.
接近光传感器380G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。终端设备通过发光二极管向外发射红外光。终端设备使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定终端设备附近有物体。当检测到不充分的反射光时,终端设备可以确定终端设备附近没有物体。终端设备可以利用接近光传感器380G检测用户手持终端设备贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器380G也可用于皮套模式,口袋模式自动解锁与锁屏。The proximity light sensor 380G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The terminal device emits infrared light to the outside through the light-emitting diode. Terminal equipment uses photodiodes to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device. When insufficient reflected light is detected, the terminal device can determine that there is no object near the terminal device. The terminal device can use the proximity light sensor 380G to detect that the user holds the terminal device close to the ear to talk, so as to automatically turn off the screen to save power. The proximity light sensor 380G can also be used in leather case mode, and the pocket mode will automatically unlock and lock the screen.
环境光传感器380L用于感知环境光亮度。终端设备可以根据感知的环境光亮度自适应调节显示屏394亮度。环境光传感器380L也可用于拍照时自动调节白平衡。环境光传感器380L还可以与接近光传感器380G配合,检测终端设备是否在口袋里,以防误触。The ambient light sensor 380L is used to sense the brightness of the ambient light. The terminal device can adaptively adjust the brightness of the display screen 394 according to the perceived brightness of the ambient light. The ambient light sensor 380L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 380L can also cooperate with the proximity light sensor 380G to detect whether the terminal device is in the pocket to prevent accidental touch.
指纹传感器380H用于采集指纹。终端设备可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。The fingerprint sensor 380H is used to collect fingerprints. Terminal devices can use the collected fingerprint characteristics to unlock fingerprints, access application locks, take photos with fingerprints, and answer calls with fingerprints.
温度传感器380J用于检测温度。在一些实施例中,终端设备利用温度传感器380J检测的温度,执行温度处理策略。例如,当温度传感器380J上报的温度超过阈值,终端设备执行降低位于温度传感器380J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,终端设备对电池342加热,以避免低温导致终端设备异常关机。在其他一些实施例中,当温度低于又一阈值时,终端设备对电池342的输出电压执行升压,以避免低温导致的异常关机。The temperature sensor 380J is used to detect temperature. In some embodiments, the terminal device uses the temperature detected by the temperature sensor 380J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 380J exceeds a threshold value, the terminal device executes to reduce the performance of the processor located near the temperature sensor 380J in order to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the terminal device heats the battery 342 to avoid abnormal shutdown of the terminal device due to low temperature. In some other embodiments, when the temperature is lower than another threshold, the terminal device boosts the output voltage of the battery 342 to avoid abnormal shutdown caused by low temperature.
触摸传感器380K,也称“触控面板”。触摸传感器380K可以设置于显示屏394,由触摸传感器380K与显示屏394组成触摸屏,也称“触控屏”。触摸传感器380K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏394提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器380K也可以设置于终端设备的表面,与显示屏394所处的位置不同。Touch sensor 380K, also called "touch panel". The touch sensor 380K can be arranged on the display screen 394, and the touch screen is composed of the touch sensor 380K and the display screen 394, which is also called a “touch screen”. The touch sensor 380K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. The visual output related to the touch operation can be provided through the display screen 394. In other embodiments, the touch sensor 380K may also be disposed on the surface of the terminal device, which is different from the position of the display screen 394.
骨传导传感器380M可以获取振动信号。在一些实施例中,骨传导传感器380M可以获取人体声部振动骨块的振动信号。骨传导传感器380M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器380M也可以设置于耳机中,结合成骨传导耳机。音频模块370可以基于所述骨传导传感器380M获取的声部振动骨块的振动信号,解析出语音信号,实现语音功能。应用处理器可以基于所述骨传导传感器380M获取的血压跳动信号解析心率信息,实现心率检测功能。The bone conduction sensor 380M can acquire vibration signals. In some embodiments, the bone conduction sensor 380M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 380M can also contact the human pulse and receive blood pressure beating signals. In some embodiments, the bone conduction sensor 380M may also be provided in the earphone, combined with the bone conduction earphone. The audio module 370 can parse the voice signal based on the vibration signal of the vibrating bone block of the voice obtained by the bone conduction sensor 380M, and realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beating signal obtained by the bone conduction sensor 380M, and realize the heart rate detection function.
按键390包括开机键,音量键等。按键390可以是机械按键。也可以是触摸式按键。终端设备可以接收按键输入,产生与终端设备的用户设置以及功能控制有关的键 信号输入。The button 390 includes a power-on button, a volume button, and so on. The button 390 may be a mechanical button. It can also be a touch button. The terminal device can receive key input and generate key signal input related to the user settings and function control of the terminal device.
马达391可以产生振动提示。马达391可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏394不同区域的触摸操作,马达391也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。The motor 391 can generate vibration prompts. The motor 391 can be used for incoming call vibration notification, and can also be used for touch vibration feedback. For example, touch operations that act on different applications (such as taking pictures, audio playback, etc.) can correspond to different vibration feedback effects. Acting on touch operations in different areas of the display screen 394, the motor 391 can also correspond to different vibration feedback effects. Different application scenarios (for example: time reminding, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.
指示器392可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。The indicator 392 can be an indicator light, which can be used to indicate the charging status, power change, and can also be used to indicate messages, missed calls, notifications, and so on.
SIM卡接口395用于连接SIM卡。SIM卡可以通过插入SIM卡接口395,或从SIM卡接口395拔出,实现和终端设备的接触和分离。终端设备可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口395可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口395可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口395也可以兼容不同类型的SIM卡。SIM卡接口395也可以兼容外部存储卡。终端设备通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,终端设备采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在终端设备中,不能和终端设备分离。The SIM card interface 395 is used to connect to the SIM card. The SIM card can be inserted into the SIM card interface 395 or pulled out from the SIM card interface 395 to achieve contact and separation with the terminal device. The terminal device can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. The SIM card interface 395 can support Nano SIM cards, Micro SIM cards, SIM cards, etc. The same SIM card interface 395 can insert multiple cards at the same time. The types of the multiple cards can be the same or different. The SIM card interface 395 can also be compatible with different types of SIM cards. The SIM card interface 395 can also be compatible with external memory cards. The terminal device interacts with the network through the SIM card to realize functions such as call and data communication. In some embodiments, the terminal device adopts an eSIM, that is, an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.
图4是本申请实施例提供的一种语音合成方法的示意性流程图,作为示例而非限定,该方法可以应用于上述终端设备中,参见图4,该方法包括:Fig. 4 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application. As an example and not a limitation, the method can be applied to the above-mentioned terminal device. Referring to Fig. 4, the method includes:
步骤401、确定待转换文本对应的每个音素的时长范围。Step 401: Determine the duration range of each phoneme corresponding to the text to be converted.
终端设备在生成语音数据的过程中,可以确定待转换文本对应的每个音素的音素时长,并结合待转换文本的语言特征,通过预先设置的声学模型和声码器生成语音数据。而且,在确定音素时长的过程中,可以先确定每个音素的时长范围,以便在后续步骤中,可以基于时长范围选取不同的音素时长,以提高生成的语音数据的自然度和多样性。In the process of generating voice data, the terminal device can determine the phoneme duration of each phoneme corresponding to the text to be converted, and combine the language features of the text to be converted to generate voice data through a preset acoustic model and vocoder. Moreover, in the process of determining the phoneme duration, the duration range of each phoneme can be determined first, so that in subsequent steps, different phoneme durations can be selected based on the duration range to improve the naturalness and diversity of the generated speech data.
在具体实现时,终端设备可以先对待转换文本中的各个音素进行提取,再基于预先训练的模型获取每个音素的发音信息,之后可以根据每个音素的发音信息确定每个音素的时长范围,则参见图5,本步骤401可以包括:步骤401a和步骤401b。In specific implementation, the terminal device can first extract each phoneme in the text to be converted, and then obtain the pronunciation information of each phoneme based on the pre-trained model, and then determine the duration range of each phoneme according to the pronunciation information of each phoneme. Referring to FIG. 5, this step 401 may include: step 401a and step 401b.
401a、确定待转换文本对应的每个音素的平均发音时长、发音时长方差和发音时长分布密度。401a. Determine the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme corresponding to the text to be converted.
终端设备可以将待转换文本输入预先训练的模型,通过模型对待转换文本的各个音素进行提取,还可以对待转换文本进行分析,确定每个音素的发音信息,如发音时长、发音时长方差和发音时长分布密度,以便在后续步骤中,可以根据发音信息确定每个音素的发音范围。The terminal device can input the text to be converted into a pre-trained model, extract each phoneme of the text to be converted through the model, and can also analyze the text to be converted to determine the pronunciation information of each phoneme, such as the pronunciation time, the variance of the pronunciation time, and the pronunciation time. Distribution density, so that in the subsequent steps, the pronunciation range of each phoneme can be determined according to the pronunciation information.
在通过模型提取音素的过程中,终端设备可以先对待转换文本进行拆分,得到按顺序排列的多个文字,再基于每个文字的发音规则,提取得到每个文字的至少一个音素,在对每个文字的音素提取完毕后,即可得到待转换文本的多个音素。In the process of extracting phonemes through the model, the terminal device can first split the text to be converted to obtain multiple texts arranged in sequence, and then extract at least one phoneme of each text based on the pronunciation rules of each text. After the phonemes of each text are extracted, multiple phonemes of the text to be converted can be obtained.
例如,基于拼音的发音规则,终端设备可以将每个文字对应的声母和韵母分别作为该文字的音素,若待转换文本为“今天天气好晴朗”,则该待转换文本的多个音素可以分别为“j”、“in”、“t”、“ian”、“t”、“ian”、“q”、“i”、“h”、“ao”、“q”、“ing”、“l”和“ang”。For example, based on the pronunciation rules of Pinyin, the terminal device can use the initials and vowels corresponding to each character as the phonemes of the character respectively. If the text to be converted is "it is sunny today", the multiple phonemes of the text to be converted can be separately "J", "in", "t", "ian", "t", "ian", "q", "i", "h", "ao", "q", "ing", " l" and "ang".
另外,由于每个音素的发音信息中可以包括不同类型的信息,因而终端设备可以将待转换文本输入不同的模型,以得到不同的发音信息。例如,终端设备可以将待转换文本输入预先设置的文本分析模型,得到文本分析模型输出的每个音素的发音时长分布密度,其中,该文本分析模型可以为深度神经网络(deep neural networks,DNN)模型。和/或,终端设备可以将待转换文本输入预先设置的时长模型,得到时长模型输出的每个音素的平均发音时长和发音时长方差。In addition, since the pronunciation information of each phoneme can include different types of information, the terminal device can input the text to be converted into different models to obtain different pronunciation information. For example, the terminal device may input the text to be converted into a preset text analysis model to obtain the pronunciation time distribution density of each phoneme output by the text analysis model, where the text analysis model may be a deep neural network (DNN) Model. And/or, the terminal device may input the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.
401b、根据每个音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个音素的时长范围。401b. Determine the duration range of each phoneme according to the average pronunciation duration of each phoneme, the variance of the pronunciation duration, and the distribution density of the pronunciation duration.
终端设备在得到每个音素的发音信息后,可以根据预先设置的计算公式,将发音信息所包括的不同参数代入计算公式,从而可以得到每个音素的时长范围。例如,可以假设每个音素的音素时长服从正态分布,则可以根据每个音素的平均发音时长、发音时长方差和发音时长分布密度,通过正态分布算法,采用正态分布的公式进行计算,确定每个音素的时长范围。After obtaining the pronunciation information of each phoneme, the terminal device can substitute different parameters included in the pronunciation information into the calculation formula according to a preset calculation formula, so that the duration range of each phoneme can be obtained. For example, it can be assumed that the phoneme duration of each phoneme obeys a normal distribution, and the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme can be calculated by using the normal distribution algorithm and the formula of the normal distribution. Determine the duration range of each phoneme.
例如,参见图6,确定第x个音素的平均发音时长为t(x),发音时长方差为std2(x),发音时长分布密度为p(x),则符合公式p(x)=N(t(x),std2(x)),通过对x进行求解得到x1和x2,若x1小于x2,则可以将区间[x1,x2]作为第x个音素的时长范围。For example, referring to Figure 6, it is determined that the average pronunciation duration of the x-th phoneme is t(x), the variance of the pronunciation duration is std2(x), and the pronunciation duration distribution density is p(x), then the formula p(x)=N( t(x), std2(x)), x1 and x2 are obtained by solving for x. If x1 is less than x2, the interval [x1, x2] can be used as the duration range of the xth phoneme.
步骤402、将每个音素的时长范围中的任一时长,确定为相对应的音素的音素时长。Step 402: Determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme.
终端设备在得到每个音素的时长范围后,可以基于待转换文本中每个音素的文本语义信息、用户性格年龄、或者随机确定每个音素的音素时长,使得终端设备可以基于相同的待转换文本生成不同的语音数据。After obtaining the duration range of each phoneme, the terminal device can determine the phoneme duration of each phoneme based on the text semantic information of each phoneme in the text to be converted, the age of the user’s personality, or randomly, so that the terminal device can be based on the same text to be converted Generate different voice data.
可选的,对于每个音素,终端设备可以根据音素对应的文字在待转换文本中的位置,获取音素的文本语义信息,再基于音素的时长范围和音素的文本语义信息,确定音素的音素时长。Optionally, for each phoneme, the terminal device can obtain the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted, and then determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme .
在一种可能的实现方式中,对于每个音素,终端设备可以先确定该音素所对应的文字,再在待转换文本中,确定该文字所在的语句,分析确定该语句的语义,并结合待转换文字中全部语句所表达的语义,确定该音素的文本语义信息。之后,终端设备可以根据该文本语义信息,从时长范围对应的多个时长中,选取该音素的音素时长。In a possible implementation, for each phoneme, the terminal device can first determine the text corresponding to the phoneme, and then determine the sentence in the text to be converted, analyze the semantics of the sentence, and combine it with the text to be converted. Transform the semantics of all sentences in the text to determine the textual semantic information of the phoneme. After that, the terminal device can select the phoneme duration of the phoneme from the multiple durations corresponding to the duration range according to the text semantic information.
例如,音素的文本语义信息表示开心的情绪,则可以从时长范围中选取短促的时长作为该音素的音素时长。For example, if the textual semantic information of a phoneme represents a happy mood, a short duration can be selected from the duration range as the phoneme duration of the phoneme.
或者,终端设备可以获取用户数据,该用户数据可以包括用户的年龄信息和性格信息,则终端设备可以根据用户的年龄和性格,从各个音素的时长范围中确定每个音素的音素时长,从而生成与用户相匹配的语音信息。Alternatively, the terminal device may obtain user data, and the user data may include the user's age information and personality information, and the terminal device may determine the phoneme duration of each phoneme from the duration range of each phoneme according to the age and personality of the user, thereby generating Voice information that matches the user.
在一种可能的实现方式中,终端设备可以获取预先存储的用户数据,或者向服务器请求用户数据,并根据用户数据确定与用户相匹配的语音类型,然后根据用户的语音类型,从时长范围对应的多个时长中,选取该音素的音素时长。In a possible implementation, the terminal device can obtain pre-stored user data, or request user data from the server, and determine the voice type that matches the user based on the user data, and then correspond to the user’s voice type from the duration range Among the multiple durations of, select the phoneme duration of the phoneme.
例如,用户数据指示用户为中年人、性格沉稳,则与该用户相匹配的语音类型可以为慢条斯理的类型,相应的可以选取悠长的时长作为该音素的音素时长。For example, if the user data indicates that the user is middle-aged and has a calm personality, the voice type that matches the user can be a slow and tidy type, and accordingly, a long duration can be selected as the phoneme duration of the phoneme.
需要说明的是,在实际应用中,用户数据还可以包括其他表示用户说话类型的信息,例如用户数据可以包括表示用户情绪的搜索数据,还可以包括表示用户最近是否购买商品的购物数据等,本申请实施例对此不做限定。It should be noted that in practical applications, user data may also include other information indicating the type of user's speech. For example, user data may include search data indicating user emotions, and may also include shopping data indicating whether the user has recently purchased goods, etc. The application embodiment does not limit this.
另外,若终端设备可以同时获取文本语义信息和用户数据,则可以进一步按照文本语义信息和用户数据分别对应的权重,确定每个音素的音素时长。但是,若终端设备既无法获取文本语义信息,也无法获取用户数据,则可以按照待转换文本中的各个音素均服从正态分布的情况,按照正态分布的规则确定各个音素的音素时长,本申请实施例对确定音素时长的方式不做限定。In addition, if the terminal device can obtain text semantic information and user data at the same time, it can further determine the phoneme duration of each phoneme according to the weights corresponding to the text semantic information and the user data. However, if the terminal device cannot obtain neither the text semantic information nor the user data, it can determine the phoneme duration of each phoneme according to the normal distribution of each phoneme in the text to be converted. The application embodiment does not limit the way of determining the phoneme duration.
步骤403、根据待转换文本和每个音素的音素时长,生成语音数据。Step 403: Generate voice data according to the text to be converted and the phoneme duration of each phoneme.
终端设备在合成语音数据的过程中,可以采用不同的方式,基于每个音素的音素时长生成语音数据。例如,终端设备可以采用参数法、拼接法或端到端的方式,生成语音数据,而无论采用哪种方式生成语音数据,均可以按照上述方式确定待转换文本对应的每个音素的音素时长。In the process of synthesizing voice data, the terminal device can use different methods to generate voice data based on the phoneme duration of each phoneme. For example, the terminal device can use a parameter method, a splicing method, or an end-to-end method to generate voice data, and no matter which method is used to generate the voice data, the phoneme duration of each phoneme corresponding to the text to be converted can be determined in the foregoing manner.
以参数法为例,终端设备可以先按照上述方式确定各个音素的音素时长,再将音素时长和提取的待转换文本的语言特征输入声学模型,得到用于生成语音数据的基频等参数,再通过声码器根据基频等参数生成语音数据。Taking the parameter method as an example, the terminal device can first determine the phoneme duration of each phoneme according to the above method, and then input the phoneme duration and the language features of the extracted text to be converted into the acoustic model to obtain the fundamental frequency and other parameters used to generate the voice data. The voice data is generated by the vocoder according to the fundamental frequency and other parameters.
采用拼接法或端到端的方式生成语音数据的过程,与上述参数法生成语音数据的过程类似,在此不再赘述。The process of generating voice data in a splicing method or an end-to-end manner is similar to the process of generating voice data in the above-mentioned parameter method, and will not be repeated here.
综上所述,本申请实施例提供的语音合成方法,通过确定待转换文本对应的每个音素的时长范围,再将每个音素的时长范围中的任一时长,确定为相对应的音素的音素时长,最后根据待转换文本和每个音素的音素时长,生成语音数据。针对相同待转换文本的多个语音数据,多个语音数据中同一音素的音素时长可能基于相同的时长范围取值不同,则可以合成得到多种不同的语音数据,避免了针对同一待转换文本每次合成得到相同的语音数据,降低了语音合成的机械性,提高了语音合成的自然度和多样性。In summary, the speech synthesis method provided by the embodiments of the present application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the corresponding phoneme The phoneme duration, and finally the voice data is generated according to the text to be converted and the phoneme duration of each phoneme. For multiple speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted. The same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.
而且,通过确定每个音素的时长范围,并在时长范围内选取音素的时长,使得音素时长的取值不会产生巨大偏差,进而可以避免音素时长的取值过大或过小造成语音数据异常的情况,提高了语音合成的稳定性。Moreover, by determining the duration range of each phoneme, and selecting the duration of the phoneme within the duration range, the value of the phoneme duration will not have a huge deviation, thereby avoiding the value of the phoneme duration being too large or too small to cause abnormal voice data The situation improves the stability of speech synthesis.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
对应于上文实施例所述的语音合成方法,图7是本申请实施例提供的一种语音合成装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the speech synthesis method described in the above embodiment, FIG. 7 is a structural block diagram of a speech synthesis device provided in an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
参照图7,该装置包括:Referring to Figure 7, the device includes:
范围确定模块701,用于确定待转换文本对应的每个音素的时长范围;The range determining module 701 is used to determine the duration range of each phoneme corresponding to the text to be converted;
时长确定模块702,用于将该时长范围中的任一时长确定为每个该音素的音素时长;The duration determining module 702 is configured to determine any duration in the duration range as the phoneme duration of each phoneme;
生成模块703,用于根据该待转换文本和每个音素的音素时长,生成语音数据。The generating module 703 is configured to generate voice data according to the text to be converted and the phoneme duration of each phoneme.
可选的,该范围确定模块701,具体用于确定该待转换文本对应的每个该音素的 平均发音时长、发音时长方差和发音时长分布密度;根据每个该音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个该音素的时长范围。Optionally, the range determination module 701 is specifically configured to determine the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme corresponding to the text to be converted; according to the average pronunciation duration and pronunciation duration of each phoneme The variance and the distribution density of pronunciation duration determine the duration range of each phoneme.
可选的,该范围确定模块701,还具体用于将该待转换文本输入预先设置的文本分析模型,得到该文本分析模型输出的每个该音素的发音时长分布密度;将该待转换文本输入预先设置的时长模型,得到该时长模型输出的每个该音素的平均发音时长和发音时长方差。Optionally, the range determining module 701 is further specifically configured to input the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model; and input the text to be converted The preset duration model obtains the average pronunciation duration and the variance of the pronunciation duration of each phoneme output by the duration model.
可选的,该范围确定模块701,还具体用于根据每个音素的平均发音时长、发音时长方差和发音时长分布密度,通过正态分布算法确定每个该音素的时长范围。Optionally, the range determining module 701 is further specifically configured to determine the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
可选的,该时长确定模块702,具体用于对于每个该音素,根据该音素对应的文字在该待转换文本中的位置,获取该音素的文本语义信息;基于该音素的时长范围和该音素的文本语义信息,确定该音素的音素时长。Optionally, the duration determining module 702 is specifically configured to obtain, for each phoneme, the textual semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted; based on the duration range of the phoneme and the The textual semantic information of a phoneme determines the phoneme duration of the phoneme.
可选的,该时长确定模块702,具体用于获取用户数据,该用户数据包括用户的年龄信息和性格信息;基于该音素的时长范围和该用户数据,确定每个该音素的音素时长。Optionally, the duration determining module 702 is specifically configured to obtain user data, the user data including age information and personality information of the user; based on the duration range of the phoneme and the user data, determine the phoneme duration of each phoneme.
可选的,该生成模块703,具体用于根据该待转换文本和每个该音素的音素时长,通过预先设置的声学模型和声码器生成该语音数据。Optionally, the generating module 703 is specifically configured to generate the voice data through a preset acoustic model and vocoder according to the text to be converted and the phoneme duration of each phoneme.
综上所述,本申请实施例提供的语音合成装置,通过确定待转换文本对应的每个音素的时长范围,再将每个音素的时长范围中的任一时长,确定为相对应的音素的音素时长,最后根据待转换文本和每个音素的音素时长,生成语音数据。针对相同待转换文本的多个语音数据,多个语音数据中同一音素的音素时长可能基于相同的时长范围取值不同,则可以合成得到多种不同的语音数据,避免了针对同一待转换文本每次合成得到相同的语音数据,降低了语音合成的机械性,提高了语音合成的自然度和多样性。In summary, the speech synthesis device provided in the embodiment of the present application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the corresponding phoneme The phoneme duration, and finally the voice data is generated according to the text to be converted and the phoneme duration of each phoneme. For multiple speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted. The same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the system embodiment described above is merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be It can be combined or integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in this application can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may at least include: any entity or device capable of carrying computer program code to the terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc. In some jurisdictions, in accordance with legislation and patent practices, computer-readable media cannot be electrical carrier signals and telecommunication signals.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (10)

  1. 一种语音合成方法,其特征在于,包括:A method for speech synthesis, characterized in that it comprises:
    确定待转换文本对应的每个音素的时长范围;Determine the duration range of each phoneme corresponding to the text to be converted;
    将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长;Determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;
    根据所述待转换文本和每个所述音素的音素时长,生成语音数据。According to the text to be converted and the phoneme duration of each phoneme, voice data is generated.
  2. 如权利要求1所述的语音合成方法,其特征在于,所述确定待转换文本对应的每个音素的时长范围,包括:5. The speech synthesis method according to claim 1, wherein the determining the duration range of each phoneme corresponding to the text to be converted comprises:
    确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度;Determining the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted;
    根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围。The duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
  3. 如权利要求2所述的语音合成方法,其特征在于,所述确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,包括:3. The speech synthesis method according to claim 2, wherein the determining the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted comprises:
    将所述待转换文本输入预先设置的文本分析模型,得到所述文本分析模型输出的每个所述音素的发音时长分布密度;Inputting the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model;
    将所述待转换文本输入预先设置的时长模型,得到所述时长模型输出的每个所述音素的平均发音时长和发音时长方差。The text to be converted is input into a preset duration model, and the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model are obtained.
  4. 如权利要求2所述的语音合成方法,其特征在于,所述根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围,包括:3. The speech synthesis method of claim 2, wherein the determining the duration range of each phoneme according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes comprises:
    根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,通过正态分布算法确定每个所述音素的时长范围。According to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme, the duration range of each phoneme is determined by a normal distribution algorithm.
  5. 如权利要求1所述的语音合成方法,其特征在于,所述将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长,包括:The speech synthesis method according to claim 1, wherein the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme comprises:
    对于每个所述音素,根据所述音素对应的文字在所述待转换文本中的位置,获取所述音素的文本语义信息;For each of the phonemes, acquiring the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted;
    基于所述音素的时长范围和所述音素的文本语义信息,确定所述音素的音素时长。Determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
  6. 如权利要求1所述的语音合成方法,其特征在于,所述将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长,包括:The speech synthesis method according to claim 1, wherein the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme comprises:
    获取用户数据,所述用户数据包括用户的年龄信息和性格信息;Acquiring user data, the user data including age information and personality information of the user;
    基于所述音素的时长范围和所述用户数据,确定每个所述音素的音素时长。Based on the duration range of the phoneme and the user data, the phoneme duration of each phoneme is determined.
  7. 如权利要求1至6任一所述的语音合成方法,其特征在于,所述根据所述待转换文本和每个所述音素的音素时长,生成语音数据,包括:The speech synthesis method according to any one of claims 1 to 6, wherein the generating speech data according to the text to be converted and the phoneme duration of each phoneme comprises:
    根据所述待转换文本和每个所述音素的音素时长,通过预先设置的声学模型和声码器生成所述语音数据。According to the text to be converted and the phoneme duration of each phoneme, the voice data is generated through a preset acoustic model and a vocoder.
  8. 一种语音合成装置,其特征在于,包括:A speech synthesis device, characterized in that it comprises:
    范围确定模块,用于确定待转换文本对应的每个音素的时长范围;The range determination module is used to determine the duration range of each phoneme corresponding to the text to be converted;
    时长确定模块,用于将每个所述音素的时长范围中的任一时长,确定为相对应的 音素的音素时长;A duration determining module, configured to determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;
    生成模块,用于根据所述待转换文本和每个所述音素的音素时长,生成语音数据。The generating module is used to generate voice data according to the text to be converted and the phoneme duration of each phoneme.
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。A terminal device, comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 7. The method of any one.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the method according to any one of claims 1 to 7 when the computer program is executed by a processor.
PCT/CN2021/080403 2020-05-26 2021-03-12 Speech synthesis method and device WO2021238338A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010456116.1 2020-05-26
CN202010456116.1A CN113793589A (en) 2020-05-26 2020-05-26 Speech synthesis method and device

Publications (1)

Publication Number Publication Date
WO2021238338A1 true WO2021238338A1 (en) 2021-12-02

Family

ID=78745521

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/080403 WO2021238338A1 (en) 2020-05-26 2021-03-12 Speech synthesis method and device

Country Status (2)

Country Link
CN (1) CN113793589A (en)
WO (1) WO2021238338A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0580791A (en) * 1991-09-20 1993-04-02 Hitachi Ltd Device and method for speech rule synthesis
US20020138270A1 (en) * 1997-12-18 2002-09-26 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
CN106601226A (en) * 2016-11-18 2017-04-26 中国科学院自动化研究所 Phoneme duration prediction modeling method and phoneme duration prediction method
CN107481715A (en) * 2017-09-29 2017-12-15 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110992927A (en) * 2019-12-11 2020-04-10 广州酷狗计算机科技有限公司 Audio generation method and device, computer readable storage medium and computing device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100223028B1 (en) * 1996-12-14 1999-10-01 정선종 Apparatus and method for modelling the duration time of speech synthesizer
JP3854713B2 (en) * 1998-03-10 2006-12-06 キヤノン株式会社 Speech synthesis method and apparatus and storage medium
JP4603375B2 (en) * 2005-02-01 2010-12-22 日本放送協会 Duration time length generating device and duration time length generating program
CN107705782B (en) * 2017-09-29 2021-01-05 百度在线网络技术(北京)有限公司 Method and device for determining phoneme pronunciation duration
CN108597492B (en) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN109599092B (en) * 2018-12-21 2022-06-10 秒针信息技术有限公司 Audio synthesis method and device
CN109979428B (en) * 2019-04-02 2021-07-23 北京地平线机器人技术研发有限公司 Audio generation method and device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0580791A (en) * 1991-09-20 1993-04-02 Hitachi Ltd Device and method for speech rule synthesis
US20020138270A1 (en) * 1997-12-18 2002-09-26 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
CN106601226A (en) * 2016-11-18 2017-04-26 中国科学院自动化研究所 Phoneme duration prediction modeling method and phoneme duration prediction method
CN107481715A (en) * 2017-09-29 2017-12-15 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110992927A (en) * 2019-12-11 2020-04-10 广州酷狗计算机科技有限公司 Audio generation method and device, computer readable storage medium and computing device

Also Published As

Publication number Publication date
CN113793589A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
WO2021244457A1 (en) Video generation method and related apparatus
WO2021036568A1 (en) Fitness-assisted method and electronic apparatus
EP4270382A1 (en) Text data processing method and apparatus
WO2021258814A1 (en) Video synthesis method and apparatus, electronic device, and storage medium
WO2022193989A1 (en) Operation method and apparatus for electronic device and electronic device
CN111742539B (en) Voice control command generation method and terminal
WO2021052139A1 (en) Gesture input method and electronic device
CN114242037A (en) Virtual character generation method and device
WO2021068926A1 (en) Model updating method, working node, and model updating system
WO2021169351A1 (en) Method and apparatus for anaphora resolution, and electronic device
CN112256868A (en) Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment
WO2022062884A1 (en) Text input method, electronic device, and computer-readable storage medium
CN113593567A (en) Method for converting video and sound into text and related equipment
CN114444000A (en) Page layout file generation method and device, electronic equipment and readable storage medium
CN109285563B (en) Voice data processing method and device in online translation process
WO2022022319A1 (en) Image processing method, electronic device, image processing system and chip system
WO2022007757A1 (en) Cross-device voiceprint registration method, electronic device and storage medium
WO2022214004A1 (en) Target user determination method, electronic device and computer-readable storage medium
CN113380240B (en) Voice interaction method and electronic equipment
WO2022078116A1 (en) Brush effect picture generation method, image editing method and device, and storage medium
WO2021238338A1 (en) Speech synthesis method and device
CN114120987B (en) Voice wake-up method, electronic equipment and chip system
CN115393676A (en) Gesture control optimization method and device, terminal and storage medium
CN114822525A (en) Voice control method and electronic equipment
CN114528842A (en) Word vector construction method, device and equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21813421

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21813421

Country of ref document: EP

Kind code of ref document: A1