WO2021238338A1

WO2021238338A1 - Speech synthesis method and device

Info

Publication number: WO2021238338A1
Application number: PCT/CN2021/080403
Authority: WO
Inventors: 别凡虎
Original assignee: 华为技术有限公司
Priority date: 2020-05-26
Filing date: 2021-03-12
Publication date: 2021-12-02
Also published as: CN113793589A

Abstract

A speech synthesis method and device, which are suitable for the fields of terminal artificial intelligence technology and text-to-speech technology. The method comprises: determining the duration range of each phoneme corresponding to text to be converted (401); determining any duration within the duration range of each phoneme to be a phoneme duration of a corresponding phoneme (402); and generating speech data according to the text and the phoneme duration of each phoneme (403). For multiple pieces of speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple pieces of speech data may have different values on the basis of the same duration range, and then a variety of different speech data may be synthesized, which avoids the synthesizing of the same speech data every time for the same text, reduces the mechanical nature of speech synthesis, and improves the naturalness and diversity of speech synthesis.

Description

Speech synthesis method and device

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office, the application number is 202010456116.1, and the application name is "Speech Synthesis Method and Apparatus" on May 26, 2020, the entire content of which is incorporated into this application by reference.

Technical field

This application belongs to the technical field of terminal artificial intelligence and the technical field from text to speech, and in particular relates to a speech synthesis method and device.

Background technique

With the continuous development of artificial intelligence technology, terminal equipment can not only receive the voice information sent by the user, but also play the voice information to the user. The user does not need to consult the text displayed by the terminal device, and can learn the information displayed by the terminal device only through hearing.

In the related art, the terminal device can obtain the text to be converted, and perform feature extraction on the text to be converted to obtain the language feature, and then determine the phoneme duration of each phoneme corresponding to the text to be converted through the language feature, and finally generate according to the respective phoneme duration and language feature Voice data.

However, in the process of synthesizing voice data by the terminal device, for the same text to be converted, the voice data generated multiple times are the same, which leads to excessive mechanization of the voice synthesis.

Summary of the invention

The embodiments of the present application provide a speech synthesis method and device, which can solve the problem of excessively mechanized speech synthesis.

In the first aspect, an embodiment of the present application provides a speech synthesis method, including:

Determine the duration range of each phoneme corresponding to the text to be converted;

Determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;

According to the text to be converted and the phoneme duration of each phoneme, voice data is generated.

In a first possible implementation manner of the first aspect, the determining the duration range of each phoneme corresponding to the text to be converted includes:

Determining the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted;

The duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.

Based on the first possible implementation manner of the first aspect, in the second possible implementation manner of the first aspect, the determining the average pronunciation duration and the variance of the pronunciation duration of each of the phonemes corresponding to the text to be converted And the distribution density of pronunciation duration, including:

Inputting the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model;

The text to be converted is input into a preset duration model, and the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model are obtained.

Based on the first possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, the determination is made according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme The duration range of each phoneme includes:

According to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme, the duration range of each phoneme is determined by a normal distribution algorithm.

In a fourth possible implementation manner of the first aspect, the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:

For each of the phonemes, acquiring the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted;

Determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.

In a fifth possible implementation manner of the first aspect, the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:

Acquiring user data, the user data including age information and personality information of the user;

Based on the duration range of the phoneme and the user data, the phoneme duration of each phoneme is determined.

Based on any possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the generating voice data according to the text to be converted and the phoneme duration of each phoneme includes :

According to the text to be converted and the phoneme duration of each phoneme, the voice data is generated through a preset acoustic model and a vocoder.

In the second aspect, an embodiment of the present application provides a speech synthesis device, including:

The range determination module is used to determine the duration range of each phoneme corresponding to the text to be converted;

A duration determining module, configured to determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;

The generating module is used to generate voice data according to the text to be converted and the phoneme duration of each phoneme.

In a first possible implementation manner of the second aspect, the range determining module is specifically configured to determine the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted; The duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.

Based on the first possible implementation manner of the second aspect, in the second possible implementation manner of the second aspect, the range determination module is further specifically configured to input the text to be converted into a preset text analysis model , Obtain the pronunciation duration distribution density of each phoneme output by the text analysis model; input the text to be converted into a preset duration model to obtain the average pronunciation duration of each phoneme output by the duration model and Variance of pronunciation time.

Based on the first possible implementation manner of the second aspect, in the third possible implementation manner of the second aspect, the range determination module is further specifically configured to perform according to the average pronunciation duration and the pronunciation duration of each phoneme. The variance and the pronunciation duration distribution density are determined by the normal distribution algorithm to determine the duration range of each phoneme.

In a fourth possible implementation manner of the second aspect, the duration determining module is specifically configured to obtain, for each phoneme, the position of the text corresponding to the phoneme in the text to be converted The text semantic information of the phoneme; based on the duration range of the phoneme and the text semantic information of the phoneme, the phoneme duration of the phoneme is determined.

In a fifth possible implementation manner of the second aspect, the duration determination module is specifically configured to obtain user data, where the user data includes age information and personality information of the user; based on the duration range of the phoneme and the User data to determine the phoneme duration of each phoneme.

Based on any possible implementation manner of the second aspect, in the sixth possible implementation manner of the second aspect, the generating module is specifically configured to perform according to the text to be converted and the phoneme duration of each phoneme , Generating the voice data through a preset acoustic model and vocoder.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes all The computer program implements the speech synthesis method as described in any one of the above-mentioned first aspects.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and is characterized in that, when the computer program is executed by a processor, the implementation is as in the above-mentioned first aspect Any one of the speech synthesis methods.

In a fifth aspect, the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the speech synthesis method described in any one of the above-mentioned first aspects.

In a sixth aspect, an embodiment of the present application provides a chip system, the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory to implement the above-mentioned first aspect Any one of the speech synthesis method.

Wherein, the chip system may be a single chip or a chip module composed of multiple chips.

Compared with the prior art, the embodiments of the present application have the following beneficial effects:

The embodiment of this application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme, and finally according to the text to be converted and each phoneme. The phoneme duration of the phoneme generates speech data. For multiple speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted. The same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.

Description of the drawings

FIG. 1 is a schematic diagram of a speech synthesis scenario involved in a speech synthesis method provided by an embodiment of the present application;

2 is a schematic diagram of another speech synthesis scenario involved in the speech synthesis method provided by an embodiment of the present application;

Fig. 3 is a structural block diagram of a terminal device provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;

FIG. 5 is a schematic flowchart for determining the duration range of a phoneme according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a duration range provided by an embodiment of the present application;

Fig. 7 is a structural block diagram of a speech synthesis device provided by an embodiment of the present application.

Detailed ways

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.

The terms used in the following embodiments are only for the purpose of describing specific embodiments, and are not intended to limit the application. As used in the specification and appended claims of this application, the singular expressions "a", "an", "said", "above", "the" and "this" are intended to also This includes expressions such as "one or more" unless the context clearly indicates to the contrary. It should also be understood that in the embodiments of the present application, "one or more" refers to one, two, or more than two; and "and/or" describes the association relationship of associated objects, which means that there may be three relationships; for example, A and/or B can mean: A alone exists, A and B exist at the same time, and B exists alone, where A and B can be singular or plural. The character "/" generally indicates that the associated objects before and after are in an "or" relationship.

The speech synthesis method provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super mobile personal computers For terminal devices (ultra-mobile personal computer, UMPC), netbooks, and personal digital assistants (personal digital assistant, PDA), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.

For example, the terminal device may be a station (STAION, ST) in a WLAN, a cell phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication functions, computing devices or other processing devices connected to wireless modems, in-vehicle devices, car networking terminals, computers, laptop computers, handheld communication devices , Handheld computing devices, satellite wireless devices, wireless modem cards, television set top boxes (STB), customer premise equipment (customer premise equipment, CPE), and/or other equipment used to communicate on wireless systems, and download First-generation communication systems, for example, mobile terminals in 5G networks or mobile terminals in the future evolution of Public Land Mobile Network (PLMN) networks, etc.

As an example and not a limitation, when the terminal device is a wearable device, the wearable device can also be a general term for using wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, Watches, clothing and shoes, etc. A wearable device is a portable device that is directly worn on the body or integrated into the user's clothes or accessories. Wearable devices are not only a hardware device, but also realize powerful functions through software support, data interaction, and cloud interaction. In a broad sense, wearable smart devices include full-featured, large-sized, complete or partial functions that can be implemented without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, and need to be used in conjunction with other devices such as smart phones. , Such as all kinds of smart bracelets and smart jewelry for physical sign monitoring.

FIG. 1 is a schematic diagram of a speech synthesis scenario involved in a speech synthesis method provided by an embodiment of the present application. See FIG. The phoneme duration of each phoneme corresponding to the converted text is adjusted, so that different voice data can be generated based on the same text to be converted.

In a possible implementation, the terminal device 110 may obtain the text to be converted, and input the text to be converted into a pre-trained text analysis model and duration model, respectively, to obtain the average pronunciation duration and pronunciation of each phoneme corresponding to the text to be converted. The duration variance and the pronunciation duration distribution density can be based on the normal distribution rule to determine the duration range of each phoneme according to the average pronunciation duration, the variance of the pronunciation duration and the pronunciation duration distribution density of each phoneme.

After that, the terminal device 110 may then combine the corresponding text semantic information of each phoneme in the text to be converted based on the duration range of each phoneme, and/or pre-stored user data including age information and personality information of the user, and convert the duration Any duration in the range is determined as the phoneme duration of each phoneme, so that speech data is generated according to the duration of each phoneme and the text to be converted.

Among them, the average pronunciation duration is used to represent the average of the duration of each phoneme for the same phoneme, the variance of the pronunciation duration is used to represent the degree of difference between the duration of each phoneme of the same phoneme and the average pronunciation duration, and the distribution density of pronunciation duration is used to represent the correspondence of the same phoneme. Probability of different phoneme duration.

In addition, a phoneme is the smallest phonetic unit divided according to the natural attributes of the speech. It is analyzed according to the pronunciation actions in the syllable, and an action constitutes a phoneme. For example, taking the pronunciation rules of Pinyin as an example, the initials corresponding to the pinyin of each character can be used as one phoneme, and the finals of the pinyin as another phoneme. For example, in "weather", the phoneme corresponding to the word "天" can include "t" "And "ian", the phonemes corresponding to the word "qi" can include "q" and "i".

It should be noted that, in actual applications, referring to Figure 2, the speech synthesis scenario may also include a server 120, and the terminal device 110 may be connected to the server 120, so that the server 120 can convert the text to be converted, and obtain a text based on the same text to be converted. Different voice data.

In the process of generating voice data, the terminal device 110 may first send the text to be converted to the server 120, and the server 120 may determine the duration range of each phoneme according to the text to be converted, and then combine the semantic information of the text to be converted and the pre-stored user data, The phoneme duration of each phoneme can be determined from each duration range, so that voice data can be generated according to each phoneme duration and the text to be converted, and the generated voice data can be sent to the terminal device 110, then the terminal device 110 can receive and play the voice generated by the server 120 data.

For the sake of simplicity, the following embodiments only take the terminal device 110 included in the speech synthesis scenario and do not include the server 120 as an example for description. In actual applications, not only can the terminal device 110 convert the voice data, but also the server The voice data obtained by conversion 120 is not limited in the embodiment of the present application.

Fig. 3 is a structural block diagram of a terminal device provided by an embodiment of the present application. 3, the terminal device may include a processor 310, an external memory interface 320, an internal memory 321, a universal serial bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, and an antenna 1. , Antenna 2, mobile communication module 350, wireless communication module 360, audio module 370, speaker 370A, receiver 370B, microphone 370C, earphone interface 370D, sensor module 380, buttons 390, motor 391, indicator 392, camera 393, display 394, and subscriber identification module (subscriber identification module, SIM) card interface 395, etc. The sensor module 380 can include pressure sensor 380A, gyroscope sensor 380B, air pressure sensor 380C, magnetic sensor 380D, acceleration sensor 380E, distance sensor 380F, proximity light sensor 380G, fingerprint sensor 380H, temperature sensor 380J, touch sensor 380K, ambient light Sensor 380L, bone conduction sensor 380M, etc.

It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal device. In other embodiments of the present application, the terminal device may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.

The processor 310 may include one or more processing units. For example, the processor 310 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors.

Among them, the controller can be the nerve center and command center of the terminal device. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.

A memory may also be provided in the processor 310 to store instructions and data. In some embodiments, the memory in the processor 310 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 310. If the processor 310 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 310 is reduced, and the efficiency of the system is improved.

In some embodiments, the processor 310 may include one or more interfaces. The interface can include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.

The I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 310 may include multiple sets of I2C buses. The processor 310 may couple the touch sensor 380K, charger, flash, camera 393, etc., respectively through different I2C bus interfaces. For example, the processor 310 may couple the touch sensor 380K through an I2C interface, so that the processor 310 and the touch sensor 380K communicate through the I2C bus interface to realize the touch function of the terminal device.

The I2S interface can be used for audio communication. In some embodiments, the processor 310 may include multiple sets of I2S buses. The processor 310 may be coupled with the audio module 370 through an I2S bus to implement communication between the processor 310 and the audio module 370. In some embodiments, the audio module 370 may transmit audio signals to the wireless communication module 360 through the I2S interface, so as to realize the function of answering calls through the Bluetooth headset.

The PCM interface can also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 370 and the wireless communication module 360 may be coupled through a PCM bus interface. In some embodiments, the audio module 370 may also transmit audio signals to the wireless communication module 360 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communication. The bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, the UART interface is generally used to connect the processor 310 and the wireless communication module 360. For example, the processor 310 communicates with the Bluetooth module in the wireless communication module 360 through the UART interface to realize the Bluetooth function. In some embodiments, the audio module 370 may transmit audio signals to the wireless communication module 360 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.

The MIPI interface can be used to connect the processor 310 with the display screen 394, the camera 393 and other peripheral devices. The MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc. In some embodiments, the processor 310 and the camera 393 communicate through a CSI interface to implement the shooting function of the terminal device. The processor 310 and the display screen 394 communicate through the DSI interface to realize the display function of the terminal device.

The GPIO interface can be configured through software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 310 with the camera 393, the display screen 394, the wireless communication module 360, the audio module 370, the sensor module 380, and so on. The GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.

The USB interface 330 is an interface that complies with the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, and so on. The USB interface 330 can be used to connect a charger to charge the terminal device, and can also be used to transfer data between the terminal device and peripheral devices. It can also be used to connect earphones and play audio through earphones. The interface can also be used to connect other electronic devices, such as AR devices.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely a schematic description, and does not constitute a structural limitation of the terminal device. In other embodiments of the present application, the terminal device may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

The charging management module 340 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 340 may receive the charging input of the wired charger through the USB interface 330. In some wireless charging embodiments, the charging management module 340 may receive the wireless charging input through the wireless charging coil of the terminal device. While the charging management module 340 charges the battery 342, it can also supply power to the electronic device through the power management module 341.

The power management module 341 is used to connect the battery 342, the charging management module 340 and the processor 310. The power management module 341 receives input from the battery 342 and/or the charge management module 340, and supplies power to the processor 310, the internal memory 321, the external memory, the display screen 394, the camera 393, and the wireless communication module 360. The power management module 341 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance). In some other embodiments, the power management module 341 may also be provided in the processor 310. In other embodiments, the power management module 341 and the charging management module 340 may also be provided in the same device.

The wireless communication function of the terminal device can be realized by the antenna 1, the antenna 2, the mobile communication module 350, the wireless communication module 360, the modem processor, and the baseband processor.

The antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the terminal device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, antenna 1 can be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna can be used in combination with a tuning switch.

The mobile communication module 350 can provide wireless communication solutions including 2G/3G/4G/5G, etc., which are applied to terminal devices. The mobile communication module 350 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like. The mobile communication module 350 can receive electromagnetic waves by the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modem processor for demodulation. The mobile communication module 350 can also amplify the signal modulated by the modem processor, and convert it to electromagnetic wave radiation via the antenna 1. In some embodiments, at least part of the functional modules of the mobile communication module 350 may be provided in the processor 310. In some embodiments, at least part of the functional modules of the mobile communication module 350 and at least part of the modules of the processor 310 may be provided in the same device.

The modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low-frequency baseband signal is processed by the baseband processor and then passed to the application processor. The application processor outputs sound signals through audio equipment (not limited to the speaker 370A, the receiver 370B, etc.), or displays images or videos through the display screen 394. In some embodiments, the modem processor may be an independent device. In other embodiments, the modem processor may be independent of the processor 310 and be provided in the same device as the mobile communication module 350 or other functional modules.

The wireless communication module 360 can provide applications on terminal devices including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 360 may be one or more devices integrating at least one communication processing module. The wireless communication module 360 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 310. The wireless communication module 360 can also receive the signal to be sent from the processor 310, perform frequency modulation, amplify it, and convert it into electromagnetic waves and radiate it through the antenna 2.

In some embodiments, the antenna 1 of the terminal device is coupled with the mobile communication module 350, and the antenna 2 is coupled with the wireless communication module 360, so that the terminal device can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include the global positioning system (GPS), the global navigation satellite system (GLONASS), the Beidou navigation satellite system (BDS), and the quasi-zenith satellite system (quasi). -zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).

The terminal device realizes the display function through the GPU, the display screen 394, and the application processor. The GPU is an image processing microprocessor, which is connected to the display screen 394 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 310 may include one or more GPUs, which execute program instructions to generate or change display information.

The display screen 394 is used to display images, videos, and the like. The display screen 394 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the terminal device may include one or N display screens 394, and N is a positive integer greater than one.

The terminal device can realize the shooting function through ISP, camera 393, video codec, GPU, display screen 394 and application processor.

The ISP is used to process the data fed back by the camera 393. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transfers the electrical signal to the ISP for processing and is converted into an image visible to the naked eye. ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 393.

The camera 393 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the terminal device may include 1 or N cameras 393, and N is a positive integer greater than 1.

Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the terminal device selects the frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.

Video codecs are used to compress or decompress digital video. The terminal device can support one or more video codecs. In this way, the terminal device can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. NPU can realize the intelligent cognition of terminal equipment and other applications, such as: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 320 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device. The external memory card communicates with the processor 310 through the external memory interface 320 to realize the data storage function. For example, save music, video and other files in an external memory card.

The internal memory 321 may be used to store computer executable program code, where the executable program code includes instructions. The processor 310 executes various functional applications and data processing of the terminal device by running instructions stored in the internal memory 321. The internal memory 321 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the terminal device. In addition, the internal memory 321 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.

The terminal device can implement audio functions through the audio module 370, the speaker 370A, the receiver 370B, the microphone 370C, the earphone interface 370D, and the application processor. For example, music playback, recording, etc.

The audio module 370 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 370 can also be used to encode and decode audio signals. In some embodiments, the audio module 370 may be provided in the processor 310, or part of the functional modules of the audio module 370 may be provided in the processor 310.

The speaker 370A, also called "speaker", is used to convert audio electrical signals into sound signals. The terminal device can listen to music through the speaker 370A, or listen to a hands-free call.

The receiver 370B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the terminal device answers a call or voice message, it can receive the voice by bringing the receiver 370B close to the human ear.

Microphone 370C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 370C through the mouth, and input the sound signal to the microphone 370C. The terminal device can be provided with at least one microphone 370C. In other embodiments, the terminal device can be provided with two microphones 370C, which can implement noise reduction functions in addition to collecting sound signals. In other embodiments, the terminal device may also be provided with three, four or more microphones 370C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.

The earphone interface 370D is used to connect wired earphones. The earphone interface 370D may be a USB interface 330, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 380A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 380A may be provided on the display screen 394. There are many types of pressure sensor 380A, such as resistive pressure sensor, inductive pressure sensor, capacitive pressure sensor and so on. The capacitive pressure sensor may include at least two parallel plates with conductive materials. When a force is applied to the pressure sensor 380A, the capacitance between the electrodes changes. The terminal equipment determines the strength of the pressure based on the change in capacitance. When a touch operation acts on the display screen 394, the terminal device detects the intensity of the touch operation according to the pressure sensor 380A. The terminal device may also calculate the touched position based on the detection signal of the pressure sensor 380A. In some embodiments, touch operations that act on the same touch position but have different touch operation intensities can correspond to different operation instructions. For example: when a touch operation whose intensity is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.

The gyroscope sensor 380B can be used to determine the motion posture of the terminal device. In some embodiments, the angular velocity of the terminal device around three axes (ie, x, y, and z axes) can be determined by the gyroscope sensor 380B. The gyro sensor 380B can be used for shooting anti-shake. Exemplarily, when the shutter is pressed, the gyroscope sensor 380B detects the shake angle of the terminal device, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to counteract the shake of the terminal device through a reverse movement to achieve anti-shake. The gyroscope sensor 380B can also be used for navigation and somatosensory game scenes.

The air pressure sensor 380C is used to measure air pressure. In some embodiments, the terminal device uses the air pressure value measured by the air pressure sensor 380C to calculate the altitude to assist positioning and navigation.

The magnetic sensor 380D includes a Hall sensor. The terminal device can use the magnetic sensor 380D to detect the opening and closing of the flip holster. In some embodiments, when the terminal device is a flip machine, the terminal device can detect the opening and closing of the flip according to the magnetic sensor 380D. Then, according to the detected opening and closing status of the leather case or the opening and closing status of the flip cover, features such as automatic unlocking of the flip cover are set.

The acceleration sensor 380E can detect the magnitude of the acceleration of the terminal device in various directions (generally three axes). When the terminal device is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and be used in applications such as horizontal and vertical screen switching, pedometers and so on.

Distance sensor 380F, used to measure distance. The terminal device can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the terminal device can use the distance sensor 380F to measure the distance to achieve fast focusing.

The proximity light sensor 380G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The terminal device emits infrared light to the outside through the light-emitting diode. Terminal equipment uses photodiodes to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device. When insufficient reflected light is detected, the terminal device can determine that there is no object near the terminal device. The terminal device can use the proximity light sensor 380G to detect that the user holds the terminal device close to the ear to talk, so as to automatically turn off the screen to save power. The proximity light sensor 380G can also be used in leather case mode, and the pocket mode will automatically unlock and lock the screen.

The ambient light sensor 380L is used to sense the brightness of the ambient light. The terminal device can adaptively adjust the brightness of the display screen 394 according to the perceived brightness of the ambient light. The ambient light sensor 380L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 380L can also cooperate with the proximity light sensor 380G to detect whether the terminal device is in the pocket to prevent accidental touch.

The fingerprint sensor 380H is used to collect fingerprints. Terminal devices can use the collected fingerprint characteristics to unlock fingerprints, access application locks, take photos with fingerprints, and answer calls with fingerprints.

The temperature sensor 380J is used to detect temperature. In some embodiments, the terminal device uses the temperature detected by the temperature sensor 380J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 380J exceeds a threshold value, the terminal device executes to reduce the performance of the processor located near the temperature sensor 380J in order to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the terminal device heats the battery 342 to avoid abnormal shutdown of the terminal device due to low temperature. In some other embodiments, when the temperature is lower than another threshold, the terminal device boosts the output voltage of the battery 342 to avoid abnormal shutdown caused by low temperature.

Touch sensor 380K, also called "touch panel". The touch sensor 380K can be arranged on the display screen 394, and the touch screen is composed of the touch sensor 380K and the display screen 394, which is also called a “touch screen”. The touch sensor 380K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. The visual output related to the touch operation can be provided through the display screen 394. In other embodiments, the touch sensor 380K may also be disposed on the surface of the terminal device, which is different from the position of the display screen 394.

The bone conduction sensor 380M can acquire vibration signals. In some embodiments, the bone conduction sensor 380M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 380M can also contact the human pulse and receive blood pressure beating signals. In some embodiments, the bone conduction sensor 380M may also be provided in the earphone, combined with the bone conduction earphone. The audio module 370 can parse the voice signal based on the vibration signal of the vibrating bone block of the voice obtained by the bone conduction sensor 380M, and realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beating signal obtained by the bone conduction sensor 380M, and realize the heart rate detection function.

The button 390 includes a power-on button, a volume button, and so on. The button 390 may be a mechanical button. It can also be a touch button. The terminal device can receive key input and generate key signal input related to the user settings and function control of the terminal device.

The motor 391 can generate vibration prompts. The motor 391 can be used for incoming call vibration notification, and can also be used for touch vibration feedback. For example, touch operations that act on different applications (such as taking pictures, audio playback, etc.) can correspond to different vibration feedback effects. Acting on touch operations in different areas of the display screen 394, the motor 391 can also correspond to different vibration feedback effects. Different application scenarios (for example: time reminding, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

The indicator 392 can be an indicator light, which can be used to indicate the charging status, power change, and can also be used to indicate messages, missed calls, notifications, and so on.

The SIM card interface 395 is used to connect to the SIM card. The SIM card can be inserted into the SIM card interface 395 or pulled out from the SIM card interface 395 to achieve contact and separation with the terminal device. The terminal device can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. The SIM card interface 395 can support Nano SIM cards, Micro SIM cards, SIM cards, etc. The same SIM card interface 395 can insert multiple cards at the same time. The types of the multiple cards can be the same or different. The SIM card interface 395 can also be compatible with different types of SIM cards. The SIM card interface 395 can also be compatible with external memory cards. The terminal device interacts with the network through the SIM card to realize functions such as call and data communication. In some embodiments, the terminal device adopts an eSIM, that is, an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.

Fig. 4 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application. As an example and not a limitation, the method can be applied to the above-mentioned terminal device. Referring to Fig. 4, the method includes:

Step 401: Determine the duration range of each phoneme corresponding to the text to be converted.

In the process of generating voice data, the terminal device can determine the phoneme duration of each phoneme corresponding to the text to be converted, and combine the language features of the text to be converted to generate voice data through a preset acoustic model and vocoder. Moreover, in the process of determining the phoneme duration, the duration range of each phoneme can be determined first, so that in subsequent steps, different phoneme durations can be selected based on the duration range to improve the naturalness and diversity of the generated speech data.

In specific implementation, the terminal device can first extract each phoneme in the text to be converted, and then obtain the pronunciation information of each phoneme based on the pre-trained model, and then determine the duration range of each phoneme according to the pronunciation information of each phoneme. Referring to FIG. 5, this step 401 may include: step 401a and step 401b.

401a. Determine the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme corresponding to the text to be converted.

The terminal device can input the text to be converted into a pre-trained model, extract each phoneme of the text to be converted through the model, and can also analyze the text to be converted to determine the pronunciation information of each phoneme, such as the pronunciation time, the variance of the pronunciation time, and the pronunciation time. Distribution density, so that in the subsequent steps, the pronunciation range of each phoneme can be determined according to the pronunciation information.

In the process of extracting phonemes through the model, the terminal device can first split the text to be converted to obtain multiple texts arranged in sequence, and then extract at least one phoneme of each text based on the pronunciation rules of each text. After the phonemes of each text are extracted, multiple phonemes of the text to be converted can be obtained.

For example, based on the pronunciation rules of Pinyin, the terminal device can use the initials and vowels corresponding to each character as the phonemes of the character respectively. If the text to be converted is "it is sunny today", the multiple phonemes of the text to be converted can be separately "J", "in", "t", "ian", "t", "ian", "q", "i", "h", "ao", "q", "ing", " l" and "ang".

In addition, since the pronunciation information of each phoneme can include different types of information, the terminal device can input the text to be converted into different models to obtain different pronunciation information. For example, the terminal device may input the text to be converted into a preset text analysis model to obtain the pronunciation time distribution density of each phoneme output by the text analysis model, where the text analysis model may be a deep neural network (DNN) Model. And/or, the terminal device may input the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.

401b. Determine the duration range of each phoneme according to the average pronunciation duration of each phoneme, the variance of the pronunciation duration, and the distribution density of the pronunciation duration.

After obtaining the pronunciation information of each phoneme, the terminal device can substitute different parameters included in the pronunciation information into the calculation formula according to a preset calculation formula, so that the duration range of each phoneme can be obtained. For example, it can be assumed that the phoneme duration of each phoneme obeys a normal distribution, and the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme can be calculated by using the normal distribution algorithm and the formula of the normal distribution. Determine the duration range of each phoneme.

For example, referring to Figure 6, it is determined that the average pronunciation duration of the x-th phoneme is t(x), the variance of the pronunciation duration is std2(x), and the pronunciation duration distribution density is p(x), then the formula p(x)=N( t(x), std2(x)), x1 and x2 are obtained by solving for x. If x1 is less than x2, the interval [x1, x2] can be used as the duration range of the xth phoneme.

Step 402: Determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme.

After obtaining the duration range of each phoneme, the terminal device can determine the phoneme duration of each phoneme based on the text semantic information of each phoneme in the text to be converted, the age of the user’s personality, or randomly, so that the terminal device can be based on the same text to be converted Generate different voice data.

Optionally, for each phoneme, the terminal device can obtain the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted, and then determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme .

In a possible implementation, for each phoneme, the terminal device can first determine the text corresponding to the phoneme, and then determine the sentence in the text to be converted, analyze the semantics of the sentence, and combine it with the text to be converted. Transform the semantics of all sentences in the text to determine the textual semantic information of the phoneme. After that, the terminal device can select the phoneme duration of the phoneme from the multiple durations corresponding to the duration range according to the text semantic information.

For example, if the textual semantic information of a phoneme represents a happy mood, a short duration can be selected from the duration range as the phoneme duration of the phoneme.

Alternatively, the terminal device may obtain user data, and the user data may include the user's age information and personality information, and the terminal device may determine the phoneme duration of each phoneme from the duration range of each phoneme according to the age and personality of the user, thereby generating Voice information that matches the user.

In a possible implementation, the terminal device can obtain pre-stored user data, or request user data from the server, and determine the voice type that matches the user based on the user data, and then correspond to the user’s voice type from the duration range Among the multiple durations of, select the phoneme duration of the phoneme.

For example, if the user data indicates that the user is middle-aged and has a calm personality, the voice type that matches the user can be a slow and tidy type, and accordingly, a long duration can be selected as the phoneme duration of the phoneme.

It should be noted that in practical applications, user data may also include other information indicating the type of user's speech. For example, user data may include search data indicating user emotions, and may also include shopping data indicating whether the user has recently purchased goods, etc. The application embodiment does not limit this.

In addition, if the terminal device can obtain text semantic information and user data at the same time, it can further determine the phoneme duration of each phoneme according to the weights corresponding to the text semantic information and the user data. However, if the terminal device cannot obtain neither the text semantic information nor the user data, it can determine the phoneme duration of each phoneme according to the normal distribution of each phoneme in the text to be converted. The application embodiment does not limit the way of determining the phoneme duration.

Step 403: Generate voice data according to the text to be converted and the phoneme duration of each phoneme.

In the process of synthesizing voice data, the terminal device can use different methods to generate voice data based on the phoneme duration of each phoneme. For example, the terminal device can use a parameter method, a splicing method, or an end-to-end method to generate voice data, and no matter which method is used to generate the voice data, the phoneme duration of each phoneme corresponding to the text to be converted can be determined in the foregoing manner.

Taking the parameter method as an example, the terminal device can first determine the phoneme duration of each phoneme according to the above method, and then input the phoneme duration and the language features of the extracted text to be converted into the acoustic model to obtain the fundamental frequency and other parameters used to generate the voice data. The voice data is generated by the vocoder according to the fundamental frequency and other parameters.

The process of generating voice data in a splicing method or an end-to-end manner is similar to the process of generating voice data in the above-mentioned parameter method, and will not be repeated here.

In summary, the speech synthesis method provided by the embodiments of the present application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the corresponding phoneme The phoneme duration, and finally the voice data is generated according to the text to be converted and the phoneme duration of each phoneme. For multiple speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted. The same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.

Moreover, by determining the duration range of each phoneme, and selecting the duration of the phoneme within the duration range, the value of the phoneme duration will not have a huge deviation, thereby avoiding the value of the phoneme duration being too large or too small to cause abnormal voice data The situation improves the stability of speech synthesis.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Corresponding to the speech synthesis method described in the above embodiment, FIG. 7 is a structural block diagram of a speech synthesis device provided in an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.

Referring to Figure 7, the device includes:

The range determining module 701 is used to determine the duration range of each phoneme corresponding to the text to be converted;

The duration determining module 702 is configured to determine any duration in the duration range as the phoneme duration of each phoneme;

The generating module 703 is configured to generate voice data according to the text to be converted and the phoneme duration of each phoneme.

Optionally, the range determination module 701 is specifically configured to determine the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme corresponding to the text to be converted; according to the average pronunciation duration and pronunciation duration of each phoneme The variance and the distribution density of pronunciation duration determine the duration range of each phoneme.

Optionally, the range determining module 701 is further specifically configured to input the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model; and input the text to be converted The preset duration model obtains the average pronunciation duration and the variance of the pronunciation duration of each phoneme output by the duration model.

Optionally, the range determining module 701 is further specifically configured to determine the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.

Optionally, the duration determining module 702 is specifically configured to obtain, for each phoneme, the textual semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted; based on the duration range of the phoneme and the The textual semantic information of a phoneme determines the phoneme duration of the phoneme.

Optionally, the duration determining module 702 is specifically configured to obtain user data, the user data including age information and personality information of the user; based on the duration range of the phoneme and the user data, determine the phoneme duration of each phoneme.

Optionally, the generating module 703 is specifically configured to generate the voice data through a preset acoustic model and vocoder according to the text to be converted and the phoneme duration of each phoneme.

In summary, the speech synthesis device provided in the embodiment of the present application determines the duration range of each phoneme corresponding to the text to be converted, and then determines any duration in the duration range of each phoneme as the corresponding phoneme The phoneme duration, and finally the voice data is generated according to the text to be converted and the phoneme duration of each phoneme. For multiple speech data of the same text to be converted, the phoneme duration of the same phoneme in the multiple speech data may have different values based on the same time length range, and a variety of different speech data can be synthesized, avoiding the need for each text to be converted. The same speech data can be obtained through the second synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the system embodiment described above is merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be It can be combined or integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in this application can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may at least include: any entity or device capable of carrying computer program code to the terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc. In some jurisdictions, in accordance with legislation and patent practices, computer-readable media cannot be electrical carrier signals and telecommunication signals.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A method for speech synthesis, characterized in that it comprises:

Determine the duration range of each phoneme corresponding to the text to be converted;

Determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;

According to the text to be converted and the phoneme duration of each phoneme, voice data is generated.
5. The speech synthesis method according to claim 1, wherein the determining the duration range of each phoneme corresponding to the text to be converted comprises:

Determining the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted;

The duration range of each phoneme is determined according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme.
3. The speech synthesis method according to claim 2, wherein the determining the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes corresponding to the text to be converted comprises:

Inputting the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model;

The text to be converted is input into a preset duration model, and the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model are obtained.
3. The speech synthesis method of claim 2, wherein the determining the duration range of each phoneme according to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each of the phonemes comprises:

According to the average pronunciation duration, the variance of the pronunciation duration, and the distribution density of the pronunciation duration of each phoneme, the duration range of each phoneme is determined by a normal distribution algorithm.
The speech synthesis method according to claim 1, wherein the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme comprises:

For each of the phonemes, acquiring the text semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted;

Determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
The speech synthesis method according to claim 1, wherein the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme comprises:

Acquiring user data, the user data including age information and personality information of the user;

Based on the duration range of the phoneme and the user data, the phoneme duration of each phoneme is determined.
The speech synthesis method according to any one of claims 1 to 6, wherein the generating speech data according to the text to be converted and the phoneme duration of each phoneme comprises:

According to the text to be converted and the phoneme duration of each phoneme, the voice data is generated through a preset acoustic model and a vocoder.
A speech synthesis device, characterized in that it comprises:

The range determination module is used to determine the duration range of each phoneme corresponding to the text to be converted;

A duration determining module, configured to determine any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;

The generating module is used to generate voice data according to the text to be converted and the phoneme duration of each phoneme.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 7. The method of any one.
A computer-readable storage medium storing a computer program, wherein the computer program implements the method according to any one of claims 1 to 7 when the computer program is executed by a processor.